Dong Xie Feifei Li Bin Yao Gefei Li Liang Zhou Minyi Guo Spatial Data is Ubiquitous Locationbased Services IoT Projects amp Sensor Networks Social Media Problems of Existing Systems ID: 636022
Download Presentation The PPT/PDF document "Simba: Efficient In-Memory Spatial Analy..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Simba: Efficient In-Memory Spatial Analytics
Dong Xie*, Feifei Li*, Bin Yao+, Gefei Li+, Liang Zhou+, Minyi Guo+Slide2
Spatial Data is Ubiquitous!
Location-based Services
IoT
Projects & Sensor Networks
Social MediaSlide3
Problems of Existing Systems
Disk-oriented -> low performanceHadoop-GIS, SpatialHadoop, GeoMesaNo native support for spatial operatorsSpark SQL, MemSQLNo sophisticated query planner & optimizerSpatialSpark, GeoSparkSlide4
Simba:
Spatial In-Memory Big data AnalyticsSimba is an extension of Spark SQL across the system stack!
1. Programming Interface
4. New Query Optimizations
3. Efficient Spatial Operators
2. Table IndexingSlide5
Comparison with Existing SystemsSlide6
Query Workload in Simba
Life of a query in SimbaSlide7
Programming Interfaces
Extends both SQL Parser and DataFrame API of Spark SQLMake spatial queries more naturalAchieve something that is impossible in Spark SQL.SELECT *FROM pointsSORT BY (x - 2)*(x - 2) + (y - 3)*(y - 3)LIMIT
5
SELECT
*
FROM
points
WHERE POINT
(x, y)
IN KNN
(
POINT
(2, 3), 5)
SELECT
*
FROM
queries q
KNN JOIN
pois p
ON POINT
(
p.x
, p.y)
IN KNN
(
POINT
(
q.x
,
q.y
), 3)Slide8
Programming Interfaces (cont’d)
Fully compatible with original Spark SQL operators.Same level of flexibility for DataFrameSELECT poi.id, count(*) as cFROM poi DISTANCE JOIN data ON POINT(
data.lat
, data.long) IN CIRCLERANGE
(
POINT
(
poi.lat
, poi.long), 3.0)
WHERE POINT
(
data.lat
, data.long)
IN RANGE
(
POINT(24.39, 66.88)
, POINT
(49.38, 124.84))
GROUP BY
poi.id
ORDER BY
poi.id
poi.
distanceJoin
(data,
Point
(poi(“
lat
”), poi(“long”)),
Point
(data(“
lat
”), data(“long”)), 3.0)
.
range
(
Point
(data(“
lat
”), data(“long”)),
Point
(24.39, 66.88),
Point
(49.38, 124.84))
.
groupBy
(poi(“id”))
.
agg
(
count
(“*”).
as
(“c”)).
sort
(poi(“id”)).
show
()Slide9
Table Indexing
All Spark SQL operations are based on RDD scanning.Inefficient for selective spatial queries!In Spark SQL: Record -> RowTable -> RDD[Row]Solution in Simba: native two-level indexing over RDDsChallenges:RDD is not designed for random accessAchieve this without hurting Spark kernel and RDD abstractionSlide10
Table Indexing (cont’d)
Two-level Indexing Framework: local + global indexing
Partition
Packing
&
Indexing
Array[Row]
Local Index
IPartition
[Row]
Partition Info
Local Index
Global Index
Global Index
Row
IndexRDD
[Row]
On Master Node
RDD[Row]
CREATE INDEX
idx_name
ON
R(
)
USE
idx_type
DROP
INDEX
idx_name
ON
table_name
Slide11
Table Indexing (cont’d)
Representation for Indexed Tables (RDDs) in Simba
case class
IPartition
[Type](data: Array[Type], index: Index)
type
IndexRDD
[Type] =
RDD
[
IPartition
[Type]]
Indexed tables are still RDDs!Slide12
Spatial operations
Indexing support -> efficient algorithmsGlobal Index: partition pruningLocal Index: parallel pruning within selected partitionslocalindexesR1R2R3
R4
R5
R6
R1
R2
R3
R4
R5
R6
global Index
partition pruning on the master node
parallel pruning on selected partitionsSlide13
Spatial operations
Range Query : Two steps: global Filtering + local processing
Slide14
Spatial Operations (cont’d)
nearest neighbor query : Key to achieve good performance:Local indexesPruning bound that is sufficient to cover global NN results.
Slide15
More Sophisticated Spatial Operations
Distance Join : Our solution: the DJSpark AlgorithmNN join :
Our solution: the
RKJSpark AlgorithmDetails in the paper…
Slide16
Query Optimizer
Index and spatial-awareness optimizationsIndex scan optimization: for better index utilizationSelectivity estimation + Cost-based OptimizationSelectivity estimation over local indexesChoose a proper plan: scan or use index.Spatial predicates mergingSlide17
Query Optimizer (cont’d)
Partition Size Auto-TuningData LocalityLoad BalancingMemory fitness <- record-size estimator Broadcast join optimization: small table joins large tableLogical partitioning optimization for RKJSparkprovides tighter pruning bounds
A
good Partitioner
(e.g., STR Partitioner)Slide18
Comparison with Existing Systems (cont’d)
Single-relation operations ThroughputSingle-relation operations LatencyEnvironment: A 9 node cluster with 54 cores and 135GB RAMQuery over 500M OpenStreetMap
entriesSlide19
Comparison with Existing Systems (cont’d)
Join operations performanceJoin between two 3M-entry tablesSlide20
Conclusion
Simba: A distributed in-memory spatial analytics engineUser-friendly SQL & DataFrame APIIndexing support for efficient query processingSpatial operator implementation tailored towards SparkSpatial and index-awareness optimizationsNo changes to Spark kernel -> easier migration to higher version SparkSuperior performance compared against other systemsNow open sourced at: https://github.com/InitialDLab/Simba/Under active development….Slide21
Thanks for your attention!dongx@cs.utah.eduSlide22
Supported SQL & DataFrame API
Point wrapperBox range queryCircle range query nearest neighbor query SQL
:
POINT(pois.x
+ 2,
pois.y
* 3)
DataFrame
:
Point
(
pois
(“x”) + 2,
pois
(“y”) * 3)
SQL
:
p
IN
RANGE
(low, high)
DataFrame
:
range
(base: Point, low: Point, high: Point)
SQL
:
p
IN
CIRCLERANGE
(c,
rd
)
DataFrame
:
circleRange
(base: Point, c: Point,
rd
: Double)
SQL
:
p
IN
KNN
(q, k)
DataFrame
:
knn
(base: Point, k:
Int
)Slide23
Supported SQL & DataFrame API (cont’d)
Distance joinIndex management
SQL
:
R
KNN JOIN
S
ON
s
IN KNN
(r, k)
DataFrame
:
knnJoin
(target: DataFrame, left_key: Point,
right_key
: Point, k:
Int
)
SQL
:
R
DISTANCE JOIN
S
ON
s
IN CIRCLERANGE
(r,
)
DataFrame
:
distanceJoin
(target: DataFrame,
left_key
: Point,
right_key
: Point,
: Double)
SQL
:
CREATE INDEX
idx_name
ON
R(
)
USE
idx_type
DROP INDEX
idx_name
ON
table_name
DROP INDEX ON
table_name
SHOW INDEX ON
R
DataFrame
:
index
(
idx_type
:
IndexType
,
idx_name
: String,
attrs:Seq
[Attribute])
dropIndex
()
showIndex
()
Slide24
Spatial Operations: Distance Join
Distance Join : General theta-join in Spark SQL -> Cartesian product!!!Our solution: the DJSpark Algorithm
…
Global Join
Local Join
Local Index
of
Slide25
Spatial Operations:
NN join NN join : Solutions in Simba:Block Nested Loop
join (
BKJSpark-N)Block Nested Loop
join with local R-Trees (
BKJSpark
-R
)
Voronoi
join* (
VKJSpark
)
-value
join
+ (ZKJSpark) -> approximate
join
R-Tree
join (
RKJSpark
)
* W. Lu, Y. Shen, S. Chen, and B. C.
Ooi
. Efficient Processing of k Nearest Neighbor joins using MapReduce.
VLDB,
2012.
+
C. Zhang, F. Li, and J.
Jestes
. Efficient parallel
kNN
joins for large
data in mapreduce.
EDBT
, 2012.Slide26
Spatial Operations:
NN join (cont’d) R-Tree NN join (RKJSpark)Distributed hash-join like algorithm.
For each partition
, find
,
Define
as the centroid of partition
Take a
uniform random sample
, and suppose
For each partition
:
Slide27
Query Optimizer
Index scan optimization: for better index utilizationFilter By:
Full Table Scan
Table Scan using Index Operators
With Predicate:
Filter By:
Result
Result
Transform to DNF
OptimizeSlide28
Index Building Cost: Time
Index Building Time against Data sizeIndex Building Time against DimensionSlide29
Index Building Cost: Space
Local Index SizeGlobal Index SizeSlide30
Performance against Spark SQL: Data Size
NN Query Throughput NN Query Latency Slide31
Performance of Joins: Data Size
Distance Join PerformanceNN Join Performance Slide32
Support for multi-dimension
NN Throughput against Dimension NN Latency against Dimension
Slide33
Support for multi-dimension joins
Distance Join against DimensionNN Join against Dimension Slide34
Future Works
Native support to general geometric objectsPolygons, Segments, etc.Spatial join over predicates such as and Data in very high dimensions (> 10d)More sophisticated cost-based optimizations.