/
Simba: Efficient In-Memory Spatial Analytics Simba: Efficient In-Memory Spatial Analytics

Simba: Efficient In-Memory Spatial Analytics - PowerPoint Presentation

phoebe-click
phoebe-click . @phoebe-click
Follow
389 views
Uploaded On 2018-02-25

Simba: Efficient In-Memory Spatial Analytics - PPT Presentation

Dong Xie Feifei Li Bin Yao Gefei Li Liang Zhou Minyi Guo Spatial Data is Ubiquitous Locationbased Services IoT Projects amp Sensor Networks Social Media Problems of Existing Systems ID: 636022

join index sql point index join point sql data spatial dataframe query operations poi spark table local indexing cont

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Simba: Efficient In-Memory Spatial Analy..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Simba: Efficient In-Memory Spatial Analytics

Dong Xie*, Feifei Li*, Bin Yao+, Gefei Li+, Liang Zhou+, Minyi Guo+Slide2

Spatial Data is Ubiquitous!

Location-based Services

IoT

Projects & Sensor Networks

Social MediaSlide3

Problems of Existing Systems

Disk-oriented -> low performanceHadoop-GIS, SpatialHadoop, GeoMesaNo native support for spatial operatorsSpark SQL, MemSQLNo sophisticated query planner & optimizerSpatialSpark, GeoSparkSlide4

Simba:

Spatial In-Memory Big data AnalyticsSimba is an extension of Spark SQL across the system stack!

1. Programming Interface

4. New Query Optimizations

3. Efficient Spatial Operators

2. Table IndexingSlide5

Comparison with Existing SystemsSlide6

Query Workload in Simba

Life of a query in SimbaSlide7

Programming Interfaces

Extends both SQL Parser and DataFrame API of Spark SQLMake spatial queries more naturalAchieve something that is impossible in Spark SQL.SELECT *FROM pointsSORT BY (x - 2)*(x - 2) + (y - 3)*(y - 3)LIMIT

5

SELECT

*

FROM

points

WHERE POINT

(x, y)

IN KNN

(

POINT

(2, 3), 5)

SELECT

*

FROM

queries q

KNN JOIN

pois p

ON POINT

(

p.x

, p.y)

IN KNN

(

POINT

(

q.x

,

q.y

), 3)Slide8

Programming Interfaces (cont’d)

Fully compatible with original Spark SQL operators.Same level of flexibility for DataFrameSELECT poi.id, count(*) as cFROM poi DISTANCE JOIN data ON POINT(

data.lat

, data.long) IN CIRCLERANGE

(

POINT

(

poi.lat

, poi.long), 3.0)

WHERE POINT

(

data.lat

, data.long)

IN RANGE

(

POINT(24.39, 66.88)

, POINT

(49.38, 124.84))

GROUP BY

poi.id

ORDER BY

poi.id

poi.

distanceJoin

(data,

Point

(poi(“

lat

”), poi(“long”)),

Point

(data(“

lat

”), data(“long”)), 3.0)

.

range

(

Point

(data(“

lat

”), data(“long”)),

Point

(24.39, 66.88),

Point

(49.38, 124.84))

.

groupBy

(poi(“id”))

.

agg

(

count

(“*”).

as

(“c”)).

sort

(poi(“id”)).

show

()Slide9

Table Indexing

All Spark SQL operations are based on RDD scanning.Inefficient for selective spatial queries!In Spark SQL: Record -> RowTable -> RDD[Row]Solution in Simba: native two-level indexing over RDDsChallenges:RDD is not designed for random accessAchieve this without hurting Spark kernel and RDD abstractionSlide10

Table Indexing (cont’d)

Two-level Indexing Framework: local + global indexing

Partition

Packing

&

Indexing

Array[Row]

Local Index

IPartition

[Row]

Partition Info

Local Index

Global Index

 

 

 

 

 

 

Global Index

 

 

 

 

 

Row

 

 

 

 

 

IndexRDD

[Row]

On Master Node

RDD[Row]

CREATE INDEX

idx_name

ON

R(

)

USE

idx_type

DROP

INDEX

idx_name

ON

table_name

 Slide11

Table Indexing (cont’d)

Representation for Indexed Tables (RDDs) in Simba

case class

IPartition

[Type](data: Array[Type], index: Index)

type

IndexRDD

[Type] =

RDD

[

IPartition

[Type]]

Indexed tables are still RDDs!Slide12

Spatial operations

Indexing support -> efficient algorithmsGlobal Index: partition pruningLocal Index: parallel pruning within selected partitionslocalindexesR1R2R3

R4

R5

R6

R1

R2

R3

R4

R5

R6

global Index

partition pruning on the master node

parallel pruning on selected partitionsSlide13

Spatial operations

Range Query : Two steps: global Filtering + local processing 

 Slide14

Spatial Operations (cont’d)

nearest neighbor query : Key to achieve good performance:Local indexesPruning bound that is sufficient to cover global NN results. 

 

 

 

 Slide15

More Sophisticated Spatial Operations

Distance Join : Our solution: the DJSpark AlgorithmNN join :

Our solution: the

RKJSpark AlgorithmDetails in the paper…

 Slide16

Query Optimizer

Index and spatial-awareness optimizationsIndex scan optimization: for better index utilizationSelectivity estimation + Cost-based OptimizationSelectivity estimation over local indexesChoose a proper plan: scan or use index.Spatial predicates mergingSlide17

Query Optimizer (cont’d)

Partition Size Auto-TuningData LocalityLoad BalancingMemory fitness <- record-size estimator Broadcast join optimization: small table joins large tableLogical partitioning optimization for RKJSparkprovides tighter pruning bounds

 

A

good Partitioner

(e.g., STR Partitioner)Slide18

Comparison with Existing Systems (cont’d)

Single-relation operations ThroughputSingle-relation operations LatencyEnvironment: A 9 node cluster with 54 cores and 135GB RAMQuery over 500M OpenStreetMap

entriesSlide19

Comparison with Existing Systems (cont’d)

Join operations performanceJoin between two 3M-entry tablesSlide20

Conclusion

Simba: A distributed in-memory spatial analytics engineUser-friendly SQL & DataFrame APIIndexing support for efficient query processingSpatial operator implementation tailored towards SparkSpatial and index-awareness optimizationsNo changes to Spark kernel -> easier migration to higher version SparkSuperior performance compared against other systemsNow open sourced at: https://github.com/InitialDLab/Simba/Under active development….Slide21

Thanks for your attention!dongx@cs.utah.eduSlide22

Supported SQL & DataFrame API

Point wrapperBox range queryCircle range query nearest neighbor query SQL

:

POINT(pois.x

+ 2,

pois.y

* 3)

DataFrame

:

Point

(

pois

(“x”) + 2,

pois

(“y”) * 3)

SQL

:

p

IN

RANGE

(low, high)

DataFrame

:

range

(base: Point, low: Point, high: Point)

SQL

:

p

IN

CIRCLERANGE

(c,

rd

)

DataFrame

:

circleRange

(base: Point, c: Point,

rd

: Double)

SQL

:

p

IN

KNN

(q, k)

DataFrame

:

knn

(base: Point, k:

Int

)Slide23

Supported SQL & DataFrame API (cont’d)

Distance joinIndex management 

SQL

:

R

KNN JOIN

S

ON

s

IN KNN

(r, k)

DataFrame

:

knnJoin

(target: DataFrame, left_key: Point,

right_key

: Point, k:

Int

)

SQL

:

R

DISTANCE JOIN

S

ON

s

IN CIRCLERANGE

(r,

)

DataFrame

:

distanceJoin

(target: DataFrame,

left_key

: Point,

right_key

: Point,

: Double)

 

SQL

:

CREATE INDEX

idx_name

ON

R(

)

USE

idx_type

DROP INDEX

idx_name

ON

table_name

DROP INDEX ON

table_name

SHOW INDEX ON

R

DataFrame

:

index

(

idx_type

:

IndexType

,

idx_name

: String,

attrs:Seq

[Attribute])

dropIndex

()

showIndex

()

 Slide24

Spatial Operations: Distance Join

Distance Join : General theta-join in Spark SQL -> Cartesian product!!!Our solution: the DJSpark Algorithm 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Global Join

Local Join

 

 

 

 

 

 

 

 

 

 

 

Local Index

of

 

 

 Slide25

Spatial Operations:

NN join NN join : Solutions in Simba:Block Nested Loop

join (

BKJSpark-N)Block Nested Loop

join with local R-Trees (

BKJSpark

-R

)

Voronoi

join* (

VKJSpark

)

-value

join

+ (ZKJSpark) -> approximate

join

R-Tree

join (

RKJSpark

)

 

* W. Lu, Y. Shen, S. Chen, and B. C.

Ooi

. Efficient Processing of k Nearest Neighbor joins using MapReduce.

VLDB,

2012.

+

C. Zhang, F. Li, and J.

Jestes

. Efficient parallel

kNN

joins for large

data in mapreduce.

EDBT

, 2012.Slide26

Spatial Operations:

NN join (cont’d) R-Tree NN join (RKJSpark)Distributed hash-join like algorithm.

For each partition

, find

,

Define

as the centroid of partition

Take a

uniform random sample

, and suppose

For each partition

:

 Slide27

Query Optimizer

Index scan optimization: for better index utilizationFilter By:

 

Full Table Scan

Table Scan using Index Operators

With Predicate:

 

 

 

Filter By:

 

Result

Result

Transform to DNF

OptimizeSlide28

Index Building Cost: Time

Index Building Time against Data sizeIndex Building Time against DimensionSlide29

Index Building Cost: Space

Local Index SizeGlobal Index SizeSlide30

Performance against Spark SQL: Data Size

NN Query Throughput NN Query Latency Slide31

Performance of Joins: Data Size

Distance Join PerformanceNN Join Performance Slide32

Support for multi-dimension

NN Throughput against Dimension NN Latency against Dimension

 Slide33

Support for multi-dimension joins

Distance Join against DimensionNN Join against Dimension Slide34

Future Works

Native support to general geometric objectsPolygons, Segments, etc.Spatial join over predicates such as and Data in very high dimensions (> 10d)More sophisticated cost-based optimizations.