/
Spark SQL What did you think of this paper? Spark SQL What did you think of this paper?

Spark SQL What did you think of this paper? - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
367 views
Uploaded On 2018-03-12

Spark SQL What did you think of this paper? - PPT Presentation

This paper Appeared at the Industry Track of SIGMOD Lightly reviewed Usecases and impact more important than new technical contributions Light on experiments Light on details Esp on optimization ID: 648021

000 spark data relational spark 000 relational data sql paper imperative url visitcounts query top bridging time var user

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Spark SQL What did you think of this pap..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Spark SQLSlide2

What did you think of this paper?Slide3

This paper

Appeared at the “Industry” Track of SIGMOD

Lightly reviewed

Use-cases and impact more important than new technical contributions

Light on experiments

Light on details

Esp. on optimization Slide4

Key Benefits of SparkSQL

Bridging the gap between procedural and relational

Allowing analysts to mix both

Not just fully A or fully B but intermingled

At the same time, doesn’t force one single format of intermingling

Can issue fully SQL

Can issue fully procedural

Not better than impala: but not their contribution. Slide5

Impala

From

Cloudera

Since 2012

SQL on

Hadoop

Clusters

Open-source

Support for Protocol Buffers like format (parquet)

C++ based: less overhead of java/

scala

May circumvent MR by using a distributed query engine similar

to parallel RDBMSSlide6

History lesson: earliest example of “bridging the gap”

What’s the earliest example of “bridging the gap” between procedural and relational?Slide7

History lesson: earliest example of “bridging the gap”

What’s the earliest example of “bridging the gap” between procedural and relational?

UDFs

Been there since the early 90s

Rage back then: Object relational databases

OOP was starting to pick up

Representing and reasoning about objects in databases

Postgres

was one of the first to use it

Used to call custom code in the middle of SQLSlide8

RDDs and SparkSlide9

The paper itself

Great model for a systems paper

Talk about something that is useful + used by many many real users

Argue not just that your techniques are good but also that your limitations are not fundamentally bad

Extensive experiments to back it up.

Awesome performance numbers always help.

Won the best paper award at NSDI’12Slide10

Memory vs. Disk (borrowed)

L1 cache reference 0.5 ns

Branch

mispredict

5 ns

L2 cache reference 7 ns

Mutex

lock/unlock 100 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 10,000 ns

Send 2K bytes over 1

Gbps

network 20,000 ns

Read 1 MB sequentially from memory 250,000 ns

Round trip within same datacenter 500,000 ns

Disk seek 10,000,000 ns

Read 1 MB sequentially from network 10,000,000 ns

Read 1 MB sequentially from disk 30,000,000 ns

Send packet CA->Netherlands->CA 150,000,000 nsSlide11

Spark vs. Dremel

Similar to

Dremel

in that

the focus is on interactive ad-hoc tasks

Caveat:

Dremel

is primarily aggregation

p

rimarily read-only

m

oving away from the drawbacks of MR (but in different ways)

Dremel

uses Column Store ideas + Disk

Spark uses Memory (Java objects) + Avoiding

checkpointing

+ Persistence Slide12

Spark Primitives vs. MapReduceSlide13

Disadvantages of

MapReduce

1. Extremely rigid data flow

Other flows constantly hacked in

Join, Union

Split

M

R

M

M

R

M

Chains

2. Common operations must be coded by hand

Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden

inside map-reduce functions

Difficult to maintain, extend, and optimizeSlide14

Not the first time!

Similar proposals have been made to natively support other relational operators on top of

MapReduce

.

PIG: Imperative style, like Spark. From Yahoo!Slide15

visits =

load

/data/visits

as

(user, url, time);

gVisits =

group

visits

by

url;

visitCounts =

foreach

gVisits

generate

url, count(urlVisits);

urlInfo =

load

‘/data/urlInfo

as

(url, category, pRank);

visitCounts =

join

visitCounts

by

url, urlInfo

by

url;

gCategories =

group

visitCounts

by

category;

topUrls =

foreach gCategories generate

top(visitCounts,10);

store topUrls into ‘

/data/topUrls’;

Another Example: PIGSlide16

Another Example:

DryadLINQ

s

tring

uri

=

@"file://\\machine\directory\

input.pt

"

;

PartitionedTable

<

LineRecord

>

input =

PartitionedTable

.Get<LineRecord>(uri);

string

separator = ",";

var

words =

input.

SelectMany

(x =>

SplitLineRecord

(separator));

var

groups =

words.

GroupBy

(x => x);

var

counts = groups.

Select(x => new Pair(x.Key

, x.Count

()));var

ordered = counts.

OrderByDescending(x => x[2]);

var top = ordered.

Take(k);top.

ToDryadPartitionedTable("matching.pt");GetSM

G

S

O

Take

Execution Plan GraphSlide17

Not the first time!

Similar proposals have been made to natively support other relational operators on top of

MapReduce

.

Unlike

Spark

,

most of them cannot

have datasets persist across queries.

PIG: Imperative style, like Spark. From Yahoo!

DryadLINQ

: Imperative programming interface. From Microsoft.

HIVE: SQL like. From Facebook

HadoopDB

: SQL like (hybrid of MR + databases). From YaleSlide18

Spark: Control

Spark leaves control of data, algorithms, persistence to the user. Is this a good idea?Slide19

Spark: Control

Spark leaves control of data, algorithms, persistence to the user. Is this a good idea?

Good idea:

User may know which datasets need to be used and how

Bad idea:

System may be able to optimize and schedule computation across nodes

Standard argument of declarative vs. imperativeSlide20

What are other ways Spark can be optimized? Slide21

What are other ways Spark can be optimized?

More Declarative than Imperative

Relational Query Optimization

Reordering predicates

Caching, fault-tolerance only when needed

Careful scheduling

Careful partitioning, co-location, and persistence

IndexesSlide22

Shark

Two key ideas:

Column store

Mid-query re-planning

+ Other tweaks

Bringing the power of relational databases to shark

w

hile this is not as much of a landmark paper by itself, it represents the evolution in thinking from imperative to declarativeSlide23

Recall…

Mid query

replanning

is not new given the work on

adaptive query

processing

Traditional database systems plan once based on

Statistics:

distributions via histograms

data

layout &

locality

sizes of source relations

s

electivities

of predicates

i

ntermediate sizes

Can be notoriously bad!

Famous example of unknown selectivity being estimated as 1/3.Slide24

Ways in which it can be used

Mid-way

reoptimization

if statistics differ significantly mid-plan

Use statistics from previous plans to optimize current plan

Starting multiple plans at the same time, converge on one

Routing tuples to operators randomly

Adaptive sharing of common expressions

Picking plans with least “expected cost”Slide25

Adaptive QP

Still very much an unsolved problem…

No one technique is known to be best

For more details:

Survey by

Deshpande

, Ives, Raman.