This paper Appeared at the Industry Track of SIGMOD Lightly reviewed Usecases and impact more important than new technical contributions Light on experiments Light on details Esp on optimization ID: 648021
Download Presentation The PPT/PDF document "Spark SQL What did you think of this pap..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Spark SQLSlide2
What did you think of this paper?Slide3
This paper
Appeared at the “Industry” Track of SIGMOD
Lightly reviewed
Use-cases and impact more important than new technical contributions
Light on experiments
Light on details
Esp. on optimization Slide4
Key Benefits of SparkSQL
Bridging the gap between procedural and relational
Allowing analysts to mix both
Not just fully A or fully B but intermingled
At the same time, doesn’t force one single format of intermingling
Can issue fully SQL
Can issue fully procedural
Not better than impala: but not their contribution. Slide5
Impala
From
Cloudera
Since 2012
SQL on
Hadoop
Clusters
Open-source
Support for Protocol Buffers like format (parquet)
C++ based: less overhead of java/
scala
May circumvent MR by using a distributed query engine similar
to parallel RDBMSSlide6
History lesson: earliest example of “bridging the gap”
What’s the earliest example of “bridging the gap” between procedural and relational?Slide7
History lesson: earliest example of “bridging the gap”
What’s the earliest example of “bridging the gap” between procedural and relational?
UDFs
Been there since the early 90s
Rage back then: Object relational databases
OOP was starting to pick up
Representing and reasoning about objects in databases
Postgres
was one of the first to use it
Used to call custom code in the middle of SQLSlide8
RDDs and SparkSlide9
The paper itself
Great model for a systems paper
Talk about something that is useful + used by many many real users
Argue not just that your techniques are good but also that your limitations are not fundamentally bad
Extensive experiments to back it up.
Awesome performance numbers always help.
Won the best paper award at NSDI’12Slide10
Memory vs. Disk (borrowed)
L1 cache reference 0.5 ns
Branch
mispredict
5 ns
L2 cache reference 7 ns
Mutex
lock/unlock 100 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 10,000 ns
Send 2K bytes over 1
Gbps
network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from network 10,000,000 ns
Read 1 MB sequentially from disk 30,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 nsSlide11
Spark vs. Dremel
Similar to
Dremel
in that
the focus is on interactive ad-hoc tasks
Caveat:
Dremel
is primarily aggregation
p
rimarily read-only
m
oving away from the drawbacks of MR (but in different ways)
Dremel
uses Column Store ideas + Disk
Spark uses Memory (Java objects) + Avoiding
checkpointing
+ Persistence Slide12
Spark Primitives vs. MapReduceSlide13
Disadvantages of
MapReduce
1. Extremely rigid data flow
Other flows constantly hacked in
Join, Union
Split
M
R
M
M
R
M
Chains
2. Common operations must be coded by hand
Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden
inside map-reduce functions
Difficult to maintain, extend, and optimizeSlide14
Not the first time!
Similar proposals have been made to natively support other relational operators on top of
MapReduce
.
PIG: Imperative style, like Spark. From Yahoo!Slide15
visits =
load
‘
/data/visits
’
as
(user, url, time);
gVisits =
group
visits
by
url;
visitCounts =
foreach
gVisits
generate
url, count(urlVisits);
urlInfo =
load
‘/data/urlInfo
’
as
(url, category, pRank);
visitCounts =
join
visitCounts
by
url, urlInfo
by
url;
gCategories =
group
visitCounts
by
category;
topUrls =
foreach gCategories generate
top(visitCounts,10);
store topUrls into ‘
/data/topUrls’;
Another Example: PIGSlide16
Another Example:
DryadLINQ
s
tring
uri
=
@"file://\\machine\directory\
input.pt
"
;
PartitionedTable
<
LineRecord
>
input =
PartitionedTable
.Get<LineRecord>(uri);
string
separator = ",";
var
words =
input.
SelectMany
(x =>
SplitLineRecord
(separator));
var
groups =
words.
GroupBy
(x => x);
var
counts = groups.
Select(x => new Pair(x.Key
, x.Count
()));var
ordered = counts.
OrderByDescending(x => x[2]);
var top = ordered.
Take(k);top.
ToDryadPartitionedTable("matching.pt");GetSM
G
S
O
Take
Execution Plan GraphSlide17
Not the first time!
Similar proposals have been made to natively support other relational operators on top of
MapReduce
.
Unlike
Spark
,
most of them cannot
have datasets persist across queries.
PIG: Imperative style, like Spark. From Yahoo!
DryadLINQ
: Imperative programming interface. From Microsoft.
HIVE: SQL like. From Facebook
HadoopDB
: SQL like (hybrid of MR + databases). From YaleSlide18
Spark: Control
Spark leaves control of data, algorithms, persistence to the user. Is this a good idea?Slide19
Spark: Control
Spark leaves control of data, algorithms, persistence to the user. Is this a good idea?
Good idea:
User may know which datasets need to be used and how
Bad idea:
System may be able to optimize and schedule computation across nodes
Standard argument of declarative vs. imperativeSlide20
What are other ways Spark can be optimized? Slide21
What are other ways Spark can be optimized?
More Declarative than Imperative
Relational Query Optimization
Reordering predicates
Caching, fault-tolerance only when needed
Careful scheduling
Careful partitioning, co-location, and persistence
IndexesSlide22
Shark
Two key ideas:
Column store
Mid-query re-planning
+ Other tweaks
Bringing the power of relational databases to shark
w
hile this is not as much of a landmark paper by itself, it represents the evolution in thinking from imperative to declarativeSlide23
Recall…
Mid query
replanning
is not new given the work on
adaptive query
processing
Traditional database systems plan once based on
Statistics:
distributions via histograms
data
layout &
locality
sizes of source relations
s
electivities
of predicates
i
ntermediate sizes
Can be notoriously bad!
Famous example of unknown selectivity being estimated as 1/3.Slide24
Ways in which it can be used
Mid-way
reoptimization
if statistics differ significantly mid-plan
Use statistics from previous plans to optimize current plan
Starting multiple plans at the same time, converge on one
Routing tuples to operators randomly
Adaptive sharing of common expressions
Picking plans with least “expected cost”Slide25
Adaptive QP
Still very much an unsolved problem…
No one technique is known to be best
For more details:
Survey by
Deshpande
, Ives, Raman.