Zaharia UC Berkeley wwwsparkprojectorg Introduction to Spark Internals UC BERKELEY Outline Project goals Components Life of a job Extending Spark How to contribute Project Goals Generality ID: 320944
Download Presentation The PPT/PDF document "Matei" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Matei
ZahariaUC Berkeleywww.spark-project.org
Introduction to Spark Internals
UC BERKELEYSlide2
Outline
Project goalsComponentsLife of a jobExtending SparkHow to contributeSlide3
Project GoalsGenerality
Low latencyFault toleranceSimplicity: diverse workloads, operators, job sizes: sub-second
:
faults shouldn’t be special case
:
often comes from generalitySlide4
Codebase SizeSpark: 20,000 LOC
Hadoop 1.0: 90,000 LOCHadoop 2.0: 220,000 LOC(non-test, non-example sources)Slide5
Codebase Details
Hadoop I/O:400 LOCMesos backend: 700 LOC
Standalone backend: 1700 LOC
Interpreter: 3300 LOC
Spark core: 16,000 LOC
Operators: 2000
Block manager: 2700
Scheduler: 2500
Networking: 1200
Accumulators: 200
Broadcast: 3500Slide6
Outline
Project goalsComponentsLife of a jobExtending SparkHow to contributeSlide7
Components
sc = new
SparkContextf =
sc.textFile
(“…”)
f.
filter
(…)
.
count
()
...
Your program
Spark client
(app master)
Spark worker
HDFS,
HBase
, …
Block manager
Task threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster
managerSlide8
Example Job
val sc = new SparkContext( “spark://...”, “MyJob”
, home, jars)
val
file = sc.
textFile
(
“
hdfs
://...”
)
val
errors =
file.
filter
(
_.contains(“ERROR”)
)
errors.
cache
()
errors.
count
()
Resilient
distributed
datasets
(RDDs
)
ActionSlide9
RDD Graph
HadoopRDDpath = hdfs://...FilteredRDDfunc = _.contains(…)shouldCache = true
file:
errors:
Partition-level view:
Dataset-level view:
Task 1
Task 2
...Slide10
Data Locality
First run: data not in cache, so use HadoopRDD’s locality prefs (from HDFS)Second run: FilteredRDD is in cache, so use its locationsIf something falls out of cache, go back to HDFSSlide11
In More Detail:Life of a JobSlide12
Scheduling Process
rdd1.
join
(rdd2)
.
groupBy
(…)
.
filter
(…)
RDD Objects
build
operator DAG
agnostic to operators!
doesn’t know about stages
DAGScheduler
split graph into
stages
of tasks
submit each stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via cluster manager
retry failed or straggling tasks
Cluster
manager
Worker
execute tasks
store and serve blocks
Block manager
Threads
Task
stage
failedSlide13
RDD AbstractionGoal: wanted to support wide array of operators and let users compose them arbitrarily
Don’t want to modify scheduler for each oneHow to capture dependencies generically?Slide14
RDD Interface
Set of partitions (“splits”)List of dependencies on parent RDDsFunction to compute a partition given parentsOptional preferred locationsOptional partitioning info (Partitioner)
Captures all current Spark operations!Slide15
Example: HadoopRDD
partitions = one per HDFS blockdependencies = nonecompute(partition) = read corresponding blockpreferredLocations(part) = HDFS block locationpartitioner = noneSlide16
Example: FilteredRDD
partitions = same as parent RDDdependencies = “one-to-one” on parentcompute(partition) = compute parent and filter itpreferredLocations(part) = none (ask parent)partitioner = noneSlide17
Example: JoinedRDD
partitions = one per reduce taskdependencies = “shuffle” on each parentcompute(partition) = read and join shuffled datapreferredLocations(part) = nonepartitioner = HashPartitioner(numTasks
)
Spark will now know this data is hashed!Slide18
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with inputs co-partitioned
map, filter
“Narrow”
deps
:
“Wide” (shuffle)
deps
:Slide19
DAG SchedulerInterface: receives a “target” RDD, a function to run on each partition, and a listener for results
Roles:Build stages of Task objects (code + preferred loc.)Submit them to TaskScheduler as readyResubmit failed stages if outputs are lostSlide20
Scheduler Optimizations
Pipelines narrow ops. within a stagePicks join algorithms based on partitioning (minimize shuffles)Reuses previously cached data
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A:
B:
C:
D:
E:
F:
G:
= previously computed
partition
TaskSlide21
Task Details
Stage boundaries are only at input RDDs or “shuffle” operationsSo, each task looks like this:(Note: we write shuffle outputs to RAM/disk to allow retries)Taskf1 f2
…
map output file
or
master
external
storage
fetch map
outputs
and/orSlide22
Task Details
Each Task object is self-containedContains all transformation code up to input boundary (e.g. HadoopRDD => filter => map)Allows Tasks on cached data to even if they fall out of cacheDesign goal: any Task can run on any node
Only way a Task can fail is lost map output filesSlide23
Event Flow
DAGSchedulerTaskScheduler
runJob
(
targetRDD
, partitions,
func
, listener)
submitTasks
(
taskSet
)
task finish & stage
failure events
Cluster or local runner
graph of stages
RDD partitioning
pipelining
task placement
retries on failure
speculation
inter-job policy
Task objectsSlide24
TaskScheduler
Interface:Given a TaskSet (set of Tasks), run it and report resultsReport “fetch failed” errors when shuffle output lostTwo main implementations:LocalScheduler (runs locally)ClusterScheduler (connects to a cluster manager using a pluggable “SchedulerBackend” API)Slide25
TaskScheduler Details
Can run multiple concurrent TaskSets, but currently does so in FIFO orderWould be really easy to plug in other policies!If someone wants to suggest a plugin API, please doMaintains one TaskSetManager per TaskSet that tracks its locality and failure infoPolls these for tasks in order (FIFO)Slide26
WorkerImplemented by the Executor class
Receives self-contained Task objects and calls run() on them in a thread poolReports results or exceptions to masterSpecial case: FetchFailedException for shufflePluggable ExecutorBackend for clusterSlide27
Other Components
BlockManager“Write-once” key-value store on each workerServes shuffle data as well as cached RDDsTracks a StorageLevel for each block (e.g. disk, RAM)Can drop data to disk if running low on RAMCan replicate data across nodesSlide28
Other Components
CommunicationManagerAsynchronous IO based networking libraryAllows fetching blocks from BlockManagersAllows prioritization / chunking across connections (would be nice to make this pluggable!)Fetch logic tries to optimize for block sizesSlide29
Other Components
MapOutputTrackerTracks where each “map” task in a shuffle ranTells reduce tasks the map locationsEach worker caches the locations to avoid refetchingA “generation ID” passed with each Task allows invalidating the cache when map outputs are lostSlide30
Outline
Project goalsComponentsLife of a jobExtending SparkHow to contributeSlide31
Extension PointsSpark provides several places to customize functionality:
Extending RDD: add new input sources or transformationsSchedulerBackend: add new cluster managersspark.serializer: customize object storageSlide32
What People Have Done
New RDD transformations (sample, glom, mapPartitions, leftOuterJoin, rightOuterJoin)New input sources (DynamoDB)Custom serialization for memory and bandwidth efficiencyNew language bindings (Java, Python)Slide33
Possible Future Extensions
Pluggable inter-job schedulerPluggable cache eviction policy (ideally with priority flags on StorageLevel)Pluggable instrumentation / event listenersLet us know if you want to contribute these!Slide34
As an ExerciseTry writing your own input RDD from the local
filesystem (say one partition per file)Try writing your own transformation RDD (pick a Scala collection method not in Spark)Try writing your own action (e.g. product())Slide35
Outline
Project goalsComponentsLife of a jobExtending SparkHow to contributeSlide36
Development Process
Issue tracking: spark-project.atlassian.netDevelopment discussion: spark-developersMain work: “master” branch on GitHubSubmit patches through GitHub pull requestsBe sure to follow code style and add tests!Slide37
Build ToolsSBT and Maven currently both work (but switching to only Maven)
IDEA is the most common IDEA; Eclipse may be made to workSlide38
Thanks!Stay tuned for future developer
meetups.