/
Matei Matei

Matei - PowerPoint Presentation

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
388 views
Uploaded On 2016-05-15

Matei - PPT Presentation

Zaharia UC Berkeley wwwsparkprojectorg Introduction to Spark Internals UC BERKELEY Outline Project goals Components Life of a job Extending Spark How to contribute Project Goals Generality ID: 320944

spark task block rdd task spark rdd block job map partition components tasks data stage loc cache cluster shuffle

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Matei" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Matei

ZahariaUC Berkeleywww.spark-project.org

Introduction to Spark Internals

UC BERKELEYSlide2

Outline

Project goalsComponentsLife of a jobExtending SparkHow to contributeSlide3

Project GoalsGenerality

Low latencyFault toleranceSimplicity: diverse workloads, operators, job sizes: sub-second

:

faults shouldn’t be special case

:

often comes from generalitySlide4

Codebase SizeSpark: 20,000 LOC

Hadoop 1.0: 90,000 LOCHadoop 2.0: 220,000 LOC(non-test, non-example sources)Slide5

Codebase Details

Hadoop I/O:400 LOCMesos backend: 700 LOC

Standalone backend: 1700 LOC

Interpreter: 3300 LOC

Spark core: 16,000 LOC

Operators: 2000

Block manager: 2700

Scheduler: 2500

Networking: 1200

Accumulators: 200

Broadcast: 3500Slide6

Outline

Project goalsComponentsLife of a jobExtending SparkHow to contributeSlide7

Components

sc = new

SparkContextf =

sc.textFile

(“…”)

f.

filter

(…)

.

count

()

...

Your program

Spark client

(app master)

Spark worker

HDFS,

HBase

, …

Block manager

Task threads

RDD graph

Scheduler

Block tracker

Shuffle tracker

Cluster

managerSlide8

Example Job

val sc = new SparkContext( “spark://...”, “MyJob”

, home, jars)

val

file = sc.

textFile

(

hdfs

://...”

)

val

errors =

file.

filter

(

_.contains(“ERROR”)

)

errors.

cache

()

errors.

count

()

Resilient

distributed

datasets

(RDDs

)

ActionSlide9

RDD Graph

HadoopRDDpath = hdfs://...FilteredRDDfunc = _.contains(…)shouldCache = true

file:

errors:

Partition-level view:

Dataset-level view:

Task 1

Task 2

...Slide10

Data Locality

First run: data not in cache, so use HadoopRDD’s locality prefs (from HDFS)Second run: FilteredRDD is in cache, so use its locationsIf something falls out of cache, go back to HDFSSlide11

In More Detail:Life of a JobSlide12

Scheduling Process

rdd1.

join

(rdd2)

.

groupBy

(…)

.

filter

(…)

RDD Objects

build

operator DAG

agnostic to operators!

doesn’t know about stages

DAGScheduler

split graph into

stages

of tasks

submit each stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster manager

retry failed or straggling tasks

Cluster

manager

Worker

execute tasks

store and serve blocks

Block manager

Threads

Task

stage

failedSlide13

RDD AbstractionGoal: wanted to support wide array of operators and let users compose them arbitrarily

Don’t want to modify scheduler for each oneHow to capture dependencies generically?Slide14

RDD Interface

Set of partitions (“splits”)List of dependencies on parent RDDsFunction to compute a partition given parentsOptional preferred locationsOptional partitioning info (Partitioner)

Captures all current Spark operations!Slide15

Example: HadoopRDD

partitions = one per HDFS blockdependencies = nonecompute(partition) = read corresponding blockpreferredLocations(part) = HDFS block locationpartitioner = noneSlide16

Example: FilteredRDD

partitions = same as parent RDDdependencies = “one-to-one” on parentcompute(partition) = compute parent and filter itpreferredLocations(part) = none (ask parent)partitioner = noneSlide17

Example: JoinedRDD

partitions = one per reduce taskdependencies = “shuffle” on each parentcompute(partition) = read and join shuffled datapreferredLocations(part) = nonepartitioner = HashPartitioner(numTasks

)

Spark will now know this data is hashed!Slide18

Dependency Types

union

groupByKey

join with inputs not

co-partitioned

join with inputs co-partitioned

map, filter

“Narrow”

deps

:

“Wide” (shuffle)

deps

:Slide19

DAG SchedulerInterface: receives a “target” RDD, a function to run on each partition, and a listener for results

Roles:Build stages of Task objects (code + preferred loc.)Submit them to TaskScheduler as readyResubmit failed stages if outputs are lostSlide20

Scheduler Optimizations

Pipelines narrow ops. within a stagePicks join algorithms based on partitioning (minimize shuffles)Reuses previously cached data

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A:

B:

C:

D:

E:

F:

G:

= previously computed

partition

TaskSlide21

Task Details

Stage boundaries are only at input RDDs or “shuffle” operationsSo, each task looks like this:(Note: we write shuffle outputs to RAM/disk to allow retries)Taskf1  f2

 …

map output file

or

master

external

storage

fetch map

outputs

and/orSlide22

Task Details

Each Task object is self-containedContains all transformation code up to input boundary (e.g. HadoopRDD => filter => map)Allows Tasks on cached data to even if they fall out of cacheDesign goal: any Task can run on any node

Only way a Task can fail is lost map output filesSlide23

Event Flow

DAGSchedulerTaskScheduler

runJob

(

targetRDD

, partitions,

func

, listener)

submitTasks

(

taskSet

)

task finish & stage

failure events

Cluster or local runner

graph of stages

RDD partitioning

pipelining

task placement

retries on failure

speculation

inter-job policy

Task objectsSlide24

TaskScheduler

Interface:Given a TaskSet (set of Tasks), run it and report resultsReport “fetch failed” errors when shuffle output lostTwo main implementations:LocalScheduler (runs locally)ClusterScheduler (connects to a cluster manager using a pluggable “SchedulerBackend” API)Slide25

TaskScheduler Details

Can run multiple concurrent TaskSets, but currently does so in FIFO orderWould be really easy to plug in other policies!If someone wants to suggest a plugin API, please doMaintains one TaskSetManager per TaskSet that tracks its locality and failure infoPolls these for tasks in order (FIFO)Slide26

WorkerImplemented by the Executor class

Receives self-contained Task objects and calls run() on them in a thread poolReports results or exceptions to masterSpecial case: FetchFailedException for shufflePluggable ExecutorBackend for clusterSlide27

Other Components

BlockManager“Write-once” key-value store on each workerServes shuffle data as well as cached RDDsTracks a StorageLevel for each block (e.g. disk, RAM)Can drop data to disk if running low on RAMCan replicate data across nodesSlide28

Other Components

CommunicationManagerAsynchronous IO based networking libraryAllows fetching blocks from BlockManagersAllows prioritization / chunking across connections (would be nice to make this pluggable!)Fetch logic tries to optimize for block sizesSlide29

Other Components

MapOutputTrackerTracks where each “map” task in a shuffle ranTells reduce tasks the map locationsEach worker caches the locations to avoid refetchingA “generation ID” passed with each Task allows invalidating the cache when map outputs are lostSlide30

Outline

Project goalsComponentsLife of a jobExtending SparkHow to contributeSlide31

Extension PointsSpark provides several places to customize functionality:

Extending RDD: add new input sources or transformationsSchedulerBackend: add new cluster managersspark.serializer: customize object storageSlide32

What People Have Done

New RDD transformations (sample, glom, mapPartitions, leftOuterJoin, rightOuterJoin)New input sources (DynamoDB)Custom serialization for memory and bandwidth efficiencyNew language bindings (Java, Python)Slide33

Possible Future Extensions

Pluggable inter-job schedulerPluggable cache eviction policy (ideally with priority flags on StorageLevel)Pluggable instrumentation / event listenersLet us know if you want to contribute these!Slide34

As an ExerciseTry writing your own input RDD from the local

filesystem (say one partition per file)Try writing your own transformation RDD (pick a Scala collection method not in Spark)Try writing your own action (e.g. product())Slide35

Outline

Project goalsComponentsLife of a jobExtending SparkHow to contributeSlide36

Development Process

Issue tracking: spark-project.atlassian.netDevelopment discussion: spark-developersMain work: “master” branch on GitHubSubmit patches through GitHub pull requestsBe sure to follow code style and add tests!Slide37

Build ToolsSBT and Maven currently both work (but switching to only Maven)

IDEA is the most common IDEA; Eclipse may be made to workSlide38

Thanks!Stay tuned for future developer

meetups.