/
Resilient Distributed Datasets Resilient Distributed Datasets

Resilient Distributed Datasets - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
411 views
Uploaded On 2017-01-18

Resilient Distributed Datasets - PPT Presentation

A FaultTolerant Abstraction for InMemory Cluster Computing Matei Zaharia Mosharaf Chowdhury Tathagata Das Ankur Dave Justin Ma Murphy McCauley Michael J Franklin Scott Shenker Ion Stoica ID: 511115

spark data fault distributed data spark distributed fault rdd evaluation grained tolerance memory rdds filter errors sharing log iterations

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Resilient Distributed Datasets" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Resilient Distributed DatasetsA Fault-Tolerant Abstraction for In-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, BerkeleyPresented by Qi Gao, adapted from Matei’s NSDI’12 presentation

EECS 582 – W16

1Slide2

OverviewA new programming model (

RDD)parallel/distributed computingin-memory sharingfault-toleranceAn implementation of RDD: Spark2Slide3

MotivationMapReduce greatly simplifies “big data” analysis on large, unreliable clusters

Simple interface: map and reduceHides the details of parallelism, data partition, fault-tolerance, load-balancing...Problemscannot support complex applications efficientlycannot support interactive applications efficientlyRoot causeInefficient data sharing

3

In MapReduce, the only way to share data across jobs is stable storage ->

slow

!Slide4

Motivation4Slide5

Goal: In-Memory Data Sharing5Slide6

Challenges6

10-100x faster than network/disk, but how to achieve fault-tolerance efficiently?Data replication?Log fine-grained updates to mutable states?

Network bandwidth is scarce resourceDisk I/O is slow

Costly for data-intensive appsSlide7

Observation7

Coarse-grained operation:In many distributed computing, same operation is applied to multiple data items in parallelSlide8

RDD AbstractionRestricted form of distributed shared memory

immutable, partitioned collection of recordscan only be built through coarse-grained deterministic transformations (map, filter, join...)8Efficient fault-tolerance using lineageLog coarse-grained operations instead of fine-grained data updates

An RDD has enough information about how it’s derived from other datasetRecompute lost partitions on failureSlide9

Fault-tolerance

9Slide10

SparkImplements Resilient Distributed Datasets (RDDs)

10Operations on RDDsTransformations: defines new dataset based on previous onesActions: starts a job to execute on cluster

Well-designed interface to represent RDDsMakes it very easy to implement transformations

Most Spark transformation implementation < 20 LoCSlide11

Simple Yet Powerful11

WordCount Implementation: Hadoop vs. Spark

Pregel: iterative graph processing, 200 LoC using SparkHaLoop: iterative MapReduce, 200 LoC using Spark Slide12

Worker

WorkerMasterSpark Example: Log MiningLoad error messages from a log into memory and run interactive queries

12

lines = spark.textFile("hdfs://...")

base RDD

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

transformation

errors.filter(_.contains("Foo")).count()

action!

errors.filter(_.contains("Bar")).count()

Result

: full-text search on 1TB data in 5-7sec vs. 170sec with on-disk data!Slide13

Evaluation13

10 iterations on 100GB data using 25-100 machinesSlide14

Evaluation14

10 iterations on 54GB data with approximately 4M articles

2.4x7.4xSlide15

Evaluation

15running time for 10 iterations of k-means on 75 nodes, each iteration contains 400 tasks on 100GB dataSlide16

Conclusion16

RDDs offer a simple yet efficient programming model for a broad range of distributed applicationsRDDs provides outstanding performance and efficient fault-toleranceSlide17

Thank you!17Slide18

Backup18Slide19

Evaluation19

Performance of logistic regression using 100GB data on 25 machinesSlide20

Example: PageRank20Slide21

21

Example: PageRankA good example to show the advantage of in-memory sharing vs. MapReduce’s stable storage sharingSlide22

Data Layout Optimization

22Slide23

Another interesting evaluation

23