A FaultTolerant Abstraction for InMemory Cluster Computing Matei Zaharia Mosharaf Chowdhury Tathagata Das Ankur Dave Justin Ma Murphy McCauley Michael J Franklin Scott Shenker Ion Stoica ID: 511115
Download Presentation The PPT/PDF document "Resilient Distributed Datasets" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Resilient Distributed DatasetsA Fault-Tolerant Abstraction for In-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, BerkeleyPresented by Qi Gao, adapted from Matei’s NSDI’12 presentation
EECS 582 – W16
1Slide2
OverviewA new programming model (
RDD)parallel/distributed computingin-memory sharingfault-toleranceAn implementation of RDD: Spark2Slide3
MotivationMapReduce greatly simplifies “big data” analysis on large, unreliable clusters
Simple interface: map and reduceHides the details of parallelism, data partition, fault-tolerance, load-balancing...Problemscannot support complex applications efficientlycannot support interactive applications efficientlyRoot causeInefficient data sharing
3
In MapReduce, the only way to share data across jobs is stable storage ->
slow
!Slide4
Motivation4Slide5
Goal: In-Memory Data Sharing5Slide6
Challenges6
10-100x faster than network/disk, but how to achieve fault-tolerance efficiently?Data replication?Log fine-grained updates to mutable states?
Network bandwidth is scarce resourceDisk I/O is slow
Costly for data-intensive appsSlide7
Observation7
Coarse-grained operation:In many distributed computing, same operation is applied to multiple data items in parallelSlide8
RDD AbstractionRestricted form of distributed shared memory
immutable, partitioned collection of recordscan only be built through coarse-grained deterministic transformations (map, filter, join...)8Efficient fault-tolerance using lineageLog coarse-grained operations instead of fine-grained data updates
An RDD has enough information about how it’s derived from other datasetRecompute lost partitions on failureSlide9
Fault-tolerance
9Slide10
SparkImplements Resilient Distributed Datasets (RDDs)
10Operations on RDDsTransformations: defines new dataset based on previous onesActions: starts a job to execute on cluster
Well-designed interface to represent RDDsMakes it very easy to implement transformations
Most Spark transformation implementation < 20 LoCSlide11
Simple Yet Powerful11
WordCount Implementation: Hadoop vs. Spark
Pregel: iterative graph processing, 200 LoC using SparkHaLoop: iterative MapReduce, 200 LoC using Spark Slide12
Worker
WorkerMasterSpark Example: Log MiningLoad error messages from a log into memory and run interactive queries
12
lines = spark.textFile("hdfs://...")
base RDD
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()
transformation
errors.filter(_.contains("Foo")).count()
action!
errors.filter(_.contains("Bar")).count()
Result
: full-text search on 1TB data in 5-7sec vs. 170sec with on-disk data!Slide13
Evaluation13
10 iterations on 100GB data using 25-100 machinesSlide14
Evaluation14
10 iterations on 54GB data with approximately 4M articles
2.4x7.4xSlide15
Evaluation
15running time for 10 iterations of k-means on 75 nodes, each iteration contains 400 tasks on 100GB dataSlide16
Conclusion16
RDDs offer a simple yet efficient programming model for a broad range of distributed applicationsRDDs provides outstanding performance and efficient fault-toleranceSlide17
Thank you!17Slide18
Backup18Slide19
Evaluation19
Performance of logistic regression using 100GB data on 25 machinesSlide20
Example: PageRank20Slide21
21
Example: PageRankA good example to show the advantage of in-memory sharing vs. MapReduce’s stable storage sharingSlide22
Data Layout Optimization
22Slide23
Another interesting evaluation
23