Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,
1 / 1

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,

Author : lindy-dunigan | Published Date : 2025-05-19

Description: Matei Zaharia Mosharaf Chowdhury Tathagata Das Ankur Dave Justin Ma Murphy McCauley Michael Franklin Scott Shenker Ion Stoica Spark Fast Interactive LanguageIntegrated Cluster Computing wwwsparkprojectorg Project Goals Extend

Presentation Embed Code

Download Presentation

Download Presentation The PPT/PDF document "Matei Zaharia, Mosharaf Chowdhury, Tathagata Das," is the property of its rightful owner. Permission is granted to download and print the materials on this website for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Transcript:Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,:
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive, Language-Integrated Cluster Computing www.spark-project.org Project Goals Extend the MapReduce model to better support two common classes of analytics apps: Iterative algorithms (machine learning, graphs) Interactive data mining Enhance programmability: Integrate into Scala programming language Allow interactive use from Scala interpreter Project Goals Extend the MapReduce model to better support two common classes of analytics apps: Iterative algorithms (machine learning, graphs) Interactive data mining Enhance programmability: Integrate into Scala programming language Allow interactive use from Scala interpreter Explain why the original MapReduce model does not efficiently support these use cases? Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Motivation Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data from stable storage on each query Solution: Resilient Distributed Datasets (RDDs) Allow apps to keep working sets in memory for efficient reuse Retain the attractive properties of MapReduce Fault tolerance, data locality, scalability Support a wide range of applications Programming Model Resilient distributed datasets (RDDs) Immutable, partitioned collections of objects Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage Can be cached for efficient reuse Actions on RDDs Count, reduce, collect, save, … Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...)) Example: Logistic Regression Goal: find best line separating two

Download Document

Here is the link to download the presentation.
"Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,"The content belongs to its owner. You may download and print it for personal use, without modification, and keep all copyright notices. By downloading, you agree to these terms.

Related Presentations

Delay Scheduling A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling Discretized Streams A FaultTolerant Model for Scalable Stream Processing Matei Zaharia Choosy MaxMin Fair Sharing for Datacenter Jobs with Constraints Ali Ghodsi Matei Zaharia Das Violoncello 1.  Shakyamuni The Future of New York City: Das Violoncello Overview of current resources and update on DAS Meeting Cam Matei Zaharia O ciclo virtuoso das Das Lied der Deutschen Geht das überhaupt? Matei  Zaharia , in collaboration with