/
Cloud Computing and MapReduce Cloud Computing and MapReduce

Cloud Computing and MapReduce - PowerPoint Presentation

barbara
barbara . @barbara
Follow
67 views
Uploaded On 2023-06-23

Cloud Computing and MapReduce - PPT Presentation

Used slides from RAD Lab at UC Berkeley about the cloud httpabovethecloudscsberkeleyedu and slides from Jimmy Lin s slides httpwwwumiacsumdedujimmylincloud2010Springindexhtml licensed under Creation Commons Attribution 30 License ID: 1002243

cloud data computing file data cloud file computing large reduce key cost storage machines block mapreduce map values centers

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Cloud Computing and MapReduce" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Cloud Computing and MapReduceUsed slides from RAD Lab at UC Berkeley about the cloud ( http://abovetheclouds.cs.berkeley.edu/) and slides from Jimmy Lin’s slides (http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html) (licensed under Creation Commons Attribution 3.0 License)

2. Cloud computingWhat is the “cloud”?Many answers. Easier to explain with examples:Gmail is in the cloudAmazon (AWS) EC2 and S3 are the cloudGoogle AppEngine is the cloudWindows Azure is the cloudSimpleDB is in the cloudThe “network” (cloud) is the computer

3. Cloud ComputingWhat about Wikipedia? “Cloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a utility (like the electricity grid) over a network (typically the Internet). “

4. Computing as a “service” rather than a “product”Everything happens in the “cloud”: both storage and computingPersonal devices (laptops/tablets) simply interact with the cloudAdvantagesDevice agonstic – can seamlessly move from one device to otherEfficiency/scalability: programming frameworks allow easy scalability (relatively speaking)Increasing need to handle “Big Data”ReliabilityMulti-tenancy (better on the cloud provider)Cost: “pay as you go” allows renting computing resources as needed – much cheaper than building your own systemsCloud Computing

5. moreScalability means that you (can) have infinite resources, can handle unlimited number of usersMulti-tenancy enables sharing of resources and costs across a large pool of users. Lower cost, higher utilization… but other issues: e.g. security.Elasticity: you can add or remove computer nodes and the end user will not be affected/see the improvement quickly. Utility computing (similar to electrical grid)

6. X-as-a-Service

7. X-as-a-Service

8. Cloud Types

9. The key infrastructure piece that enables CCEveryone is building themHuge amount of work on deciding how to build/design themData Centers

10. Data Centers

11. Data Centers

12. Amazon data centers: Some old data8 MW data center can include about 46,000 serversCosts about $88 million to build (just the facility)Power a pretty large portion, but server costs still dominateData Centers source: James Hamilton Presentation

13. Data CentersPower distributionAlmost 11% lost in distribution – starts mattering when the total power consumption is in millionsModular and pre-fab designsFast and economic deployments, built in a factorysource: James Hamilton PresentationSlides from 4-5 years ago

14. Data CentersNetworking equipmentVery very expensive: server/storage prices dropping fastNetworking frozen in time: vertically integrated ecosystemBottleneck – forces workload placement restrictionsCooling/temperature/energy issuesAppropriate placement of vents, inlets etc. a key issueThermal hotspots often appear and need to worked aroundOverall cost of cooling is quite highSo is the cost of running the computing equipmentBoth have led to issues in energy-efficient computingHard to optimize PUE (Power Usage Effectiveness) in small data centers may lead to very large data centers in near futureIdeally PUE should be 1, currently numbers are around 1.07-1.221.07 is a Facebook data center that does not have A/Csource: James Hamilton Presentation

15. MGHPCCMassachusetts Green High Performance Computing CenterMGHPCC in Holyoke, MACost: $95M8.6 acres, 10,000 high-end computers with hundreds of thousands of processor cores10 MW power + 5MW for cooling/lightingClose to electricity sources (hydroelectric plant)+ solarMore: http://www.mghpcc.org/

16. Amazon Web Services11 AWS regions worldwide

17. VirtualizationVirtual machines (e.g., running Windows inside a Mac) etc. has been around for a long timeUsed to be very slow…Only recently became efficient enough to make it a key for CCBasic idea: run virtual machines on your servers and sell time on themThat’s how Amazon EC2 runsMany advantages:“Security”: virtual machines serves as “almost” impenetrable boundaryMulti-tenancy: can have multiple VMs on the same serverEfficiency: replace many underpowered machines with a few high-power machines

18. VirtualizationConsumer VM products include VMWare, Parallels (for Mac), VirtualBox etc…Some tricky things to keep in mind:Harder to reason about performance (if you care)Identical VMs may deliver somewhat different performanceMuch continuing work on the virtualization technology itself

19. Hottest thing right now…Avoid the overheads of virtualization altogetherDocker

20. CLOUD COMPUTING ECONOMICS AND ELASTICITY

21. Cloud Application DemandMany cloud applications have cyclical demand curvesDaily, weekly, monthly, …DemandTimeResources

22. Unused resourcesEconomics of Cloud UsersPay by use instead of provisioning for peakRecall: DC costs >$150M and takes 24+ months to design and buildStatic data centerData center in the cloudDemandCapacityTimeResourcesDemandCapacityTimeResourcesHow do you pick a capacity level?

23. Unused resourcesEconomics of Cloud UsersRisk of over-provisioning: underutilizationHuge sunk cost in infrastructureStatic data centerDemandCapacityTimeResourcesResourcesDemandCapacityTime (days)123

24. Utility Computing ArrivesAmazon Elastic Compute Cloud (EC2)“Compute unit” rental: $0.10-0.80 0.085-0.68/hour1 CU ≈ 1.0-1.2 GHz 2007 AMD Opteron/Intel Xeon coreNo up-front cost, no contract, no minimumBilling rounded to nearest hour (also regional,spot pricing)PlatformUnitsMemoryDiskSmall - $0.10 $.085/hour32-bit11.7GB160GBLarge - $0.40 $0.35/hour64-bit47.5GB850GB – 2 spindlesX Large - $0.80 $0.68/hour64-bit815GB1690GB – 4 spindlesHigh CPU Med - $0.20 $0.1764-bit51.7GB350GBHigh CPU Large - $0.80 $0.6864-bit207GB1690GBHigh Mem X Large - $0.5064-bit6.517.1GB1690GBHigh Mem XXL - $1.2064-bit1334.2GB1690GBHigh Mem XXXL - $2.4064-bit2668.4GB1690GB

25. Utility Storage ArrivesAmazon S3 and Elastic Block Storage offer low-cost, contract-less storage

26.

27. Programming FrameworksThird key piece emerged from efforts to “scale out” i.e., distribute work over large numbers of machines (1000’s of machines)Parallelism has been around for a long timeBoth in a single machine, and as a cluster of computers But always been considered very hard to program, especially the distributed kindToo many things to keep track ofHow to parallelize, how to distribute the data, how to handle failures etc etc..Google developed MapReduce and BigTable frameworks, and ushered in a new era

28. Programming FrameworksNote the difference between “scale up” and “scale out”scale up usually refers to using a larger machine – easier to doscale out refers to distributing over a large number of machinesEven with VMs, I still need to know how to distribute work across multiple VMsAmazon’s largest single instance may not be enough

29. Cloud Computing InfrastructureComputation model: MapReduce*Storage model: HDFS*Other computation models: HPC/Grid ComputingNetwork structure*Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

30. Cloud Computing Computation ModelsFinding the right level of abstractionvon Neumann architecture vs cloud environmentHide system-level details from the developersNo more race conditions, lock contention, etc.Separating the what from howDeveloper specifies the computation that needs to be performedExecution framework (“runtime”) handles actual executionSimilar to SQL!!

31. Typical Large-Data ProblemIterate over a large number of recordsExtract something of interest from eachShuffle and sort intermediate resultsAggregate intermediate resultsGenerate final outputKey idea: provide a functional abstraction for these two operations – MapReduceMapReduce(Dean and Ghemawat, OSDI 2004)

32. gggggfffffMapFoldRoots in Functional Programming

33. MapReduceProgrammers specify two functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’’>*All values with the same key are sent to the same reducerThe execution framework handles everything else…

34. mapmapmapmapShuffle and Sort: aggregate values by keysreducereducereducek1k2k3k4k5k6v1v2v3v4v5v6ba12cc36ac52bc78a15b27c2368r1s1r2s2r3s3MapReduce

35. MapReduceProgrammers specify two functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*All values with the same key are sent to the same reducerThe execution framework handles everything else…What’s “everything else”?

36. MapReduce “Runtime”Handles schedulingAssigns workers to map and reduce tasksHandles “data distribution”Moves processes to dataHandles synchronizationGathers, sorts, and shuffles intermediate dataHandles errors and faultsDetects worker failures and automatically restartsHandles speculative executionDetects “slow” workers and re-executes workEverything happens on top of a distributed FS (later)Sounds simple, but many challenges!

37. MapReduceProgrammers specify two functions:map (k, v) → <k’, v’>*reduce (k’, v’) → <k’, v’>*All values with the same key are reduced togetherThe execution framework handles everything else…Not quite…usually, programmers also specify:partition (k’, number of partitions) → partition for k’Often a simple hash of the key, e.g., hash(k’) mod RDivides up key space for parallel reduce operationscombine (k’, v’) → <k’, v’>*Mini-reducers that run in memory after the map phaseUsed as an optimization to reduce network traffic

38. combinecombinecombinecombineba12c9ac52bc78partitionpartitionpartitionpartitionmapmapmapmapk1k2k3k4k5k6v1v2v3v4v5v6ba12cc36ac52bc78Shuffle and Sort: aggregate values by keysreducereducereducea15b27c298r1s1r2s2r3s3c2368

39. Two more details…Barrier between map and reduce phasesBut we can begin copying intermediate data earlierKeys arrive at each reducer in sorted orderNo enforced ordering across reducers

40. split 0split 1split 2split 3split 4workerworkerworkerworkerworkerMasterUserProgramoutputfile 0outputfile 1(1) submit(2) schedule map(2) schedule reduce(3) read(4) local write(5) remote read(6) writeInputfilesMapphaseIntermediate files(on local disk)ReducephaseOutputfilesAdapted from (Dean and Ghemawat, OSDI 2004)MapReduce Overall Architecture

41. “Hello World” Example: Word CountMap(String docid, String text): for each word w in text: Emit(w, 1);Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);

42. MapReduce can refer to…The programming modelThe execution framework (aka “runtime”)The specific implementationUsage is usually clear from context!

43. MapReduce ImplementationsGoogle has a proprietary implementation in C++Bindings in Java, PythonHadoop is an open-source implementation in JavaDevelopment led by Yahoo, used in productionNow an Apache projectRapidly expanding software ecosystem, but still lots of room for improvementLots of custom research implementationsFor GPUs, cell processors, etc.

44. Cloud Computing Storage, or how do we get data to the workers?Compute NodesNASSANWhat’s the problem here?

45. Distributed File SystemDon’t move data to workers… move workers to the data!Store data on the local disks of nodes in the clusterStart up the workers on the node that has the data localWhy?Network bisection bandwidth is limitedNot enough RAM to hold all the data in memoryDisk access is slow, but disk throughput is reasonableA distributed file system is the answerGFS (Google File System) for Google’s MapReduceHDFS (Hadoop Distributed File System) for Hadoop

46. GFS: AssumptionsChoose commodity hardware over “exotic” hardwareScale “out”, not “up”High component failure ratesInexpensive commodity components fail all the time“Modest” number of huge filesMulti-gigabyte files are common, if not encouragedFiles are write-once, mostly appended toPerhaps concurrentlyLarge streaming reads over random accessHigh sustained throughput over low latencyGFS slides adapted from material by (Ghemawat et al., SOSP 2003)

47. GFS: Design DecisionsFiles stored as chunksFixed size (64MB)Reliability through replicationEach chunk replicated across 3+ chunkserversSingle master to coordinate access, keep metadataSimple centralized managementNo data cachingLittle benefit due to large datasets, streaming readsSimplify the APIPush some of the issues onto the client (e.g., data layout)HDFS = GFS clone (same basic ideas implemented in Java)

48. From GFS to HDFSTerminology differences:GFS master = Hadoop namenodeGFS chunkservers = Hadoop datanodesFunctional differences:No file appends in HDFS (was planned)HDFS performance is (likely) slower

49. Adapted from (Ghemawat et al., SOSP 2003)(file name, block id)(block id, block location)instructions to datanodedatanode state(block id, byte range)block dataHDFS namenodeHDFS datanodeLinux file system…HDFS datanodeLinux file system…File namespace/foo/barblock 3df2ApplicationHDFS ClientHDFS Architecture

50. Namenode ResponsibilitiesManaging the file system namespace:Holds file/directory structure, metadata, file-to-block mapping, access permissions, etc.Coordinating file operations:Directs clients to datanodes for reads and writesNo data is moved through the namenodeMaintaining overall health:Periodic communication with the datanodesBlock re-replication and rebalancingGarbage collection

51. Putting everything together…datanode daemonLinux file system…tasktrackerslave nodedatanode daemonLinux file system…tasktrackerslave nodedatanode daemonLinux file system…tasktrackerslave nodenamenodenamenode daemonjob submission nodejobtracker

52. MapReduce/GFS SummarySimple, but powerful programming modelScales to handle petabyte+ workloadsGoogle: six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computersYahoo!: 16.25 hours to sort 1PB on 3,800 computersIncremental performance improvement with more nodesSeamlessly handles failures, but possibly with performance penalties