/
DMTCP Transparent Checkpointing for Cluster Computations and the Desktop Jason Ansel Computer DMTCP Transparent Checkpointing for Cluster Computations and the Desktop Jason Ansel Computer

DMTCP Transparent Checkpointing for Cluster Computations and the Desktop Jason Ansel Computer - PDF document

phoebe-click
phoebe-click . @phoebe-click
Follow
510 views
Uploaded On 2014-12-12

DMTCP Transparent Checkpointing for Cluster Computations and the Desktop Jason Ansel Computer - PPT Presentation

mitedu Kapil Arya Gene Cooperman College of Computer and Information Science Northeastern University Boston MA kapilgene ccsneuedu Abstract DMTCP Distributed MultiThreaded CheckPointing is a transparent userlevel checkpointing package for distributed ID: 22842

mitedu Kapil Arya Gene Cooperman

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "DMTCP Transparent Checkpointing for Clus..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

availabletoday.DMTCPtriestosupportbothtraditionalhighperformanceapplicationsandtypicaldesktopappli-cations.Withthisinmind,DMTCPsupportsthecriticalfeatureoftransparency:nore-compilationandnore-linkingofuserbinaries.BecauseitsupportsawiderangeofrecentLinuxkernels(2.6.9throughthecurrent2.6.28),itcanbepackagedasjustonemoduleinalargerapplication.Theapplicationbinaryneedsnorootprivilegeandneednotbere-conguredfornewkernels.Thisalsopainlesslyaddsa“save/restoreworkspace”capabilitytoanapplication,oreventoaproblem-solvingenvironmentwithmanythreadsorprocesses.Ultimately,thenoveltyofDMTCPrestsonitspar-ticularcombinationoffeatures:user-level,multithreaded,distributedprocessesconnectedwithsockets,andfastcheck-pointtimeswithnegligableoverheadwhilenotcheckpoint-ing.Thosefeaturesaredesignedtosupportabroadrangeofusecasesforcheckpointing,whichgobeyondthetraditionalusecasesoftoday.1.1.UseCasesInthissection,wepresentsomeusesofcheckpointingthatgobeyondthetraditionalcheckpointingoflong-runningbatchprocesses.Manyoftheseadditionalusesaremotivatedbydesktopapplications.1)save/restoreworkspace:Interactivelanguagesfre-quentlyincludetheirown“save/restoreworkspace”commands.DMTCPeliminatesthatneed.2)“undump”capability:programsthatwouldother-wisehavelongstartuptimesoftencreateacus-tom“dump/undump”facility.Thesoftwareisthenbuilt,dumpedafterstartup,andre-builttopackagea“checkpoint”alongwithanundumproutine.Oneoftheapplicationsforwhichweareworkingwiththedevelopers;cmsRun,hasexactlythisproblem:initial-izationof10minutestohalfanhourduetoobtainingreasonablycurrentdatafromadatabase,alongwithissuesoflinkingapproximately400dynamiclibraries:unacceptablewhenmanythousandsofsuchrunsarerequired.3)asubstituteforPRELINK:PRELINKisaLinuxtechnologyforprelinkinganapplication,inordertosavestartuptimewhenmanylargedynamiclibrariesareinvoked.PRELINKmustbemaintainedinsyncwiththechangingLinuxarchitecture.4)debuggingofdistributedapplications:allprocessesarecheckpointedjustbeforeabugandthenrestarted(possiblyonasinglehost)fordebugging.5)checkpointedimageasthe“ultimatebugreport”6)applicationswithCPU-intensivefront-endandinter-activeanalysisofresultsatback-end:Runonhighperformancehostorcluster,andrestartallprocessesonasinglelaptop7)traditionalcheckpointingoflong-runningdistributedapplicationsthatmayrunundersomedialectofMPI,orunderacustomsocketspackage(e.g.iPython,usedinSciPy/NumPyforparallelnumericalapplications.)8)robustness:upondetectingdistributeddeadlockorrace,automaticallyreverttoanearliercheckpointimageandrestartinslower,“safemode”,untilbeyondthedangerpoint.1.2.OutlineSection2coversrelatedwork.Section3describesDMTCPasseenbyanend-user.Section4describesthesoftwarearchitecture.Section5presentsexperimentalre-sults.Finally,Section6presentstheconclusionsandfuturework.2.RelatedWorkThereisalonghistoryofcheckpointingpackages(kernel-anduser-level,coordinatedanduncoordinated,single-threadedvs.multithreaded,etc.).Giventhespacelimitations,wehighlightonlythemostsignicantofthemanyotherapproaches.DejaVu[29](whosedevelopmentoverlappedthatofDMTCP)alsoprovidestransparentuser-levelcheckpointingofdistributedprocessbasedonsockets.However,DejaVuappearstobemuchslowerthanDMTCP.Forexample,intheChombobenchmark,Ruscioetal.reportexecutingtencheckpointsperhourwith45%overhead.Incomparison,onabenchmarkofsimilarscaleDMTCPtypicallycheck-pointsin2seconds,withessentiallyzerooverheadbetweencheckpoints.Nevertheless,DejaVuisalsoabletocheckpointInniBandconnectionsbyusingacustomizedversionofMVAPICH.DejaVutakesamoreinvasiveapproachthanDMTCP,byloggingallcommunicationandbyusingpageprotectiontodetectmodicationofmemorypagesbetweencheckpoints.ThisaccountsforadditionaloverheadduringnormalprogramexecutionthatisnotpresentinDMTCP.SinceDejaVuwasnotpubliclyavailableatthetimeofthiswriting,adirecttimingcomparisononacommonbenchmarkwasnotpossible.Theremainingworkondistributedtransparentcheck-pointingcanbedividedintotwocategories:1)User-levelMPIlibrariesforcheckpointing[4],[5],[12],[14],[15],[32],[34],[36],[37]:worksfordistributedprocesses,butonlyiftheycommunicateexclusivelythroughMPI(MessagePassingInterface).TypicallyrestrictedtoaparticulardialectofMPI.2)Kernel-level(system-level)checkpointing[13],[16],[18],[19],[30],[31],[33]:modicationofkernel;requirementsonmatchingpackageversiontokernelversion. Thesetwolayersareseparate,withasmallAPIbetweenthem.Thistwo-layeruser-levelapproachhasapotentialadvantageinnon-Linuxoperatingsystems,whereDMTCPcanbeportedtorunoverothersingle-processcheckpointingpackagesthatmayalreadyexist.Checkpointingisaddedtoarbitraryapplicationsbyinject-ingasharedlibraryatexecutiontime.Thislibrary:Launchesacheckpointmanagementthreadineveryuserprocesswhichcoordinatescheckpointing.Addswrappersaroundasmallnumberoflibcfunc-tionsinordertorecordinformationaboutopensocketsattheircreationtime.Systemcallsandtheproclesystemarealsousedtoprobekernelstate.Weuseacoordinatedcheckpointingmethod,whereallprocessesandthreadscluster-widearesimultaneouslysus-pendedduringcheckpointing.Networkdata“onthewire”andinkernelbuffersisushedintotherecipientprocess'smemoryandsavedinitscheckpointimage.Afteracheck-pointorrestart,thisnetworkdataissentbacktotheoriginalsenderandretransmittedpriortoresuminguserthreads.AmoredetailedaccountofourmethodologycanbefoundinSection4Theonlyglobalcommunicationprimitiveusedatcheck-pointtimeisabarrier.Atrestarttime,weadditionallyrequireadiscoveryservicetodiscoverthenewaddressesforprocessesmigratedtonewhosts.4.2.InitializationofanapplicationprocessunderDMTCPAtstartupofanewprocessdmtcp_checkpointin-jectsdmtcphijack:so,theDMTCPlibraryresponsibleforcheckpointing,intotheuserprogram.LibraryinjectioniscurrentlydoneusingLD_PRELOAD.Libraryinjectioncanalsobedoneafterprogramstartup[35]andonotherarchi-tectures[9].Onceinjectedintotheuserprocess,DMTCPloadsmtcp:so,oursingleprocesscheckpointer,andcallstheMTCPsetuproutinestoenableintegrationwithDMTCP.MTCPcreatesthecheckpointmanagerthreadinthissetuproutine.DMTCPalsoopensaTCP/IPconnectiontothecheckpointcoordinatoratthistime.Thisresultsinacopyofourlibrariesandmanagerresidingwithineachcheckpointedprocess.DMTCPaddswrappersaroundasmallnumberoflibcfunctions.Thisisdonebyoverridinglibcsymbolswithourlibrary.Forefciencyreasons,weavoidwrappinganyfrequentlyinvokedsystemcallssuchasreadandwrite.ThewrappersarenecessarysinceDMTCPmustbeawareofallforkedchildprocesses,ofallattemptstocreateremoteprocesses(forexampleviaanexectoansshprocess),andoftheparametersbywhichallsocketsarecreated.Inthecaseofsockets,DMTCPneedstoknowwhetherthesocketsareTCP/IPsockets(andwhethertheyarelistenerornon-listenersockets),UNIXdomainsockets,orpseudo-terminals.DMTCPplaceswrappersaroundthefollowingfunctions:socket,connect,bind,listen,accept,setsockopt,fexecve,execve,execv,execvp,fork,close,dup2,socketpair,openlog,syslog,closelog,ptsnameandptsname r.Therestofthissectiondescribesthepurposesforthesewrapper.4.3.CheckpointingunderDMTCPCheckpointingproceedsthroughsevenstagesandsixglobalbarriers.Globalbarrierscouldbeimplementedef-cientlythroughpeer-to-peercommunicationorbroadcasttrees,butarecurrentlycentralizedforsimplicityofimple-mentation.ThefollowingistheDMTCPdistributedalgorithmforcheckpointinganentirecluster.Itisexecutedasyn-chronouslyineachuserprocess.Theonlycommunicationprimitiveusedisacluster-widebarrier.ThefollowingstepsaredepictedgraphicallyinFigure1.1)Normalexecution:Thecheckpointmanagerthreadineachprocesswaitsuntilanewcheckpointisrequestedbythecoordinator.Thisisdonebywaitingataspecialbarrierthatisnotreleaseduntilcheckpointtime.2)Suspenduserthreads:MTCPsuspendsalluserthreads,thenDMTCPsavestheownerofeachledescriptor.DMTCPthenwaitsuntilallapplicationprocessesreachBarrier2:“suspended”,thenreleasesthebarrier.3)ElectshardFDleaders:DMTCPexecutesanelectionofaleaderforeachpotentiallysharedledescrip-tor.WetricktheoperatingsystemintoelectingaleaderforusbymisusingtheF SETOWNagoffcntl.Allprocessessettheowner,andthelastonewinstheelection.InStep4,aprocesscantestifitistheelectionleaderforasocketfdbytestingiffcntl(fd,F_GETOWN)==getpid().TheoriginalvalueforF SETOWNisrestoredafterkernelbuffersarerelled.DMTCPthenwaitsuntilallapplicationprocessesreachBarrier3:“electioncompleted”,thenreleasesthebarrier.4)Drainkernelbuffersandperformhandshakeswithpeers:Foreachsocket,thecorrespondingelectionleaderushesthatsocketbysendingaspecialtoken.Itthendrainsthatsocketbyreceivinguntilthereisnomoreavailabledataandthespecialtokenisseen.DMTCPthenperformshandshakeswithallsocketpeerstodiscoverthegloballyuniqueIDoftheremotesideofallsockets.Theconnectioninformationtableisthenwrittentodisk.DMTCPthenwaitsuntilallapplicationprocessesreachBarrier4:“drained”,thenreleasesthebarrier. Figure2:StepsforrestartingthesystemcheckpointedinFigure1.Theuniedrestartprocessandsubsequentforkarerequiredtorecreatesocketsandpipessharedbetweenprocesses.7)Resumeuserthreads:Theprogramcontinuesexecut-ingStep7ofcheckpointing.Step2abovebearsfurtherexplanation.Recallthatpriortocheckpointing,wheneveranewconnectionwasaccepted,wrappersaroundthesystemcallsconnectandaccepthadtransferredinformationabouttheconnectortotheacceptor.ThisinformationincludesagloballyuniquesocketIDthatremainsconstantevenifprocessesarere-located.Atrestarttime,theacceptorforeachsocketadver-tisestheaddressandportofitsrestartlistersockettothediscoveryservice.Whentheconnectorreceivesthisadvertisement,itopensanewconnectiontotheacceptorwhosenttheadvertisement.Thetwosidesthenperformahandshakeandagreeonthesocketbeingrestored.Finally,dup2isusedoneachsidetomovethesocketdescriptortothecorrectlocation.Thisprocesscontinuesasynchronouslyuntilallsocketsarerestored.Ourmethodologysupportsbothsidesofasocketmigrating.Italsosupportsloopbacksockets.4.5.ImplementationStrategiesIntheimplementation,somelessobviousissuesariseinthesupportforpipes,sharedmemory(viammap),andvirtualpids.Pipespresentanissuebecausetheyareunidirectional.AsseeninSections4.3and4.4,thestrategyforcheckpointingnetworkdatainasocketconnectionisforthereceivertodrainthesocketintouserspace,thenwriteacheckpointimage,andnallyre-sendthenetworkdatathroughthesamesocketbacktothesender.Inordertosupportpipes,awrapperaroundthepipesystemcallpromotespipesintosockets.Inthecaseofsharedmemory,ifthebackingleofasharedmemorysegmentismissingandwehavedirectorywritepermision,thenwecreateanewbackingle.Next,assumingthebackingleispresentandwehavewriteaccess,weoverwritethesharedmemorysegmentwithdatafromthecheckpointimage.Iftwoprocessessharethismemory,theywillbothwritetothesamesharedsegment,butwiththesamedata,sincethesegmentwasalsosharedatthetimeofcheckpoint.Ifwedonothavewriteaccess(forexample,read-onlyaccesstocertainsystem-widedata),thenwemapthememorysegmentbythecurrentdataofthele,andnotthecheckpointimagedata.Inordertosupportvirtualpids(processids),onemustworryaboutpidconicts.TheoriginalpidWhenaprocessisrstcreatedthroughacalltofork,itspidalsobecomesitsvirtualpid,andthatvirtualpidismaintainedthroughoutsucceedinggenerationsofrestarts.Hence,anewprocessmayhavepidA.Aftercheckpointandrestart,asecondprocessmaybecreatedwiththesamepidA.Ourwrapperaroundforkdetectsthissituation,terminatesthechildwiththeconictingvirtualpid,,andforksonceagain.5.ExperimentalResultsDMTCPiscurrentlyimplementedforGNU/Linux.ThesoftwarehasbeenveriedtoworkonrecentversionsofUbuntu,Debian,OpenSuse,andRedHatLinuxwithLinux (a)Checkpoint/Restarttimings. (b)Checkpointsizes.Figure3:Commonshell-likelanguagesandotherapplications.Allarerunonasinglenodewithcompressionenabled.kernelsrangingfromversion2.6.9throughversion2.6.28.DMTCPrunsonx86,x86 64andmixed(32-bitprocessesin64-bitLinux)architectures.Experimentswererunontwobroadclassesofprograms:shell-likelanguagesintendedforasinglecomputer(e.g.MATLAB,Perl,Python,Octave,etc.);anddistributedpro-gramsacrossthenodesofacluster(e.g.ParGeant4,iPython,MPICH2,OpenMPI,etc).Reportedcheckpointimagesareaftergzipcompression(unlessotherwiseindicated),sinceDMTCPdynamicallyinvokesgzipbeforesaving,bydefault.InSection5.1,ourgoalwastodemonstrateon20com-monreal-worldapplications.Anemphasisonshell-likelan-guageswerechosenfortheirwidespreadusage,andfortheirtendencytoinvokemultipleprocessesandmultiplethreadsintheirimplementation.Thelanguageswerechosenfromtheapplicationslistedunder“Interactivemodelanguages”(shell-likelanguages)inthearticle“Listofprogramminglanguagesbycategory”onWikipedia.Section5.2isconcernedwithtestingforscalability.Theparalleltoolsandbenchmarkswerechosenfortheirpopularityinthecomputationalsciencecommunity.Theywereaugmentedwithsomecomputationalpackagesthathadalreadybeenconguredandinstalledastoolsusedbyourownworkinggroup.5.1.CommonShell-LikeLanguagesThesetestswereconductedonadual-socket,quad-core(8totalcores)XeonE5320.Thissystemwasrunning64-bitDebian“sid”GNU/Linuxwithkernelversion2.6.28.Toshowbreadth,wepresentcheckpointtimes,restarttimes,andcheckpointsizesonawidevarietyofcommonlyusedapplications.TheseresultsareshowninFigure3.Theseapplicationsare:BC(1.06.94)–anarbitraryprecisioncal-culatorlanguage;Emacs(2.22)–awellknowntexteditor;GHCi(6.8.2)–theGlasgowHaskellCompiler;Ghostscript(8.62)–PostScriptandPDFlanguageinterpreter;GNU-Plot(4.2)–aninteractiveplottingprogram;GST(3.0.3)–theGNUSmalltalkvirtualmachine;Lynx(2.8.7)–acommandlinewebbrowser;Macaulay2(2-1.1)–asystemsupportingresearchinalgebraicgeometryandcommutativealgebra;MATLAB(7.4.0)–ahigh-levellanguageandinteractiveenvironmentfortechnicalcomputing;MZScheme(4.0.1)–thePLTSchemeimplementation;OCaml(3.10.2)–theObjectiveCamlinteractiveshell;Octave(3.0.1)–ahigh-levelinteractivelanguagefornumericalcomputations;PERL(5.10.0)–PracticalExtractionandReportLanguageinterpreter;PHP(5.2.6)–anHTML-embeddedscriptinglanguage;Python(2.5.2)–aninterpreted,interactive,object- [13]PaulHargroveandJasonDuell.Berkeleylabcheck-point/restart(BLCR)forLinuxclusters.JournalofPhysicsConferenceSeries,46:494–499,September2006.[14]ThomasHerault,PierreLemarinier,andFranckCappello.Blockingvs.non-blockingcoordinatedcheckpointingforlarge-scalefault-tolerantMPI.InProceedingsofInternationalSymposiumonHighPerformanceComputingandNetworking(SC2006),2006.[15]JoshuaHursey,JeffreyM.Squyres,TimothyI.Mattox,andAndrewLumsdain.Thedesignandimplementationofcheckpoint/restartprocessfaulttoleranceforOpenMPI.InProceedingsofthe21stIEEEInternationalParallelandDistributedProcessingSymposium(IPDPS)/12thIEEEWorkshoponDependableParallel,DistributedandNetwork-CentricSystems.IEEEComputerSociety,March2007.[16]G.J.Janakiraman,J.R.Santos,D.Subhraveti,andY.Turner.Application-transparentdistributedcheckpoint-restartonstan-dardoperatingsystems.InDependableSystemsandNetworks(DSN-05),pages260–269,2005.[17]Byoung-JipKim.Comparisonoftheexistingcheckpointsystems.Technicalreport,IBMWatson,October2005.[18]OrenLaadanandJasonNieh.Transparentcheckpoint-restartofmultipleprocessesforcommodityclusters.In2007USENIXAnnualTechnicalConference,pages323–336,2007.[19]OrenLaadan,DanPhung,andJasonNieh.Transparentnetworkedcheckpoint-restartforcommodityclusters.In2005IEEEInternationalConferenceonClusterComputing.IEEEPress,2005.[20]KaiLi,JeffreyF.Naughton,andJamesS.Plank.Real-time,concurrentcheckpointforparallelprograms.InProc.ofSecondACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming,pages79–88,March1990.[21]KaiLi,JeffreyF.Naughton,andJamesS.Plank.Low-latency,concurrentcheckpointingforparallelprograms.IEEETransactionsonParallelandDistributedSystems,5:874–879,August1994.[22]MichaelLitzkow,ToddTannenbaum,JimBasney,andMironLivny.CheckpointandmigrationofUNIXprocessesintheCondordistributedprocessingsystem.Technicalreport1346,UniversityofWisconsin,Madison,Wisconsin,April1997.[23]FernandoPerezandBrianE.Granger.iPython:Asystemforinteractivescienticcomputing.ComputinginScienceandEngineering,pages21–29,May/June2007.(See,also,http://ipython.scipy.org/moin/Parallel Computing.).[24]EduardoPinheiro.EPCKPT—acheckpointutilityfortheLinuxkernel.http://www.research.rutgers.edu/edpin/epckpt/.[25]J.S.Plank,J.Xu,andR.H.B.Netzer.Compresseddifferences:Analgorithmforfastincrementalcheckpointing.TechnicalReportCS-95-302,UniversityofTennessee,August1995.[26]JamesS.Plank,MicahBeck,GerryKingsley,andKaiLi.Libckpt:TransparentcheckpointingunderUnix.InProc.oftheUSENIXWinter1995TechnicalConference,pages213–323,1995.[27]MichaelRieker,JasonAnsel,andGeneCooperman.Trans-parentuser-levelcheckpointingfortheNativePOSIXThreadLibraryforLinux.InProc.ofParallelandDistributedProcessingTechniquesandApplications(PDPTA-06),pages492–498,2006.[28]EricRoman.Asurveyofcheckpoint/restartimplementations.Technicalreport,LawrenceBerkeleyNationalLaboratory,November2003.[29]JosephRuscio,MichaelHeffner,andSrinidhiVaradarajan.DejaVu:Transparentuser-levelcheckpointing,migration,andrecoveryfordistributedsystems.InIEEEInternationalParallelandDistributedProcessingSymposium,March2007.[30]SriramSankaran,JeffreyM.Squyres,BrianBarrett,andAndrewLumsdaine.TheLAM/MPIcheckpoint/restartframe-work:System-initiatedcheckpointing.InternationalJournalofHighPerformanceComputingApplications,19:479–493,2005.[31]DanielJ.Sorin,MiloM.K.Martin,MarkD.Hill,andDavidA.Wood.SafetyNet:improvingtheavailabil-ityofsharedmemorymultiprocessorswithglobalcheck-point/recovery.InISCA'02:Proceedingsofthe29thannualInternationalSymposiumonComputerArchitecture,pages123–134,Washington,DC,USA,2002.IEEEComputerSociety.[32]GeorgStellner.Cocheck:Checkpointingandprocessmi-grationforMPI.InIPPS'96:Proceedingsofthe10thInternationalParallelProcessingSymposium,pages526–531,Washington,DC,USA,1996.IEEEComputerSociety.[33]O.O.Sudakov,I.S.Meshcheriakov,andY.V.Boyko.CHPOX:TransparentcheckpointingsystemforLinuxclusters.InIEEEInternationalWorkshoponIntelligentDataAcquisitionandAdvancedComputingSystems:TechnologyandApplications,pages159–164,2007.[34]NamyoonWoo,SoonhoChoi,hyungsooJung,JungwhanMoon,HeonY.Yeom,TaesoonPark,andHyungwooPark.MPICH-GF:ProvidingfaulttoleranceonGridenvironments.The3rdIEEE/ACMInternationalSymposiumonClusterComputingandtheGrid(CCGrid2003),theposterandre-searchdemosessionMay,2003,Tokyo,Japan.[35]VictorC.Zandy,BartonP.Miller,andMironLivny.Processhijacking.In8thIEEEInternationalSymposiumonHighPerformanceDistributedComputing,pages177–184,1999.[36]YouhuiZhang,DongshengWong,andWeiminZheng.User-levelcheckpointandrecoveryforLAM/MPI.InACMSIGOPSOperatingSystemsReview,volume39,pages72–81,2005.[37]GengbinZheng,LixiaShi,andL.V.Kale.FTC-Charm++:Anin-memorycheckpoint-basedfaulttolerantruntimeforCharm++andMPI.In2004IEEEInternationalConferenceonClusterComputing(Fault-TolerantSession),pages93–103,2004.