mitedu Kapil Arya Gene Cooperman College of Computer and Information Science Northeastern University Boston MA kapilgene ccsneuedu Abstract DMTCP Distributed MultiThreaded CheckPointing is a transparent userlevel checkpointing package for distributed ID: 22842
Download Pdf The PPT/PDF document "DMTCP Transparent Checkpointing for Clus..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
availabletoday.DMTCPtriestosupportbothtraditionalhighperformanceapplicationsandtypicaldesktopappli-cations.Withthisinmind,DMTCPsupportsthecriticalfeatureoftransparency:nore-compilationandnore-linkingofuserbinaries.BecauseitsupportsawiderangeofrecentLinuxkernels(2.6.9throughthecurrent2.6.28),itcanbepackagedasjustonemoduleinalargerapplication.Theapplicationbinaryneedsnorootprivilegeandneednotbere-conguredfornewkernels.Thisalsopainlesslyaddsasave/restoreworkspacecapabilitytoanapplication,oreventoaproblem-solvingenvironmentwithmanythreadsorprocesses.Ultimately,thenoveltyofDMTCPrestsonitspar-ticularcombinationoffeatures:user-level,multithreaded,distributedprocessesconnectedwithsockets,andfastcheck-pointtimeswithnegligableoverheadwhilenotcheckpoint-ing.Thosefeaturesaredesignedtosupportabroadrangeofusecasesforcheckpointing,whichgobeyondthetraditionalusecasesoftoday.1.1.UseCasesInthissection,wepresentsomeusesofcheckpointingthatgobeyondthetraditionalcheckpointingoflong-runningbatchprocesses.Manyoftheseadditionalusesaremotivatedbydesktopapplications.1)save/restoreworkspace:Interactivelanguagesfre-quentlyincludetheirownsave/restoreworkspacecommands.DMTCPeliminatesthatneed.2)undumpcapability:programsthatwouldother-wisehavelongstartuptimesoftencreateacus-tomdump/undumpfacility.Thesoftwareisthenbuilt,dumpedafterstartup,andre-builttopackageacheckpointalongwithanundumproutine.Oneoftheapplicationsforwhichweareworkingwiththedevelopers;cmsRun,hasexactlythisproblem:initial-izationof10minutestohalfanhourduetoobtainingreasonablycurrentdatafromadatabase,alongwithissuesoflinkingapproximately400dynamiclibraries:unacceptablewhenmanythousandsofsuchrunsarerequired.3)asubstituteforPRELINK:PRELINKisaLinuxtechnologyforprelinkinganapplication,inordertosavestartuptimewhenmanylargedynamiclibrariesareinvoked.PRELINKmustbemaintainedinsyncwiththechangingLinuxarchitecture.4)debuggingofdistributedapplications:allprocessesarecheckpointedjustbeforeabugandthenrestarted(possiblyonasinglehost)fordebugging.5)checkpointedimageastheultimatebugreport6)applicationswithCPU-intensivefront-endandinter-activeanalysisofresultsatback-end:Runonhighperformancehostorcluster,andrestartallprocessesonasinglelaptop7)traditionalcheckpointingoflong-runningdistributedapplicationsthatmayrunundersomedialectofMPI,orunderacustomsocketspackage(e.g.iPython,usedinSciPy/NumPyforparallelnumericalapplications.)8)robustness:upondetectingdistributeddeadlockorrace,automaticallyreverttoanearliercheckpointimageandrestartinslower,safemode,untilbeyondthedangerpoint.1.2.OutlineSection2coversrelatedwork.Section3describesDMTCPasseenbyanend-user.Section4describesthesoftwarearchitecture.Section5presentsexperimentalre-sults.Finally,Section6presentstheconclusionsandfuturework.2.RelatedWorkThereisalonghistoryofcheckpointingpackages(kernel-anduser-level,coordinatedanduncoordinated,single-threadedvs.multithreaded,etc.).Giventhespacelimitations,wehighlightonlythemostsignicantofthemanyotherapproaches.DejaVu[29](whosedevelopmentoverlappedthatofDMTCP)alsoprovidestransparentuser-levelcheckpointingofdistributedprocessbasedonsockets.However,DejaVuappearstobemuchslowerthanDMTCP.Forexample,intheChombobenchmark,Ruscioetal.reportexecutingtencheckpointsperhourwith45%overhead.Incomparison,onabenchmarkofsimilarscaleDMTCPtypicallycheck-pointsin2seconds,withessentiallyzerooverheadbetweencheckpoints.Nevertheless,DejaVuisalsoabletocheckpointInniBandconnectionsbyusingacustomizedversionofMVAPICH.DejaVutakesamoreinvasiveapproachthanDMTCP,byloggingallcommunicationandbyusingpageprotectiontodetectmodicationofmemorypagesbetweencheckpoints.ThisaccountsforadditionaloverheadduringnormalprogramexecutionthatisnotpresentinDMTCP.SinceDejaVuwasnotpubliclyavailableatthetimeofthiswriting,adirecttimingcomparisononacommonbenchmarkwasnotpossible.Theremainingworkondistributedtransparentcheck-pointingcanbedividedintotwocategories:1)User-levelMPIlibrariesforcheckpointing[4],[5],[12],[14],[15],[32],[34],[36],[37]:worksfordistributedprocesses,butonlyiftheycommunicateexclusivelythroughMPI(MessagePassingInterface).TypicallyrestrictedtoaparticulardialectofMPI.2)Kernel-level(system-level)checkpointing[13],[16],[18],[19],[30],[31],[33]:modicationofkernel;requirementsonmatchingpackageversiontokernelversion. Thesetwolayersareseparate,withasmallAPIbetweenthem.Thistwo-layeruser-levelapproachhasapotentialadvantageinnon-Linuxoperatingsystems,whereDMTCPcanbeportedtorunoverothersingle-processcheckpointingpackagesthatmayalreadyexist.Checkpointingisaddedtoarbitraryapplicationsbyinject-ingasharedlibraryatexecutiontime.Thislibrary:Launchesacheckpointmanagementthreadineveryuserprocesswhichcoordinatescheckpointing.Addswrappersaroundasmallnumberoflibcfunc-tionsinordertorecordinformationaboutopensocketsattheircreationtime.Systemcallsandtheproclesystemarealsousedtoprobekernelstate.Weuseacoordinatedcheckpointingmethod,whereallprocessesandthreadscluster-widearesimultaneouslysus-pendedduringcheckpointing.Networkdataonthewireandinkernelbuffersisushedintotherecipientprocess'smemoryandsavedinitscheckpointimage.Afteracheck-pointorrestart,thisnetworkdataissentbacktotheoriginalsenderandretransmittedpriortoresuminguserthreads.AmoredetailedaccountofourmethodologycanbefoundinSection4Theonlyglobalcommunicationprimitiveusedatcheck-pointtimeisabarrier.Atrestarttime,weadditionallyrequireadiscoveryservicetodiscoverthenewaddressesforprocessesmigratedtonewhosts.4.2.InitializationofanapplicationprocessunderDMTCPAtstartupofanewprocessdmtcp_checkpointin-jectsdmtcphijack:so,theDMTCPlibraryresponsibleforcheckpointing,intotheuserprogram.LibraryinjectioniscurrentlydoneusingLD_PRELOAD.Libraryinjectioncanalsobedoneafterprogramstartup[35]andonotherarchi-tectures[9].Onceinjectedintotheuserprocess,DMTCPloadsmtcp:so,oursingleprocesscheckpointer,andcallstheMTCPsetuproutinestoenableintegrationwithDMTCP.MTCPcreatesthecheckpointmanagerthreadinthissetuproutine.DMTCPalsoopensaTCP/IPconnectiontothecheckpointcoordinatoratthistime.Thisresultsinacopyofourlibrariesandmanagerresidingwithineachcheckpointedprocess.DMTCPaddswrappersaroundasmallnumberoflibcfunctions.Thisisdonebyoverridinglibcsymbolswithourlibrary.Forefciencyreasons,weavoidwrappinganyfrequentlyinvokedsystemcallssuchasreadandwrite.ThewrappersarenecessarysinceDMTCPmustbeawareofallforkedchildprocesses,ofallattemptstocreateremoteprocesses(forexampleviaanexectoansshprocess),andoftheparametersbywhichallsocketsarecreated.Inthecaseofsockets,DMTCPneedstoknowwhetherthesocketsareTCP/IPsockets(andwhethertheyarelistenerornon-listenersockets),UNIXdomainsockets,orpseudo-terminals.DMTCPplaceswrappersaroundthefollowingfunctions:socket,connect,bind,listen,accept,setsockopt,fexecve,execve,execv,execvp,fork,close,dup2,socketpair,openlog,syslog,closelog,ptsnameandptsname r.Therestofthissectiondescribesthepurposesforthesewrapper.4.3.CheckpointingunderDMTCPCheckpointingproceedsthroughsevenstagesandsixglobalbarriers.Globalbarrierscouldbeimplementedef-cientlythroughpeer-to-peercommunicationorbroadcasttrees,butarecurrentlycentralizedforsimplicityofimple-mentation.ThefollowingistheDMTCPdistributedalgorithmforcheckpointinganentirecluster.Itisexecutedasyn-chronouslyineachuserprocess.Theonlycommunicationprimitiveusedisacluster-widebarrier.ThefollowingstepsaredepictedgraphicallyinFigure1.1)Normalexecution:Thecheckpointmanagerthreadineachprocesswaitsuntilanewcheckpointisrequestedbythecoordinator.Thisisdonebywaitingataspecialbarrierthatisnotreleaseduntilcheckpointtime.2)Suspenduserthreads:MTCPsuspendsalluserthreads,thenDMTCPsavestheownerofeachledescriptor.DMTCPthenwaitsuntilallapplicationprocessesreachBarrier2:suspended,thenreleasesthebarrier.3)ElectshardFDleaders:DMTCPexecutesanelectionofaleaderforeachpotentiallysharedledescrip-tor.WetricktheoperatingsystemintoelectingaleaderforusbymisusingtheF SETOWNagoffcntl.Allprocessessettheowner,andthelastonewinstheelection.InStep4,aprocesscantestifitistheelectionleaderforasocketfdbytestingiffcntl(fd,F_GETOWN)==getpid().TheoriginalvalueforF SETOWNisrestoredafterkernelbuffersarerelled.DMTCPthenwaitsuntilallapplicationprocessesreachBarrier3:electioncompleted,thenreleasesthebarrier.4)Drainkernelbuffersandperformhandshakeswithpeers:Foreachsocket,thecorrespondingelectionleaderushesthatsocketbysendingaspecialtoken.Itthendrainsthatsocketbyreceivinguntilthereisnomoreavailabledataandthespecialtokenisseen.DMTCPthenperformshandshakeswithallsocketpeerstodiscoverthegloballyuniqueIDoftheremotesideofallsockets.Theconnectioninformationtableisthenwrittentodisk.DMTCPthenwaitsuntilallapplicationprocessesreachBarrier4:drained,thenreleasesthebarrier. Figure2:StepsforrestartingthesystemcheckpointedinFigure1.Theuniedrestartprocessandsubsequentforkarerequiredtorecreatesocketsandpipessharedbetweenprocesses.7)Resumeuserthreads:Theprogramcontinuesexecut-ingStep7ofcheckpointing.Step2abovebearsfurtherexplanation.Recallthatpriortocheckpointing,wheneveranewconnectionwasaccepted,wrappersaroundthesystemcallsconnectandaccepthadtransferredinformationabouttheconnectortotheacceptor.ThisinformationincludesagloballyuniquesocketIDthatremainsconstantevenifprocessesarere-located.Atrestarttime,theacceptorforeachsocketadver-tisestheaddressandportofitsrestartlistersockettothediscoveryservice.Whentheconnectorreceivesthisadvertisement,itopensanewconnectiontotheacceptorwhosenttheadvertisement.Thetwosidesthenperformahandshakeandagreeonthesocketbeingrestored.Finally,dup2isusedoneachsidetomovethesocketdescriptortothecorrectlocation.Thisprocesscontinuesasynchronouslyuntilallsocketsarerestored.Ourmethodologysupportsbothsidesofasocketmigrating.Italsosupportsloopbacksockets.4.5.ImplementationStrategiesIntheimplementation,somelessobviousissuesariseinthesupportforpipes,sharedmemory(viammap),andvirtualpids.Pipespresentanissuebecausetheyareunidirectional.AsseeninSections4.3and4.4,thestrategyforcheckpointingnetworkdatainasocketconnectionisforthereceivertodrainthesocketintouserspace,thenwriteacheckpointimage,andnallyre-sendthenetworkdatathroughthesamesocketbacktothesender.Inordertosupportpipes,awrapperaroundthepipesystemcallpromotespipesintosockets.Inthecaseofsharedmemory,ifthebackingleofasharedmemorysegmentismissingandwehavedirectorywritepermision,thenwecreateanewbackingle.Next,assumingthebackingleispresentandwehavewriteaccess,weoverwritethesharedmemorysegmentwithdatafromthecheckpointimage.Iftwoprocessessharethismemory,theywillbothwritetothesamesharedsegment,butwiththesamedata,sincethesegmentwasalsosharedatthetimeofcheckpoint.Ifwedonothavewriteaccess(forexample,read-onlyaccesstocertainsystem-widedata),thenwemapthememorysegmentbythecurrentdataofthele,andnotthecheckpointimagedata.Inordertosupportvirtualpids(processids),onemustworryaboutpidconicts.TheoriginalpidWhenaprocessisrstcreatedthroughacalltofork,itspidalsobecomesitsvirtualpid,andthatvirtualpidismaintainedthroughoutsucceedinggenerationsofrestarts.Hence,anewprocessmayhavepidA.Aftercheckpointandrestart,asecondprocessmaybecreatedwiththesamepidA.Ourwrapperaroundforkdetectsthissituation,terminatesthechildwiththeconictingvirtualpid,,andforksonceagain.5.ExperimentalResultsDMTCPiscurrentlyimplementedforGNU/Linux.ThesoftwarehasbeenveriedtoworkonrecentversionsofUbuntu,Debian,OpenSuse,andRedHatLinuxwithLinux (a)Checkpoint/Restarttimings. (b)Checkpointsizes.Figure3:Commonshell-likelanguagesandotherapplications.Allarerunonasinglenodewithcompressionenabled.kernelsrangingfromversion2.6.9throughversion2.6.28.DMTCPrunsonx86,x86 64andmixed(32-bitprocessesin64-bitLinux)architectures.Experimentswererunontwobroadclassesofprograms:shell-likelanguagesintendedforasinglecomputer(e.g.MATLAB,Perl,Python,Octave,etc.);anddistributedpro-gramsacrossthenodesofacluster(e.g.ParGeant4,iPython,MPICH2,OpenMPI,etc).Reportedcheckpointimagesareaftergzipcompression(unlessotherwiseindicated),sinceDMTCPdynamicallyinvokesgzipbeforesaving,bydefault.InSection5.1,ourgoalwastodemonstrateon20com-monreal-worldapplications.Anemphasisonshell-likelan-guageswerechosenfortheirwidespreadusage,andfortheirtendencytoinvokemultipleprocessesandmultiplethreadsintheirimplementation.ThelanguageswerechosenfromtheapplicationslistedunderInteractivemodelanguages(shell-likelanguages)inthearticleListofprogramminglanguagesbycategoryonWikipedia.Section5.2isconcernedwithtestingforscalability.Theparalleltoolsandbenchmarkswerechosenfortheirpopularityinthecomputationalsciencecommunity.Theywereaugmentedwithsomecomputationalpackagesthathadalreadybeenconguredandinstalledastoolsusedbyourownworkinggroup.5.1.CommonShell-LikeLanguagesThesetestswereconductedonadual-socket,quad-core(8totalcores)XeonE5320.Thissystemwasrunning64-bitDebiansidGNU/Linuxwithkernelversion2.6.28.Toshowbreadth,wepresentcheckpointtimes,restarttimes,andcheckpointsizesonawidevarietyofcommonlyusedapplications.TheseresultsareshowninFigure3.Theseapplicationsare:BC(1.06.94)anarbitraryprecisioncal-culatorlanguage;Emacs(2.22)awellknowntexteditor;GHCi(6.8.2)theGlasgowHaskellCompiler;Ghostscript(8.62)PostScriptandPDFlanguageinterpreter;GNU-Plot(4.2)aninteractiveplottingprogram;GST(3.0.3)theGNUSmalltalkvirtualmachine;Lynx(2.8.7)acommandlinewebbrowser;Macaulay2(2-1.1)asystemsupportingresearchinalgebraicgeometryandcommutativealgebra;MATLAB(7.4.0)ahigh-levellanguageandinteractiveenvironmentfortechnicalcomputing;MZScheme(4.0.1)thePLTSchemeimplementation;OCaml(3.10.2)theObjectiveCamlinteractiveshell;Octave(3.0.1)ahigh-levelinteractivelanguagefornumericalcomputations;PERL(5.10.0)PracticalExtractionandReportLanguageinterpreter;PHP(5.2.6)anHTML-embeddedscriptinglanguage;Python(2.5.2)aninterpreted,interactive,object- [13]PaulHargroveandJasonDuell.Berkeleylabcheck-point/restart(BLCR)forLinuxclusters.JournalofPhysicsConferenceSeries,46:494499,September2006.[14]ThomasHerault,PierreLemarinier,andFranckCappello.Blockingvs.non-blockingcoordinatedcheckpointingforlarge-scalefault-tolerantMPI.InProceedingsofInternationalSymposiumonHighPerformanceComputingandNetworking(SC2006),2006.[15]JoshuaHursey,JeffreyM.Squyres,TimothyI.Mattox,andAndrewLumsdain.Thedesignandimplementationofcheckpoint/restartprocessfaulttoleranceforOpenMPI.InProceedingsofthe21stIEEEInternationalParallelandDistributedProcessingSymposium(IPDPS)/12thIEEEWorkshoponDependableParallel,DistributedandNetwork-CentricSystems.IEEEComputerSociety,March2007.[16]G.J.Janakiraman,J.R.Santos,D.Subhraveti,andY.Turner.Application-transparentdistributedcheckpoint-restartonstan-dardoperatingsystems.InDependableSystemsandNetworks(DSN-05),pages260269,2005.[17]Byoung-JipKim.Comparisonoftheexistingcheckpointsystems.Technicalreport,IBMWatson,October2005.[18]OrenLaadanandJasonNieh.Transparentcheckpoint-restartofmultipleprocessesforcommodityclusters.In2007USENIXAnnualTechnicalConference,pages323336,2007.[19]OrenLaadan,DanPhung,andJasonNieh.Transparentnetworkedcheckpoint-restartforcommodityclusters.In2005IEEEInternationalConferenceonClusterComputing.IEEEPress,2005.[20]KaiLi,JeffreyF.Naughton,andJamesS.Plank.Real-time,concurrentcheckpointforparallelprograms.InProc.ofSecondACMSIGPLANSymposiumonPrinciplesandPracticeofParallelProgramming,pages7988,March1990.[21]KaiLi,JeffreyF.Naughton,andJamesS.Plank.Low-latency,concurrentcheckpointingforparallelprograms.IEEETransactionsonParallelandDistributedSystems,5:874879,August1994.[22]MichaelLitzkow,ToddTannenbaum,JimBasney,andMironLivny.CheckpointandmigrationofUNIXprocessesintheCondordistributedprocessingsystem.Technicalreport1346,UniversityofWisconsin,Madison,Wisconsin,April1997.[23]FernandoPerezandBrianE.Granger.iPython:Asystemforinteractivescienticcomputing.ComputinginScienceandEngineering,pages2129,May/June2007.(See,also,http://ipython.scipy.org/moin/Parallel Computing.).[24]EduardoPinheiro.EPCKPTacheckpointutilityfortheLinuxkernel.http://www.research.rutgers.edu/edpin/epckpt/.[25]J.S.Plank,J.Xu,andR.H.B.Netzer.Compresseddifferences:Analgorithmforfastincrementalcheckpointing.TechnicalReportCS-95-302,UniversityofTennessee,August1995.[26]JamesS.Plank,MicahBeck,GerryKingsley,andKaiLi.Libckpt:TransparentcheckpointingunderUnix.InProc.oftheUSENIXWinter1995TechnicalConference,pages213323,1995.[27]MichaelRieker,JasonAnsel,andGeneCooperman.Trans-parentuser-levelcheckpointingfortheNativePOSIXThreadLibraryforLinux.InProc.ofParallelandDistributedProcessingTechniquesandApplications(PDPTA-06),pages492498,2006.[28]EricRoman.Asurveyofcheckpoint/restartimplementations.Technicalreport,LawrenceBerkeleyNationalLaboratory,November2003.[29]JosephRuscio,MichaelHeffner,andSrinidhiVaradarajan.DejaVu:Transparentuser-levelcheckpointing,migration,andrecoveryfordistributedsystems.InIEEEInternationalParallelandDistributedProcessingSymposium,March2007.[30]SriramSankaran,JeffreyM.Squyres,BrianBarrett,andAndrewLumsdaine.TheLAM/MPIcheckpoint/restartframe-work:System-initiatedcheckpointing.InternationalJournalofHighPerformanceComputingApplications,19:479493,2005.[31]DanielJ.Sorin,MiloM.K.Martin,MarkD.Hill,andDavidA.Wood.SafetyNet:improvingtheavailabil-ityofsharedmemorymultiprocessorswithglobalcheck-point/recovery.InISCA'02:Proceedingsofthe29thannualInternationalSymposiumonComputerArchitecture,pages123134,Washington,DC,USA,2002.IEEEComputerSociety.[32]GeorgStellner.Cocheck:Checkpointingandprocessmi-grationforMPI.InIPPS'96:Proceedingsofthe10thInternationalParallelProcessingSymposium,pages526531,Washington,DC,USA,1996.IEEEComputerSociety.[33]O.O.Sudakov,I.S.Meshcheriakov,andY.V.Boyko.CHPOX:TransparentcheckpointingsystemforLinuxclusters.InIEEEInternationalWorkshoponIntelligentDataAcquisitionandAdvancedComputingSystems:TechnologyandApplications,pages159164,2007.[34]NamyoonWoo,SoonhoChoi,hyungsooJung,JungwhanMoon,HeonY.Yeom,TaesoonPark,andHyungwooPark.MPICH-GF:ProvidingfaulttoleranceonGridenvironments.The3rdIEEE/ACMInternationalSymposiumonClusterComputingandtheGrid(CCGrid2003),theposterandre-searchdemosessionMay,2003,Tokyo,Japan.[35]VictorC.Zandy,BartonP.Miller,andMironLivny.Processhijacking.In8thIEEEInternationalSymposiumonHighPerformanceDistributedComputing,pages177184,1999.[36]YouhuiZhang,DongshengWong,andWeiminZheng.User-levelcheckpointandrecoveryforLAM/MPI.InACMSIGOPSOperatingSystemsReview,volume39,pages7281,2005.[37]GengbinZheng,LixiaShi,andL.V.Kale.FTC-Charm++:Anin-memorycheckpoint-basedfaulttolerantruntimeforCharm++andMPI.In2004IEEEInternationalConferenceonClusterComputing(Fault-TolerantSession),pages93103,2004.