/
USENIX Association 8th USENIX Symposium on Operating Systems Design an USENIX Association 8th USENIX Symposium on Operating Systems Design an

USENIX Association 8th USENIX Symposium on Operating Systems Design an - PDF document

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
487 views
Uploaded On 2015-10-10

USENIX Association 8th USENIX Symposium on Operating Systems Design an - PPT Presentation

R2AnApplicationLevelKernelforRecordandReplayZhenyuGuoXiWangJianTangXuezhengLiuZhileiXuMingWuMFransKaashoekZhengZhangMicrosoftResearchAsiaTsinghuaUniversityMITCSAILBSTRACTLibrarybasedrecordandrepla ID: 156273

R2:AnApplication-LevelKernelforRecordandReplayZhenyuGuoXiWangJianTangXuezhengLiuZhileiXuMingWuM.FransKaashoekZhengZhangMicrosoftResearchAsiaTsinghuaUniversityMITCSAILBSTRACTLibrary-basedrecordandrepla

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "USENIX Association 8th USENIX Symposium ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation193 R2:AnApplication-LevelKernelforRecordandReplayZhenyuGuoXiWangJianTangXuezhengLiuZhileiXuMingWuM.FransKaashoekZhengZhangMicrosoftResearchAsiaTsinghuaUniversityMITCSAILBSTRACTLibrary-basedrecordandreplaytoolsaimtoreproduceanapplication’sexecutionbyrecordingtheresultsofse- 1948th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association Annotation Scope Description Section in parameter input(read-only)parameter out parameter output(mutable)parameter val parameter modiedsizeofanarraybuffer(valcanbeanyexpression) xpointerkind parameter addressallocatedinternally(kindcanbenullthread,orprocess preparekeybuf function prepareasynchronousdatatransfer commitkey function commitasynchronousdatatransfer 3 callback parameter callbackfunctionpointer(upcall) synckey function causalityamongsyscallsandupcalls(keycanbeanyexpression) 4 cache function cacheforreducinglogsize reproduce function reproduceI/Oforreducinglogsize 6 Table1:Annotationkeywords(fordatatransfer,executionorder,andoptimization).example,toensurefaithfulreplaythedevelopermustar-rangetorecordtheresultsofcorrectlybutitsresultsvaryfordifferentparameters.ForsuchcasesR2makesiteasyforadevelopertochooseaninter-faceconsistingofhigher-levelfunctionsthatcausethesameinteractionswiththeenvironment,butareeasiertorecordandreplay.Forexample,thedevelopermaychoose,whichcalls’sef-fectsareeasiertorecordandreplay.Thesecondreasonforallowingdeveloperstochoosetheinterfaceisthattheycanchooseaninterfacethatre-sultsinlowrecordingoverheadfortheirapplications.Lowoverheadisimportantbecausethedeveloperscanthenruntheirapplicationsinrecordingmodeevenduringdeployment,whichmayhelpindebuggingproblemsthatshowuprarely.Toreduceoverhead,adevelopermightchoosetorecordandreplaytheinteractionsatahigh-levelinterface(e.g.,MPIandSQLlibraryinterfacesuchasSQLite)becauselessinformationmustberecorded.Inaddition,thesehigher-levelinterfacesmaybeeasiertobereplayfaithful.Tolowertheimplementationeffortforintercepting,recording,andreplayingachoseninterface,R2gener-atesstubsforthecallsinthechoseninterfaceandar-rangesthatthesestubsarecalledwhentheapplicationinvokesthecalls.Thestubsperformtherecordingandthereplayofthecalls.Toensurethatthesestubsbehaveinwaythatisreplayfaithful,thedevelopermustanno-tatetheinterfacewithsimpleannotations(seeTable1)thatspecify,forexample,howdataistransferredacrosstheinterposedinterfaceforcallsthatchangememoryinadditiontohavingareturnvalue.ToreducetheeffortofannotatingR2reusesexistingannotationsfromSAL[13]forWindowsAPI.Inspiredbythekernel/userdivisioninoperatingsystems,R2usesamodebit,whichstubssaveandrestore,totrackifacallisonbehalfoftheapplica-tionandshouldberecorded.WehaveimplementedR2onWindows,andusedittorecordandreplayatthreeinterfaces(Win32,MPI,andSQLiteAPI).Ithassuccessfullyreplayedvarioussystemapplications(seeSection8),includingapplicationsthatcannotbereplayedwithpreviouslibrary-basedtools.R2hasalsoreplayedandhelpedtodebugtwodistributedsystems,andhasbeenusedasabuildingblockinothertools[20,31,22].Themaincontributionsofthepaperare:rst,arecordandreplaytoolthatallowsdeveloperstodecidewhichin-terfacetorecordandreplay;second,asetofannotationsthatallowsstrictseparationoftheapplicationabovetheinterposedinterfaceandtheimplementationbelowtheinterface,andthatreducesthemanualworkthatadevel-opermustdo;third,animplementationofarecordandreplaylibraryforWindows,whichiscapableofreplay-ingchallengingsystemapplicationswithlowrecordingoverhead.Therestofthepaperisorganizedasfollows.Sec-tion2givesanoverviewofthedesign.Section3and4describetheannotationsfordatatransfersandexecutionorders,respectively.Section5discusshowtorecordandreplaytheMPIandSQLiteinterfaces.Section6and7describeannotationsforoptimizationsandimplementa-tiondetails,respectively.WeevaluateR2inSection8,discussrelatedworkinSection9,andconcludeinSec-tion10.VERVIEWAgoalofR2istoreplayapplicationsfaithfully.Todosothecallstointerceptmustbecarefullychosenandstubsmusthandleseveralchallenges.Thissectionstartswithanexampletoillustratethechallenges,andthende-scribeshowR2addressesthem.2.1AnExampleandChallengesFaithfulreplayisparticularlychallengingforsystemapplications,whichinteractwiththeoperatingsys-temincomplicatedways.ConsiderFigure1,atyp-icalnetworkprogramonWindows:athreadbindsasockettoanI/Oport( USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation195 structiocb{OVERLAPPEDov;intmain(){HANDLEhPort=...;for(...)CreateThread(...,WorkerThread,hPort,...);1011SOCKETs=socket(...);12CreateIoCompletionPort(s,hPort,...);13structiocbcb=(structiocb14�cb-buf=malloc(BUFSIZ);15�cb-user_data=...;16BOOLfSucc=ReadFileEx(s,�cb-buf,BUFSIZ,17)&cb,0);18192021DWORDWINAPIWorkerThread(HANDLEhPort){22for(;;){23structiocb24DWORDsize;25GetQueuedCompletionStatus(hPort,&size,...,26)&cb,...);27buf=�cb-buf;28user_data=�cb-user_data;293031return0;32Figure1:Atypicalnetworkprogramusingasyn-chronousI/OandcompletionportonWindows.Thepat-ternisalsowidelyavailableonotherplatforms,suchasLinuxaio( etc.),Solariseventcomple-tion( etc.),andFreeBSDkqueue.line12),enqueuesanasynchronousI/Orequest,line16),andaworkerthreadwaitsontheI/OportforthecompletionoftheI/Orequest,line25).SimilarinterfacesareprovidedonotheroperatingsystemssuchasLinux(),Solaris(),andBSD(andareusedbypopularsoftwaresuchasthelighttpdwebserverthatpowersYouTubeandWikipedia.Therstchallengeadevelopermustaddressiswhatcallsarepartoftheinterfacethatwillberecordedandreplayed.Forexample,inFigure1,adevelopermightchoosebutnot.However,sinceduringreplaythecalltoisnotexecuted,thereturnedsocketdescriptorissimplyreadfromtherecordedlogratherthancreated.Sothechoicemaycrashduringreplayandfailtheapplication;thedevelopershouldchoosebothfunctions,oralowerlayerthatuses.Section2.2formulatesanum-berofrulesthatcanguidethedeveloper.R2generatesstubsforthefunctionsthatthedevel-operchoosestorecordandreplay,andarrangesthatin-vocationstothesefunctionswillbedirectedtothecor-respondingstubs.Toavoidreimplementingormodifyingtheimplementationoftheinterposedinterface,R2’sgoalisforthestubstocalltheoriginalinterceptedfunctionsandtorecordtheirresults.ThisapproachalsoallowsR2torecordandreplayfunctionsforwhichonlythebinaryversionsareavailable.Toachievethisimplementationgoalandtoen-surefaithfulreplay,thestubsmustaddressanum-berofimplementationchallenges.ConsiderthecaseinwhichthedeveloperselectsthefunctionsfromtheWin-dowsAPI(e.g.,,etc.)astheinterfacetobeinterposed.Duringarecordrun,thestubsmustrecordinalogthesocketdescriptorandthecompletionportasinte-gers,theoutputof(e.g.,thevalueofatline26andthecontentofatline27),alongwithothernecessaryinfor-mation,suchasthetimestampwhentheoperatingsystemstartsasanupcall(callback)viaanewthread.Duringareplayrun,thestubswillnotinvoketheinterceptedfunctionssuchasor,butinsteadwillreadtheresultssuchasdescriptors,thevalueofandthecontentforfromthelog.Thestubsmustalsocausethememorysideeffectstohappen(e.g.,copyingcontentinto).Finally,thereplayrunmustalsodeliverupcalls(e.g.,)attherecordedtimestamps.Theserequirementsraisethefollowingchallenges:Useofinterceptedfunctionsbytheimplementationoftheinterposedinterface.Forexample,theimple-mentationitselfmayinvokethefunctionandthoseinvocationsshouldnotberecorded.Functionsthathavesideeffects.Forexample,torecordandreplay,thestubsmustrecordthecontentofandllitduringreplay.Thestubformustknowthatthesecondargumenthassideeffects.Addressesreturnedbymustbeidenticalduringrecordingandreplay.ThecodeinFigure1requiresthatthevaluethatreceivesatline13notchangefromrecordtoreplay:thereplayrunreadsthevalueoffromthelogatline26andthatshouldbeequaltothevaluereturnedbyatline13;adifferentvalueformayleadtoacrashinfurtheruses(line27and28).Threadscreatedbytheimplementationofthein-terposedinterface.Theoperatingsystem,forexam-ple,mightcreatethreadstodelivereventstotheap-plication,ormightcreate“anonymous”threadstoperformhouseholdtasks.Theformershouldbere-createdduringreplay,butthelatternot. 1968th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association Executionorder.Dependenciesduringrecordingmustbepreservedduringreplay.Forexample,’sstartofanasynchronousI/OmusthappenbeforethecompletionofthatI/Oevent.2.2ChoosinganInterfaceAsastartingpointforchoosinganinterface,thedevel-opermustchoosefunctionsthatformacutinthecallgraph.ConsiderthecallgraphinFigure2.Thefunctioncallstwofunctionsand,andthosetwobothcallathirdfunction,whichinteractswiththeapplica-tion’senvironment.Thedevelopercannotchoosetohaveonly(oronly)betheinterposedinterface.Inthecaseofchoosing,theeffectsofinteractionsbywillberecordedonlywhencalledbybutnotbyDuringreplay,interactionscausedbywillbereadfromthelogbutcallstofromwillbere-executed;andwillseedifferentinteractionsduringreplay.Forfaithfulreplay,theinterposedinterfacemustformacutinthecallgraph,thustheinterfacecanbe(cut4)orbothand(cut1).Cuts3and4arealsone,butrequireR2totrackifwascalledfromitssideofthecutorfromtheotherside.Inthecaseofcut3,R2willnotrecordinvocationsofbybecauseinvocationswillberecordedandreplayed.R2supportsallfourcuts.Whentheinterposedinterfaceformsacutinthecallgraph,everyfunctioniseitherabovetheinterfaceorbe-lowtheinterface.Forexample,ifthedeveloperchoosesandastheinterposedinterface,thenaboveandisbelowtheinterface.Functionsabovetheinterposedinterfacewillbeexecutedduringreplay,whilefunctionsbelowtheinterposedinterfacewillnotbeexecutedduringreplay.Toensurefaithfulreplay,thecutmustadditionallysatisfytworules.1(ISOLATIONAllinstancesofunrecordedreadsandwritestoavariableshouldbeeitherbebeloworabovetheinterposedinterface.Followingtheisolationrulewilleliminateanysharedstatebetweencodeaboveandbelowtheinterface.Avari-ablebelowtheinterfacewillbeunobservabletofunc-tionsabovetheinterface;itisoutsideofthedebuggingscopeofadeveloper.Avariableabovetheinterfacewillbefaithfullyreplayed,executingalloperationsonit.Vi-olatingtheisolationrulewillresultinunfaithfulreplaybecausechangestoavariablemadebyfunctionsbelowtheinterfacewillnothappenduringthereplay.FortheWindowsAPI,theisolationruletypicallyholds.Forexample,alltheoperationsonaledescrip-torareperformedthroughfunctionsratherthandirectmemoryaccess.DuringrecordingR2recordsthelede-scriptorasaninteger.Duringreplay,R2retrievesthein- Figure2:Fourcutsinacallgraphforrecordandreplay.Thefunctioninteractswiththeenvironment.tegerfromthelogandreturnsittofunctionsabovetheinterposedinterface,withoutinvokingtheoperatingsys-temtoallocatedescriptors.AslongasR2interceptsthecompletesetoflefunctions,therecordedledescrip-torworkscorrectlywiththereplayedapplicationanden-suresreplayfaithfulness.2(NONDETERMINISMAnysourceofnondeter-minismshouldbebelowtheinterposedinterface.Ifanynondeterminismisbelowtheinterposedinter-face,theimpacttofunctionsabovetheinterfacewillbecapturedandreturnedtothem.Violatingthisrulewillresultinunfaithfulreplay,becausethebehaviorduringreplaywillbedifferentfromduringrecording.Thesourcesofnondeterminismareasfollows.1.Callsthatreceiveinputdatafromtheexternal(e.g.,environmentvariables,les,andnetwork).2.Interprocesscommunicationsthroughsharedmem-ory(e.g.,inWindowscommuni-cateswiththeCSRSSsystemserviceforstandardinput/outputthroughashared-memorysegment).3.Interactionsbetweenthreadsthroughsharedvari-ables(e.g.,spinlocks).R2canhandletherstsourceeasilyifthedeveloperfollowstheisolationrulebecauseinputdatafromtheex-ternalisreceivedthroughfunctions.FortheWindowsAPIthedevelopermustmarkthesefunctionsbeingpartoftheinterposedinterface,whicheliminatesthenonde-terminism.Forthesecondsource,R2canre-executeduringre-playifthereplayedapplicationonlyreadsfromsharedmemory.Formoregeneralcasesthedevelopermustmarkthehigher-levelfunctionthatenclosesthenonde-terminismofsharedmemoryaccessesasbeingpartoftheinterposedinterface(e.g.,Thethirdsourceofnondeterminismstemsfromvari-ablesthataresharedbetweenthreadsviadirectmem-oryaccessinstructionsratherthanfunctions.Asimilar USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation197 8771r1sonn 8771r1sonn 8771r1sonn 8772msonn i175t(dt3 ohh30e4 0fd) n06(4(0t) dthdrnefso h10instf 0ie)dfh10hfe()n0ie)df frof(0ie)df msfr0ie)df Figure3:R2overview.Theuserspaceissplitintotwospaces:R2runtimethatinterceptssyscallsandunderly-inglibrariesarerunninginR2systemspace;theapplica-tionexecutesinR2replayspace.caseistheinstructiononthex86architecturethatreadstheCPUtimestampcounter.Oftenthesein-structionsareenclosedbyhigher-levelinterfacefunc-tions(e.g.,lockandunlockofspinlocks).Developersmustannotatethemasbeingpartoftheinterposedin-terface.Inpractice,libraryAPIsaregoodcandidatesfortheinterposedinterface.First,libraryfunctionsusuallyhavariablessharedbetweeninternallibraryroutinesanditisdifculttoselectonlyasubsetofthemastheinter-posedinterface.Second,mostsourcesofnondetermin-ismarecontainedwithinsoftwarelibraries(e.g.,spinlocvariablesinalocklibrary),andtheyaffectexternalstateonlyvialibraryinterfaces.2.3IsolationR2mustaddresstheimplementationchallengeslistedinSection2.1forthestubsforthefunctionsthatarepartoftheappropriately-choseninterface.Astartingpointtohandlethesechallengesistoseparatetheapplicationthatisbeingrecordedandreplayedfromthecodebelowtheinterface.Inspiredbyisolationbetweenkernelspaceanduserspaceinoperatingsystems,R2denestwospaces(seeFigure3):replayspaceandsystemspace.Unlikeoper-atingsystems,however,thedevelopercandecidewhichinterfaceistheboundarybetweenreplayandsystemspace.Likeinoperatingsystems,werefertothefunc-tionsintheinterposedinterfaceassyscalls(unlessex-plicitlyspecied,allsyscallsmentionedbelowareR2syscallsinsteadofOSsyscalls).Syscallsmayregistercallbackfunctions,whichwecallupcalls,thatareissuedlaterintoreplayspacebysystemspaceruntime.Withtheseterminologies,wecandescribethespacesasfol-lows:Replayspace.Allthecodeanddatathatisabovethechosensyscallinterface.Systemspace.TheR2libraryandtheunderlyinglibraries,aswellasanyapplicationcodeanddatathatisbelowthechosensyscallinterface.R2recordstheoutputofsyscallsinvokedfromappli-cationspace,theinputofupcallsinvokedfromsystemspace,andtheirordering.Itfaithfullyreplaysthemdur-ingthereplay.R2doesnotrecordandreplayeventsinsystemspace.ConsiderthecodeinFigure1again.Thedevelopermayhavechosenandassyscalls.Theexecutionofandisinreplayspaceandtheexecutionofthesyscallsandtheunderlyinglibrariesisinsystemspace.Torecordandreplaysyscallsandupcalls,R2gener-atesstubsfromtheirfunctionprototypes.R2usesDe-tours[15]tointerceptsyscallsandupcalls,anddetourtheirexecutiontothegeneratedstub.Forsyscallsthattakeafunctionasanargument,R2dynamicallygener-atesastubforthefunctionandpassesontheaddressoftheupcallstubtothesystemlayer.Thiswaywhenlaterthesystemlayerinvokestheupcall,itwillinvoketheup-callstub.Forsyscallsthatreturndatathroughapointerargu-ment,R2mustrecordthedatathatisreturnedduringrecordingandcopythatdataintoapplicationspacedur-ingreplay.Todosocorrectly,thedevelopermustanno-tatepointerargumentssothatR2knowswhatdatashouldberecordedandhowthestubsmusttransferdataacrossthesyscallinterface.Section3describesthoseannota-tions.R2maintainsareplay/systemmodebitforeachthreadtoindicatewhetherthecurrentthreadisexecutinginreplayorsystemspace(analogtouser/kernelmodebit).Whentheapplicationinreplayspaceinvokesasyscall,thesyscallstubsetsthereplay/systemmodebittosystemspacemode,invokesthesyscall,recordsitsre-sults,andrestoresthemodebit.Similarly,anupcallstubrecordsthearguments,setsthemodebittoreplayspacemode,invokestheupcall,andrestoresthemodebitaftertheupcallreturns.ThisbitallowsR2tohandleacallfromsystemspacetoafunctionthatisasyscall;ifasyscalliscalledfromsystemspacethenitmustbeexecutedwithoutrecord-inganything(e.g.,acalltofromsystemspace).Similarly,ifasyscalliscalledfromsystemspaceandithasafunctionargument,thenR2willnotgenerateanupcallstubforthatargument.ItalsoallowsR2toapplydifferentpoliciestodifferentspaces(e.g.,allocatememoryinaseparatepoolforcodeinreplayspace).Forfunctionsthathavestatethatstraddlesthebound-arybetweensystemandreplayspace(e.g.,libc),thedevelopermaybeabletoadjusttheinterface 1988th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association toavoidsuchstate(seeSection2.2)ormaybeabletoduplicatethestatebylinkingastaticlibrary(e.g.,libc)ineachspace.2.4ExecutionControlTheseparationinreplayspaceandsystemspaceallowsR2tohandleanonymousthreadsandthreadsthat,forex-ample,theoperatingsystemcreatestodelivereventstotheapplication.R2startsasfollows.WhenauserinvokesR2withtheapplicationtoberecorded,R2’sinitialthreadstartsinsystemspace.Itloadstheapplication’sexecutableandtreatsthemainentryasanupcall(i.e.,thefunc-tionisturnedintoanupcallbygeneratinganupcallstub).Thestubsetsthereplay/systemmodebitofthecurrentthreadtoreplayspacemode,andinvokes.R2as-signsthethreadadeterministictag,whichthestubswillalsorecord.Bythismeans,R2putsthefunctionsinthecallgraphstartingfromtillthesyscallinterfaceintoreplayspace.Anonymousthreadsthatdonotinteractwiththeap-plicationwillbeexcludedfromreplay.Thesethreadswillnotcallsyscallsandupcallsandthusdonotgeneratelogentriesduringrecordingandarenotreplayed.However,ifathreadstartedbytheoperatingsystemperformsanupcall(e.g.,totriggeranapplicationregisteredCtrl-ChandleronWindows),thentheupcallstubwillsetthemodebit;thethreadwillenterreplayspace,anditsac-tionswillberecorded.Duringreplay,R2willreplaythisupcall,butthethreadfortheupcallmaynotexistduringreplay.R2solvesthisproblembycreatingthreadsondemand.Be-foreinvokinganupcall,R2willrstlookupifthecurrentthreadistheonethatrantheupcallduringrecording(bycomparingthedeterministictagassignedbyR2).Ifnot,R2willcreatethethread.Forfaithfulreplay,R2mustreplayallsyscallsandupcallsinthesameorderasduringarecordrun.Inmul-tithreadedprograms(andsingle-threadedprogramswithasynchronousI/O)theremaybedependenciesbetweensyscallsandupcalls.Section4introducesafewannota-tionsthatallowR2topreserveacorrectorder.2.5MemoryManagementIfadeveloperchoosesandassyscalls,R2mustensurethataddressesreturnedbydur-ingrecordingarealsoreturnedduringreplay.Ifthead-dressesreturnedbyduringreplayaredifferent,thenduringreplaythebugthatthedeveloperistrackingdownmightnotshowup(e.g.,invalidvalueinwillnotbereproducedifadifferentisreturnedinFigure1becausetheprogramcrashesbeforeitreachesthebuggystate).But,duringreplay,functionsinsystemspacethatcalledduringrecordingwillnotbecalledduringreplay,andsoduringreplayislikelytoreturnadifferentvalue.Toensurefaithfulreplayapplicationmusthaveanidenticaltraceofinvocationstoensurethataddressesduringrecordingandreplayarethesame.R2usesseparatememoryallocatorsforreplayandsys-temspace.Acalltoallocatesmemoryfromadedicatedpoolifitiscalledinreplayspace(i.e.,themodebitofthecurrentthreadisofreplayspacemode),whileitdelegatesthecalltotheoriginallibcimplemen-tationifitiscalledinsystemspace.Memoryaddressesreturnedinsystemspacemaychangeduetoinherentdifferencesbetweenrecordandreplay,butthoseaddressesarenotobservableinreplayspacesotheywillnotimposeanyproblemsduringre-play.Achallengeisaddressesallocatedinsystemspacebutreturnedtoreplayspace.Forexample,asyscallgetcwd(NULL,0)togetworkingdirectorypath-namemaycallinternallytoallocatememoryinsystemspaceandreturnitsaddresstoreplayspace.ToensurereplayfaithfulnessR2allocatesashadowcopyinthededicatedpoolforreplayspaceandreturnsittotheapplicationinstead.R2usestheannotationxpointerde-scribedinSection3toannotatesuchfunctions.SimilartoJockey[28],threadsthatmayexecuteinreplayspacehaveanextrastackallocatedfromthere-play’smemorypool,andR2switchesthetwostacksonanupcallorsyscall.Thisensuresthatthememoryad-dressesoflocalvariablesarethesameduringrecordingandreplay.Likeallotherlibrary-basedreplaytools,R2doesnotprotectagainstastraypointerintheapplicationwithwhichtheapplicationaccidentallyoverwritesmemoryinsystemspace.Suchpointersareusuallyexposedandxedearlyinthedevelopmentcycle.Resourcesotherthanmemory(e.g.,les,sockets)donotposethesamechallengesasmemory,aslongasthedeveloperhaschosentheinterposedinterfacewell.R2doesnothavetoallocatetheseresourcesduringreplay,becausetheexecutioninreplayspacewilltouchtheseresourcesonlyviasyscalls,whichR2records.Memoryincontrastischangedbymachineinstructions,whichR2cannotrecord.2.6Stubs,SlotsandCodeTemplatesR2generatesstubsfromcodetemplates.Wehavede-velopedcodetemplatesforrecording,replay,etc.,butde-veloperscanaddcodetemplatesforotheroperationsthattheywouldlikestubstoperform.Toallowforthisexten-sibility,astubismadeofanumberofslots,witheachslotcontainingafunctionthatperformsaspecicoper-ation.Forexample,thereisaslotforrecording,oneforreplay,etc. USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation199 intrecv([in]SOCKETs,[out,bsize(return)]char[in]intlen,[in]intflags);(a)annotatedsyscall/upcallprototype&#x?=$f;&#x-000;&#x?=$f;&#x-000;BEGIN_SLOT(record_&#x?=$f;&#x-000;&#x?=$f;&#x-000;)logger&#x?=$f;&#x-000;&#x?=$f;&#x-000;current_thr_tag;(())&#x?ifi;&#xs_sy;&#xscal;&#xl$f-;妀{?logger&#x?}?0;return_val;=is_syscall($f)?&#x?$di;&#xrect;&#xion-;妐’out’:’in’;?&#x?for;ຬh;&#x$f-0;(as$p){if&#x?for;ຬh;&#x$f-0;($p-has($direction)){if&#x?for;ຬh;&#x$f-0;($p-has(’bsize’))&#x?for;ຬh;&#x$f-0;{?101112else&#x?}-6;�{?13logger&#x?=$f;&#x-000;&#x?=$f;&#x-000;1415(b)recordslotfunctiontemplateBEGIN_SLOT(record_recv,recv)loggerrecv_signaturecurrent_thr_tag;loggerreturn_val;logger.write(buffer,return_val);(c)generatedrecordslotfunctionFigure4:Templates(inPHP[2])andSlots.R2usesrecord(andotherslikereplay)codetemplates(e.g.,(b))togeneratecorrespondingslotfunctions(e.g.,(c)).Figure4providesanoverviewofhowarecordslotfunctionisgeneratedforthesyscall.DevelopersannotatetheprototypeofwithkeywordsfromTa-ble1;forthisstepwillresultintheprototypeinFigure4(a).(OnWindowsthedeveloperdoesnothavetodoanyannotationfor,becauseR2reusestheSALannotations.)R2usesarecordtemplate(seeFig-ure4(b))toprocesstheannotatedprototypeandproducestherecordslotfunction(seeFigure4(c)).ATAR2providesasetofkeywordstodenethedatatrans-feratsyscallandupcallsboundaries.ThesekeywordshelpR2isolatethereplayandsystemspace.Thissec-tionpresentsthedatatransferannotationsanddiscusseshowR2usesthemtoensurereplayfaithfulness.3.1AnnotationsTheannotationsfordatatransfersfallintothefollowingthreecategories.Directionannotationsdenethesourceanddestina-tionofadatatransfer.InFigure4,keywordonandindicatesthattheyareread-onlyandtransferdataintofunction,whileoutonindicatesthatllsthememoryregionatandtransfers[prepare(lpOverlapped,lpBuffer)]ReadFileEx([in]HANDLEhFile,[out]LPVOIDlpBuffer,[in]DWORDnNumberOfBytesToRead,[in]LPOVERLAPPEDlpOverlapped,[in,callback]LPOVERLAPPED_COMPLETION_ROUTINEcompletionCb);1011typedefvoid12[commit(lpOverlapped,cbTransferred)]13FileIOCompletionRoutine)(14[in]DWORDdwErrorCode,15[in]DWORDcbTransferred,16[in]LPOVERLAPPEDlpOverlapped);171819[commit(lpOverlapped,cbTransferred)]2021[in]HANDLEhFile,22[in]LPOVERLAPPEDlpOverlapped,23[out]LPDWORDcbTransferred,24[in]BOOLbWait);Figure5:Asynchronyannotations:prepareindicatesthatissuesanasynchronousI/Orequestkeyedby,thecompletionofwhichisnotiedaseitherorcommitindicatesthere-questkeyedbyiscompletedandthetransferreddatasizeisdataoutofthefunction.ThereturnvalueofafunctionisimplicitlyannotatedasoutBufferannotationsdenehowR2shouldserializeanddeserializedatabeingtransferredforrecordandre-play.ForinacontiguousmemoryregioninFig-ure4,whichisfrequentlyseeninsystemscode,key-wordspeciesthesize,(e.g.,return),sothatR2canthenautomaticallyserializeanddeserial-izethebuffer.Forotherirregulardatastructuressuchaslinkedlists(e.g.,structhostent,thereturntypeof),R2requiresdeveloperstoprovidecustomizedserializationanddeserializationviaoperatooverloadingonstreams,whichisacommonC++idiom.Asynchronyannotationsdeneasynchronousdatatransfersthatnishintwocalls,ratherthaninone.Forexample,inFigure5issuesanasynchronousI/Orequestkeyedbywhichwecallarequestkey.Developersusekeywordpreparetoannotatethesyscallwiththerequestkeyandtheassociatedbuffer.Thecompletionoftherequestwillbenotiedviaeitheranupcallto(line13)orasyscall(line20),whentheas-sociatedbufferhasbeenlledinsystemspace.Inei-thercase,developersusekeywordcommittoanno-tateitwiththerequestkeyandtransferredbuffersize.R2canthenmatchitssizeviatherequestkeyforrecord 2008th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association andreplay.AsmentionedinSection2.5,somesyscallsallocateabufferinsystemspaceandtheapplicationmayusethebufferinreplayspace.R2providesthekeywordxpointertoannotatethisbuffer,andwillallocateashadowbufferinreplayspacefortheapplication,atbothrecordandreplaytime.Dataarecopiedtotheshadowbufferfromtherealbufferinsystemspaceduringrecording,andfromlogsduringreplay.Whiledatacopymayaddsomeoverheadduringrecording,thiskindofsyscallsisinfre-quentlyusedinpractice.Mostsuchsyscallsallocatenewbufferslocallyandusuallyhavepaired“free”syscalls(e.g.,andand).Someotherswithoutpaired“free”functionsmayreturnthread-specicorglobaldata,suchasand.Theyshouldbeannotatedwithxpointerthreadandxpointerprocess,respectively.3.2CodeGenerationFigure4illustratestherecordslotcodetemplateandthenalrecordslotfunctionfor.Therecordslotfunc-tionslogallthedatatransmittedfromR2systemspacetoR2replayspace.Theslottemplate(Figure4(b))gen-eratescodeforrecordingthereturnvalueonlywhenpro-cessingR2syscalls(line4).Whenscanningtheparame-ters,itwillrecordthedatatransferaccordingtotheeventtypeandannotateddirectionkeywords(line6and8).Specicallyforupcalls,theinputparametersandupcallfunctionpointersarerecordedsothatR2canexecutethesamecallbackwiththesameparameters(includingmemorypointers)duringreplay.Forprototypesannotatedwithprepare,therecordslottemplatewillskiprecordingthebuffer.Instead,R2usesanotherslottemplatetogeneratetwoextrarecordandreplayslotsforeachprototype.Oneisforrecordingthebuffer(includingthebufferpointers),andtheotherisforreplayingthebuffer,whichreadstherecordedbufferpointersandllsthemwiththerecordeddata.Thesetwoslotswillbepluggedintostubsforprototypeslabeledcommitattherecordandreplayphases,respec-tively.ThisapproachensuresthatthememoryaddressesduringreplayareidenticaltotheonesreturnedduringrecordingforasynchronousI/Ooperations.Forfaithfulreplay,R2mustreplayallsyscallsandup-callsinthesameorderasduringarecordrun.Forsingle-threadedprogramsasynchronousI/Oraisessomechal-lenges.Formultithreadedprogramsthatrunonmultipro-cessors,recordingtherightorderismorechallenging,becausesyscallsandupcallscanhappenconcurrently,butdependenciesbetweensyscallsandupcallsexecutedd()3ReleaseMutex([in]HANDLEhMutex););()8WaitForSingleObject([in]HANDLEhMutex,10[in]DWORDdwMilliseconds);Figure6:Syscall-syscallcausalityannotationsusingsync.R2serializesthesyscalleventswiththesamesynckeytoobtainaneventorder,andthecausali-tiesbetweentheseeventsareimplicitlyheldbytheeventsequence.bydifferentthreadsmustbemaintained.R2providesde-veloperswithtwoannotationstoexpresssuchdependen-cies.ThissectiondescribeshowR2handlestherecord-ingandreplayexecutionorder.4.1EventDenitionInR2therearethreeevents:syscalls,upcalls,andcausal-ities.R2usesthecausalityeventstoenforcethehappens-beforerelationbetweeneventsexecutedbydifferentthreads.Considerthefollowingscenario:onethreadusesasyscalltoputanobjectinaqueue,andlaterasecondthreadusesanothersyscalltoretrieveitfromthequeue.Duringreplaytherstsyscallmustalwayshappenbeforethesecondone;otherwise,thesecondsyscallwillreceiveanincorrectresult.Usingannotations,thesecausalitiesarecapturedincausalityevents.Acausalityeventhasasourceeventandadestinationevent,whichcap-tures.R2generatesaslotfunctionthatitstoresinbothand’sstubs,whichwillcausethecausalityeventtobereplayedwhenR2replaysandR2capturestwotypesofcausalityevents:syscall-syscall:asyscalldependsonanearlierone,e.g.,signalandwait;syscall-upcall:asyscallregistersacallbackthatisexecutedlaterasanupcall.Tocapturesyscall-syscallcausality,R2providesthekeywordsynctoannotatesyscallsthatoperateonthesameresource.Figure6presentsanexample,whereacalltothatacquiresamu-texdependsonanearliercalltothatreleasesit.Developerscanthenannotatethemwithsync.WecallthemutexsynckeyR2createscausalityeventsforsyscallswiththesamesynckey.Inadditiontomutexes,asynckeycanbeanyexpressionthatreferstoauniqueresource.ForasynchronousI/Ooperations(seeFigure5),R2usesasthesynckey. USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation201 HANDLECreateThread([in]LPSECURITY_ATTRIBUTESlpThreadAttributes,[in]SIZE_TdwStackSize,[in,callback]LPTHREAD_START_ROUTINElpStartCb,[in]LPVOIDlpParameter,[in]DWORDdwCreationFlags,[out]LPDWORDlpThreadId);Figure7:Asyscall-upcallcausalityannotationus-ingcallback.R2convertsthecallbackargumentintoanupcallstub;whentheupcallisde-livered,thesyscall-upcallcausalitywillbecaptured.procedurePDATELOCKtypeCAUSALITY thenmaxsourceoodelseendifendprocedureFigure8:Algorithmforcalculatingeventclocks.Theprocedureisinvokedwhenprocessingeveryevent.Forsyscall-upcallcausality,developerscanusethekeywordcallbacktomarkthedependency,asillustratedinFigure7.R2generatesacausalityeventforthesyscallandtheupcallto4.2RecordingEventOrderR2usesaLamportclock[18]totimestampalleventsandusesthattimestamptoreplay.Itassignseachthreadclockandeacheventaclock.Figure8showsR2’salgorithmtocalculatetheLamportclockforeachevent.Duringrecording,R2rstsetstheclockofthemainthreadto0(line1).Then,whenaneventisinvoked,R2calculatestheclockforthateventusingtheprocedurePDATELOCK.Fornon-causalityevents,R2simplyincreasesthecurrentthread’sclockbyone(line7)andthenassignsthatvaluetotheevent(line8).Foracausalityevent,R2updatesthecurrentthread’sclocktothegreatervalueofitselfandtheclockfromthesourceeventofthecausality(notethatsourcedestination),andincreasesitbyone(line4).R2alsoassignsthisvaluetothecausalityevent(line5).Whenathreadinvokesthedestinationeventofacausalityevent,R2rstrunstheslotfunctionforthecausalityevent,whichinvokesUPDATELOCKwiththecausalityeventasargument.Thiswillcausetheclockofthecausalityeventtobepropagatedtothethread(line7inFigure8).Thereareseveralpossibleexecutionordersthatpre-servethehappens-beforerelation,aswediscussnext.4.3ReplayingEventOrderR2canusetwodifferentorderstorecordandreplayevents:total-orderandcausal-order.Total-orderexecu-tioncanfaithfullyreplaytheapplication,butmayslowdownmultithreadedprogramsrunningonamultiproces-sor,andmayhideconcurrencybugs.Causal-orderexe-cutionallowstrueconcurrentexecution,butmayreplayincorrectlyiftheprogramhasraceconditions.4.3.1Total-OrderExecutionLikeliblog[10],intotal-orderexecutionmode,R2usesatokentoenforceatotalorderinreplayspace,in-cludingexecutionslicesthatpotentiallycouldbeexe-cutedconcurrentlybydifferentthreads.Duringrecord-ingwhenathreadentersreplayspace(i.e.,returningfromasyscallorinvokinganupcall),itmustacquirethetokenrstandcalculateatimestampifanupcallispresent.Whenathreadleavesreplayspace(i.e.,invokingasyscall),R2assignsatimestamptothesyscallandthenreleasesthetoken.OnatokenownershipswitchR2gen-eratesacausalityeventtorecordthatthetokenispassedfromonethreadtoanother.Thisdesignserializesexecu-tioninreplayspaceduringrecording,althoughthreadsexecutinginsystemspaceremainconcurrent.Theresultisatotalorderonallevents.Duringreplay,R2replaysintherecordedtotalor-der.Asitreplaystheevents,R2willdynamicallycreatenewthreadsforeventsexecutedbydifferentthreadsdur-ingrecording(asdescribedearlierinSection2.4).Itwillensurethatthesethreadsexecuteinthesameorderasenforcedbythetokenduringrecording.Thereasontousemultiplethreads,eventhoughtheexecutioninreplayspacehasbeenserialized,isthatdevelopersmaywanttopauseareplayanduseastandarddebuggertoinspectthelocalvariablesofaparticularthreadtounderstandhowtheprogramreachedthestateitisin.Inaddition,usingmultiplethreadsduringreplayensuresthatthread-specicstorageworkscorrectly.4.3.2Causal-OrderExecutionCausal-orderexecution,however,allowsthreadstoexecuteinparallelinbothreplayandsystemspace.R2doesnotimposeatotalorderinreplayspace,itjustcap-turesthecausalitiesofsyscall-syscallandsyscall-upcaThereforetheapplicationwillachievethesamespeedupincausal-orderexecution.Toimplementcausal-orderexecution,R2reusesthereplayfacilityfortotal-orderexecution.R2processesthcausal-ordereventlogbeforereplay,usesalogconvertertotranslatetheeventsequencesintoanytotalorderthatpreservesallcausalities,andreplaysusingthetotal-order 2028th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association execution.Iftheprogramhasanunrecordedcausality(e.g.,datarace),R2cannotguaranteetoreplaythesecausalitiesfaithfullyincausal-orderexecution.Wehavenotfullyimplementedthelogconverteryet,sinceourfo-cusisreplayingdistributedapplicationsandtotal-orderexecutionhasbeengoodenoughforthispurpose.EFININGWNWehaveannotatedalargesetofWin32APIcallsforR2tosupportmostWindowsapplicationswithoutanyeffortfromdevelopers,includingthoserequiredtobeannotatedwithxpointerpreparecommit,andsyncSometimesdevelopersmaywanttodenetheirownsyscalls,eithertoenclosenondeterminisminsyscalls,spinlockcasesinSection2.2)ortore-ducerecordingoverhead.InthissectionweuseMPIandSQLiteasexamplestoexplainhowtodothisingeneral.5.1MPIMPIisacommunicationprotocolforprogrammingpar-allelcomputingapplications.AnMPIlibraryusuallyhasnondeterminismthatcannotbecapturedbyinterceptingWin32functions(e.g.,MPICH[4]usesshared-memoryandspinlocksforinterprocesscommunication).Tore-playMPIapplications,thisnondeterminismmustbeen-capsulatedbyR2syscalls.Therefore,weannotateallMPIfunctionsassyscalls,makingtheentireMPIlibraryruninsystemspace.SincetheMPIlibraryiswellencap-sulatedbytheseMPIfunctions,doingthisensuresthatbothrulesinSection2.2aresatised.AnnotatingMPIfunctionsisaneasytask.Mostfunc-tionsonlyrequiretheandoutannotationsatpa-rameters.Several“non-blocking”MPIfunctions(e.g., and )useasynchronousdatatransfer,whichiseasilycapturedusingtheprepareandcommitannotations.Figure9showstheannotated and ,whichissueanasyn-chronousreceiveandwaitforthecompletionnotica-tion,respectively.Thesefunctionsareassociatedbasedontheparameterbytheannotations.Sec-tion8.2presentsthenumberofannotationsneeded.5.2SQLiteSQLite[3]isawidely-usedSQLdatabaselibrary.AclientaccessesthedatabasebyinvokingtheSQLiteAPI.UsingWin32levelsyscalls,R2canfaithfullyreplaySQLiteclientapplications.Additionally,developerscanaddtheSQLiteAPItoR2syscallssothatR2willrecordtheoutputsofSQLiteAPI,andavoidrecordingleop-erationsissuedbytheSQLitelibraryinsystemspace.Incertainscenarios,recordingattheSQLiteAPIlayercandramaticallyreducethelogsize,comparedwithrecordingattheWin32layer.Forexample,some[prepare(request,buf)]MPI_Irecv([out,bsize(MPISize(type,count))]void[in]intcount,[in]MPI_Datatypetype,[in]intsource,[in]inttag,[in]MPI_Commcomm,10[out]MPI_Requestrequest);1112133()14MPI_Wait(15[in]MPI_Request16[out]MPI_StatusFigure9:ExampleofasynchronyannotationsonMPIfunctions.Thesizeofthereturnedbufferatcommitisbydefaulttheregisteredprepare intsqlite3_prepare([in]sqlite3[in]constchar[in]nByte,[out]sqlite3_stmt[out]constcharpzTail);intsqlite3_column_int([in]sqlite3_stmt10[in]iCol);Figure10:ExampleofannotatedSQLitefunctions.queriesmayscanalargetablebutreturnonlyasmallportionofmatchedresults.Forthesequeries,recordingonlythenalresultsismoreefcientthanrecordingalldatafetchedfromdatabaseles.Section8.4showstheperformancebenetsofthisapproach.Figure10showstwoannotatedSQLitefunctions: and column aretypicalroutinesforcompilingaqueryandretrievingcol-umnresults,respectively.NNOTATIONSFORPTIMIZATIONInthissection,weintroducetwoadditionalannotationkeywordstooptimizeR2’sperformance.Cacheannotation.Byinspectinglogswendthatafewsyscallsareinvokedmuchmorefrequentlythanothers—morethantwoordersofmagnitude.Also,mostofthemreturnonlyastatuscodethatdoesnotchangefrequently(e.g.,onWindowsreturnszeroinmostcases).ToimproverecordingperformanceR2introduceskeywordcachetoannotatesuchsyscalls.Everytimeasyscallannotatedwithcachereturnsastatuscode,R2comparesthevaluewiththecachedonefromthesamesyscall;onlywhenitchangeswillR2recordthenewvalueinthelogandupdatethecache.AnApacheexperimentinSection8.5.1showsthatthisoptimizationreducesthelogsizebyafactorof17.66%. USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation203 ManuallyCodedModules kloc annotationparser&codegenerator 4.1core(interception,isolation,slot) 1.3upcall(callback 0.7causality(sync 1.9aio(preparecommit 1.3record-replay(memory,data,event) 10.2 Total 19.5 AutomaticallyGeneratedModules kloc callback.Win32 3.6causality.Win32 2.9aio.Win32 1.9R2.Win32 102.0R2.app.specic - Total 110.2 Table2:R2modules.Reproduceannotation.Someapplicationdatacanbereproducedatreplaytimewithoutrecording.ConsideraBitTorrentnodethatreceivesdatafromotherpeersandwritesthemtodisk.Italsoreadsthedownloadeddatafromdiskandsendsthemtootherpeers.Itissafetorecordallinputdataforreplay,i.e.,bothreceivingfromnetworkandreadingfromdisk.However,R2neednotrecordthelatter.DeveloperscanusekeywordreproducetoannotateleI/Osyscallsinthiscase.R2thengener-atesstubsfromaspeciccodetemplate,tore-executeorsimulateI/Ooperations,insteadofrecordingandfeed-ing.NetworkI/O,suchasintra-groupcommunicationsofanMPIapplication,canbereducedsimilarly[33].Section8.5.2and8.5.3showthatthisoptimizationcanreducelogsizesrangingfrom13.7%to99.4%forBit-TorrentandMPIexperiments.MPLEMENTATIONR2isdecomposedintoanumberofreusablemodules.Table2listseachmoduleandlinesofcode(loc).Insum,wehavemanuallywritten19kloc;110klocaregener-atedautomaticallyforR2’sWin32layerimplementation.WehaveannotatedmorethanonethousandWin32APIcalls(seestatisticsinTable4).Althoughthiswellcoverscommonly-usedones,Win32hasamuchwiderinterface,andwemaystillhavemissedsomeusedbyap-plications.Therefore,wehavebuiltanAPIcheckerthatscanstheapplication’simporttableandsymbolletodetectmissingAPIcallswhenanapplicationstarts.DuringreplayR2maystillfailbecauseofsomeun-recordednon-determinisms,e.g.,dataracesnotenclosedbyR2syscalls.Sincenon-determinismsusuallyleadtodifferentcontrolowchoicesthusdifferentsyscallinvo-cationsequences,R2recordsthesyscallsignature(e.g., Category SoftwarePackage webserver Apache,lighttpd,NullHTTPddatabase SQLite,BerkeleyDB,MySQLdistributedsystem libtorrent,Nyx,PacicAvirtualmachine Lua,Parrot,Pythonnetworkclient cURL,PuTTY,Wgetmisc. zip,MPICH Table3:Softwarepackagessuccessfullyreplayed.name)andchecksitduringreplay(checkwhetherthecurrentinvokedsyscallhasthesamesignaturewiththatfromthelog).Bythismeans,R2canefcientlydetectthemismatchatthersttimewhenR2gainscontrolafterthedeviation.Whenamismatchisfound,R2reportsthecur-rentLamportclockandthemismatch.ThedevelopercanthenreplaytheapplicationagainwithabreakpointsetattheLamportclockvalueminusone.Whenthebreakpointishit,thedevelopercanthenexaminehowtheprogramreachedadifferentstateduringreplay,andxtheprob-lem(e.g.,byadjustingtheinterposedinterface).WehavefoundthatthisapproachworkswelltodebugtheR2in-terface.VALUATIONWehaveusedR2tosuccessfullyreplaymanyreal-worldapplications.Table3summarizesanincompletelist.Mostoftheapplicationsarepopularsystemap-plications,suchasmulti-threadedApacheandMySQLservers,whichwebelieveR2isthersttoreplay.Nyx[7]isasocialnetworkcomputationengineforMSNandPacicA[19]isastructuredstoragesystemsimilartoGoogleBigtable[6],bothofwhicharelarge,complexdistributedsystemsandhaveusedR2forreplaydebug-ging.TheimplementationoftheseapplicationsrequiresaddressingthechallengesmentionedinSection2.Thissectionanswersthefollowingquestions.Howmucheffortisrequiredtoannotatethesyscall-upcallinterface?Howimportantareannotationstosuccessfulreplayofapplications?HowmuchdoesR2slowdownapplicationsduringrecording?Howeffectivearecustomsyscalllayersandanno-tations(cacheandreproduce)inreducinglogsizeandoptimizingperformance?Similartopreviousreplaywork,wedonotevalu-atethereplayperformance,becausereplayisusuallyaninteractiveprocess.However,thereplayedapplica-tionwithoutanydebugginginteractionrunsmuchfaster 2048th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association Interface #func in out bsize cb xptr pr ci sync cache reproduce #serial kloc Win32 1,301 1,100 631 168 53 11 17 4 30 2 3 7 110.2MPI 191 171 150 20 6 4 6 4 1 0 4 5 22.2 153 150 16 4 19 3 0 0 7 0 0 0 15.7 Table4:Informationaboutannotationsandcodegeneration.Columnswithannotationkeywordsshowthenumberoffunctionsforeachkeyword.Keywordscallbackxpointerpreparecommit,areabbreviatedtoxprprrespectively.Thelasttwocolumnslistthenumberoffunctionsthatrequirecustomized(de)-serialization,andlinesofautomaticallygeneratedcode,respectively.thanwhenrecording(e.g.,areplayrunofBitTorrentledownloadingis13xfaster).8.1ExperimentalSetupAllexperimentsareconductedonmachineswith2.0GHzXeondual-coreCPU,4GBmemory,two250GB,7200/sdisks,runningWindowsServer2003ServicePack2,andinterconnectedviaa1Gbpsswitch.Unlessexplicitlyspecied,theapplicationdataandR2loglesarekeptonthesamedisk;therecordrunusestotal-orderexecution;alloptimizations(i.e.,cacheandreproduce)areturnedoff.8.2AnnotationEffortWeannotatedtherstset(500+functions)oftheWin32syscallinterfacewithinoneperson-week,andthenanno-tatedasneeded.Wereusedout,andcallbackfromtheWindowsPlatformSDK,andmanuallyaddedtheotherannotations(i.e.,xpointerpreparecommitsync);wemanuallyannotatedonly62functions.Wefoundthatoncewedecidedhowtoannotateafewfunctionsforaparticularprogrammingconcept(e.g.,asynchronousI/O,orsynchronization),thenwecouldannotatetheremainingfunctionsquickly.Forex-ample,afterweannotatedthele-relatedasynchronousI/Ofunctions,wequicklywentthroughallthesocketre-latedasynchronousI/Ofunctions.Forthetwoothersyscallinterfaces,MPIandSQLite(discussedinSection5),wespenttwoperson-daysan-notatingeachbeforeR2couldreplayMPIandSQLiteapplications.Therstfourkeywords(out,andcallback)aretrivialandcostuslittletime,andwemainlyspentourtimeonotherannotationsandwritingcus-tomized(de)-serializationfunctions.Table4listsforthethreesyscallinterfacetheanno-tationsused,howmanyfunctionsusedthem,thenumberoffunctionsthatneededcustomized(de)-serialization,andthelinesofcodeautogenerated(approximately148kloc).ThetableshowsthattheannotationsareimportantforR2;withoutthemitwouldhavebeenatediousanderror-pronejobtomanuallywritesomanystubs. Conguration Request#/s Slowdown Lograte native 1242.23 - stubonly 1241.75 0.04% log 1125.58 1.34% 0.760causal-order 1197.52 3.60% 1.114total-order 1129.94 9.04% 0.781 Table5:ApacheperformanceunderdifferentR2con-gurations(cacheon).LograteismeasuredinKB/req.Clientconcurrencylevelis50andthedownloadlesizeis64KB.8.3PerformanceWemeasuretherecordingperformanceofR2usingtheApachewebserver2.2.4withitsdefaultcongura-tion(250threads)andthestandardApacheBenchclient,whichisincludedinthesamepackage.Table5showsthereductioninrequestthroughputandthelogoverheadunderdifferentR2congurations.WeuseApacheBenchtomimic50concurrentclients,allofwhichdownload64KBstaticles(whichisatypicalwebpagesize).Eachcongurationinthetableexecutes500,000requests.Aswecansee,thestub,thelogger,andthecausal-orderexecutionhavelittleperformanceim-pact;thetotal-orderexecutionimposesaslowdownupto9.04%,whichwebelievetobeacceptableforthepurposeofdebugging.Thelogproducedforeachrequestisap-proximately0.8KB,slightlybiggerforcausal-orderex-ecutionmodesinceitneedstologmorecausalityevents.Figure11showstheresultsfortotal-orderandcausal-ordercongurations,withavariednumberofconcurrentclientsandlesize.Weseethatwhenthedownloadleislarger,theslowdownissmaller.ThisisbecausethelargerlesizemeansthattheexecutioninreplayspacecostslessCPUtime,andtheslowdownimposedbytotalorderexecutionisless.Forthesmallestsizeof16KBwetested,theaverageslowdownforallconcurrencylevelsis11.2%undertotal-ordercongurationand4.9%undercausal-orderconguration.InadditiontoApache,wehavealsomeasuredtheperformanceofmanyotherapplicationswhilerecording.Theslowdownformostcasesismoderate(e.g.,9%on USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation205 1500 200025003000 dr2 16K/native 16K/co log 16K/to log 64K/native 1000 1500 102030405080100 dr2 d 64K/native 64K/co-log 64K/to log 256K/native 256K/co-log 256K/to log Figure11:Apacheperformanceusingtotal-orderandcausal-ordercongurations,withavariednumberofconcurrentclientsandlesize. 1001000 logdsized(M1) win32 sqlite file mem logdsized(M1) timed(s) win32 sqlite native file timed(s) Figure12:SQLitelogsizeandexecutiontimeatWin32andSQLiteinterfacesusingFILEandMEMcongura-tions.averageforthestandardMySQLbenchmark[12]).Thereareexceptions,suchastheSQLitecasebelow.Theper-formanceoftheseexceptions,however,canbeimprovedusingeithercustomizedsyscalllayeroroptimizationan-notations.8.4CustomizedSyscallLayersThissectionevaluatestheperformanceofR2forSQLiteusingtwosyscalllayers,i.e.,Win32andSQLite,whichwasdiscussedinSection5.2.WeadoptedabenchmarkfromNyx[7],whichcal-culatesthedegreedistributionofuserconnectionsinasocialnetworkgraph.Thecalculationisexpressedasaquery:SELECTCOUNT()FROMedgeGROUPBYsrc .Weperformthequery10times,andmea-surelogsizeaswellasexecutiontime.Thedatasetcon-tains156,068edgesandisstoredinSQLite;thelesizeisapproximately3MB.BydefaultSQLitestorestemporarytablesandin-dicesinles;itcanstoretheminmemorybysettinganoption.WeuseFILEandMEMtonamethetwocong-urations.Otheroptionsaredefaultvalues.Figure12showslogsizeandexecutiontimeforrecordingatthetwosyscalllayers,respectively.Inei-therconguration,recordingattheWin32interfacepro-ducesmuchlargerlogs(890MBand35MB),comparedtorecordingattheSQLiteinterface(only3MB).The CachedSyscalls Call# Miss# HitRatio GetLastError 618,015 99,948 83.82% 150,016 2 99.99% 150,003 1 99.99% 100,147 2 99.99% 100,014 4 99.99% Total 1,118,195 99,957 91.06% Table6:Apachecachedsyscallswithcachemissandhitstatistics.Clientconcurrencylevelis50anddownloadlesizeis64KB.slowdownfactorsofrecordingatthetwointerfacesare126.3%and9.6%underFILE,17.8%and7.3%underMEM,respectively.NotethattherecordingattheSQLiteinterfacepro-ducesthesamesizeoflogforthetwocongurations,becausetheSQLlayerdoesnotinvolveleI/Oandthelogsizeisnoteffectedbythecongurations.FromtheseresultswecanseethatrecordingattheSQLitelayercanreducelogoverheadandimproveper-formance,ifforaquerySQLitemustperformI/Ofre-quently.8.5OptimizationAnnotationsAsdiscussedinSection6,R2introducestwoannotationkeywordstoimproveitsperformance.Weevaluatetheminthissection.8.5.1ThecacheAnnotationWeusetheApachebenchmarkagaintoevaluatethecacheannotation.TheexperimentrunsR2intotal-orderexecutionmode.Theclient’sconcurrencylevelis50andtheledownloadedis64KB.ProlingApacheshowsthat5outof61syscallscontributemorethan50%ofsyscall.Weusethecacheannotationforthesesyscalls.Table6showshowmanytimesthesesyscallswerein-vokedanddidnothitthecacheinonetestrun.Weseethatthereturnvaluesofthesesyscallsweremostlyinthecache,andthattheaveragehitratiois91.06%.Thisre-ducedthelogsizefrom21.99MBto18.1MB(approx-imately17.66%reduction).Weappliedthecacheopti-mizationtoonlyvesyscalls,butwecouldgainmorebenetsifweannotatedmoresyscalls.8.5.2ReproducedFileI/OAsdiscussedinSection6,whenthereproduceanno-tationisusedforleI/OwhenrecordingBitTorrent,thelecontentthatisreadfromadiskisnotrecorded,andtherelatedlesyscallsarere-executedduringreplay.WeuseapopularC++BitTorrentimplementationlibtorrent[1]tomeasuretheimpactofthisannotation.Theexperimentwasconductedon11machines,withone 2068th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association finish time (s) nodes (sorted) n(htmr srodse srodse )srfsdeior Figure13:Finishtimeof10BitTorrentnodesinrunsofnative,recordwithoutandwithreproducedI/Ooptimiza-tion.seedand10downloaders.Theseedlesizewas4GB;theuploadbandwidthwaslimitedto8MB/s.R2ranintotal-ordermodewithcacheoptimizationoff.Theaveragelogsizesoftherecordingrunwith-outandwiththereproduceannotationare17.1GBand5.4GB,respectively.Theoptimizationreducesthelogsizeby68.2%.Relativetothe4GBlesize,thetwocasesintroduce297.5%and26.4%logsizeoverhead,re-spectively.Figure13presentsthenishtimeofthe10down-loadersforanativerun,aswellforrecordedrunswith-outandwiththereproduceannotation.Onaverage,theslowdownfactorsofrecordingwithoutandwiththean-notationare28%and3%,respectively.WecanseethatthereproduceannotationiseffectivewhenrecordingI/Ointensiveapplications,reducingboththelogandperfor-manceoverhead.8.5.3ReproducedNetworkI/OWeusetheMPIsyscalllayertoevaluatethebene-tofthereproduceannotationfornetworkI/O.Theex-perimentwasconductedinourMPI-replayproject[33],whichusesR2.WeannotatedMPIfunctionsusingthereproduceannotationsothatthemessagesarenotrecordedbutreproducedduringreplay.Table7showstheeffectivenessfortwotypicalMPIbenchmarks:GE[14]andPU[11].WeseethattheclientprocessofPUgainsmuchbenetfromthiskeyword;thelogsizeisreducedbymorethan99.4%.ForGE,italsoresultsinalogsizereduction,butofabout13.7%.ELATEDR2borrowsmanytechniquesfrompreviousreplaytools,inparticularfromlibrary-basedones.ThissectionrelateR2totheminmoredetail.Library-basedreplay.Severalreplaytoolsusealibrary-basedapproach.TheclosestworkisJockey[28]andli- w/oopt(MB) w/opt(MB) Ratio node0 node1 node0 node1 GE 55.3 55.3 47.7 47.7 86.3%PU 4.5 1170.0 5.7 1.7 0.6% Table7:ReproducednetworkI/OoptimizationonMPI.“Ratio”isthelogsizewiththisoptimizationcomparedthatwithout.TheothereldsareR2logsizeoneachnode.blog[10],wherearuntimeuser-modelibraryisinjectedintoatargetapplicationforrecordandreplay.Webor-rowmanyideasfromthesetools(e.g.,usingatokentoensuretotal-orderexecution)butextendthelibrary-baseapproachtoawiderrangeofapplicationsusingstricterisolation(inspiredbyoperatingsystemkernelideas),byexiblecustomizationoftherecordandreplayinterface,byannotations,andautomaticgenerationofstubs.R2alsoisolatestheapplicationfromthetoolinadifferentway.Forexample,Jockeytriestoguaranteethattheapplicationbehavesthesamewithandwithoutrecordandreplay,andliblogsharesthesamegoal.Con-sequently,theybothsendthememoryrequestsfromthetooltoadedicatedmemoryregiontoavoidchangingthememoryfootprintoftheapplication.AsdiscussedinSection1,R2aimsforreplayfaithfulnessinstead,anditmanagesmemoryrequestsfromtheapplicationJockeyandlibloghaveaxedinterfaceforrecordandreplay(amixofsystemandlibccalls);anynonde-terminismthatisnotcoveredwillcausereplaytofail.R2enablesdeveloperstoannotatesuchcasesusingkey-wordsonfunctionsofhigher-levelinterfacestoenclosenondeterminism.Ontheimplementationside,bothJockeyandli-bloghavemanuallyimplementedmanystubs(100+);R2’smoreautomaticapproachmakesiteasiertosup-portawiderrangeofsyscalls.Forexample,Jockeydoesnotsupportmultithreading;liblogalsodoesnotsupportasynchronousI/Oandotherfunctions.RecPlay[25]capturescausalitiesamongthreadsbytrackingsynchronizationprimitives.R2usesthatideatoo,butalsocapturesothercausalities(e.g.,syscall-upcallcausalities).RecPlayusesavectorclockduringreplayandcandetectdataraces.ThisfeaturecouldbeusefultoR2too.Anotherlibrary-basedapproachbutlessrelatedisFlashback[29],whichmodiesthekernelandrecordstheinputoftheapplicationatsystemcalllevel.Sinceitisimplementedasakerneldriver,itislesseasytodeployandusethanR2.Domain-specicreplay.Therearealargenumberofreplaytoolsfocusingonapplicationsusingrestrictedprogrammingmodels,suchasdistributedsharedmem- USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation207 ory[27]orMPI[26],orinspecicprogramminglan-guagessuchasStandardML[30]orJava[17].Thisap-proachisnotsuitableforthesystemapplicationsthatR2targets.Infact,webuiltareplaytoolbefore[21],whichreliesontheprogrammerstodeveloptheirapplicationsusingourownhome-grownAPI.ThelimitationofthisworkpropelledustodesignandbuildR2.Wholesystemreplay.Adirectwaytosupportlegacyapplicationsistoreplaythewholesystem,includingtheoperatingsystemandthetargetapplications.Asetofre-playtoolsaimatthistarget,eitherusinghardwaresup-port[32,24,23]orvirtualmachines[8,16,5].Theycanreplayalmosteveryaspectofanapplication’senvi-ronmentfaithfully,includingschedulingdecisionsinsidtheoperatingsystem,whichmakesthemsuitabletode-bugproblemssuchasraceconditions.TheycanachievesimilarperformanceasR2;ReVirt[8],forinstance,hasaslowdownof8%forrebuildingthekernelorrunningSpecWeb99benchmarks.However,theycanbeinconve-nientandexpensivetodeploy.Forexample,developersmustcreateavirtualmachineandinstallacopyoftheoperatingsystemtorecordandreplayanapplication.Annotations.Annotationsonfunctionsarewidelyusedinmanyelds.Forexample,aprojectinsideMi-crosoftusesSALandstaticanalysistondbufferoverows[13].Instead,SafeDrive[34]insertsruntimecheckswherestaticanalysisisinsufcientaccordingtotheannotations.Whiletheyallfocusonndingbugs,R2usesannotationstounderstandfunctionsideeffects,andgeneratescodetorecordandreplaythem.Enforcingisolationwithbinaryinstrumentation.XFI[9]isaprotectionsystemwhichusesacombina-tionofstaticanalysiswithinlinesoftwareguardsthatperformchecksatruntime.Itensuresmemoryisolationbyintroducingexternalcheckingmodulestochecksus-piciousmemoryaccessesatruntime.BecauseXFImon-itorsthememoryaccessatinstructionlevel,itsoverheadvariesfrom5%toafactoroftwo,dependingonhowthestaticanalysisworksandalsothebenchmark.R2isolatesatfunctioninterfacesandtargetsreplay,whichallowsittobemorelooseinitsisolationinsomeways(i.e.,itdoesnothavetoprotectagainstattacks),butmorestrictinotherways(i.e.,memoryaddressescannotchangefromrecordingtoreplay).10CR2useskernelideastosplitanapplication’saddressspaceintoareplayandasystemspace,allowingstrictseparationbetweentheapplicationandthereplaytool.Withhelpfromthedeveloper,whospeciessomean-notationsonthesyscallinterface,R2carefullyman-agestransitionsbetweenreplayandsystemspaceatthesyscallinterface,andisolatesresources(e.g.,threadsandmemory)withinaspace.TheannotationsalsoallowR2togeneratesyscallandupcallstubsfromcodetemplatesautomatically,andmakeiteasyfordeveloperstochoosedifferentsyscall/upcallinterfaces(e.g.,MPIorSQLite).Italsoallowsdeveloperstoenclosenondeterminismandavoidsharedstatebetweenreplayandsystemspace.Annota-tionsforoptimizationscanreducetherecordlogsizeandimproveperformance.ByusingtheseideasR2extendsrecordingandreplaytoapplicationsthatstate-of-the-artlibrary-basedreplaytoolscannothandle.R2hasbecomeanimportanttoolfordebuggingapplications,especiallydistributedones,andabuildingblockforotherdebuggingtools,suchasrun-timehangcure[31],distributedpredicatechecking[20],taskhierarchyinference[22],andmodelchecking.CKNOWLEDGMENTWethankAlvinCheung,EvanJones,JohnMcCullough,RobertMorris,StefanSavage,AlexSnoeren,GeoffreyVoelker,ourshepherd,DavidLie,andtheanonymousreviewersfortheirinsightfulcomments.ThankstoourcolleaguesMatthewCallcut,TracyChen,RuiniXue,andLidongZhouforvaluablefeedback.[1]libtorrent0.11.[2]PHP:Hypertextpreprocessor.[3]SQLite3.5.8.[4]D.AshtonandJ.Krishna.MPICH2WindowsDevelopmentGuide.ArgonneNationalLaboratory,2008.[5]S.Bhansali,W.-K.Chen,S.deJong,A.Edwards,R.Murray,M.Drinic,D.Mihocka,andJ.Chau.Frameworkforinstruction-leveltracingandanalysisofprogramexecutions.In,2006.[6]F.Chang,J.Dean,S.Ghemawat,W.C.Hsieh,D.A.Wallach,M.Burrows,T.Chandra,A.Fikes,andR.E.Gruber.Bigtable:Adistributedstoragesystemforstructureddata.InOSDI,2006.[7]Y.Chen,T.Chen,M.Chen,andZ.Zhang.IslandsintheMSNMessengerbuddynetwork.InSocialNets,2008.[8]G.W.Dunlap,S.T.King,S.Cinar,M.Basrai,andP.M.Chen.ReVirt:Enablingintrusionanalysisthroughvirtual-machineloggingandreplay.InOSDI,2002. 2088th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association [9]U.Erlingsson,M.Abadi,M.Vrable,M.Budiu,andG.C.Necula.XFI:Softwareguardsforsystemaddressspaces.InOSDI,2006.[10]D.Geels,G.Altekar,S.Shenker,andI.Stoica.Replaydebuggingfordistributedapplications.InUSENIX,2006.[11]W.Gropp,E.Lusk,N.Doss,andA.Skjellum.Ahigh-performance,portableimplementationoftheMPImessagepassinginterfacestandard.ParallelComputing,22(6):789–828,1996.[12]Z.Guo,X.Wang,X.Liu,W.Lin,andZ.Zhang.Towardspragmaticlibrary-basedreplay.TechnicalReportMSR-TR-2008-02,MicrosoftResearch,2008.[13]B.Hackett,M.Das,D.Wang,andZ.Yang.Modularcheckingforbufferoverowsinthelarge.,2006.[14]Z.Huang,M.K.Purvis,andP.Werstein.Performanceevaluationofview-orientedparallelprogramming.In,2005.[15]G.HuntandD.Brubacher.Detours:BinaryinterceptionofWin32functions.InUSENIXWindowsNTSymposium,1999.[16]S.T.King,G.W.Dunlap,andP.M.Chen.Debuggingoperatingsystemswithtime-travelingvirtualmachines.InUSENIX,2005.[17]R.Konuru,H.Srinivasan,andJ.-D.Choi.DeterministicreplayofdistributedJavaapplications.In,2000.[18]L.Lamport.Time,clocksandtheorderingofeventsinadistributedsystem.CACM21(7):558–565,1978.[19]W.Lin,M.Yang,L.Zhang,andL.Zhou.PacicA:Replicationinlog-baseddistributedstoragesystems.TechnicalReportMSR-TR-2008-25,MicrosoftResearch,2008.[20]X.Liu,Z.Guo,X.Wang,F.Chen,X.Lian,J.Tang,M.Wu,M.F.Kaashoek,andZ.Zhang.S:Debuggingdeployeddistributedsystems.InNSDI,2008.[21]X.Liu,W.Lin,A.Pan,andZ.Zhang.WiDSchecker:Combatingbugsindistributedsystems.NSDI,2007.[22]H.Mai,C.Gao,X.Liu,X.Wang,andG.M.Voelker.Towardsautomaticinferenceoftaskhierarchiesincomplexsystems.InHotDep,2008.[23]S.Narayanasamy,C.Pereira,andB.Calder.RecordingsharedmemorydependenciesusingStrata.InASPLOS,2006.[24]S.Narayanasamy,G.Pokam,andB.Calder.BugNet:Continuouslyrecordingprogramexecutionfordeterministicreplaydebugging.In,2005.[25]M.RonsseandK.D.Bosschere.RecPlay:Afullyintegratedpracticalrecord/replaysystem.TOCS17(2):133–152,1999.[26]M.Ronsse,K.D.Bosschere,andJ.C.deKergommeaux.ExecutionreplayforanMPI-basedmulti-threadedruntimesystem.InParCo,1999.[27]M.RonsseandW.Zwaenepoel.Executionreplayfortreadmarks.In,1997.[28]Y.Saito.Jockey:Auserspacelibraryforrecord-replaydebugging.InAADEBUG,2005.[29]S.Srinivasan,C.Andrews,S.Kandula,andY.Zhou.Flashback:Alight-weightextensionforrollbackanddeterministicreplayforsoftwaredebugging.InUSENIX,2004.[30]A.TolmachandA.W.Appel.AdebuggerforStandardML.JournalofFunctionalProgramming,5(2):155–200,1995.[31]X.Wang,Z.Guo,X.Liu,Z.Xu,H.Lin,X.Wang,andZ.Zhang.Hanganalysis:Fightingresponsivenessbugs.InEuroSys,2008.[32]M.Xu,R.Bodik,andM.D.Hill.A“ightdatarecorder”forenablingfull-systemmultiprocessordeterministicreplay.In,2003.[33]R.Xue,X.Liu,M.Wu,Z.Guo,W.Chen,W.Zheng,Z.Zhang,andG.M.Voelker.MPIWiz:SubgroupreproduciblereplayofMPIapplications.PPoPP,2009.[34]F.Zhou,J.Condit,Z.Anderson,I.Bagrak,R.Ennals,M.Harren,G.Necula,andE.Brewer.SafeDrive:Safeandrecoverableextensionsusinglanguage-basedtechniques.InOSDI,2006.