/
Deterministic Galois Ondemand Portable and Parameterless Donald Nguyen Andrew Lenharth Deterministic Galois Ondemand Portable and Parameterless Donald Nguyen Andrew Lenharth

Deterministic Galois Ondemand Portable and Parameterless Donald Nguyen Andrew Lenharth - PDF document

test
test . @test
Follow
564 views
Uploaded On 2014-12-14

Deterministic Galois Ondemand Portable and Parameterless Donald Nguyen Andrew Lenharth - PPT Presentation

utexasedu Abstract Nondeterminism in program execution can make program development and debugging dif64257cult In this paper we argue that solutions to this problem should be ondemand portable and parameterless Ondemand means that the programming mo ID: 23670

utexasedu Abstract Nondeterminism program

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Deterministic Galois Ondemand Portable a..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

ProgrammingmodelssuchasDPJ[9],revisionswithdeterministicmerges[11],StreamIt[27],andnesteddata-parallelprograms[8]aredeterministicbyconstruction,buttheyareincompatiblewithon-demanddeterminism.OtherprogrammingmodelslikeGrace[5]arenotdeterministicbyconstruction,buttheoutputofaprogrammaydependonthenumberofthreadsusedtoexecutetheprogram.Therearedeterministicprogrammingmodelsthathave“escapes”thatallownon-deterministicprogramstobewritten,suchasDPJwithnon-determinism[10]andPBBS[7],butthereisnoconvenientmethodforproducingdeterministicexecutionsondemandfromsuchprograms.Whendeterminismisprovidedbyscheduling,itisusu-allynotportable.InhardwaresystemslikeRCDC[17]andCalvin[21],thenumberofthreadsisafundamentalpartofprogramrepresentation,andanydeterministicexecutionisalwayswithrespecttoaparticularnumberofthreads.AttheOSandcompilerlevel,itispossibletovirtualizetheconceptofthreadstoachieveportability,butcurrentsystemssuchasDeterminator[2],CoreDet[3]anddOS[4]donotdothis.Runtimeapproachesarereplacementsfornon-determinis-ticprogramlibraries.BothDThreads[24]andKendo[25]modelthebehaviorofpthreads,anexplicitlythreadedli-brary,andarenotportable.Finally,someoftheabovesystemshaveuser-tunableschedulingparametersthataffectexecutionperformanceandoutput:examplesareRCDC,CoreDet,Kendo,andPBBS.Inthispaper,wepresentadeterministicparallelsys-temthatison-demand,portableandparameter-free.Ittakeshigh-level,non-deterministicprogramswrittenintheGa-loisprogrammingmodel[26]asinput.Theseprogramscanbeexecutednon-deterministicallywiththeGaloisruntime,ortheycanbeexecuteddeterministicallyusingatechniquethatwecalldeterministicinterferencegraph(DIG)sched-uling.Theapplicationprogramdoesnothavetochangewhenswitchingbetweennon-deterministicanddetermin-isticschedulingsincethedesiredschedulerisspeciedthroughacommand-lineparameter.Toevaluatethequalityofourdeterministicprograms,wecomparetheirperformancewithhandwrittendeterministicparallelprogramsfromthePBBSbenchmarksuite[7].Ourresultsshowamedianperformanceof0.62XcomparedtothehandwrittenPBBSimplementations.Tohighlighttheim-portanceofon-demanddeterminism,wecomparetheper-formanceofnon-deterministicGaloisprogramswiththatofdeterministicPBBSprograms.Thenon-deterministicpro-gramsachieveamedianspeedupof2.4XoverthePBBSimplementations.Inaddition,wecompareourDIGschedulingapproachwithapriordeterministicsystem,CoreDet[3],onnon-de-terministicPBBSprograms.CoreDetcanmakeanythreadedprogramdeterministic.TheCoreDetexecutionachievesscalingforonlyoneoutofthefourapplications.Thisisbe-1foreachTasktinP:2atomic:3t()//executetask4enqueue(S(t))//enqueuenewtasks,ifany(a)Non-deterministicprogram.1foreachTasktinP:2ifwriteMarks(0,id(t),L(t)):3//allwritessuccessful4t()5enqueue(S(t))6else:7enqueue(t)8writeMarks(id(t),0,L(t))(b)Schedulingnon-deterministicprograms.Figure1:Non-deterministicprogramandscheduling(seeFig-ure3forauxiliaryfunctiondenitions).causeunlikethePARSECbenchmarks[6]commonlyusedtoevaluatedeterministicparallelsystems,ourbenchmarkshavemuchmoresynchronizationandsmallertasksizes,whichseverelytaxessystemswithhighschedulingover-heads.Therestofthispaperisorganizedasfollows.InSec-tion2,wedescribethenon-deterministicGaloisprogram-mingmodelandprovideahigh-leveloverviewofschedulingwithinterferencegraphs.InSection3,wediscusstheimple-mentationofthisapproachintheGaloissystem.Section4summarizesthebenchmarksusedintheevaluation.InSec-tion5,wemeasuretheperformanceofvariousdeterministicandnon-deterministicprograms,aswellastheperformanceoftheprogramswithCoreDet.Section6summarizesrelatedwork,andSection7concludesthepaper.2.GaloisprogrammingmodelTheprogrammingmodelusedisanabstractversionoftheGaloisprogrammingmodelforunorderedalgorithms[26].Therearenoexplicitlyparallelconstructslikethreadsandlocks;instead,parallelismisspeciedimplicitlythroughtheuseoftheiteratorshowninFigure1a.Keyfeaturesarethefollowing.Pisapooloftasksthatcanbeperformedinanyorder.Theprogramterminateswhenalltaskshavebeenexe-cuted.Whentasktiscompleted,itmaycreateasetofnewtasksS(t),whichareaddedtothetaskpool.Tasktistheparentofthetasksitcreates;thetransitiveclosureoftheparentrelationiscalledtheancestorrelation.Eachtaskperformscomputationandreadsandwritesshared-memorylocations.ThesetoflocationsreadR(t)andwrittenW(t)byatasktissaidtoconstituteitsneighborhood,whichisdenotedbyL(t)=R(t)[W(t). Tasksarerequiredtobecautious:thatis,ataskmustreadallofthelocationsinitsneighborhoodbeforeitcanwritetoanyofthem.Aconictoccursbetweentaskst1andt2if(i)neithertaskisanancestoroftheother,and(ii)oneofthemwritestotheneighborhoodoftheother(W(t1)\L(t2)6=;).Iftherecanbenoconictsbetweentasks,parallelexecu-tionisstraightforwardsincetheprogramisageneralizeddata-parallelloopinwhichtheiterationrangecangrowdynamically.Inthepresenceofconicts,acorrectpar-allelschedulefortheprogramshouldbeserializable:itmustappearasifalltaskswereperformedatomicallyinsomeorderthatrespectstheancestorrelation.Althoughneighborhoodscanbedenedassetsofcon-cretememorylocations,itisbettertodenethemassetsofabstractmemorylocations;forexample,foragraphal-gorithm,theseabstractlocationsmightcorrespondtographelementssuchasnodesandedgesofthegraph,ratherthantotheconcretememorylocationsimplementingthisabstractdatatype.TheGaloisruntimesystemimplementssynchro-nizationbyassociatingabstractlocksormarkswithabstractlocations(x2.1).2.1Non-deterministicschedulingInmostprograms,neighborhoodsoftasksarenotknownstatically.Oneparallelizationstrategyistousespeculativeexecution:afreethreadselectsanarbitrarytaskfromthetaskpoolPandexecutesit,rollingbackthetaskifaconictwithanotherconcurrentlyexecutingtaskisdetected.Trackingofneighborhoodscanbedoneusingatransactional-memory-likeapproachoverabstractmemorylocations[19,20].Forprogrammingmodelswithcautioustasks,conictde-tectionandcorrectioncanbedoneusingmuchlighterweightmechanismsbecausethesynchronizationproblemreducestothewell-knowndiningphilosopher'sproblem[12].Con-ceptually,eachabstractlocationcanbeacquiredbyanowner.Theexecutionofataskcanbedividedintotwophases:intherstphase,ataskreadslocationsbutdoesnotwritetoanyofthem,acquiringownershipoftheselocations,andinthesecondphase,thetaskwritestosomelocations,butitdoesnotwritetoanylocationthatitdidnotreadintherstphase.Thepointbetweentherstandsecondphaseiscalledthefailsafepoint.Forcautioustasks,conictsaredetectedintherstphase,androllbackisimplementedsim-plybyreleasingownershipofalllocations.Oncethefailsafepointhasbeencrossed,globaldatastructurescanbeupdatedinplacewithouttheneedforbackupcopiesofmodieddata.Figure1bimplementsthisideafornon-deterministictaskscheduling.Foreachabstractmemorylocationl,wemain-tainamarklocationMark(l).Eachtaskthasauniqueidid(t),andthereisanid0thatisdistinctfromallotherids.Forthisimplementation,idsneedonlybeunique.Forthede-terministicschedulerbelow(x3),idsmustalsohaveatotalorder,and0mustbelessthananyotherid.Marklocationsinitiallycontainthevalue0.Theschedulerchoosesatasktandtriestoacquireownershipofallthelocationsintheneighborhoodofthattask,usingcompare-and-setinstruc-tionsforexample.Ifsuccessful,thetaskexecutesandaddsnewtasksS(t)tothepool.Ifunsuccessful,thereisaconict,andthescheduleraddstbacktothepoolforre-execution.Ineithercase,anymarksacquiredareresetbackto0.3.DeterministicschedulingInthissection,wedescribeourimplementationofdetermin-isticscheduling.Itisbasedonndingindependentsetsinanimplicitlyconstructedinterferencegraphoftasks.Werstdescribethehigh-levelidea(x3.1),thenitsimplementation(x3.2),andnallysomeimportantoptimizations(x3.3).3.1DIGschedulingDenition1.GivenasetoftasksP,aninterferencegraphforPisanundirectedgraphGP=(VP;EP)inwhichthereisadistinctnodeinVPrepresentingeachtaskinP,andthereisanundirectededge(v1;v2)2EPifthetasksrepresentedbyv1andv2haveaconict.Theinterferencegraphforasetoftaskscanbebuiltbyexecutingeachtaskuptoitsfailsafepointwhiletrackingitsneighborhoodandputtingaconictedgebetweentwotasksiftheirneighborhoodsoverlap.Thisisaconservativeapproachsinceitputsaconictedgebetweentwotaskseveniftheybothreadalocationthatneitherofthemwritesto.Programanalysiscanbeusedtodetermineconictsmoreaccurately;forthepurposesofthispaper,anyconservativeinterferencegraphisadequate.Interferencegraphscanbeusedtoscheduletasksasfol-lows.ThetasksinthetaskpoolPareexecutedinrounds.Ineachround,theschedulerperformsthefollowingactivities:inspect:buildaninterferencegraphGPforthetasksinP,select:ndanindependentsetIofnodesinGPandremovethecorrespondingtasksfromP,andexecute:executethetasksinIinparallel,addinganynewlycreatedtaskstoP.Schedulingiscompletedwhenalltaskshavebeenexe-cuted.Duringtheselectphase,itisdesirablebutnotneces-sarytondamaximalindependentsetofnodesinthegraph.Asubtlepointisthattheinterferencegraphmustingen-eralberebuiltfromscratcheachroundsincetheneighbor-hoodofataskisrelativetotheglobalstate,whichismodi-edbytasksintheexecutephase.Thereisoneenhancementtothisbasicschemethatisusefulforreducingtheoverheadofinterferencegraphcon-struction.Notethattheschedulingstrategyworkscorrectlyevenif,ineachround,theinterferencegraphisconstructedonlyforasubsetoftasksinthepool;theremainingtasksaresimplydelayedtolaterrounds.Thiswindowingschemecanreducetheoverheadofinterferencegraphconstruction locationsofataskdoesnotcontainitsid,thetaskisnotpartoftheindependentset,anditisplacedinthenextsettobeexecutedinafutureround.Ineithercase,themarkswrittenbyataskareclearedinpreparationforthenextround.Executioncontinuesinroundsuntiltherearenotasksleftinnext.Iftherearenotasksinthetodoset,theschedulerterminates;otherwisethesetasksaremovedtonext,andexecutioncontinues.Notethatineachround,thetaskincurwithmaximumidisguaranteedtoexecute,soeachroundexecutesatleastonetask.Beforeenqueuedtaskscanbescheduled,theymustbeassignedauniqueid.Theassignmentofidsmustalsobede-terministic.Idsareassignedasfollows.TheinitialtasksaregivenidsbasedontheiterationorderoftheC++iteratorthatcontainsthetasks.Whentasktcreatestasku,theschedulerstoreswithtaskutheidofthetaskthatcreateditid(t)andanumberkindicatingwhetheritwastherst,second,third,etc.taskcreatedbyt.Inthesortfunction,tasksaresortedlexicographicallybasedonthepair(id(t);k),andthesched-ulerusesthepositioninthetotalorderdenedbythesortastheidforthenewtasks.Theperformanceofthisschedulerdependscriticallyonthewindowsize,soweimplementedanadaptivealgorithmthatgrowsandshrinksthewindowsizeeachrounddepend-ingonthenumberoftasksthatsuccessfullycommittedinthepreviousround.ThegetWindowOfTasksandcalculateWin-dowfunctionsinFigure2implementthisfunctionality.ThecalculateWindowfunctioncomputesthewindowsizeforthecurrentroundbasedonthefractionoftasksthatcommittedinthepreviousround.Ifthecommitratioislessthansometargetthreshold,thenextwindowsizeisscaleddownpro-portionally.Ifthecommitratioisabovethethreshold,thewindowsizeisdoubled.ThegetWindowOfTasksfunctionsimplyreturnsthisprexoftasksincurandpostponestheremaindertonext.Sincethenumberoftasksthatcommitinaroundisindependentofthenumberofexecutingthreads,thisheuristicisportableacrossmachines.ToimplementthedeterministicmarkingschemeintheGaloissystem,wemadetwochangestotheexistingsystem.First,thedefaultmarkvaluesintheGaloissystemarenotordered.Wemodiedthemarkingcodetokeeptrackoftheidofataskandtousethatvalueappropriatelywhenwritingmarkvalues.Second,neighborhoodsarenotexplicitlymaintainedbytheGaloissystem.Marksareacquiredincrementallydur-ingexecutionviausercodecallstoadatastructurelibrary.Theonlywaytogettheneighborhoodofataskistoexecutethetaskandobservewhichmarksareacquired.Toimple-menttheinspectphase,wesimplyexecuteatask,which,byitsnormalexecution,markslocationsinitsneighborhood.Whenthetaskreachesitsfailsafepoint(therstwritetoagloballocation),itimmediatelyreturns.ToimplementtheselectAndExecphase,were-executethetaskfromthebe-ginning,andinsteadofwritingmarks,wecheckwhetherthemarksthatwewouldhavewrittenmatchthevaluesthathavebeenwritten.Thisimplementsline11ofFigure3.Ifataskreadsamarkvaluethatisnotitsid,wegotoline17.Thisbaselineimplementationissufcienttodetermin-isticallyscheduleanyprogramwrittenintheprogrammingmodelofFigure1a.3.3OptimizationsThebaselinedeterministicschedulerdescribedinSection3.2containsseveralinefciencies,whichwenowaddress.First,itredundantlyexecutestheprexofataskuptoitsfailsafepointwhenataskisselectedandexecuted.Amoreefcientmethodwouldbetosuspendexecutionofataskatthefailsafepointduringtheinspectphaseandtore-sumeexecutioninthecommitphase.Onresumption,thetaskmustcheckthatallthemarkvaluesstillmatchitsid(Figure3line11).Thecapabilitytopauseandresumeex-ecutioncanbeachievedgenerallyusingadditionalthreadsorcreatingcontinuations.Inouroptimizedimplementation,weuseamoread-hocapproach,whichsimulatestheeffectofformingacontinuationwithoutimplementingafullcom-pilertransform.Weprovidealibraryfunctionthatallowsuserstoallocateobjectsintheinspectphasewhichcanberecalledduringthecommitphase.Programmerscanusethisfunctionalitytomanuallyachievethesameeffectastasksus-pendandresume.Tomakesurethatresumedtasksarevalidtocommit,wemakeasmallchangetotheprotocolintheinspectphase.Insteadofjustwritingthemaximummarkvalue,atasktchecksifthepreviousvalueofthemarklocationisnot0andnotid(t);ifso,bywritingitsmark,tasktwillpreventthetaskuthatcorrespondstothecurrentmarkvaluefromcommitting.Normally,taskudetectsthiscasewhentheschedulerexecutesline11,orinthecaseofthebaselinescheduler,whentaskuisexecutedasecondtime.Whenusingthecontinuationoptimization,tisnowresponsibleforpreventingufromexecuting.Itdoesthisbywritingtoaagvariablethatuchecksbeforeresumingexecution.Second,theperformanceoftheschedulerisverysensi-tivetoinitialtaskorder.Applicationsthatexploittemporallocalityexecutetaskswithoverlappingneighborhoodscloseintime.Thistypicallytranslatestothosetasksbeingclosetogetheriniterationorder,which,inthebaselineschedulerimplementation,meansthattheytypicallywillbeexecutedinthesameround,wheretheywillcertainlyconictwitheachother.Thisleadstotheperversesituationwheretheschedulerneedstoreducelocalitytoimproveperformance.Weaddressthisissuebyassumingthattasksplacedclosetogetheriniterationorderhavehighlocalityandplacethosetasksinseparateroundsifpossible.Third,thecostofsortingenqueuedtaskscanbelargerelativetotheapplicationtime.Thereisacommonspecialcasewhereataskenqueuestasks,butthosetasksaredrawnfromaxedsetoftasks.Inthiscase,taskscanbeassigneduniqueidsbeforeparallelexecution,andtheprogrammer canpasstheseidstothescheduler,whichusesthemdirectlyinsteadofgeneratingnewidsviathesortfunction.3.4Comparisonofnon-deterministicanddeterministicschedulersComparedtothenon-deterministicschedulinginSection2.1,DIGschedulingaddsseveraloverheads,whoseperformanceimpactweevaluateinSection5.AsseeninFigures1band2,thedeterministicschedulerexecutesmanymoreinstructions.Thedeterministicschedulerintroducesaconceptofroundsthatisnotpresentintheoriginalprogram.Theseroundsareimplementedusingglobalsynchronization.Roundsextendthecriticalpathlengthofaprogrambe-causetheschedulercannotproceedtothenextrounduntilallofthetasksareprocessedforthecurrentround.Theschedulerexecutestasksaccordingtoaparticularschedule,butthatschedulemaynotbethebestperform-ingoneamongpossibleprogramschedules.Theexecutionofataskisbrokenintotwoparts,thein-spectphaseandtheexecutionphase,separatedbyabar-rier.Thememorylocationsaccessedduringtheinspectphaseofataskareverylikelytobeaccessedbytheex-ecutionphaseofthesametask,butunderDIGschedul-ing,thesetwophasesaretemporallyseparatedbyafactorthatisafunctionofnumberoftasksattemptedduringaround,whichistypicallyverylarge.Conversely,increas-inglocalitybyreducingthenumbertasksattemptedinaround,increasesthenumberofroundsexecuted,whichincreasesthecriticalpathlengthoftheprogram.4.Experimentalsetup4.1ApplicationsThebenchmarksinourstudyaredrawnfromthreedifferentsources:thePARSEC(v2.1)benchmarksuite[6],theprob-lembasedbenchmarksuite(PBBS)(v0.1)[7],andtheLone-star(v2.1.5)benchmarksuite[22].PARSECThePARSECbenchmarksuitehasbeenusedinpreviousevaluationsofdeterministicscheduling[3,17,24].Itcontainstwelveapplicationsorkernels.Mostareparal-lelizedusingthepthreadslibrary.Wechosethethreebench-marksthathaveOpenMPimplementations:blackscholes,bodytrackandfreqmine.PBBSThePBBSprograms[7]areorganizedbyproblem,andeachproblemhasoneormoresolutionprograms,atleastoneofwhichisdeterministic.Thereareatotalofsixteenproblems,butmanyoftheseprogramsaredata-parallelornesteddata-parallel,andtheirperformancedependslargelyonfactorslikegoodloadbalancing,whichisnotthesubjectofthispaper.Wethereforeexcludedthemfromourstudy.Wechosedeterministicprogramsthatsolvedthefourre-mainingproblems:breadth-rstsearch(bfs),Delaunaytri-angulation(dt),Delaunaymeshrenement(dmr),andmax-imalindependentset(mis).Weexcludemaximalmatchingbecauseofitssimilaritytomaximalindependentset.Inthesecodes,determinismisensuredbyapplication-specictech-niquescustomizedtoeachapplication,andtheytypicallyinvolvebulk-synchronousexecutioninrounds.ThePBBSmaximalindependentsetprogramisdata-parallel,butwehaveincludeditinourstudyforcomparisonwithanon-de-terministicmaximalindependentsetprogramthatexistsintheLonestarsuite.LonestarFromtheLonestarbenchmarksuite,weselectedfourprogramsthatsolvethesameproblemsasthosewein-cludedfromPBBS,usingthesamealgorithms,andanim-plementationofthepreow-pushalgorithm(pfp)thatusestheglobalrelabelingheuristictoimproveconvergence[13].WeautomaticallygeneratedeterministicimplementationsofallLonestarprogramsbyapplyingtheDIGschedulingofSection3,includingtheoptimizationsofSection3.3.ThereisonesmalldifferencebetweenthePBBSandLonestarimplementationsofDelaunaytriangulation(dt).ThealgorithmiccomplexityofDelaunaytriangulationde-pendsontheorderinwhichpointsareinserted,andran-dominsertionorderhasbeenshowntobeoptimal[14].InthePBBSimplementation,pointsarerandomizedofine.IntheLonestarimplementation,pointsarereorderedonlineus-ingthebiasedrandomizedinsertionorderalgorithm[1].Forcomparisonpurposes,wedonotincludethereorderingtimeineitherimplementation.Asmentionedabove,theLonestarmaximalindependentsetprogramisnon-deterministicwhilethePBBSversionisdata-parallel.ApplicationvariantsIntheexperimentalresults,thevari-antg-ndenotestheoriginalnon-deterministicLonestarap-plication,andthedeterministicvariantgeneratedfromDIGschedulingiscalledg-d.ThevariantPBBSdenotesthePBBSversionoftheapplication.4.2Data-setsForthePARSECbenchmarks,weusethesimlargeinputsforblackscholesandfreqmineandthenativeinputforbody-track.TheperformanceofthePBBSandLonestarbenchmarkscanvarysignicantlywiththetypeofinput.Inourexperi-ence,thebehavioracrossrandominputsforanapplicationislargelysimilar,sowechooseasinglerepresentativeinputforeachapplication.TheseinputsarelargelydrawnfromtheevaluationofBlellochetal.[7].Forbfs,weusearandomgraphof10millionnodeswhereeachnodeisconnectedtoverandomlyselectednodes.Fordmr,weuseaDelaunaytriangulatedmeshof2.5millionrandomlyselectedpointsfromtheunitsquare.Fordt,weuse10millionpointsran-domlyselectedfromaunitsquare.Formis,weusethesameinputasbfs.Forpfp,weusearandomgraphof223nodeswitheachnodeconnectedto4randomneighbors. communicaterelativelyinfrequently.Deterministicschedul-ingforthesekindsofapplicationscanbesupportedusingrelativelyheavyweightmechanismssincetheoverheadofthesystemisasmallfractionoftheoverallexecutiontime.However,thesemechanismsmaynotbeusefulforappli-cationswithverylightweighttasksthatcommunicatefre-quently.Onashared-memorysystem,theconceptofcom-municationislesswell-denedcomparedtoadistributedsystem,butoneapproximationisthenumberofatomicup-datesanapplicationperforms.Figures4and5showtaskexecutionrates,abortratios,andatomicupdateratesforourapplicationson1threadand40threadsonmachinem4x10.Forthedeterministicvariants,thenumberofroundsisalsoshown.ForPBBSvariants,thisisthenumberofbulk-synchronousroundsofthehandwrittendeterministicsched-uling.First,weseethatthePBBSandLonestarbenchmarkshaveveryne-graintasks.Forexample,theg-nversionofdmr,runningononethread,commits0.26taskspermi-crosecond(seeFigure4),whichtranslatesto3.8microsec-ondspertask(thisistheparallelversionofthecodewithsynchronization,runningononethread),whichisontheor-derofathousandcycles.On40threads,thisparallelprogramcommitsroughly9taskspermicrosecond,whichtranslatestoathroughputofroughly0.11microsecondspertask.Second,weseethattheabortratiosoftheg-nvariantsofallapplicationsareessentiallyzeroevenat40threads.Conictsbetweentasksinthenon-deterministicvariantsareveryrare:thisisbecausetherearealargenumberoftaskscomparedtothenumberofthreads.Thedeterministicvariantsg-dandPBBShavelargerabortratiosbecauseineachround,thenumberoftaskswhoseneighborhoodsareinspectedistypicallylargerthanthenumberofthreads.Conictscanalsohappenwithonlyonethreadwhentwotaskswithoverlappingneighborhoodsareinspectedinthesameround.Third,thePARSECbenchmarks—blackscholes,body-trackandfreqmine,whicharefrequentlyusedtoevaluatedeterministicschedulers—haveordersofmagnitudefeweratomicupdatesthantheirregularalgorithmsofthePBBSandLonestarsuites(seeFigure5).Forexample,blacksc-holesat40threadsperformsatomicupdatesatarateofabout1updatepermicrosecond,whilethemisg-nvariantperformsatomicupdatesattherateof100updatespermi-crosecond.Thesequalitativedifferencesinapplicationcharacteris-ticssignicantlyimpactthedesignofdeterministicsched-ulers,asweshowinthefollowingsections.5.2DeterministicthreadschedulingInthissection,wepresentperformanceresultsfromusingCoreDet,adeterministicthreadscheduler,onourbenchmarkapplications.UnlikeDIGscheduling,CoreDetrunsonun-modiedpthreadprograms. Figure6:Speedupwith(solidlines)andwithout(dottedlines)CoreDetsystemonnon-deterministicprograms.Speedupbase-linesareinFigure8.Somedmranddtrunsonnuma8x4timedoutafter10minutes.Ideally,wewouldliketorunthePARSECandg-nnon-deterministicprogramswithCoreDettomakethemdeter-ministic.Unfortunately,theCoreDetcompilerisbasedontheolderLLVM2.6compiler,anditisunabletocompileanyoftheg-nprograms.Togetaroundthisproblem,weexploitthefactthatbfs,dmranddtinPBBSaredeterministicimple-mentationsofnon-deterministicalgorithms.Tomakethemnon-deterministic,wetransformtheprogramsbyhandtomatchthenon-deterministicprogrampatternshowninFig-ure1andrunthesewithCoreDet.Weleavethemisbench-markasadata-parallelprogram.WeusetheCoreDetsysteminlow-overhead,synchro-nization-onlymode,whichreducesthesystemtoanim-plementationoftheKendoalgorithm[25]andrequiresallsynchronizationbetweenthreadstousethepthreadlibrary. Figure11:SamplesofDRAMaccessperformancecounteronmachinem4x10. Figure12:Effectoforderinginputformis.Speedupisrelativetothebestvariantwithonethreadamongthetwoinputs(g-n,4.3seconds).Theorderedinputisagraphofa2Dmesh.Thenodesaresortedaccordingtoaspace-llingcurve.Thedeterminismguaranteeofthesesystemsisstillquitefragile,becausetheinsertionofasingleinstructionwillproduceaprogramthatgeneratesdifferentoutputs.Also,performanceissensitivetothetasklength.Deviettietal.showthatsystemoverheadscanvarybetween160%–250%dependingonthetasksizeparameter[17].Incontrast,systemslikeGraceandDThreadsformtheirtasksbasedonsynchronizationinstructions,whichmeansthataddingnon-synchronizationinstructionswillnotchangethedecompositionoftheprogramintotasks.However,thisexibilitycomesatacostastasksarenowquitelong,andloadbalancingbecomesanissue.DThreadsusesasequen-tialtokenpassingalgorithmtodeterministicallyprocesssyn-chronizationevents,sotheentiresequenceofinstructionsboundedbysynchronizationinstructionsisblockedwaitingforthetoken.Kendo,whichbreakstasksupintosmallerpieces,canextractmoreparallelismbyexecutingaprexofinstructionsbeforethesynchronizationinstruction.Morerecently,Cuietal.haveproposedthatusersaddperformancehintsakintothreadbarrierstoimprovetheloadbalancingofadeterministicscheduler[15].Kendo,CoreDet,DeterminatorandsomePBBSpro-grams[7]haveatunableparameterthatcontrolsthetaskorroundsize,butnomethodtoadaptivelysetthatparame-terbasedonobservedexecution.dOSusesinstruction-basedtaskformation,butitusesanadaptivealgorithmliketheonedescribedinSection3.2todeterministicallyadjustthetasksizebasedonobservedparallelism.Calvinusesastandardhardwaretwo-bitpredictortodynamicallyincreasetasksizewhenthereisnosynchronizationinatask.7.ConclusionDeterministicexecutionofparallelprogramshascertainad-vantagessuchasreproducibilityofresults,whichmakesdebuggingeasier.Thereisasubstantialbodyofrecentworkonenforcingdeterminismattheprogramminglanguageandprogrammingmodellevel,andatthesystemlevelthroughdeterministicscheduling.Webelievethatanysuchsys-temshouldprovidethreefeatures:on-demanddeterminism,portabilityandparameter-freedomacrossplatforms,whichnoexistingsystemcurrentlydoes.Inthispaper,weconsideredtheproblemofensuringdeterministicexecutionforirregularprograms,whichareparticularlychallengingbecausetasksizesaresmallerandtaskscommunicatemorefrequentlythaninconventionalparallelprogramslikethePARSECbenchmarks.Irregularprogramshavenotbeenstudiedmuchinthiscontext,andweshowedthattheseprogramsdonotscalewhenexecutedonacurrentdeterminismbyschedulingsystem.Oursolutiontakeshigh-levelnon-deterministicprogramswrittenintheGaloismodelandimplementsdeterminismau-tomaticallybyruntimescheduling.Inmanyinstances,theresultingprogramshaveperformancecomparabletohand-writtendeterministicprograms.However,weshowedonthreedifferentplatformsthatnon-deterministicprogramsperformsubstantiallybetterthantheirhandwrittendeterministiccounterparts,demon-stratingthatthereisaperformancepenaltyforenforcingdeterministicexecution.Thisperformancedifferencearisesbecausethedeterministicversionshavelessintra-taskandinter-tasklocality.Therefore,webelievethatdeterminismondemand,whichleavesthechoicebetweenperformanceanddeterminismtotheapplicationuser,isareasonablede-signpoint.