/
DieCast Testing Distributed Systems with an Accurate S DieCast Testing Distributed Systems with an Accurate S

DieCast Testing Distributed Systems with an Accurate S - PDF document

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
459 views
Uploaded On 2014-11-25

DieCast Testing Distributed Systems with an Accurate S - PPT Presentation

Vishwanath and Amin Vahdat University of California San Diego dguptakvishwanathvahdat csucsdedu Abstract Largescale network services can consist of tens of thou sands of machines running thousands of unique soft ware con64257gurations spread across ID: 16759

Vishwanath and Amin Vahdat

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "DieCast Testing Distributed Systems with..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DieCast:TestingDistributedSystemswithanAccurateScaleModelDiwakerGupta,KashiV.Vishwanath,andAminVahdatUniversityofCalifornia,SanDiegofdgupta,kvishwanath,vahdatg@cs.ucsd.eduAbstractLarge-scalenetworkservicescanconsistoftensofthou-sandsofmachinesrunningthousandsofuniquesoft-warecongurationsspreadacrosshundredsofphysicalnetworks.Testingsuchservicesforcomplexperfor-manceproblemsandcongurationerrorsremainsadif-cultproblem.Existingtestingtechniques,suchassim-ulationorrunningsmallerinstancesofaservice,havelimitationsinpredictingoverallservicebehavior.Althoughtechnicallyandeconomicallyinfeasibleatthistime,testingshouldideallybeperformedatthesamescaleandwiththesamecongurationasthedeployedservice.WepresentDieCast,anapproachtoscalingnet-workservicesinwhichwemultiplexallofthenodesinagivenservicecongurationasvirtualmachines(VM)spreadacrossamuchsmallernumberofphysicalma-chinesinatestharness.CPU,network,anddiskarethenaccuratelyscaledtoprovidetheillusionthateachVMmatchesamachinefromtheoriginalserviceintermsofbothavailablecomputingresourcesandcommuni-cationbehaviortoremoteservicenodes.Wepresentthearchitectureandevaluationofasystemtosupportsuchexperimentationanddiscussitslimitations.Weshowthatforavarietyofservices—includingacommer-cial,high-performance,cluster-basedlesystem—andresourceutilizationlevels,DieCastmatchesthebehav-ioroftheoriginalservicewhileusingafractionofthephysicalresources.1IntroductionToday,moreandmoreservicesarebeingdeliveredbycomplexsystemsconsistingoflargeensemblesofma-chinesspreadacrossmultiplephysicalnetworksandge-ographicregions.Economiesofscale,incrementalscal-ability,andgoodfaultisolationpropertieshavemadeclustersthepreferredarchitectureforbuildingplanetary-scaleservices.Asinglelogicalrequestmaytouchdozensofmachinesonmultiplenetworks,allprovidingin-stancesofservicestransparentlyreplicatedacrossmul-tiplemachines.Servicesconsistingoftensofthousandsofmachinesarecommonplace[11].Economicconsiderationshavepushedserviceproviderstoaregimewhereindividualservicemachinesmustbemadefromcommoditycomponents—savinganextra$500pernodeina100,000-nodeserviceiscritical.Similarly,nodesruncommodityoperatingsystems,withonlymoderatelevelsofreliability,andcustom-writtenapplicationsthatareoftenrushedtoproductionbecauseofthepressuresof“InternetTime.”Inthisenvironment,failureiscommon[24]anditbecomestheresponsibilityofhigher-levelsoftwarearchitectures,usuallyemployingcustommonitoringinfrastructuresandsignicantserviceanddatareplication,tomaskindividual,correlated,andcascadingfailuresfromendclients.Oneoftheprimarychallengesfacingdesignersofmodernnetworkservicesistestingtheirdynamicallyevolvingsystemarchitecture.Inadditiontothesheerscaleofthetargetsystems,challengesinclude:heteroge-neoushardwareandsoftware,dynamicallychangingre-questpatterns,complexcomponentinteractions,failureconditionsthatonlymanifestunderhighload[21],theeffectsofcorrelatedfailures[20],andbottlenecksaris-ingfromcomplexnetworktopologies.Beforeupgrad-inganyaspectofanetworkedservice—theloadbalanc-ing/replicationscheme,individualsoftwarecomponents,thenetworktopology—architectswouldideallycreateanexactcopyofthesystem,modifythesinglecomponenttobeupgraded,andthensubjecttheentiresystemtobothhistoricalandworst-caseworkloads.Suchtestingmustincludesubjectingthesystemtoavarietyofcontrolledfailureandattackscenariossinceproblemswithapar-ticularupgradewilloftenonlyberevealedundercertainspecicconditions.Creatinganexactcopyofamodernnetworkedservicefortestingisoftentechnicallychallengingandeconom-icallyinfeasible.Thearchitectureofmanylarge-scalenetworkedservicescanbecharacterizedas“controlledchaos,”whereitisoftenimpossibletoknowexactlywhatthehardware,software,andnetworktopologyofthesys-temlookslikeatanygiventime.Evenwhenthepre-cisehardware,softwareandnetworkcongurationofthesystemisknown,theresourcestoreplicatetheproduc-tionenvironmentmightsimplybeunavailable,particu-larlyforlargeservices.Andyet,reliable,lowoverhead,andeconomicallyfeasibletestingofnetworkservicesre-mainscriticaltodeliveringrobusthigher-levelservices.Thegoalofthisworkistodevelopatestingmethod-1 ologyandarchitecturethatcanaccuratelypredictthebe-haviorofmodernnetworkserviceswhileemployinganorderofmagnitudelesshardwareresources.Forex-ample,consideraserviceconsistingof10,000hetero-geneousmachines,100switches,andhundredsofindi-vidualsoftwarecongurations.Weaimtocongureasmallernumberofmachines(e.g.,100-1000dependingonservicecharacteristics)toemulatetheoriginalcong-urationascloselyaspossibleandtosubjectthetestin-frastructuretothesameworkloadandfailureconditionsastheoriginalservice.Theperformanceandfailurere-sponseofthetestsystemshouldcloselyapproximatetherealbehaviorofthetargetsystem.Ofcourse,thesegoalsareinfeasiblewithoutgivingsomethingup:ifitwerepossibletocapturethecomplexbehaviorandoverallper-formanceofa10,000nodesystemon1,000nodes,thentheoriginalsystemshouldlikelyrunon1,000nodes.Akeyinsightbehindourworkisthatwecantradetimeforsystemcapacitywhileaccuratelyscalingindi-vidualsystemcomponentstomatchthebehaviorofthetargetinfrastructure.Weemploytimedilationtoaccu-ratelyscalethecapacityofindividualsystemsbyacon-gurablefactor[19].Timedilationfullyencapsulatesoperatingsystemsandapplicationssuchthattherateatwhichtimepassescanbemodiedbyaconstantfactor.Atimedilationfactor(TDF)of10meansthatforeverysecondofrealtime,allsoftwareinadilatedframebe-lievesthattimehasadvancedbyonly100ms.Ifwewishtosubjectatargetsystemtoaone-hourworkloadwhenscalingthesystembyafactorof10,thetestwouldtake10hoursofrealtime.Formanytestingenvironments,thisisanappropriatetradeoff.Sincethepassageoftimeissloweddownwhiletherateofexternalevents(suchasnetworkI/O)remainsunchanged,thesystemappearstohavesubstantiallyhigherprocessingpowerandfasternetworkanddisk.Inthispaper,wepresentDieCast,acompleteenvi-ronmentforbuildingaccuratemodelsofnetworkser-vices(Section2).Critically,weruntheactualoper-atingsystemsandapplicationsoftwareofsometargetenvironmentonafractionofthehardwareinthatenvi-ronment.Thisworkmakesthefollowingcontributions.First,weextendouroriginalimplementationoftimedi-lation[19]tosupportfullyvirtualizedaswellasparavir-tualizedhosts.Tosupportcompletesystemevaluations,oursecondcontributionshowshowtoextenddilationtodiskandCPU(Section3).Inparticular,weintegrateafulldisksimulatorintothevirtualmachinemonitor(VMM)toconsiderarangeofpossiblediskarchitec-tures.Finally,weconductadetailedsystemevaluation,quantifyingDieCast'saccuracyforarangeofservices,includingacommercialstoragesystem(Sections4and5).Thegoalsofthisworkareambitiousandwhilewecannotclaimtohaveaddressedallofthemyriadchal-lengesassociatedwithtestinglarge-scalenetworkser-vices(Section6),webelievethatDieCastshowssigni-cantpromiseasatestingvehicle2SystemArchitectureWebeginbyprovidinganoverviewofourapproachtoscalingasystemdowntoatargettestharness.Wethendiscusstheindividualcomponentsofourarchitecture.2.1OverviewFigure1givesanoverviewofourapproach.Ontheleft(Figure1(a))isanabstractdepictionofanetworkser-vice.Aloadbalancingswitchsitsinfrontoftheserviceandredirectsrequestsamongasetoffront-endHTTPservers.Theserequestsmayinturntraveltoamiddletierofapplicationservers,whomayqueryastoragetierconsistingofdatabasesornetworkattachedstorage.Figure1(b)showshowatargetservicecanbescaledwithDieCast.Weencapsulateallnodesfromtheorigi-nalserviceinvirtualmachinesandmultiplexseveraloftheseVMsontophysicalmachinesinthetestharness.Critically,weemploytimedilationintheVMMrun-ningoneachphysicalmachinetoprovidetheillusionthateachvirtualmachinehas,forexample,asmuchpro-cessingpower,diskI/O,andnetworkbandwidthasthecorrespondinghostintheoriginalcongurationdespitethefactthatitissharingunderlyingresourceswithotherVMs.DieCastconguresVMstocommunicatethroughanetworkemulatortoreproducethecharacteristicsoftheoriginalsystemtopology.Wetheninitializethetestsystemusingthesetuproutinesoftheoriginalsystemandsubjectittoappropriateworkloadsandfault-loadstoevaluatesystembehavior.Theoverallgoalistoimprovepredictivepower.Thatis,runswithDieCastonsmallermachinecongurationsshouldaccuratelypredicttheperformanceandfaulttol-erancecharacteristicsofsomelargerproductionsystem.Inthismanner,systemdevelopersmayexperimentwithchangestosystemarchitecture,networktopology,soft-wareupgrades,andnewfunctionalitybeforedeployingtheminproduction.SuccessfulrunswithDieCastshouldimprovecondencethatanychangestothetargetser-vicewillbesuccessfullydeployed.Below,wediscussthestepsinapplyingourgeneralapproachtoapplyingDieCastscalingtotargetsystems.2.2ChoosingtheScalingFactorTherstquestiontoaddressisthedesiredscalingfac-tor.OneuseofDieCastistoreproducethescaleofanoriginalserviceinatestcluster.Anotherapplicationistoscaleexistingtestharnessestoachievemorerealismthanpossiblefromtherawhardware.Forinstance,if100nodesarealreadyavailablefortesting,thenDieCastmightbeemployedtoscaletoathousand-nodesystem (a)OriginalSystem (b)TestSystemFigure1:ScalinganetworkservicetotheDieCastinfrastructure.withamorecomplexcommunicationtopology.WhiletheDieCastsystemmaystillfallshortofthescaleoftheoriginalservice,itcanprovidemoremeaningfulap-proximationsundermoreintenseworkloadsandfailureconditionsthanmighthaveotherwisebeenpossible.Overall,thegoalistopickthelargestscalingfactorpossiblewhilestillobtainingaccuratepredictionsfromDieCast,sincethepredictionaccuracywillnaturallyde-gradewithincreasingscalingfactors.Thismaximumscalingfactordependsonthethecharacteristicsofthetargetsystem.Section6highlightsthepotentiallimita-tionsofDieCastscaling.Ingeneral,scalingaccuracywilldegradewith:i)applicationsensitivitytothene-grainedtimingbehaviorofexternalhardwaredevices;ii)capacity-constrainedphysicalresources;andiii)sys-temdevicesnotamenabletovirtualization.Intherstcategory,applicationinteractionwithI/Odevicesmaydependontheexacttimingofrequestsandresponses.Considerforinstanceane-grainedparallelapplicationthatassumesallremoteinstancesareco-scheduled.ADieCastrunmaymispredictperformanceiftargetnodesarenotscheduledatthetimeofamessagetransmissiontorespondtoablockingreadoperation.Ifwecouldin-terleaveatthegranularityofindividualinstructions,thenthiswouldnotbeanissue.However,contextswitchingamongvirtualmachinesmeansthatwemustpicktimeslicesontheorderofmilliseconds.Second,DieCastcan-notscalethecapacityofhardwarecomponentssuchasmainmemory,processorcaches,anddisk.Finally,theoriginalservicemaycontaindevicessuchasloadbal-ancingswitchesthatarenotamenabletovirtualizationordilation.Evenwiththesecaveats,wehavesuccessfullyappliedscalingfactorsof10toavarietyofserviceswithnear-perfectaccuracyasdiscussedinSections4and5.Oftheabovelimitationstoscaling,weconsidercapac-itylimitsformainmemoryanddisktobemostsigni-cant.However,wedonotbelievethistobeafundamentallimitation.Forexample,onepartialsolutionistocong-urethetestsystemwithmorememoryandstoragethantheoriginalsystem.Whilethiswillreducesomeoftheeconomicbenetsofourapproach,itwillnoterasethem.Forinstance,doublingamachine'smemorywillnottyp-icallydoubleitshardwarecost.Moreimportantly,itwillnotsubstantiallyincreasethetypicallydominanthumancostofadministeringagiventestinfrastructurebecausethenumberofrequiredadministratorsforagiventestharnessusuallygrowswiththenumberofmachinesinthesystemratherthanwiththetotalmemoryofthesys-tem.Lookingforward,ongoingresearchinVMMarchitec-tureshavethepotentialtoreclaimsomeofthemem-ory[32]andstorageoverhead[33]associatedwithmulti-plexingVMsonasinglephysicalmachine.Forinstance,fournearlyidenticallyconguredLinuxmachinesrun-ningthesamewebserverwilloverlapsignicantlyintermsoftheirmemoryandstoragefootprints.Similarly,consideranInternetservicethatreplicatescontentforim-provedcapacityandavailability.Whenscalingtheser-vicedown,multiplemachinesfromtheoriginalcongu-rationmaybeassignedtoasinglephysicalmachine.AVMMcapableofdetectingandexploitingavailablere-dundancycouldsignicantlyreducetheincrementalstor-ageoverheadofmultiplexingmultipleVMs.2.3CatalogingtheOriginalSystemThenexttaskistoconguretheappropriatevirtualma-chineimagesontoourtestinfrastructure.MaintainingacatalogofthehardwareandsoftwarecongurationthatcomprisesanInternetserviceischallenginginitsownright.However,forthepurposesofthiswork,weas-sumethatsuchacatalogisavailable.Thiscatalogwouldconsistofallofthehardwaremakinguptheservice,thenetworktopology,andthesoftwarecongurationofeachnode.Thesoftwarecongurationincludestheoperatingsystem,installedpackagesandapplications,andtheini-tializationsequencerunoneachnodeafterbooting.Theoriginalservicesoftwaremayormaynotrunontopofvirtualmachines.However,giventheincreasingbenetsofemployingvirtualmachinesindatacentersforservicecongurationandmanagementandthepopular-ityofVM-basedappliancesthatarepre-conguredtorunparticularservices[7],weassumethattheoriginalser-viceisinfactVM-based.Thisassumptionisnotcriticaltoourapproachbutitalsopartiallyaddressesanybase-lineperformancedifferentialbetweenanoderunningon barehardwareintheoriginalserviceandthesamenoderunningonavirtualmachineinthetestsystem.2.4ConguringtheVirtualMachinesWithanunderstandingofappropriatescalingfactorsandacatalogoftheoriginalserviceconguration,DieCastthenconguresindividualphysicalmachinesinthetestsystemwithmultipleVMimagesreecting,ideally,aone-to-onemapbetweenphysicalmachinesintheorigi-nalsystemandvirtualmachinesinthetestsystem.Withascalingfactorof10,eachphysicalnodeinthetargetsystemwouldhost10virtualmachines.Themappingfromphysicalmachinestovirtualmachinesshouldac-countfor:similarityinsoftwarecongurations,per-VMmemoryanddiskrequirementsandthecapacityofthehardwareintheoriginalandtestsystem.Ingeneral,asolvermaybeemployedtodetermineanear-optimalmatching[26].However,giventheVMmigrationcapa-bilitiesofmodernVMMsandDieCast'scontrollednet-workemulationenvironment,theactuallocationofaVMisnotassignicantasintheoriginalsystem.DieCastthencongurestheVMssuchthateachVMappearstohaveresourcesidenticaltoaphysicalmachineintheoriginalsystem.Consideraphysicalmachinehost-ing10VMs.DieCastwouldruneachVMwithascalingfactorof10,butallocateeachVMonly10%oftheactualphysicalresource.DieCastemploysanon-workconserv-ingschedulertoensurethateachvirtualmachinereceivesnomorethanitsallottedshareofresourcesevenwhensparecapacityisavailable.SupposeaCPUintensivetasktakes100secondstonishontheoriginalmachine.Thesametaskwouldnowtake1000seconds(ofrealtime)onadilatedVM,sinceitcanonlyuseatenthoftheCPU.However,sincetheVMisrunningundertimedilation,itonlyperceivesthat100secondshavepassed.ThusintheVMstimeframe,resourcesappearequivalenttotheoriginalmachine.WeonlyexplicitlyscaleCPUanddiskI/Olatencyonthehost;scalingofnetworkI/Ohappensvianetworkemulationasdescribednext.2.5NetworkEmulationThenalstepinthecongurationprocessistomatchthenetworkcongurationoftheoriginalserviceusingnet-workemulation.WecongureallVMsinthetestsys-temtoroutealltheircommunicationthroughouremu-lationenvironment.NotethatDieCastisnottiedtoanyparticularemulationtechnology:wehavesuccessfullyusedDieCastwithDummynet[27],Modelnet[31]andNetem[3]whereappropriate.Itislikelythatthebisectionbandwidthoftheorigi-nalservicetopologywillbelargerthanthatavailableinthetestsystem.Fortunately,timedilationisofsignif-icantvaluehere.Convincingavirtualmachinescaledbyafactorof10thatitisreceivingdataat1Gbpsonlyrequiresforwardingdatatoitat100Mbps.Similarly,itmayappearthatlatenciesinanoriginalcluster-basedservicemaybelowenoughthattheadditionalsoftwareforwardingoverheadassociatedwiththeemulationen-vironmentcouldmakeitdifculttomatchthelatenciesintheoriginalnetwork.Toouradvantage,maintainingaccuratelatencywithtimedilationactuallyrequiresin-creasingtherealtimedelayofagivenpacket;e.g.,a100sdelaynetworklinkintheoriginalnetworkshouldbedelayedby1mswhendilatingbyafactorof10.NotethatthescalingfactorneednotmatchtheTDF.Forexample,iftheoriginalnetworktopologyissolarge/fastthatevenwithaTDFof10thenetworkemu-latorisunabletokeepup,itispossibletoemployatimedilationfactorof20whilemaintainingascalingfactorof10.Insuchascenario,therewouldstillonaveragebe10virtualmachinesmultiplexedontoeachphysicalma-chine,howevertheVMMschedulerwouldallocateonly5%ofthephysicalmachine'sresourcestoindividualma-chines(meaningthat50%ofCPUresourceswillgoidle).TheTDFof20,however,woulddeliveradditionalcapac-itytothenetworkemulationinfrastructuretomatchthecharacteristicsoftheoriginalsystem.2.6WorkloadGenerationOnceDieCasthaspreparedthetestsystemtoberesourceequivalenttotheoriginalsystem,wecansubjectittoanappropriateworkload.Theseworkloadswillingen-eralbeapplication-specic.Forinstance,Monkey[15]showshowtoreplayameasuredTCPrequeststreamsenttoalarge-scalenetworkservice.Forthiswork,weuseapplication-specicworkloadgeneratorswhereavailableandinothercaseswriteourownworkloadgeneratorsthatbothcapturenormalbehavioraswellasstresstheserviceunderextremeconditions.Tomaintainatargetscalingfactor,clientsshouldalsoideallyruninDieCast-scaledvirtualmachines.Thisap-proachhastheaddedbenetofallowingustosubjectatestservicetoahighlevelofperceived-loadusingrela-tivelyfewresources.Thus,DieCastscalesnotonlythecapacityofthetestharnessbutalsotheworkloadgener-ationinfrastructure.3ImplementationWehaveimplementedDieCastsupportonseveralver-sionsofXen[10]:v2.0.7,v3.0.4,andv3.1(bothpar-avirtualizedandfullyvirtualizedVMs).HerewefocusontheXen3.1implementation.Webeginwithabriefoverviewoftimedilation[19]andthendescribethenewfeaturesrequiredtosupportDieCast.3.1TimeDilationCriticaltotimedilationisaVMM'sabilitytomodifytheperceptionoftimewithinaguestOS.Fortunately,most VMMsalreadyhavethisfunctionality,forexample,be-causeaguestOSmaydevelopabacklogof“lostticks”ifitisnotscheduledonthephysicalprocessorwhenitisduetoreceiveatimerinterrupt.SincetheguestOSrunninginaVMdoesnotruncontinuously,VMMsperi-odicallysynchronizetheguestOStimewiththephysicalmachine'sclock.TheonlyrequirementforaVMMtosupporttimedilationisthisabilitytomodifytheVM'sperceptionoftime.Infact,aswedemonstrateinSec-tion5,theconceptoftimedilationcanbeportedtoother(non-virtualized)environments.Operatingsystemsemployavarietyoftimesourcestokeeptrackoftime,includingtimerinterrupts(e.g.,theProgrammableInterruptTimerorPIT),specializedcoun-ters(e.g.,theTSConIntelplatforms)andexternaltimesourcessuchasNTP.TimedilationworksbyinterceptingthevarioustimesourcesandscalingthemappropriatelytofullyencapsulatetheOSinitsowntimeframe.OuroriginalmodicationstoXenforparavirtualizedhosts[19]thereforeappropriatelyscaletimevaluesex-posedtotheVMbythehypervisor.XenexposestwonotionsoftimetoVMs.Realtimeisthenumberofnanosecondssinceboot,andwallclocktimeisthetradi-tionalUnixtimesinceepoch.WhileXenallowstheguestOStomaintainandupdateitsownnotionoftimeviaanexternaltimesource(suchasNTP),theguestOSoftenreliessolelyonXentomaintainaccuratetime.RealandwallclocktimepassbetweentheXenhypervisorandtheguestoperatingsystemviaashareddatastructure.Di-lationusesaper-domainTDFvariabletoappropriatelyscalerealtimeandwallclocktime.Italsoscalesthefre-quencyoftimerinterruptsdeliveredtoaguestOSsincethesetimerinterruptsoftendrivetheinternaltimekeep-ingofaguest.GiventhesemodicationstoXen,ourearlierworkshowedthatnetworkdilationmatchesundi-latedbaselinesforcomplexper-owTCPbehaviorinavarietyofscenarios[19].3.2SupportforOSdiversityOuroriginaltimedilationimplementationonlyworkedwithparavirtualizedmachines,withtwomajordraw-backs:itsupportedonlyLinuxastheguestOS,andtheguestkernelrequiredmodications.Generalizingtootherplatformswouldhaverequiredcodemodi-cationstotherespectiveOS.Tobewidelyapplicable,DieCastmustsupportavarietyofoperatingsystems.Toaddresstheselimitations,weportedtimedilationtosupportfullyvirtualized(FV)VMs,enablingDieCasttosupportunmodiedOSimages.NotethatFVVMsre-quireplatformswithhardwaresupportforvirtualization,suchasIntelVTorAMDSVM.WhileXensupportforfullyvirtualizedVMsdifferssignicantlyfromthepar-avirtualizedVMsupportinseveralkeyareassuchasI/Oemulation,accesstohardwareregisters,andtimeman-agement,thegeneralideabehindtheimplementationre-mainsthesame:wewanttointerceptallsourcesoftimeandscalethem.Inparticular,ourimplementationscalesthePIT,theTSCregister(onx86),theRTC(RealTimeClock),theACPIpowermanagementtimerandtheHighPerfor-manceEventTimer(HPET).Asintheoriginalimple-mentation,wealsoscalethenumberoftimerinterruptsdeliveredtoafullyvirtualizedguest.WealloweachVMtorunwithanindependentscalingfactor.Note,how-ever,thatthescalingfactorisxedforthelifetimeofaVM—itcannotbechangedatruntime.3.3ScalingDiskI/OandCPUTimedilationasdescribedin[19]didnotscalediskper-formance,makingitunsuitableforservicesthatperformsignicantdiskI/O.Ideally,wewouldscaleindividualdiskrequestsatthediskcontrollerlayer.Thecomplexityofmoderndrivearchitectures,particularlythefactthatmuchlowlevelfunctionalityisimplementedinrmware,makessuchimplementationschallenging.Notethatsim-plydelayingrequestsinthedevicedriverisnotsufcient,sincediskcontrollersmayre-orderandbatchrequestsforefciency.Ontheotherhand,functionalityembeddedinhardwareorrmwareisdifculttoinstrumentandmod-ify.FurthercomplicatingmattersarethedifferentI/OmodelsinXen:oneforparavirtualized(PV)VMsandoneforfullyvirtualized(FV)VMs.DieCastprovidesmechanismstoscalediskI/Oforbothmodels.ForFVVMs,DieCastintegratesahighlyaccurateandefcientdisksystemsimulator—Disksim[17]—whichgivesusagoodtrade-offbetweenrealismandaccuracy.Figure2(a)depictsourintegrationofDiskSimintothefullyvirtualizedI/Omodel:foreachVM,adedicateduserspaceprocess(ioemu)inDomain-0performsI/Oemulationbyexposinga“virtualdisk”totheVM(theguestOSisunawarethatarealdiskisnotpresent).AspecialleinDomain-0servesasthebackendstoragefortheVM'sdisk.ToallowioemutointeractwithDiskSim,wewroteawrapperaroundthesimulatorforinter-processcommunication.Afterservicingeachrequest(butbeforereturning),ioemuforwardstherequesttoDisksim,whichthenre-turnsthetime,rt,therequestwouldhavetakeninitssimulateddisk.Sinceweareeffectivelylayeringasoft-warediskontopofioemu,eachrequestshouldideallytakeexactlytimertintheVM'stimeframe,ortdfrtinrealtime.Ifdelayistheamountbywhichthisre-questisdelayed,thetotaltimespentinioemubecomesdelaydtst,wherestisthetimetakentoactuallyservetherequest(DisksimonlysimulatesI/Ocharacter-istics,itdoesnotdealwiththeactualdiskcontent)anddtisthetimetakentoinvokeDisksimitself.Therequireddelayisthen(tdfrt)dtst. (a)I/OModelforFVVMs (b)I/OModelforPVVMs 0 2 4 6 8 10Time Dilation Factor 0 500 1000 1500 2000Disk throughput (kB/s) CPU and Disk unscaled (c)DBenchthroughputunderDisksimFigure2:ScalingDiskI/OThearchitectureofDisksim,however,isnotamenabletointegrationwiththePVI/Omodel(Figure2(b)).Inthis“splitI/O”model,afront-enddriverintheVM(blkfront)forwardsrequeststoaback-enddriverinDomain-0(blkback),whicharethenservicedbytherealdiskdevicedriver.ThusPVI/Oislargelyakernelactivity,whileDisksimrunsentirelyinuser-space.Fur-ther,aseparateDisksimprocesswouldberequiredforeachsimulateddisk,whereasthereisasingleback-enddriverforallVMs.Forthesereasons,forPVVMs,weinjecttheappropri-atedelaysintheblkfrontdriver.ThisapproachhastheadditionaladvantageofcontainingthesideeffectsofsuchdelaystoindividualVMs—blkbackcancon-tinueprocessingotherrequestsasusual.Further,itelimi-natestheneedtomodifydisk-specicdriversinDomain-0.Weemphasizethatthisisfunctionallyequivalenttoper-requestscalinginDisksim:thekeydifferenceisthatscalinginDisksimismuchclosertothe(simulated)hard-ware.OverallourimplementationofdiskscalingforPVVM'sissimplerthoughlessaccurateandsomewhatlessexiblesinceitrequiresthedisksubsysteminthetestinghardwaretomatchthecongurationinthetargetsystem.Wehavevalidatedbothourimplementationsusingseveralmicro-benchmarks.Forbrevity,weonlydescribeoneofthemhere.WerunDBench[29]—apopularhard-driveandle-systembenchmark—underdifferentdila-tionfactorsandplotthereportedthroughput.Figure2(c)showstheresultsfortheFVI/OmodelwithDisksimin-tegration(resultsforthePVimplementationcanbefoundinaseparatetechnicalreport[18]).Ideally,thethrough-putshouldremainconstantasafunctionofthedilationfactor.WerstrunthebenchmarkwithoutscalingdiskI/OorCPU,andwecanseethatthereportedthroughputincreasesalmostlinearly,anundesirablebehavior.Next,werepeattheexperimentandscaletheCPUalone(thus,atTDF10theVMonlyreceives10%oftheCPU).Whiletheincreaseisnolongerlinear,intheabsenceofdiskdilationitisstillsignicantlyhigherthantheexpectedvalue.Finally,withdiskdilationinplacewecanseethatthethroughputcloselytrackstheexpectedvalue.However,astheTDFincreases,westarttoseesomedivergence.Afterfurtherinvestigation,wefoundthatthisdeviationresultsfromthewaywescaledtheCPU.RecallthatwescaletheCPUbyboundingtheamountofCPUavailabletoeachVM.Initially,wesimplyusedXen'sCreditschedulertoallocateanappropriatefractionofCPUresourcestoeachVMinnon-workconservingmode.However,simplyscalingtheCPUdoesnotgovernhowthoseCPUcyclesaredistributedacrosstime.WiththeoriginalCreditscheduler,ifaVMdoesnotconsumeitsfulltimeslice,itcanbescheduledagaininsubsequenttimeslices.Forinstance,ifaVMissettobedilatedbyafactorof10andifitconsumeslessthan10%oftheCPUineachtimeslice,thenitwillrunineverytimeslice,sinceinaggregateitneverconsumesmorethanitshardboundof10%oftheCPU.Thispotentialtoruncon-tinuouslydistortstheperformanceofI/O-boundapplica-tionsunderdilation,andinparticularthey'llhaveadif-ferenttimingdistributionthantheywouldintherealtimeframe.ThisdistortionincreaseswithincreasingTDF.Thus,wefoundthat,forsomeworkloads,wemayac-tuallywishtoenforcethattheVM'sCPUconsumptionshouldbemoreuniformlyenforcedacrosstime.WemodiedtheCreditCPUschedulerinXentosup-portthismodeofoperationasfollows:ifaVMrunsfortheentiredurationofitstimeslice,weensurethatitdoesnotgetscheduledforthenext(tdf1)timeslices.IfaVMvoluntarilyyieldstheCPUorispre-emptedbeforeitstimesliceexpires,itmaybere-scheduledinasub-sequenttimeslice.However,assoonasitconsumesacumulativetotalofatimeslice'sworthofruntime(car-riedoverfromtheprevioustimeitwasdescheduled),itwillbepre-emptedandnotallowedtorunforanother(tdf1)timeslices.Thenallineingure2(c)showstheresultsoftheDBenchbenchmarkwithusingthismodiedscheduler.Aswecansee,thethroughputre-mainsconsistentevenathigherTDFs.Notethatunlikeinthisbenchmark,DieCasttypicallyrunsmultipleVMspermachine,inwhichcasethis“spreading”ofCPUcy-clesoccursnaturallyasVMscompeteforCPU. 4EvaluationWeseektoanswerthefollowingquestionswithrespecttoDieCast-scaling:i)Canwecongureasmallernum-berofphysicalmachinestomatchtheCPUcapacity,complexnetworktopology,andI/Oratesofalargerser-vice?ii)Howwelldoestheperformanceofascaledser-vicerunningonfewerresourcesmatchtheperformanceofabaselineservicerunningwithmoreresources?weconsiderthreedifferentsystems:i)BitTorrent,apopularpeer-to-peerlesharingprogram;ii)RUBiS,anauctionserviceprototypedaftereBay;andiii)Isaac,ourcong-urablenetworkthree-tierservicethatallowsustogener-atearangeofworkloadscenarios.4.1MethodologyToevaluateDieCastforagivensystem,werstestab-lishthebaselineperformance:thisinvolvesdeterminingtheconguration(s)ofinterest,xingtheworkload,andbenchmarkingtheperformance.Wethenscalethesys-temdownbyanorderofmagnitudeandcomparetheDieCastperformancetothebaseline.Whilewehaveex-tensivelyevaluatedevaluatedDieCastimplementationsforseveralversionsofXen,weonlypresenttheresultsfortheXen3.1implementationhere.DetailedevaluationforXen3.0.4canbefoundinourtechnicalreport[18].Eachphysicalmachineinourtestbedisadual-core2.3GHzIntelXeonwith4GBRAM.NotethatsincetheDisksimintegrationonlyworkswithfullyvirtualizedVMs,forafairevaluationitisrequiredthateventhebaselinesystemrunonVMs—ideallythebaselinewouldberunonphysicalmachinesdirectly(fortheparavirtual-izedsetup,wedohaveevaluationwithphysicalmachinesasthebaseline.Wereferthereaderto[18]fordetails).WecongureDisksimtoemulateaSeagateST3217diskdrive.Forthebaseline,Disksimrunsasusual(nore-questsarescaled)andwithDieCast,wescaleeachre-questasdescribedinSection3.3.Wecongureeachvirtualmachinewith256MBRAMandrunDebianEtchonLinux2.6.17.Unlessotherwisestated,thebaselinecongurationconsistsof40physicalmachineshostingasingleVMeach.WethencomparetheperformancecharaceteristicstorunswithDieCastonfourphysicalmachineshosting10VMseach,scaledbyafactorof10.WeuseModelnetforthenetworkemu-lation,andappropriatelyscalethelinkcharacteristicsforDieCast.ForallocatingCPU,weuseourmodiedCreditCPUschedulerasdescribedinSection3.3.4.2BitTorrentWebeginbyusingDieCasttoevaluateBitTorrent[1]—apopularP2Papplication.Forourbaselineexperiments,werunBitTorrent(version3.4.2)onatotalof40virtualmachines.WecongurethemachinestocommunicateacrossaModelNet-emulateddumbbelltopology(Figure3),withvaryingbandwidthandlatencyvaluesfortheac-cesslink(A)fromeachclienttothedumbbellandthedumbbelllinkitself(C).Wevarythetotalnumberofclients,thelesize,thenetworktopology,andthever-sionoftheBitTorrentsoftware.Weusethedistributionofledownloadtimesacrossallclientsasthemetricforcomparingperformance.TheaimhereistoobservehowcloselyDieCast-scaledexperimentsreproducebehaviorofthebaselinecaseforavarietyofscenarios.TherstexperimentestablishesthebaselinewherewecomparedifferentcongurationsofBitTorrentsharingaleacrossa10Mbpsdumbbelllinkandconstrainedac-cesslinksof10Mbps.Alllinkshaveaone-waylatencyof5ms.Werunatotalof40clients(withhalfoneachsideofthedumbbell).Figure5plotsthecumulativedis-tributionoftransfertimesacrossallclientsfordifferentlesizes(10MBand50MB).WeshowthebaselinecaseusingsolidlinesandusedashedlinestorepresenttheDieCast-scaledcase.WithDieCastscaling,thedistribu-tionofdownloadtimescloselymatchesthebehavioroftheoriginalsystem.Forinstance,well-connectedclientsonthesamesideofthedumbbellastherandomlycho-senseedernishmorequicklythantheclientsthatmustcompeteforscarceresourcesacrossthedumbbell.Havingestablishedareasonablebaseline,wenextcon-sidersensitivitytochangingsystemcongurations.Werstvarythenetworktopologybyleavingthedumbbelllinkunconstrained(1Gbps)withresultsinFigure5.Thegraphshowstheeffectofremovingthebottleneckonthenishtimescomparedtotheconstraineddumbbell-linkcaseforthe50-MBle:allclientsnishwithinasmalltimedifferenceofeachotherasshownbythemiddlepairofcurves.Next,weconsidertheeffectofvaryingthetotalnum-berofclients.Usingthetopologyfromthebaselineex-perimentwerepeattheexperimentsfor80and200simul-taneousBitTorrentclients.Figure6showstheresults.ThecurvesforthebaselineandDieCast-scaledversionsalmostcompletelyoverlapeachotherfor80clients(leftpairofcurves)andshowminordeviationfromeachotherfor200clients(rightpairofcurves).Notethatwith200clients,thebandwidthcontentionincreasestothepointwherethedumbbellbottleneckbecomeslessimportant.Finally,weconsideranexperimentthatdemonstratestheexibilityofDieCasttoreproducesystemperfor-manceunderavarietyofresourcecongurationsstart-ingwiththesamebaseline.Figure7showsthatinaddi-tiontomatching1:10scalingusing4physicalmachineshosting10VMseach,wecanalsomatchanalternatecongurationof8physicalmachines,hostingveVMseachwithadilationfactorofve.Thisdemonstratesthatevenifitisnecessarytovarythenumberofphysicalma-chinesavailablefortesting,itmaystillbepossibletond Figure3:TopologyforBitTorrentexperiments. Figure4:RUBiSSetup. 0 20 40 60 80 100 120 140160 0.2 0.4 0.6 0.81.0Cumulative Fraction 10 MB50 MB Baseline Figure5:Performancewithvaryinglesizes. 0 100 200 300 400500 0.2 0.4 0.6 0.81.0Cumulative Fraction 80 clients200 clients Baseline Figure6:Varying#clients. 0 20 40 60 80 100 120 140 160180 0.2 0.4 0.6 0.81.0Cumulative Fraction No DieCast Baseline Figure7:Differentcongurations.anappropriatescalingfactortomatchperformancechar-acteristics.Thisgraphalsohasafourthcurve,labeled“NoDieCast”,correspondingtorunningtheexperimentwith40VMsonfourphysicalmachines,eachwithadi-lationfactorof1—diskandnetworkarenotscaled(thusmatchthebaselineconguration),andallVMsareallo-catedequalsharesoftheCPU.Thiscorrespondstotheapproachofsimplymultiplexinganumberofvirtualma-chinesonphysicalmachineswithoutusingDieCast.Thegraphshowsthatthebehaviorofthesystemundersuchanaveapproachvarieswidelyfromactualbehavior.4.3RUBiSNext,weinvestigateDieCast'sabilitytoscaleafullyfunctionalInternetservice.WeuseRUBiS[6]—anauc-tionsiteprototypedesignedtoevaluatescalabilityandapplicationserverperformance.RUBiShasbeenusedbyotherresearcherstoapproximaterealisticInternetSer-vices[12–14].WeusethePHPimplementationofRUBiSrunningApacheasthewebserverandMySQLasthedatabase.Forconsistentresults,were-createthedatabaseandpre-populateitwith100,000usersanditemsbeforeeachex-periment.Weusethedefaultread-writetransactionta-blefortheworkloadthatexercisesallaspectsofthesys-temsuchasaddingnewitems,placingbids,addingcom-ments,viewingandbrowsingthedatabase.TheRUBiSworkloadgeneratorswarmupfor60seconds,followedbyasessionruntimeof600secondsandrampdownfor60seconds.Weemulateatopologyof40nodesconsistingof8databaseservers,16webserversand16workloadgen-eratorsasshowninFigure4.A100Mbpsnetworklinkconnectstworeplicasoftheservicespreadacrossthewide-areaattwosites.Withinasite,1Gbpslinksconnectallcomponents.Forreliability,halfofthewebserversateachsiteusethedatabaseserversintheothersite.Thereisoneloadgeneratorperwebserverandallloadgeneratorssharea100Mbpsaccesslink.Eachsys-temcomponent(servers,workloadgenerators)runsinitsownXenVM.WenowevaluateDieCast'sabilitytopredictthebe-haviorofthisRUBiScongurationusingfewerre-sources.Figures8(a)and8(b)comparethebaselineperformancewiththescaledsystemforoverallsystemthroughputandaverageresponsetime(acrossallclient-webservercombinations)onthey-axisasafunctionofnumberofsimultaneousclients(offeredload)onthex-axis.Inbothcases,theperformanceofthescaledser-vicecloselytracksthatofthebaseline.Wealsoshowtheperformanceforthe“NoDieCast”conguration:reg-ularVMmultiplexingwithnoDieCast-scaling.With-outDieCasttooffsettheresourcecontention,theaggre-gatethroughputdropswithasubstantialincreaseinre-sponsetimes.Interestingly,foroneofourinitialtests,weranwithanunintendedmis-congurationoftheRUBiSdatabase:theworkloadhadcommenting-relatedopera-tionsenabled,buttherelevanttablesweremissingfromthedatabase.Thisledtoanapproximately25%errorrate 0 1000 2000 3000 4000 5000 60007000 10000 20000 30000 40000 5000060000Aggregate Throughput (requests/min) Baseline (a)Throughput 0 1000 2000 3000 4000 5000 60007000 1000 2000 3000 4000 5000 6000 70008000Response time (ms) Baseline (b)ResponseTimeFigure8:ComparingRUBiSapplicationperformance:Baselinevs.DieCast. Figure10:ArchitectureofIsaac. 0 200 400 600 8001000 20 40 60 80100CPU used (%) DB server 0 20 40 60 80100CPU used (%) Web server 0 20 40 60 80100CPU used (%) Client Baseline (a)CPUprole 0 200 400 600 800 10001200 20 40 60 80100Memory utilization (%) DB serverClient Baseline (b)Memoryprole 0 10 20 30 40 50 60 70 8090 100 101 102 103104 Data transferred (MB) Baseline (c)NetworkproleFigure9:ComparingresourceutilizationforRUBiS:DieCastcanaccuratelyemulatethebaselinesystembehavior.withsimilartimingsintheresponsestoclientsinboththebaselineandDieCastcongurations.Thesetypesofcon-gurationerrorsareoneexampleofthetypesoftestingthatwewishtoenablewithDieCast.Next,Figures9(a)and9(b)compareCPUandmem-oryutilizationsforboththescaledandunscaledexperi-mentsasafunctionoftimeforthecaseof4800simul-taneoususersessions:wepickonenodeofeachtype(DBserver,Webserver,loadgenerator)atrandomfromthebaseline,andusethesamethreenodesforcompari-sonwithDieCast.Oneimportantquestioniswhethertheaverageperformanceresultsinearliergureshidesignif-icantincongruitiesinper-requestperformance.Here,weseethatresourceutilizationintheDieCast-scaledexper-imentscloselytrackstheutilizationinthebaselineonaper-nodeandper-tier(client,webserver,database)ba-sis.Similarly,Figure9(c)comparesthenetworkutiliza-tionofindividuallinksinthetopologyforthebaselineandDieCast-scaledexperiment.Wesortthelinksbytheamountofdatatransferredperlinkinthebaselinecase.ThisgraphdemonstratesthatDieCastcloselytracksandreproducesvariabilityinnetworkutilizationforvarioushopsinthetopology.Forinstance,hops86and87inthegurecorrespondtoaccesslinksofclientsandshowthemaximumutilization,whereasindividualaccesslinksofWebserversaremoderatelyloaded.4.4ExploringDieCastAccuracyWhilewewereencouragedbyDieCast'sabilitytoscaleRUBiSandBitTorrent,theyrepresentonlyafewpointsinthelargespaceofpossiblenetworkservicecongura-tions,forinstance,intermsoftheratiosofcomputationtonetworkcommunicationtodiskI/O.Hence,webuiltIsaac,acongurablemulti-tiernetworkservicetostresstheDieCastmethodologyonarangeofpossiblecong-urations.Figure10showsIsaac'sarchitecture.Requestsoriginatingfromaclient(C)traveltoauniquefrontendserver(FS)viaaloadbalancer(LB).TheFSmakes 0 100 200 300 400 500 600700 0.2 0.4 0.6 0.81.0Fraction of requests completed Baseline Figure11:Requestcompletiontime. DB AS FSDifferent Tiers0 20 40 60 80100Time Spent in Each Tier (%) Figure12:Tier-breakdown. 0 200 400 600 800 1000 1200 14001600 0.2 0.4 0.6 0.81.0Fraction of requests completed CPU stress100-KB writesNo DieCast stress-CPUNo DieCast stress-DB Baseline Figure13:StressingDB/CPU.anumberofcallstootherservicesthroughapplicationservers(AS).Theseapplicationserversinturnmayis-suereadandwritecallstoadatabasebackend(DB)beforebuildingaresponseandtransmittingitbacktothefrontendserver,whichnallyrespondstotheclient.IsaaciswritteninPythonandallowsconguringtheservicetoagiveninterconnecttopology,computation,communication,andI/Opattern.Acongurationde-scribes,onaperrequestclassbasis,thecomputation,communication,andI/Ocharacteristicsacrossmultipleservicetiers.Inthismanner,wecancongureexperi-mentstostressdifferentaspectsofaserviceandtoin-dependentlypushthesystemtocapacityalongmultipledimensions.WeuseMySQLforthedatabasetiertore-ectarealistictransactionalstoragetier.Forourrstexperiment,wecongureIsaacwithfourDBs,fourASs,fourFSsand28clients.Theclientsgen-eraterequests,waitforresponses,andsleepforsometimebeforegeneratingnewrequests.Eachclientgener-ates20requestsandeachsuchrequesttouchesveASs(randomlyselectedatruntime)aftergoingthroughtheFS.EachrequestfromtheASinvolves10readsfromand2writestoadatabaseeachofsize1KB.Thedatabaseserverisalsochosenrandomlyatruntime.Uponcom-pletingitsdatabasequeries,eachAScomputes500SHA-1hashesoftheresponsebeforesendingitbacktotheFS.EachFSthencollectsresponsesfromallveAS'sand-nallycomputes5,000SHA-1hashesontheconcatenatedresultsbeforereplyingtotheclient.Inlaterexperiments,wevaryboththeamountofcomputationandI/Otoquan-tifysensitivitytovaryingresourcebottlenecksWeperformthis40-nodeexperimentbothwithandwithoutDieCast.Forbrevity,wedonotshowthere-sultsofinitialtestsvalidatingDieCastaccuracy(inallcases,performancematchedcloselyinboththedilatedandbaselinecase).Rather,werunamorecomplexex-perimentwhereasubsetofthemachinesfailandthenrecover.OurgoalistoshowthatDieCastcanaccuratelymatchapplicationperformancebeforethefailureoccurs,duringthefailurescenario,andtheapplication'srecoverybehavior.After200seconds,wefailhalfofthedatabaseservers(chosenatrandom)bystoppingMySQLserversonthecorrespondingnodes.Asaresult,clientrequestsaccessingfaileddatabaseswillnotcomplete,slowingtherateofcompletedrequests.Afteroneminuteofdown-time,werestarttheMySQLserverandsoonafterweexpecttoseetherequestcompletionratetoregainitsoriginalvalue.Figure11showsfractionofrequestscom-pletedontheY-axisasafunctionoftimesincethestartoftheexperimentontheX-axis.DieCastcloselymatchesthebaselineapplicationbehaviorwithadilationfactorof10.WealsocomparethepercentageoftimespentineachofthethreetiersofIsaacaveragedacrossallre-quests.Figure12showsthatinadditiontotheend-to-endresponsetime,DieCastcloselytracksthesystembehav-ioronaper-tierbasis.Encouragedbytheresultsofthepreviousexperi-ment,wenextattempttosaturateindividualcompo-nentsofIsaactoexplorethelimitsofDieCast'saccuracy.First,weevaluateDieCast'sabilitytoscalenetworkser-viceswhendatabaseaccessdominatesper-requestser-vicetime.Figure13showsthecompletiontimeforre-quests,whereeachserviceissuesa100-KB(ratherthan1-KB)writetothedatabasewithallotherparametersre-mainingthesame.Thisamountstoatotalof1MBofdatabasewritesforeveryrequestfromaclient.Evenfortheselargerdatavolumes,DieCastfaithfullyreproducessystemperformance.Whileforthisworkload,weareabletomaintaingoodaccuracy,theevaluationofdiskdi-lationsummarizedinFigure2(c)suggeststhattherewillcertainlybepointswherediskdilationinaccuracywillaffectoverallDieCastaccuracy.Next,weevaluateDieCastaccuracywhenoneofthecomponentsinourarchitecturesaturatestheCPU.Specically,wecongureourfront-endserverssuchthatpriortosendingeachresponsetotheclient,theycomputeSHA-1hashesoftheresponse500,000timestoarti-ciallysaturatetheCPUofthistier.Theresultsofthisex-perimenttooareshowninFigure13.WeareencouragedoverallasthesystemdoesnotsignicantlydivergeeventothepointofCPUsaturation.Forinstance,theCPUutilizationfornodeshostingtheFSinthisexperimentvariedfrom5080%forthedurationoftheexperimentandevenundersuchconditionsDieCastcloselymatched thebaselinesystemperformance.The“NoDieCast”linesplottheperformanceofthestress-DBandstress-CPUcongurationswithregularVMmultiplexingwith-outDieCast-scaling.AswithBitTorrentandRUBiS,weseethatwithoutDieCast,thetestinfrastructurefailstopredicttheperformanceofthebaselinesystem.5CommercialSystemEvaluationWhilewewereencouragedbyDieCast'saccuracyfortheapplicationsweconsideredinSection4,alloftheex-perimentsweredesignedbyDieCastauthorsandwerelargelyacademicinnature.Tounderstandthegeneralityofoursystem,weconsideritsapplicabilitytoalarge-scalecommercialsystem.Panasas[4]buildsscalablestoragesystemstarget-ingLinuxclustercomputingenvironments.Ithassup-pliedsolutionstoseveralgovernmentagencies,oilandgascompanies,mediacompaniesandseveralcommer-cialHPCenterprises.AcorecomponentofPanasas'sproductsisthePanFSparallellesystem(henceforthre-ferredtoasPanFS):anobject-basedclusterlesystemthatpresentsasingle,cachecoherentuniednamespacetoclients.Tomeetcustomerrequirements,Panasasmustensureitssystemscandeliverappropriateperformanceunderarangeofclientaccesspatterns.Unfortunately,itisof-tenimpossibletocreateatestenvironmentthatreectsthesetupatacustomersite.SincePanasashasseveralcustomerswithverylargesuper-computingclustersandlimitedtestinfrastructureatitsdisposal,itsabilitytoper-formtestingatscaleisseverelyrestrictedbyhardwareavailability;exactlythetypeofsituationDieCasttar-gets.Forexample,theLosAlamosNationalLabhasde-ployedPanFSwithitsRoadrunnerpeta-scalesupercom-puter[5].TheRoadrunnersystemisdesignedtodeliverasustainedperformancelevelofonepetaopatanesti-matedcostof$90million.Becauseofthetremendousscaleandcost,Panasascannotreplicatethiscomputingenvironmentfortestingpurposes.PortingTimeDilation.Inevaluatingourabilitytoap-plyDieCasttoPanFS,weencounteredoneprimarylimi-tation.PanFSclientsuseaLinuxkernelmoduletocom-municatewiththePanFSserver.Theclient-sidecoderunsonrecentversionsofXen,andhence,DieCastsup-portedthemwithnomodications.However,thePanFSserverrunsinacustomoperatingsystemderivedfromanolderversionofFreeBSDthatdoesnotsupportXen.ThesignicantmodicationstothebaseFreeBSDoperatingsystemmadeitimpossibletoportPanFStoamorere-centversionofFreeBSDthatdoessupportXen.Ideally,itwouldbepossibletosimplyencapsulatethePanFSserverinafullyvirtualizedXenVM.However,recallthatthisrequiresvirtualizationsupportintheprocessor 64K 1M 2M 4M16M 100 150 200 250 300 350 400450Throughput (MB/s) MPI-IOIOZone Baseline Read Figure14:ValidatingDieCastonPanFS.whichwasunavailableinthehardwarePanasaswasus-ing.Evenifwehadthehardware,XendidnotsupportFreeBSDonFVVMsuntilrecentlyduetoawellknownbug[2].Thus,unfortunatelywecouldnoteasilyemploytheexistingtimedilationtechniqueswithPanFSontheserverside.However,sincewebelieveDieCastconceptsaregeneralandnotrestrictedtoXen,wetookthisoppor-tunitytoexplorewhetherwecouldmodifythePanFSOStosupportDieCast,withoutanyvirtualizationsupport.ToimplementtimedilationinthePanFSkernel,wescalethevarioustimesources,andconsequently,thewallclock.TheTDFcanbespeciedatboottimeasakernelparameter.Asbefore,weneedtoscaledownresourcesavailabletoPanFSsuchthatitsperceivedca-pacitymatchesthebaseline.Forscalingthenetwork,weuseDummynet[27],whichshipsaspartofthePanFSOS.However,therewasnomechanismforlimitingtheCPUavailabletotheOS,ortoslowthedisk.ThePanFSOSdoesnotsupportnonwork-conservingCPUallocation.Further,simplymodi-fyingtheCPUschedulerforuserprocessesisinsufcientbecauseitwouldnotthrottletherateofkernelprocess-ing.ForCPUdilation,wehadtomodifythekernelasfollows.WecreatedaCPU-boundtask,(idle),inthekernelandwestaticallyassigneditthehighestschedul-ingpriority.WescaletheCPUbymaintainingthere-quiredratiobetweentheruntimesoftheidletaskandallremainingtasks.Iftheidletaskconsumessuf-cientCPU,itisremovedfromtherunqueueandthereg-ularCPUschedulerkicksin.Ifnot,thescheduleralwayspickstheidletaskbecauseofitspriority.Fordiskdilation,wewerefacedbythecomplicationthatmultiplehardwareandsoftwarecomponentsinteractinPanFStoserviceclients.Forperformance,thereareseveralparalleldatapathsandmanyoperationsareeitherasynchronousorcached.Accuratelyimplementingdiskdilationwouldrequireaccountingforallofthepossiblecodepathsaswellasmodelingthediskdriveswithhighdelity.Inanidealimplementation,ifthephysicalser-vicetimeforadiskrequestissandtheTDFist,thentherequestshouldbedelayedbytime(t1)ssuchthatthetotalphysicalservicetimebecomests,whichunder dilationwouldbeperceivedasthedesiredvalueofs.Unfortunately,thePanasasoperatingsystemonlypro-videscoarse-grainedkerneltimers.Consequently,sleepcallswithsmalldurationstendtobeinaccurate.Usinganumberofmicro-benchmarks,wedeterminedthatthesmallestsleepintervalthatcouldbeaccuratelyimple-mentedinthePanFSoperatingsystemwas1ms.Thislimitationaffectsthewaydiskdilationcanbeim-plemented.ForI/Ointensiveworkloads,therateofdiskrequestsishigh.Atthesametime,theservicetimeofeachrequestisrelativelymodest.Inthiscase,delayingeachrequestindividuallyisnotanoption,sincetheover-headofinvokingsleepdominatestheinjecteddelayandgivesunexpectedlylargeslowdowns.Thus,wechosetoaggregatedelaysacrosssomenumberofrequestswhoseservicetimesumstomorethan1msandperiodicallyin-jectdelaysratherthaninjectingadelayforeachrequest.Anotherpracticallimitationisthatitisoftendifculttoaccuratelyboundtheservicetimeofadiskrequest.ThisisaresultofthevariousI/Opathsthatexist:requestscanbesynchronousorasynchronous,theycanbeservicedfromthecacheornotandsoon.Whilewerealizethatthisimplementationisimperfect,itworkswellinpracticeandcanbeautomaticallytunedforeachworkload.Aperfectimplementationwouldhavetoaccuratelymodelthelowleveldiskbehaviorandim-provetheaccuracyofthekernelsleepfunction.Becauseoperatingsystemsandhardwarewillincreasinglysup-portnativevirtualization,wefeelthatoursimplediskdi-lationimplementationtargetingindividualPanFSwork-loadsisreasonableinpracticetovalidateourapproach.ValidationWerstwishtoestablishDieCastaccuracybyrunningexperimentsonbarehardwareandcomparingthemagainstDieCast-scaledvirtualmachines.WestartbysettingupastoragesystemconsistingofanPanFSserverwith20disksofcapacity250GBeach(5TBtotalstorage).Weevaluatetwobenchmarksfromthestan-dardbandwidthtestsuiteusedbyPanasas.Therstbenchmarkinvolves10clients(eachonaseparatema-chine)runningIOZone[23].ThesecondbenchmarkusestheMessagePassingInterface(MPI)across100clients(again,onseparatemachines)[28].ForDieCastscaling,werepeattheexperimentwithourmodicationstothePanFSserverconguredtoenforceadilationfactorof10.Thus,weallocate10%oftheCPUtotheserveranddilatethenetworkusingDummynetto10%ofthephysicalbandwidthand10timesthelatency(topreservethebandwidth-delayproduct).Ontheclientside,wehaveallclientsrunninginseparatevirtualma-chines(10VMsperphysicalmachine),eachreceiving10%oftheCPUwithadilationfactorof10.Figure14plotstheaggregateclientthroughputforbothexperimentsonthey-axisasafunctionofthedatablocksizeonthex-axis.Circlesmarktheread AggregateThroughput Numberofclients 10 250 1000 Write 370MB/s 403MB/s 398MB/s Read 402MB/s 483MB/s 424MB/s Table1:Aggregateread/writethroughputsfromtheIOZonebenchmarkwithblocksize16M.PanFSperformancescalesgracefullywithlargerclientpopulations.throughputwhiletrianglesmarkwritethroughput.WeusesolidlinesforthebaselineanddashedlinesfortheDieCast-scaledconguration.Forbothreadsandwrites,DieCastcloselyfollowsbaselineperformance,neverdi-vergingbymorethan5%evenforunusuallylargeblocksizes.ScalingWithsufcientfaithintheabilityofDieCasttoreproduceperformanceforreal-worldapplicationwork-loadswenextaimtopushthescaleoftheexperimentbeyondwhatPanasascaneasilyachievewiththeirexist-inginfrastructure.WeareinterestedinthescalabilityofPanFSasweincreasethenumberofclientsbytwoordersofmagni-tude.Toachievethis,wedesignanexperimentsimilartotheoneabove,butthistimewextheblocksizeat16MBandvarythenumberofclients.Weuse10VMseachon25physicalmachinestosupport250clientstoruntheIOZonebenchmark.Wefurtherscaletheexper-imentbyusing10VMseachon100physicalmachinestogoupto1000clients.Ineachcase,allVMsarerun-ningataTDFof10.ThePanFSserveralsorunsataTDFof10andallresources(CPU,network,disk)arescaledappropriately.Table1showsthattheperformanceofPanFSwithincreasingclientpopulation.Interestingly,wendrelativelylittleincreaseinthroughputaswein-creasetheclientpopulation.Uponinvestigatingfurther,wefoundthatasinglePanFSservercongurationislim-itedto4Gb/s(500MB/s)ofaggregatebisectionband-widthbetweentheserversandclients(includinganyIPandlesystemoverhead).Whileournetworkemulationaccuratelyreectedthisbottleneck,wedidnotcatchthebottleneckuntilweranourexperiments.Weleaveaper-formanceevaluationwhenremovingthisbottlenecktofuturework.Wewouldliketoemphasizethatpriortoourexperi-ment,Panasashadbeenunabletoperformexperimentsatthisscale.Thisisinpartduetothefactthatsuchalargenumberofmachinesmightnotbeavailableatanygiventimeforasingleexperiment.Further,evenifmachinesareavailable,blockingalargenumberofmachinesre-sultsinsignicantresourcecontentionbecauseseveralothersmallerexperimentsarethenblockedonavail-abilityofresources.OurexperimentsdemonstratethatDieCastcanleverageexistingresourcestoworkaround thesetypesofproblems.6DieCastUsageScenariosInthissection,wediscussDieCast'sapplicabilityandlimitationsfortestinglarge-scalenetworkservicesinavarietyofenvironments.DieCastaimstoreproducetheperformanceofanorig-inalsystemcongurationandiswellsuitedforpredict-ingthebehaviorofthesystemunderavarietyofwork-loads.Further,becausethetestsystemcanbesubjecttoavarietyofrealisticandprojectedclientaccesspatterns,DieCastmaybeemployedtoverifythatthesystemcanmaintainthetermsofServiceLevelAgreements(SLA).Itrunsinacontrolledandpartiallyemulatednetworkenvironment.Thus,itisrelativelystraightforwardtocon-sidertheeffectsofrevampingaservice'snetworktopol-ogy(e.g.,toevaluatewhetheranupgradecanalleviateacommunicationbottleneck).DieCastcanalsosystem-aticallysubjectthesystemtofailurescenarios.Forex-ample,systemarchitectsmaydevelopasuiteoffault-loadstodeterminehowwellaservicemaintainsresponsetimes,dataquality,orrecoverytimemetrics.Similarly,becauseDieCastcontrolsworkloadgenerationitisap-propriateforconsideringavarietyofattackconditions.Forinstance,itcanbeusedtosubjectanInternetservicetolarge-scaleDenial-of-Serviceattacks.DieCastmayenableevaluationofvariousDOSmitigationstrategiesorsoftwarearchitectures.Manydifcult-to-isolatebugsresultfromsystemcon-gurationerrors(e.g.,attheOS,network,orapplicationlevel)orinconsistenciesthatarisefrom“liveupgrades”ofaservice.Theresultingfaultsmayonlymanifestaserrorsinasmallfractionofrequestsandeventhenafteraspecicsequenceofoperations.Operatorerrorsandmis-congurations[22,24]arealsoknowntoaccountforasignicantfractionofservicefailures.DieCastmakesitpossibletocapturetheeffectsofmis-congurationsandupgradesbeforeaservicegoeslive.Atthesametime,DieCastwillnotbeappropriateforcertainservicecongurations.Asdiscussedearlier,DieCastisunabletoscaledownthememoryorstoragecapacityofaservice.Servicesthatrelyonmulti-petabytedatasetsorsaturatethephysicalmemoriesofalloftheirmachineswithlittletonocross-machinememory/storageredundancymaynotbesuitableforDieCasttesting.Ifsystembehaviordependsheavilyonthebehavioroftheprocessorcache,andifmultiplexingmultipleVMsontoasinglephysicalmachineresultsinsignicantcachepol-lution,thenDieCastmayunder-predicttheperformanceofcertainapplicationcongurations.DieCastmaychangethene-grainedtimingofindi-vidualeventsinthetestsystem.Hence,DieCastmaynotbeabletoreproducecertainraceconditionsortiminger-rorsintheoriginalservice.Somebugs,suchasmemoryleaks,willonlymanifestafterrunningforasignicantperiodoftime.Giventhatweinatetheamountoftimerequiredtocarryoutatest,itmaytaketoolongtoisolatethesetypesoferrorsusingDieCast.Multiplexingmultiplevirtualmachinesontoasinglephysicalmachine,runningwithanemulatednetwork,anddilatingtimewillintroducesomeerrorintothepro-jectedbehavioroftargetservices.Thiserrorhasbeensmallforthenetworkservicesandscenariosweevalu-ateinthispaper.Ingeneralhowever,DieCast'saccuracywillbeserviceanddeployment-specic.WehavenotyetestablishedanoveralllimittoDieCast'sscalingabil-ity.Inseparateexperimentsnotreportedinthispaper,wehavesuccessfullyrunwithscalingfactorsof100.How-ever,inthesecases,thelimitationoftimeitselfbecomessignicant.Waiting10timeslongerforanexperimenttocongureisoftenreasonable,butwaiting100timeslongerbecomesdifcult.Someservicesemployavarietyofcustomhardware,suchasloadbalancingswitches,rewalls,andstorageappliances.Ingeneral,itmaynotbepossibletoscalesuchhardwareinourtestenvironment.Dependingonthearchitectureofthehardware,oneapproachistowrapthevariousoperatingsystemsforsuchcasesinscaledvir-tualmachines.Anotherapproachistorunthehardwareitselfandtobuildcustomwrapperstointerceptrequestsandresponses,scalingthemappropriately.Analoptionistorunsuchhardwareunscaledinthetestenvironment,introducingsomeerrorinsystemperformance.OurworkwithPanFSshowsthatitisfeasibletoscaleunmodiedservicesintotheDieCastenvironmentwithrelativelylit-tleworkonthepartofthedeveloper.7RelatedWorkOurworkbuildsuponpreviouseffortsinanumberofareas.Wediscusseachinturnbelow.TestingscaledsystemsSHRiNK[25]isperhapsmostcloselyrelatedtoDieCastinspirit.SHRiNKaimstoevaluatethebehavioroffasternetworksbysimulat-ingslowerones.Forexample,their“scalinghypothe-sis”statesthatthebehaviorof100Mbpsowsthrougha1Gbpspipeshouldbesimilarto10Mbpsthrougha100Mbpspipe.Whenthisscalinghypothesisholds,itbecomespossibletorunsimulationsmorequicklyandwithalowermemoryfootprint.Relativetothiseffort,weshowhowtoscalefullyoperationalcomputersystems,consideringcomplexinteractionsamongCPU,network,anddiskspreadacrossmanynodesandtopologies.TestingthroughSimulationandEmulationOnepopularapproachtotestingcomplexnetworkservicesisthroughbuildingasimulationmodelofsystembehaviorunderavarietyofaccesspatterns.Whilesuchsimula-tionsarevaluable,wearguethatsimulationisbestsuitedtounderstandingcoarse-grainedperformancecharacter- isticsofcertaincongurations.Simulationislesssuitedtocongurationerrorsortocapturingtheeffectsofun-expectedcomponentinteractions,failures,etc.Supercially,emulationtechniques(e.g.Emulab[34]orModelNet[31]),offeramorerealisticalternativetosimulationbecausetheysupportrunningunmodiedap-plicationsandoperatingsystems.Unfortunately,suchemulationislimitedbythecapacityoftheavailablephys-icalhardwareandhenceisoftenbestsuitedtoconsider-ingwide-areanetworkconditions(withsmallerbisectionbandwidths)orsmallersystemcongurations.Forin-stance,multiplexing1000instancesofanoverlayacross50physicalmachinesinterconnectedbyGigabitEther-netmaybefeasiblewhenevaluatingalesharingser-viceonclientswithcablemodems.However,thesame50machineswillbeincapableofemulatingthenetworkorCPUcharacteristicsof1000machinesinamulti-tiernetworkserviceconsistingofdozensofracksandhigh-speedswitches.TimeDilationDieCastleveragesearlierworkonTimeDilation[19]toassistwithscalingthenetworkcongura-tionofatargetservice.Thisearlierworkfocusedoneval-uatingnetworkprotocolsonnext-generationnetworkingtopologies,e.g.,thebehavioronTCPon10GbpsEther-netwhilerunningon1GbpsEthernet.Relativetothispreviouswork,DieCastimprovesupontimedilationtoscaledownaparticularnetworkconguration.Inaddi-tion,wedemonstratethatitispossibletotradetimeforcomputeresourceswhileaccuratelyscalingCPUcycles,complexnetworktopologies,anddiskI/O.Finally,wedemonstratetheefcacyofourapproachend-to-endforcomplex,multi-tiernetworkservices.DetectingPerformanceAnomaliesTherehavebeenanumberofrecenteffortstodebugperformanceanoma-liesinnetworkservices,includingPinpoint[14],Mag-Pie[9],andProject5[8].Eachoftheseinitiativesan-alyzesthecommunicationandcomputationacrossmul-tipletiersinmodernInternetservicestolocateperfor-manceanomalies.Theseeffortsarecomplementarytooursastheyattempttolocateproblemsindeployedsys-tems.Conversely,thegoalofourworkistotestparticu-larsoftwarecongurationsatscaletolocateerrorsbeforetheyaffectaliveservice.ModelingInternetServicesFinally,therehavebeenmanyeffortstomodeltheperformanceofnetworkser-vicesto,forexample,dynamicallyprovisiontheminre-sponsetochangingrequestpatterns[16,30]ortorerouterequestsinthefaceofcomponentfailures[12].Onceagain,theseeffortstypicallytargetalreadyrunningser-vicesrelativetoourgoaloftestingservicecongura-tions.Alternatively,suchmodelingcouldbeusedtofeedsimulationsofsystembehaviorortoverifyatacoarsegranularityDieCastperformancepredictions.8ConclusionTestingnetworkservicesremainsdifcultbecauseoftheirscaleandcomplexity.Whilenottechnicallyoreco-nomicallyfeasible,acomprehensiveevaluationwouldrequirerunningatestsystemidenticallyconguredtoandatthesamescaleastheoriginalsystem.Suchtest-ingshouldenablendingperformanceanomalies,failurerecoveryproblems,andcongurationerrorsunderavari-etyofworkloadsandfailureconditionsbeforetriggeringcorrespondingerrorsduringliveruns.Inthispaper,wepresentamethodologyandframe-worktoenablesystemtestingtomorecloselymatchboththecongurationandscaleoftheoriginalsystem.Weshowhowtomultiplexmultiplevirtualmachines,eachconguredidenticallytoanodeintheoriginalsys-tem,acrossindividualphysicalmachines.Wethendi-lateindividualmachineresources,includingCPUcycles,networkcommunicationcharacteristics,anddiskI/O,toprovidetheillusionthateachVMhasasmuchcomput-ingpowerascorrespondingphysicalnodesintheorig-inalsystem.Bytradingtimeforresources,weenablemorerealistictestsinvolvingmorehostsandmorecom-plexnetworktopologiesthanwouldotherwisebepos-sibleontheunderlyinghardware.Whileourapproachdoesaddnecessarystorageandmultiplexingoverhead,anevaluationwitharangeofnetworkservices,includ-ingacommerciallesystem,demonstratesouraccuracyandthepotentialtosignicantlyincreasethescaleandrealismoftestingnetworkservices.AcknowledgementsTheauthorswouldliketothankTejasviAswatha-narayana,JeffButlerandGarthGibsonatPanasasfortheirguidanceandsupportinportingDieCasttotheirsystems.WewouldalsoliketothankMarvinMcNettandChrisEdwardsfortheirhelpinmanagingsomeofthein-frastructure.Finally,wewouldliketothankourshep-herdSteveGribble,andouranonymousreviewersfortheirtimeandinsightfulcomments—theyhelpedtremen-douslyinimprovingthepaper.References[1]BitTorrent.http://www.bittorrent.com.[2]FreeBSDbootloaderstopswithBTXhaltedinhvmdomU.http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=622.[3]Netem.http://linux-net.osdl.org/index.php/Netem.[4]Panasas.http://www.panasas.com.[5]PanasasActiveScaleStorageClusterWillProvideI/OforWorld'sFastestComputer.http://panasas.com/press_release_111306.html.[6]RUBiS.http://rubis.objectweb.org. [7]Vmwareappliances.http://www.vmware.com/vmtn/appliances/.[8]M.K.Aguilera,J.C.Mogul,J.L.Wiener,P.Reynolds,andA.Muthitacharoen.PerformanceDebuggingforDis-tributedSystemsofBlackBoxes.InProceedingsofthe19thACMSymposiumonOperatingSystemPrinciples,2003.[9]P.Barham,A.Doelly,R.Isaacs,andR.Mortier.UsingMagpieforRequestExtractionandWorkloadModelling.InProceedingsofthe6thUSENIXSymposiumonOper-atingSystemsDesignandImplementation,2004.[10]P.Barham,B.Dragovic,K.Fraser,S.Hand,T.Harris,A.Ho,R.Neugebauer,I.Pratt,andA.Wareld.XenandtheArtofVirtualization.InProceedingsofthe19thACMSymposiumonOperatingSystemPrinciples,2003.[11]L.A.Barroso,J.Dean,andU.Holzle.WebSearchforaPlanet:TheGoogleClusterArchitecture.IEEEMicro,2003.[12]J.M.Blanquer,A.Batchelli,K.Schauser,andR.Wol-ski.Quorum:FlexibleQualityofServiceforInternetSer-vices.InProceedingsofthe3rdUSENIXSymposiumonNetworkedSystemsDesignandImplementation,2005.[13]E.Cecchet,J.Marguerite,andW.Zwaenepoel.Perfor-manceandscalabilityofEJBapplications.InProceedingsofthe17thACMConferenceonObject-OrientedPro-gramming,Systems,LanguagesandApplications,2002.[14]M.Y.Chen,E.Kiciman,E.Fratkin,A.Fox,andE.Brewer.Pinpoint:ProblemDeterminationinLarge,DynamicInternetServices.InProceedingsofthe32ndIn-ternationalConferenceonDependableSystemsandNet-works,2002.[15]Y.-C.Cheng,U.Hoelzle,N.Cardwell,S.Savage,andG.M.Voelker.MonkeySee,MonkeyDo:AToolforTCPTracingandReplaying.InProceedingsoftheUSENIXAnnualTechnicalConference,2004.[16]R.Doyle,J.Chase,O.Asad,W.Jen,andA.Vahdat.Model-BasedResourceProvisioninginaWebServiceUtility.InProceedingsoftheUSENIXSymposiumonIn-ternetTechnologiesandSystems,2003.[17]G.R.Gangerandcontributors.TheDiskSimSimu-lationEnvironment.http://www.pdl.cmu.edu/DiskSim/index.html.[18]D.Gupta,K.V.Vishwanath,andA.Vahdat.DieCast:TestingNetworkServiceswithanAccurate1/10ScaleModel.TechnicalReportCS2007-0910,UniversityofCalifornia,SanDiego,2007.[19]D.Gupta,K.Yocum,M.McNett,A.C.Snoeren,G.M.Voelker,andA.Vahdat.ToInnityandBeyond:Time-WarpedNetworkEmulation.InProceedingsofthe3rdUSENIXSymposiumonNetworkedSystemsDesignandImplementation,2006.[20]A.Haeberlen,A.Mislove,andP.Druschel.Glacier:Highlydurable,decentralizedstoragedespitemassivecorrelatedfailures.InProceedingsofthe3rdUSENIXSymposiumonNetworkedSystemsDesignandImplemen-tation,2005.[21]J.Mogul.Emergent(Mis)behaviorvs.ComplexSoftwareSystems.InProceedingsoftherstEuroSysConference,2006.[22]K.Nagaraja,F.Oliveira,R.Bianchini,R.P.Martin,andT.D.Nguyen.UnderstandingandDealingwithOperatorMistakesinInternetServices.InProceedingsofthe6thUSENIXSymposiumonOperatingSystemsDesignandImplementation,2004.[23]W.NorcottandD.Capps.IOzoneFilesystemBenchmark.http://www.iozone.org/.[24]D.Oppenheimer,A.Ganapathi,andD.Patterson.WhydoInternetservicesfail,andwhatcanbedoneaboutit.InProceedingsofthe4thUSENIXSymposiumonInternetTechnologiesandSystems,2003.[25]R.Pan,B.Prabhakar,K.Psounis,andD.Wischik.SHRINK:AMethodforScalablePerformancePredictionandEfcientNetworkSimulation.InIEEEINFOCOM,2003.[26]R.Ricci,C.Alfeld,andJ.Lepreau.ASolverfortheNetworkTestbedMappingProblem.InSIGCOMMCom-puterCounicationsReview,volume33,2003.[27]L.Rizzo.DummynetandForwardErrorCorrection.InProceedingsoftheUSENIXAnnualTechnicalConfer-ence,1998.[28]TheMPIForum.MPI:AMessagePassingInterface.pages878–883,Nov.1993.[29]A.Tridgell.EmulatingNetbench.http://samba.org/ftp/tridge/dbench/.[30]B.Urgaonkar,P.Shenoy,andT.Roscoe.Resourceover-bookingandapplicationprolinginsharedhostingplat-forms.InProceedingsofthe5thUSENIXSymposiumonOperatingSystemsDesignandImplementation,2002.[31]A.Vahdat,K.Yocum,K.Walsh,P.Mahadevan,D.Kosti´c,J.Chase,andD.Becker.ScalabilityandAccuracyinaLarge-ScaleNetworkEmulator.InProceedingsofthe5thUSENIXSymposiumonOperatingSystemsDesignandImplementation,2002.[32]C.A.Waldspurger.MemoryResourceManagementinVMwareESXServer.InProceedingsofthe5thUSENIXSymposiumonOperatingSystemsDesignandImplemen-tation,2002.[33]A.Wareld,R.Ross,K.Fraser,C.Limpach,andH.Steven.Parallax:ManagingStorageforaMillionMa-chines.InProceedingsofthe10thWorkshoponHotTop-icsinOperatingSystems.[34]B.White,J.Lepreau,L.Stoller,R.Ricci,S.Guruprasad,M.Newbold,M.Hibler,C.Barb,andA.Joglekar.AnIn-tegratedExperimentalEnvironmentforDistributedSys-temsandNetworks.InProceedingsofthe5thUSENIXSymposiumonOperatingSystemsDesignandImplemen-tation,2002.