/
Emulating Goliath Storage Systems with David Nitin Agr Emulating Goliath Storage Systems with David Nitin Agr

Emulating Goliath Storage Systems with David Nitin Agr - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
468 views
Uploaded On 2015-06-08

Emulating Goliath Storage Systems with David Nitin Agr - PPT Presentation

ArpaciDusseau Remzi H ArpaciDusseau NEC Laboratories America University of WisconsinMadison nitinneclabscom arulraj dusseau remzi cswiscedu Abstract Benchmarking 64257le and storage systems on large 64257le system images is important but dif64257cu ID: 82911

ArpaciDusseau Remzi

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Emulating Goliath Storage Systems with D..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

EmulatingGoliathStorageSystemswithDavidNitinAgrawal†,LeoArulraj,AndreaC.Arpaci-Dusseau,RemziH.Arpaci-DusseauNECLaboratoriesAmerica†UniversityofWisconsin–Madisonnitin@nec-labs.comfarulraj,dusseau,remzig@cs.wisc.eduAbstractBenchmarkingleandstoragesystemsonlargele-systemimagesisimportant,butdifcultandoftenin-feasible.Typically,runningbenchmarksonsuchlargedisksetupsisafrequentsourceoffrustrationforle-systemevaluators;thescalealoneactsasastrongdeter-rentagainstusinglargeralbeitrealisticbenchmarks.Toaddressthisproblem,wedevelopDavid:asystemthatmakesitpracticaltorunlargebenchmarksusingmodestamountofstorageormemorycapacitiesreadilyavailableonmostcomputers.Davidcreatesa“compressed”ver-sionoftheoriginalle-systemimagebyomittingallledataandlayingoutmetadatamoreefciently;anonlinestoragemodeldeterminestheruntimeofthebenchmarkworkloadontheoriginaluncompressedimage.Davidworksunderanylesystemasdemonstratedinthispa-perwithext3andbtrfs.WendthatDavidreducesstor-agerequirementsbyordersofmagnitude;Davidisabletoemulatea1TBtargetworkloadusingonlyan80GBavailabledisk,whilestillmodelingtheactualruntimeac-curately.Davidcanalsoemulatenewerorfasterdevices,e.g.,weshowhowDavidcaneffectivelyemulateamulti-diskRAIDusingalimitedamountofmemory.1IntroductionFileandstoragesystemsarecurrentlydifculttobench-mark.Ideally,onewouldliketouseabenchmarkwork-loadthatisarealisticapproximationofaknownappli-cation.Onewouldalsoliketorunitinacongurationrepresentativeofrealworldscenarios,includingrealisticdisksubsystemsandle-systemimages.Inpractice,realisticbenchmarksandtheirrealisticcongurationstendtobemuchlargerandmorecom-plextosetupthantheirtrivialcounterparts.Filesystemtraces(e.g.,fromHPLabs[17])aregoodexamplesofsuchworkloads,oftenbeinglargeandunwieldy.Devel-opingscalableyetpracticalbenchmarkshaslongbeenachallengeforthestoragesystemscommunity[16].Inparticular,benchmarkssuchasGraySort[1]andSPEC-mail2009[22]arecompellingyetdifculttosetupandusecurrently,requiringaround100TBforGraySortandanywherefrom100GBto2TBforSPECmail2009.Benchmarkingonlargestoragedevicesisthusafre-quentsourceoffrustrationforle-systemevaluators;thescaleactsasadeterrentagainstusinglargeralbeitrealis-ticbenchmarks[24],butrunningtoyworkloadsonsmalldisksisnotsufcient.Oneobvioussolutionistocontin-uallyupgradeone'sstoragecapacity.However,itisanexpensive,andperhapsaninfeasiblesolutiontojustifythecostsandoverheadssolelyforbenchmarking.StorageemulatorssuchasMemulator[10]proveex-tremelyusefulforsuchscenarios–theyletusprototypethe“future”bypretendingtopluginbigger,fasterstor-agesystemsandrunrealworkloadsagainstthem.Mem-ulator,infact,makesastrongcaseforstorageemulationastheperformanceevaluationmethodologyofchoice.Butemulatorsareparticularlytough:iftheyaretobebig,theyhavetouseexistingstorage(andthusareslow);iftheyaretobefast,theyhavetoberunoutofmemory(andthustheyaresmall).Thechallengewefaceishowcanwegetthebestofbothworlds?Toaddressthisproblem,wehavedevel-opedDavid,a“scaledown”emulatorthatallowsonetorunlargeworkloadsbyscalingdownthestoragere-quirementstransparentlytotheworkload.Davidmakesitpracticaltoexperimentwithbenchmarksthatwereoth-erwiseinfeasibletorunonagivensystem.Ourobservationisthatinmanycases,thebenchmarkapplicationdoesnotcareaboutthecontentsofindivid-ualles,butonlyaboutthestructureandpropertiesofthemetadatathatisbeingstoredondisk.Inparticular,forthepurposesofbenchmarking,manyapplicationsdonotwriteorreadlecontentsatall(e.g.,fsck);theonesthatdo,oftendonotcarewhatthecontentsareaslongassomevalidcontentismadeavailable(e.g.,backupsoft-ware).Sinceledataconstitutesasignicantfractionofthetotallesystemsize,ranginganywherefrom90to99%dependingontheactualle-systemimage[3]avoidingtheneedtostoreledatahasthepotentialtosignicantlyreducetherequiredstoragecapacityduringbenchmarking.ThekeyideainDavidistocreatea“compressed”ver-sionoftheoriginalle-systemimageforthepurposesofbenchmarking.Inthecompressedimage,unneededuser datablocksareomittedusingnovelclassicationtech-niquestodistinguishdatafrommetadataatscale;lesystemmetadatablocks(e.g.,inodes,directoriesandin-directblocks)arestoredcompactlyontheavailableback-ingstore.Theprimarybenetofthecompressedimageistoreducethestoragecapacityrequiredtorunanygivenworkload.Toensurethatapplicationsremainunawareofthisinterposition,whenevernecessary,Davidsyntheti-callygeneratesledataonthey;metadataI/Oisredi-rectedandaccessedappropriately.Davidworksunderanylesystem;wedemonstratethisusingext3[25]andbtrfs[26],twolesystemsverydifferentindesign.SinceDavidalterstheoriginalI/Opatterns,itneedstomodeltheruntimeofthebenchmarkworkloadontheoriginaluncompressedimage.Davidusesanin-kernelmodelofthediskandstoragestacktodeterminetheruntimesofallindividualrequestsastheywouldhaveexecutedontheuncompressedimage.ThemodelpaysspecialattentiontoaccuratelymodelingtheI/Orequestqueues;wendthatmodelingtherequestqueuesiscru-cialforoverallaccuracy,especiallyforapplicationsissu-ingburstyI/O.TheprimarymodeofoperationofDavidisthetiming-accuratemodeinwhichaftermodelingtheruntime,anappropriatedelayisinsertedbeforereturningtotheap-plication.Asecondaryspeedupmodeisalsoavailableinwhichthestoragemodelreturnsinstantaneouslyaftercomputingthetimetakentorunthebenchmarkontheuncompresseddisk;inthismodeDavidoffersthepoten-tialtoreduceapplicationruntimeandspeedupthebench-markitself.InthispaperwediscussandevaluateDavidinthetiming-accuratemode.Davidallowsonetorunbenchmarkworkloadsthatre-quirele-systemimagesordersofmagnitudelargerthantheavailablebackingstorewhilestillreportingtherun-timeasitwouldhavetakenontheoriginalimage.WedemonstratethatDavidevenenablesemulationoffasterandmulti-disksystemslikeRAIDusingasmallamountofmemory.Davidcanalsoaidinrunninglargebench-marksonstoragedevicesthatareexpensiveornotevenavailableinthemarketasitrequiresonlyamodelofthenon-existentstoragedevice;forexample,onecanuseamodiedversionofDavidtorunbenchmarksonahypo-thetical1TBSSD.WebelieveDavidwillbeusefultoleandstoragedevelopers,applicationdevelopers,anduserslookingtobenchmarkthesesystemsatscale.Developersoftenliketoevaluateaprototypeimplementationatlargerscalestoidentifyperformancebottlenecks,ne-tuneoptimiza-tions,andmakedesigndecisions;analysesatscaleof-tenrevealinterestingandcriticalinsightsintothesys-tem[16].Davidcanhelpobtainapproximateperfor-manceestimateswithinlimitsofitsmodelingerror.Forexample,howdoesonemeasureperformanceofale Figure1:CapacitySavings.Showsthesavingsinstor-agecapacityifonlymetadataisstored,withvaryingle-sizedistributionmodeledby(m,s)parametersofalognormaldis-tribution,(7.53,2.48)and(8.33,3.18)forthetwoextremes.systemonamulti-diskmulti-TBmirroredRAIDcon-gurationwithouthavingaccesstoone?Anend-userlookingtoselectanapplicationthatworksbestatlargerscalemayalsouseDavidforemulation.Forexample,whichanti-virusapplicationscansaterabytelesystemthefastest?OnechallengeinbuildingDavidishowtodealwithscaleasweexperimentwithlargerlesystemscontain-ingmanymorelesanddirectories.Figure1showsthepercentageofstoragespaceoccupiedbymetadataaloneascomparedtothetotalsizeofthele-systemimagewritten;thedifferentle-systemimagesforthisexperi-mentweregeneratedbyvaryingthelesizedistributionusingImpressions[2].Usingpubliclyavailabledataonle-systemmetadata[4],weanalyzedhowle-sizedis-tributionchangeswithlesystemsofvaryingsizes.Wefoundthatlargerlesystemsnotonlyhadmoreles,theyalsohadlargerles.Forthisexperiment,theparametersofthelognormaldistributioncontrollingthelesizeswerechangedalongthex-axistogen-erateprogressivelylargerlesystemswithlargerlestherein.Therelativelysmallfractionbelongingtometa-data(roughly1to10%)asshownonthey-axisdemon-stratesthepotentialsavingsinstoragecapacitymadepossibleifonlymetadatablocksarestored;Davidisde-signedtotakeadvantageofthisobservation.ForworkloadslikePostMark,mkfs,FilebenchWebServer,FilebenchVarMail,andothermi-crobenchmarks,wendthatDaviddeliversonitspromiseinreducingtherequiredstoragesizewhilestillaccuratelypredictingthebenchmarkruntimeforbothext3andbtrfs.ThestoragemodelwithinDavidisfairlyaccurateinspiteofoperatinginreal-timewithintheker-nel,andformostworkloadspredictsaruntimewithin5%oftheactualruntime.Forexample,fortheFilebenchwebserverworkload,Davidprovidesa1000-foldreduc-tioninrequiredstoragecapacityandpredictsaruntimewithin0.08%oftheactual.2         File System  Data blockUnoccupied Benchmark    Target Backing StoreAvailable Backing StoreDavid Figure2:MetadataRemappingandDataSquashinginDavid.Thegureshowshowmetadatagetsremappedanddatablocksaresquashed.ThediskimageaboveDavidisthetargetandtheonebelowitistheavailable.2DavidOverview2.1DesignGoalsforDavidScalability:EmulatingalargedevicerequiresDavidtomaintainadditionaldatastructuresandmimicsev-eraloperations;ourgoalistoensurethatitworkswellastheunderlyingstoragecapacityscales.Modelaccuracy:Animportantgoalistomodelastoragedeviceandaccuratelypredictperformance.Themodelshouldnotonlycharacterizethephysicalcharacteristicsofthedrivebutalsotheinteractionsun-derdifferentworkloadpatterns.Modeloverhead:Equallyimportanttobeingaccu-rateisthatthemodelimposesminimaloverhead;sincethemodelisinsidetheOSandrunsconcurrentlywithworkloadexecution,itisrequiredtobefairlyfast.Emulationexibility:Davidshouldbeabletoemu-latedifferentdisks,storagesubsystems,andmulti-disksystemsthroughappropriateuseofbackingstores.Minimalapplicationmodication:Itshouldallowapplicationstorununmodiedwithoutknowingthesignicantlylesscapacityofthestoragesystemunder-neath;modicationscanbeperformedinlimitedcasesonlytoimproveeaseofusebutneverasanecessity.2.2DavidDesignDavidexportsafakestoragestackincludingafakede-viceofamuchhighercapacitythanavailable.Fortherestofthepaper,weusethetermstargettodenotethehypotheticallargerstoragedevice,andavailabletode-notethephysicallyavailablesystemonwhichDavidisrunning,asshowninFigure2.ItalsoshowsaschematicofhowDavidmakesuseofmetadataremappinganddatasquashingtofreeupalargepercentageoftherequiredstoragespace;amuchsmallerbackingstorecannowser-vicetherequestsofthebenchmark.Davidisimplementedasapseudo-devicedriverthatissituatedbelowthelesystemandabovethebackingstore,interposingonallI/Orequests.Sincethedriverappearsasaregulardevice,alesystemcanbecreatedandmountedonit.Beingaloadablemodule,Davidcanbeusedwithoutanychangetotheapplication,lesystemorthekernel.Figure3presentsthearchitectureofDavidwithallthesignicantcomponentsandalsoshowsthedifferenttypesofrequeststhatarehandledwithin.WenowdescribethecomponentsofDavid.First,theBlockClassierisresponsibleforclassify-ingblocksaddressedinarequestasdataormetadataandpreventingI/Orequeststodatablocksfromgoingtothebackingstore.Davidinterceptsallwritestodatablocks,recordstheblockaddressifnecessary,anddis-cardstheactualwriteusingtheDataSquasher.I/Ore-queststometadatablocksarepassedontotheMetadataRemapper.Second,theMetadataRemapperisresponsibleforlay-ingoutmetadatablocksmoreefcientlyonthebackingstore.Itinterceptsallwriterequeststometadatablocks,generatesaremappingforthesetofblocksaddressed,andwritesoutthemetadatablockstotheremappedloca-tions.TheremappingisstoredintheMetadataRemap-pertoservicesubsequentreads.Third,writestodatablocksarenotsaved,butreadstotheseblockscouldstillbeissuedbytheapplication;inordertoallowapplicationstoruntransparently,theDataGeneratorisresponsibleforgeneratingsyntheticcontenttoservicesubsequentreadstodatablocksthatwerewrit-tenearlieranddiscarded.TheDataGeneratorcontainsanumberofbuilt-inschemestogeneratedifferentkindsofcontentandalsoallowstheapplicationtoprovidehintstogeneratemoretailoredcontent(e.g.,binaryles).Finally,byperformingtheabove-mentionedtasksDavidmodiestheoriginalI/Orequeststream.ThesemodicationsintheI/Otrafcsubstantiallychangetheapplicationruntimerenderingituselessforbenchmark-ing.TheStorageModelcarefullymodelsthe(potentiallydifferent)targetstoragesubsystemunderneathtopredictthebenchmarkruntimeonthetargetsystem.Bydoingsoinanonlinefashionwithlittleoverhead,theStorageModelmakesitfeasibletorunlargeworkloadsinaspaceandtime-efcientmanner.Theindividualcomponentsarediscussedindetailinx3throughx6.2.3ChoiceofAvailableBackingStoreDavidislargelyagnostictothechoiceofthebackingstoreforavailablestorage:HDDs,SSDs,ormemorycanbeuseddependingontheperformanceandcapacityre-quirementsofthetargetdevicebeingemulated.ThroughasignicantreductioninthenumberofdeviceI/Os,Davidcompensatesforitsinternalbook-keepingover-headandalsoforsmallmismatchesbetweentheemu-latedandavailabledevice.However,ifonewishesto3 DelayTimeTargetDevice Model AvailableBacking StoreAvailableI/O RequestQueue InodeJournalUnclassifiedParserSnooper DataData Or MetadataMetadata OrAfterTargetI/O RequestQueue Figure3:DavidArchitecture.ShowsthecomponentsofDavidandtheowofrequestshandledwithin.emulateadevicemuchfasterthantheavailabledevice,usingmemoryisasaferoption.Forexample,asshowninx6.3,DavidsuccessfullyemulatesaRAID-1congu-rationusingalimitedamountofmemory.Iftheperfor-mancemismatchisnotsignicant,aharddiskasbackingstoreprovidesmuchgreaterscaleintermsofstorageca-pacity.Throughoutthepaper,“availablestorage”referstothebackingstoreinagenericsense.3BlockClassicationTheprimaryrequirementforDavidtopreventdatawritesusingtheDataSquasheristheabilitytoclassifyablockasmetadataordata.Davidprovidesbothimplicitandex-plicitblockclassication.Theimplicitapproachismorelaboriousbutprovidesaexibleapproachtorununmod-iedapplicationsandlesystems.Theexplicitnotica-tionapproachisstraightforwardandmuchsimplertoim-plement,albeitatthecostofasmallmodicationintheoperatingsystemortheapplication;bothareavailableinDavidandcanbechosenaccordingtotherequirementsoftheevaluator.Theimplicitapproachisdemonstratedusingext3andtheexplicitapproachusingbtrfs.3.1ImplicitTypeDetectionForext2andext3,themajorityoftheblocksarestati-callyassignedforagivenlesystemsizeandcongu-rationatthetimeoflesystemcreation;theallocationfortheseblocksdoesn'tchangeduringthelifetimeofthelesystem.Blocksthatfallinthiscategoryincludethesuperblock,groupdescriptors,inodeanddatabitmaps,inodeblocksandblocksbelongingtothejournal;theseblocksarerelativelystraightforwardtoclassifybasedontheiron-disklocation,ortheirLogicalBlockAddress(LBA).However,notallblocksarestaticallyassigned;dynamically-allocatedblocksincludedirectory,indirect(single,double,ortripleindirect)anddatablocks.Un-lessallblockscontainsomeself-identicationinforma-tion,inordertoaccuratelyclassifyadynamicallyallo-catedblock,thesystemneedstotracktheinodepointingtotheparticularblocktoinferitscurrentstatus.ImplicitclassicationisbasedonpriorworkonSemantically-SmartDiskSystems(SDS)[21];anSDSemploysthreetechniquestoclassifyblocks:directandindirectclassication,andassociation.Withdirectclas-sication,blocksareidentiedsimplybytheirlocationondisk.Withindirectclassication,blocksareidentiedonlywithadditionalinformation;forexample,toiden-tifydirectorydataorindirectblocks,thecorrespondinginodemustalsobeexamined.Finally,withassociation,adatablockanditsinodeareconnected.TherearetwosignicantadditionalchallengesDavidmustaddress.First,asopposedtoSDS,Davidhastoensurethatnometadatablocksareevermisclassi-ed.Second,benchmarkscalabilityintroducesaddi-tionalmemorypressuretohandledelayedclassication.Inthispaperweonlydiscussournewcontributions(theoriginalSDSpaperprovidesdetailsofthebasicblock-classicationmechanisms).3.1.1UnclassiedBlockStoreToinferwhenaleordirectoryisallocatedanddeallo-cated,Davidtrackswritestoinodeblocks,inodebitmapsanddatabitmaps;toenumeratetheindirectanddirectoryblocksthatbelongtoaparticularleordirectory,itusesthecontentsoftheinode.Itisoftenthecasethattheblockspointedtobyaninodearewrittenoutbeforethecorrespondinginodeblock;ifaclassicationattemptismadewhenablockisbeingwritten,anindirectordi-rectoryblockwillbemisclassiedasanordinarydatablock.ThistransienterrorisunacceptableforDavidsinceitleadstothe“metadata”blockbeingdiscardedprematurelyandcouldcauseirreparabledamagetothelesystem.Forexample,ifadirectoryorindirectblockisaccidentallydiscarded,itcouldleadtolesystemcor-ruption.4 Torectifythisproblem,Davidtemporarilybuffersinmemorywritestoallblockswhichareasyetunclassi-ed,insidetheUnclassiedBlockStore(UBS).ThesewriterequestsremainintheUBSuntilaclassicationismadepossibleuponthewriteofthecorrespondinginode.Whenacorrespondinginodedoesgetwritten,blocksthatareclassiedasmetadataarepassedontotheMetadataRemapperforremapping;theyarethenwrittenouttopersistentstorageattheremappedlocation.Blocksclas-siedasdataarediscardedatthattime.AllentriesintheUBScorrespondingtothatinodearealsoremoved.TheUBSisimplementedasalistofblockI/O(bio)re-queststructures.AnextrareferencetothememorypagespointedtobythesebiostructuresisheldbyDavidaslongtheyremainintheUBS;thisreferenceensuresthatthesepagesarenotmistakenlyfreeduntiltheUBSisabletoclassifyandpersistthemondisk,ifneeded.Inordertoreducetheinodeparsingoverheadotherwiseimposedforeachinodewrite,Davidmaintainsalistofrecentlywritteninodeblocksthatneedtobeprocessedandusesaseparatekernelthreadforparsing.3.1.2JournalSnoopingStoringunclassiedblocksintheUBScancauseastrainonavailablememoryincertainsituations.Inparticular,whenext3ismountedontopofDavidinorderedjour-nalingmode,allthedatablocksarewrittentodiskatjournal-committimebutthemetadatablocksarewrittentodiskonlyatthecheckpointtimewhichoccursmuchlessfrequently.Thisresultsinatemporaryyetprecari-ousbuildupofdatablocksintheUBSeventhoughtheyareboundtobesquashedassoonasthecorrespondinginodeiswritten;thissituationisespeciallytruewhenlargeles(e.g.,10sofGB)arewritten.Inordertoen-suretheoverallscalabilityofDavid,handlinglargelesandtheconsequentexplosioninmemoryconsumptioniscritical.Toachievethiswithoutanymodicationtotheext3lesystem,DavidperformsJournalSnoopingintheblockdevicedriver.Davidsnoopsonthejournalcommittrafcforinodesandindirectblocksloggedwithinacommittedtransac-tion;thisenablesblockclassicationevenpriortocheck-point.Whenajournal-descriptorblockiswrittenaspartofatransaction,Davidrecordstheblocksthatarebeingloggedwithinthatparticulartransaction.Inaddition,alljournalwriteswithinthattransactionarecachedinmem-oryuntilthetransactioniscommitted.Afterthat,thein-odesandtheircorrespondingdirectandindirectblocksareprocessedtoallowblockclassication;theidentieddatablocksaresquashedfromtheUBSandtheiden-tiedmetadatablocksareremappedandstoredpersis-tently.ThechallengeinimplementingJournalSnoopingwastohandlethecontinuousstreamofunorderedjournalblocksandreconstructthejournaltransaction. 0 100000 200000 300000 400000 500000 600000 0 10 20 30 40 50 60 70 80 90 Number of 4KB Unclassified BlocksTime in units of 10 secssystem out of memoryMaximum memory limitWithout Ext3 Journal Snooping With Ext3 Journal Snooping Figure4:MemoryusagewithJournalSnooping.Figure4comparesthememorypressurewithandwithoutJournalSnoopingdemonstratingitseffective-ness.Itshowsthenumberof4KBblockI/OrequestsresidentintheUBSsampledat10secintervalsduringthecreationofa24GBleonext3;thelesystemismountedontopofDavidinorderedjournalingmodewithacommitintervalof5secs.Thisexperimentwasrunonadualcoremachinewith2GBmemory.Sincethisworkloadisdatawriteintensive,withoutJournalSnooping,thesystemrunsoutofmemorywhenaround450,000biorequestsareintheUBS(occupyingroughly1.8GBofmemory).JournalSnoopingensuresthatthememoryconsumedbyoutstandingbiorequestsdoesnotgobeyondamaximumof240MB.3.2ExplicitMetadataNoticationDavidismeanttobeusefulforawidevarietyoflesys-tems;explicitmetadatanoticationprovidesamecha-nismtorapidlyadoptalesystemforusewithDavid.Sincedatawritescancomeonlyfromthebenchmarkap-plicationinuser-spacewhereasmetadatawritesareis-suedbythelesystem,ourapproachistoidentifythedatablocksbeforetheyareevenwrittentothelesys-tem.Ourimplementationofexplicitnoticationisthusle-systemagnostic–itreliesonasmallmodicationtothepagecachetocollectadditionalinformation.Wedemonstratethebenetsofthisapproachusingbtrfs,alesystemquiteunlikeext3indesign.Whenanapplicationwritestoale,Davidcapturesthepointerstothein-memorypageswherethedatacon-tentisstored,asitisbeingcopiedintothepagecache.Subsequently,whenthewritesreachDavid,theyarecomparedagainstthecapturedpointeraddressestode-cidewhetherthewriteistometadataordata.Oncethepresenceistested,thepointerisremovedfromthelistsincethesamepagecanbereusedformetadatawritesinthefuture.Therearecertainlyotherwaystoimplementexplicitnotication.Onewayistocapturethechecksumofthecontentsofthein-memorypagesinsteadofthepointertotrackdatablocks.Onecanalsomodifythelesystem5 toexplicitlyagthemetadatablocks,insteadofidenti-fyingdatablockswiththepagecachemodication.Webelieveourapproachiseasiertoimplement,doesnotre-quireanylesystemmodication,andisalsoeasiertoextendtosoftwareRAIDsinceparityblocksareauto-maticallyclassiedasmetadataandnotdiscarded.4MetadataRemappingSinceDavidexportsatargetpseudodeviceofmuchhighercapacitytothelesystemthantheavailablestor-agedevice,thebiorequestsissuedtothepseudodevicewillhaveaddressesinthefulltargetrangeandthusneedtobesuitablyremapped.Forthispurpose,Davidmain-tainsaremaptablecalledMetadataRemapperwhichmaps“target”addressesto“available”addresses.TheMetadataRemappercancontainanentryeitherforonemetadatablock(e.g.,superblock),orarangeofmetadatablocks(e.g.,groupdescriptors);byallowinganarbitraryrangeofblockstoberemappedtogether,theMetadataRemapperprovidesanefcienttranslationservicethatalsoprovidesscalability.Rangeremappinginadditionpreservessequentialityoftheblocksifadiskisusedasthebackingstore.InadditiontotheMetadataRemapper,aremapbitmapismaintainedtokeeptrackoffreeandusedblocksontheavailablephysicaldevice;theremapbitmapsupportsallocationbothofasingleremappedblockandarangeofremappedblocks.Thedestination(orremapped)locationforarequestisdeterminedusingasimplealgorithmwhichtakesasinputthenumberofcontiguousblocksthatneedtoberemappedandndstherstavailablechunkofspacefromtheremapbitmap.Thiscanbedonestaticallyoratruntime;fortheext3lesystem,sincemostoftheblocksarestaticallyallocated,theremappingfortheseblockscanalsobedonestaticallytoimproveperformance.Sub-sequentwritestoothermetadatablocksareremappeddy-namically;whenmetadatablocksaredeallocated,corre-spondingentriesfromtheMetadataRemapperandtheremapbitmapareremoved.Fromourexperience,thissimplealgorithmlaysoutblocksondiskquiteefciently.Moresophisticatedallocationalgorithmsbasedonlocal-ityofreferencecanbeimplementedinthefuture.5DataGeneratorDavidservicestherequirementsofsystemsoblivioustolecontentwithdatasquashingandmetadataremapping.However,manyrealapplicationscareaboutlecontent;theDataGeneratorwithDavidisresponsibleforgener-atingsyntheticcontenttoservicereadrequeststodatablocksthatwerepreviouslydiscarded.DifferentsystemscanhavedifferentrequirementsforthelecontentandtheDataGeneratorhasvariousoptionstochoosefrom;gure5showssomeexamplesofthedifferenttypesofcontentthatcanbegenerated.Manysystemsthatreadbackpreviouslywrittendatadonotcareaboutthespeciccontentwithinthelesaslongasthereissomecontent(e.g.,ale-systembackuputility,orthePostmarkbenchmark).Muchinthesamewayasfailure-obliviouscomputinggeneratesvaluestoservicereadstoinvalidmemorywhileignoringinvalidwrites[18],Davidrandomlygeneratescontenttoserviceout-of-boundreadrequests.Somesystemsmayexpectlecontentstohavevalidsyntaxorsemantics;theperformanceofthesesystemsdependontheactualcontentbeingread(e.g.,adesk-topsearchengineforalesystem,oraspell-checker).Forsuchsystems,naivecontentgenerationwouldeithercrashtheapplicationorgivepoorbenchmarkingresults.Davidproducesvalidlecontentleveragingpriorworkongeneratingle-systemimages[2].Finally,somesystemsmayexpecttoreadbackdataexactlyastheywroteearlier(i.e.,aread-after-writeorRAWdependency)orexpectaprecisestructurethatcan-notbegeneratedarbitrarily(e.g.,abinaryleoracon-gurationle).DavidprovidesadditionalsupporttorunthesedemandingapplicationsusingtheRAWStore,de-signedasacooperativeresourcevisibletotheuserandcongurabletosuittheneedsofdifferentapplications.OurcurrentimplementationofRAWStoreisverysim-ple:inordertodecidewhichdatablocksneedtobestoredpersistently,Davidrequirestheapplicationtosup-plyalistoftherelevantlepaths.Davidthenlooksuptheinodenumberofthelesandtracksalldatablockspointedtobytheseinodes,writingthemouttodiskus-ingtheMetadataRemapperjustasanymetadatablock.Inthefuture,weintendtosupportmorenuancedwaystomaintaintheRAWStore;forexample,specifyingdirec-toriesinsteadofles,orbyusingMemoization[14].Forapplicationsthatmustexactlyreadbackasignif-icantfractionofwhattheywrite,thescalabilityadvan-tageofDaviddiminishes;insuchcasesthebenetsareprimarilyfromtheabilitytoemulatenewdevices.6StorageModelandEmulationNothavingaccesstothetargetstoragesystemrequiresDavidtopreciselycapturethebehavioroftheentirestor-agestackwithallitsdependenciesthroughamodel.ThestoragesystemmodeledbyDavidisthetargetsystemandthesystemonwhichitrunsistheavailablesystem.Davidemulatesthebehaviorofthetargetdiskbysend-ingrequeststotheavailabledisk(forpersistence)whilesimultaneouslysendingthetargetrequeststreamtotheStorageModel;themodelcomputesthetimethatwouldhavetakenfortherequesttonishonthetargetsystemandintroducesanappropriatedelayintheactualrequest6 rand.txtpangrams.txtcompress.pdfconfig.RAWconfig.RAW5 0 obj&#x/Len;&#xgth ; 0 ;&#xR/Fi;&#xlter;&#x' 00;&#x/Len;&#xgth ; 0 ;&#xR/Fi;&#xlter;&#x' 00;/FlateDecodestream鰀촀뜀x贀&#xH700;ꀀ耀trailer &#x /Si;&#xze 7; /R;&#xoot';&#x 000;&#x /Si;&#xze 7; /R;&#xoot';&#x 000;1 0 R /Info 2 0 R /IDstartxref 1052 %%EOFumask 027if ( ! $?TERM )if ($TERM==unknown) then set noglob;$TERM"`; unset noglob endifif($TERM==unknown) goto loop endifeval `tset -s -r -Q"? then setenv TERM Figure5:ExamplesofcontentgenerationbyDataGenerator.Thegureshowsarandomlygeneratedtextle,atextlewithsemanticallymeaningfulcontent,awell-formattedPDFle,andaconglewithprecisesyntaxtobestoredintheRAWStore. Parameter H1 H2 Parameter H1 H2 Disksize 80GB 1TB Cachesegments 11 500 RotationalSpeed 7200RPM 7200RPM CacheR/Wpartition Varies Varies Numberofcylinders 88283 147583 BusTransfer 133MBps 133MBps Numberofzones 30 30 Seekprole(long) 3800+(cyl*116)/103 3300+(cyl*5)/106 Sectorspertrack 567to1170 840to1680 Seekprole(short) 300+p (cyl2235) 700+p cyl Cylindersperzone 1444to1521 1279to8320 Headswitch 1.4ms 1.4ms On-diskcachesize 2870KB 300MB Cylinderswitch 1.6ms 1.6ms Diskcachesegment 260KB 600KB Devdriverreqqueue 128-160 128-160 Reqscheduling FIFO FIFO Reqqueuetimeout 3ms(unplug) 3ms(unplug) Table1:StorageModelParametersinDavid.ListsimportantparametersobtainedtomodeldisksHitachiHDS728080PLA380(H1)andHitachiHDS721010KLA330(H2).denotesparametersofI/Orequestqueue(IORQ).streambeforereturningcontrol.Figure3presentedinx2showsthissetupmoreclearly.Asageneraldesignprinciple,tosupportlow-overheadmodelingwithoutcompromisingaccuracy,weavoidus-inganytechniquethateitherreliesonstoringempiri-caldatatocomputestatisticsorrequirestable-basedap-proachestopredictperformance[6];theoverheadsforsuchmethodsaredirectlyproportionaltotheamountofruntimestatisticsbeingmaintainedwhichinturnde-pendsonthesizeofthedisk.Instead,whereverapplica-ble,wehaveadoptedanddevelopedanalyticalapproxi-mationsthatdidnotslowthesystemdown;ourresultingmodelsaresufcientlyleanwhilebeingfairlyaccurate.Toensureportabilityofourmodels,wehaverefrainedfrommakingdevice-specicoptimizationstoimproveaccuracy;webelievecurrentmodelsinDavidarefairlyaccurate.Themodelsarealsoadaptiveenoughtobeeas-ilyconguredforchangesindiskdrivesandotherpa-rametersofthestoragestack.Wenextpresentsomede-tailsofthediskmodelandthestoragestackmodel.6.1DiskModelDavid'sdiskmodelisbasedontheclassicalmodelpro-posedbyRuemmlerandWilkes[19],henceforthreferredastheRWmodel.Thediskmodelcontainsinforma-tionaboutthediskgeometry(i.e.,platters,cylindersandzones)andmaintainsthecurrentdiskheadposition;us-ingthesesourcesitmodelsthediskseek,rotation,andtransfertimesforanyrequest.Thediskmodelalsokeepstrackoftheeffectsofdiskcaches(trackprefetching,write-throughandwrite-backcaches).Inthefuture,itwillbeinterestingtoexploreusingDisksimforthediskmodel.Disksimisadetaileduser-spacedisksimulatorwhichallowsforgreaterexibilityinthetypesofdevicepropertiesthatcanbesimulatedalongwiththeirdegreeofdetail;wewillneedtoensureitdoesnotappreciablyslowdowntheemulationwhenusedwithoutmemoryasbackingstore.6.1.1DiskDriveProleThediskmodelrequiresanumberofdrive-specicpa-rametersasinput,alistofwhichispresentedintherstcolumnofTable1;currentlyDavidcontainsmodelsfortwodisks:theHitachiHDS728080PLA38080GBdisk,andtheHitachiHDS721010KLA3301TBdisk.Wehaveveriedtheparametervaluesforboththesedisksthroughcarefullycontrolledexperiments.Davidisen-visionedforuseinenvironmentswherethetargetdriveitselfmaynotbeavailable;ifusersneedtomodeladdi-tionaldrives,theyneedtosupplytherelevantparameters.Diskseeks,rotationtimeandtransfertimesaremodeledmuchinthesamewayasproposedintheRWmodel.Theactualparametervaluesdeningtheabovepropertiesarespecictoadrive;empiricallyobtainedvaluesforthetwodiskswemodelareshowninTable1.6.1.2DiskCacheModelingThedrivecacheisusuallysmall(fewhundredKBtoafewMB)andservestocachereadsfromthediskme-diatoservicefuturereads,ortobufferwrites.Unfortu-nately,thedrivecacheisoneoftheleastspeciedcom-ponentsaswell;thecachemanagementlogicislow-levelrmwarecodewhichisnoteasytomodel.7 Davidmodelsthenumberandsizeofsegmentsinthediskcacheandthenumberofdisksector-sizedslotsineachsegment.Partitioningofthecachesegmentsintoreadandwritecaches,ifany,isalsopartoftheinforma-tioncontainedinthediskmodel.DavidmodelsthereadcachewithaFIFOevictionpolicy.Tomodeltheeffectsofwritecaching,thediskmodelmaintainsstatisticsonthecurrentsizeofwritespendinginthediskcacheandthetimeneededtoushthesewritesouttothemedia.Writebufferingissimulatedbyperiodicallyemptyingafractionofthecontentsofthewritecacheduringidleperiodsofthediskinbetweensuccessiveforegroundre-quests.Thecacheismodeledwithawrite-throughpolicyandispartitionedintoasufcientlylargereadcachetomatchtheread-aheadbehaviorofthediskdrive.6.2StorageStackModelDavidalsomodelstheI/Orequestqueues(IORQs)main-tainedintheOS;Table1listsafewofitsimpor-tantparameters.WhiledevelopingtheStorageModel,wefoundthataccuratelymodelingthebehavioroftheIORQsiscrucialtopredictthetargetexecutiontimecor-rectly.TheIORQsusuallyhavealimitonthemaximumnumberofrequeststhatcanbeheldatanypoint;pro-cessesthattrytoissueanI/OrequestwhentheIORQisfullaremadetowait.Suchwaitingprocessesarewo-kenupwhenanI/Oissuedtothediskdrivecompletes,therebycreatinganemptyslotintheIORQ.Oncewo-kenup,theprocessisalsograntedprivilegetobatchaconstantnumberofadditionalI/OrequestsevenwhentheIORQisfull,aslongasthetotalnumberofrequestsiswithinaspeciedupperlimit.Therefore,forapplica-tionsissuingburstyI/O,thetimespentbyarequestintheIORQcanoutweighthetimespentatthediskbyseveralordersofmagnitude;modelingtheIORQsisthuscrucialforoverallaccuracy.DiskrequestsarrivingatDavidarerstenqueuedintoareplicaqueuemaintainedinsidetheStorageModel.Whilebeingenqueued,thediskrequestisalsocheckedforapossiblemergewithotherpendingrequests:acom-monoptimizationthatreducesthenumberoftotalre-questsissuedtothedevice.Thereisalimitonthenum-berofdiskrequeststhatcanbemergedintoasinglediskrequest;eventuallymergeddiskrequestsaredequeuedfromthereplicaqueueanddispatchedtothediskmodeltoobtaintheservicetimespentatthedrive.Thereplicaqueueusesthesamerequestschedulingpolicyasthetar-getIORQ.6.3RAIDEmulationDavidcanalsoprovideeffectiveRAIDemulation.TodemonstratesimpleRAIDcongurationswithDavid,eachcomponentdiskisemulatedusingamemory-backed“compressed”deviceunderneathsoftwareRAID.Davidexportsmultipleblockdeviceswithseparatema-jorandminornumbers;itdifferentiatesrequeststodif-ferentdevicesusingthemajornumber.Forthepur-poseofperformancebenchmarking,Davidusesasin-glememory-basedbackingstoreforallthecompressedRAIDdevices.Usingmultiplethreads,theStorageModelmaintainsseparatestateforeachofthedevicesbeingemulated.Requestsareplacedinasinglerequestqueuetaggedwithadeviceidentier;individualStorageModelthreadsforeachdevicefetchonerequestatatimefromthisrequestqueuebasedonthedeviceidentier.Similartothesingledevicecase,theservicingthreadcal-culatesthetimeatwhicharequesttothedeviceshouldnishandnotiescompletionusingacallback.DavidcurrentlyonlyprovidesmechanismsforsimplesoftwareRAIDemulationthatdonotneedamodelofasoftwareRAIDitself.NewtechniquesmightbeneededtoemulatemorecomplexcommercialRAIDcongura-tions,forexample,commercialRAIDsettingsusingahardwareRAIDcard.7EvaluationWeseektoanswerfourimportantquestions.First,whatistheaccuracyoftheStorageModel?Second,howac-curatelydoesDavidpredictbenchmarkruntimeandwhatstoragespacesavingsdoesitprovide?Third,canDavidscaletolargetargetdevicesincludingRAID?Finally,whatisthememoryandCPUoverheadofDavid?7.1ExperimentalPlatformWehavedevelopedDavidfortheLinuxoperatingsys-tem.Theharddiskscurrentlymodeledarethe1TBHitachiHDS721010KLA330(referredtoasD1TB)andthe80GBHitachiHDS728080PLA380(referredtoasD80GB);table1liststheirrelevantparameters.Unlessspeciedotherwise,thefollowingholdforalltheexperi-ments:(1)machineusedhasaquad-coreIntelprocessorand4GBRAMrunningLinux2.6.23.1(2)ext3lesys-temismountedinordered-journalingmodewithacom-mitintervalof5sec(3)microbenchmarkswererundi-rectlyonthediskwithoutalesystem(4)DavidpredictsthebenchmarkruntimeforatargetD1TBwhileinfactrunningontheavailableD80GB(5)tovalidateaccuracy,DavidwasinsteadrundirectlyonD1TB.7.2StorageModelAccuracyFirst,wevalidatetheaccuracyofStorageModelinpre-dictingthebenchmarkruntimeonthetargetsystem.SinceouraimistovalidatetheaccuracyoftheStor-ageModelalone,werunDavidinamodelonlymodewherewedisableblockclassication,remappinganddatasquashing.Davidjustpassesdowntherequeststhatitreceivestotheavailablerequestqueuebelow.Werun8 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 Fraction of I/OsTime in units of 1 usSequential Reads [ Demerit: 24.39 ]Measured Modeled 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 Fraction of I/OsTime in units of 100 usRandom Reads [ Demerit: 5.51 ]Measured Modeled 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 Fraction of I/OsTime in units of 100 usSequential Writes [ Demerit: 0.08 ]Measured Modeled 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 Fraction of I/OsTime in units of 100 usRandom Writes [ Demerit: 0.02 ]Measured Modeled Figure6:StorageModelaccuracyforSequentialandRandomReadsandWrites.Thegraphshowsthecumulativedistributionofmeasuredandmodeledtimesforsequentialandrandomreadsandwrites. Figure7:StorageModelaccuracy.Thegraphsshowthecumulativedistributionofmeasuredandmodeledtimesforthefollowingworkloadsfromlefttoright:Postmark,Webserver,VarmailandTar. ImplicitClassication–Ext3 ExplicitNotication–Btrfs Benchmark Original David Storage Original David Runtime Original David Runtime Workload Storage Storage Savings Runtime Runtime Error Runtime Runtime Error (KB) (KB) (%) (Secs) (Secs) (%) (Secs) (Secs) (%) mkfs 976762584 7900712 99.19 278.66 281.81 1.13 - - - imp 11224140 18368 99.84 344.18 339.42 -1.38 327.294 324.057 0.99 tar 21144 628 97.03 257.66 255.33 -0.9 146.472 135.014 7.8 grep - - - 250.52 254.40 1.55 141.960 138.455 2.47 virusscan - - - 55.60 47.95 -13.75 27.420 31.555 15.08 nd - - - 26.21 26.60 1.5 - - - du - - - 102.69 101.36 -1.29 - - - postmark 204572 404 99.80 33.23 29.34 -11.69 22.709 22.243 2.05 webserver 3854828 3920 99.89 127.04 126.94 -0.08 125.611 126.504 0.71 varmail 7852 3920 50.07 126.66 126.27 -0.31 126.019 126.478 0.36 sr - - - 40.32 44.90 11.34 40.32 44.90 11.34 rr - - - 913.10 935.46 2.45 913.10 935.46 2.45 sw - - - 57.28 58.96 2.93 57.28 58.96 2.93 rw - - - 308.74 291.40 -5.62 308.74 291.40 -5.62 Table2:DavidPerformanceandAccuracy.Showssavingsincapacity,accuracyofruntimeprediction,andtheoverheadofstoragemodelingfordifferentworkloads.WebserverandvarmailaregeneratedusingFileBench;virusscanusingAVG.DavidontopofD1TBandsetthetargetdrivetobethesame.NotethattheavailablesystemisthesameasthetargetsystemfortheseexperimentssinceweonlywanttocomparethemeasuredandmodeledtimestovalidatetheaccuracyoftheStorageModel.EachblockrequestistracedalongitspathfromDavidtothediskdriveandback.ThisisdoneinordertomeasurethetotaltimethattherequestspendsintheavailableIORQandthetimespentgettingservicedattheavailabledisk.Thesemea-suredtimesarethencomparedwiththemodeledtimesobtainedfromtheStorageModel.Figure6showstheStorageModelaccuracyforfourmicro-workloads:sequentialandrandomreads,andse-quentialandrandomwrites;thesemicro-workloadshavedemeritguresof24.39,5.51,0.08,and0.02respec-tively,ascomputedusingtheRuemmlerandWilkesmethodology[19].Thelargedemeritforsequentialreadsisduetoavarianceintheavailabledisk'scache-readtimes;modelingthediskcacheingreaterdetailinthefu-turecouldpotentiallyavoidthissituation.However,se-quentialreadrequestsdonotcontributetoameasurablylargeerrorinthetotalmodeledruntime;theyoftenhitthediskcacheandhaveservicetimeslessthan500mi-crosecondswhileothertypesofdiskrequeststakearound20to35millisecondstogetserviced.Anyinaccuracyinthemodeledtimesforsequentialreadsisnegligiblewhencomparedtotheservicetimesofothertypesofdiskre-quests;wethuschosetonotmakethedisk-cachemodel9 morecomplexforthesakeofsequentialreads.Figure7showstheaccuracyforfourdifferentmacroworkloadsandapplicationkernels:Postmark[13],web-server(generatedusingFileBench[15]),Varmail(mailserverworkloadusingFileBench),andaTarworkload(copyanduntarofthelinuxkernelofsize46MB).TheFileBenchVarmailworkloademulatesanNFSmailserver,similartoPostmark,butismulti-threadedinstead.TheVarmailworkloadconsistsofasetofopen/read/close,open/append/closeanddeletesinasingledirectory,inamulti-threadedfashion.TheFileBenchwebserverworkloadcomprisesofamixofopen/read/closeofmultiplelesinadirectorytree.Inaddition,tosimulateawebserverlog,aleappendoper-ationisalsoissued.Theworkloadconsistsof100threadsissuing16KBappendstotheweblogevery10reads.Overall,wendthatstoragemodelinginsideDavidisquiteaccurateforallworkloadsusedinourevaluation.Thetotalmodeledtimeaswellasthedistributionoftheindividualrequesttimesareclosetothetotalmeasuredtimeandthedistributionofthemeasuredrequesttimes.7.3DavidAccuracyNext,wewanttomeasurehowaccuratelyDavidpredictsthebenchmarkruntime.Table2liststheaccuracyandstoragespacesavingsprovidedbyDavidforavarietyofbenchmarkapplicationsforbothext3andbtrfs.WehavechosenasetofbenchmarksthatarecommonlyusedandalsostressvariouspathsthatdiskrequeststakewithinDavid.TherstandsecondcolumnsofthetableshowthestoragespaceconsumedbythebenchmarkworkloadwithoutandwithDavid.ThethirdcolumnshowsthepercentagesavingsinstoragespaceachievedbyusingDavid.Thefourthcolumnshowstheoriginalbench-markruntimewithoutDavidonD1TB.ThefthcolumnshowsthebenchmarkruntimewithDavidonD80GB.Thesixthcolumnshowsthepercentageerrorinthepredic-tionofthebenchmarkruntimebyDavid.Thenalthreecolumnsshowtheoriginalandmodeledruntime,andthepercentageerrorforthebtrfsexperiments;thestoragespacesavingsareroughlythesameasforext3.Thesr,rr,sw,andrwworkloadsarerundirectlyontherawde-viceandhenceareindependentofthelesystem.mkfscreatesalesystemwitha4KBblocksizeoverthe1TBtargetdeviceexportedbyDavid.ThisworkloadonlywritesmetadataandDavidremapswritesissuedbymkfssequentiallystartingfromthebeginningofD80GB;nodatasquashingoccursinthisexperiment.impcreatesarealisticle-systemimageofsize10GBusingthepubliclyavailableImpressionstool[2].Atotalof5000regularlesand1000directoriesarecreatedwithanaverageof10.2lesperdirectory.Thisworkloadisadata-writeintensiveworkloadandmostoftheissuedwritesendupbeingsquashedbyDavid. 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800 Actual Storage space used (GB)Runtime (100s of seconds)File System Impression size (GB) WOD SpaceD SpaceWOD TimeD Time WOD Space D Space WOD Time D Time Figure8:StorageSpaceSavingsandModelAccu-racy.The“Space”linesshowthesavingsinstoragespaceachievedwhenusingDavidfortheimpressionsworkloadwithle-systemimagesofvaryingsizesuntil800GB;“Time”linesshowtheaccuracyofruntimepredictionforthesameworkload.WOD:space/timewithoutDavid,D:space/timewithDavid.tarusestheGNUtarutilitytocreateagzippedarchiveofthele-systemimageofsize10GBcreatedbyimp;itwritesthenewlycreatedarchiveinthesamelesys-tem.Thisworkloadisadatareadanddatawriteinten-siveworkload.ThedatareadsaresatisedbytheDataGeneratorwithoutaccessingtheavailabledisk,whilethedatawritesendupbeingsquashed.grepusestheGNUgreputilitytosearchfortheex-pression“nothing”inthecontentgeneratedbybothimpandtar.Thisworkloadissuessignicantamountsofdatareadsandsmallamountsofmetadatareads.virusscanrunstheAVGvirusscanneronthele-systemimagecre-atedbyimp.ndandduruntheGNUndandGNUduutilitiesoverthecontentgeneratedbybothimpandtar.Thesetwoworkloadsaremetadatareadonlyworkloads.Davidworkswellunderboththeimplicitandex-plicitapproachesdemonstratingitsusefulnessacrosslesystems.Table2showshowDavidprovidestremen-doussavingsintherequiredstoragecapacity,upwardsof99%(a100-foldormorereduction)formostworkloads.Davidalsopredictsbenchmarkruntimequiteaccurately.Predictionerrorformostworkloadsislessthan3%,al-thoughforafewitisjustover10%.Theerrorsinthepredictedruntimesstemfromtherelativesimplicityofourin-kernelDiskModel;forexample,itdoesnotcap-turethelayoutofphysicalblocksonthemagneticmediaaccurately.ThisinformationisnotpublishedbythediskmanufacturersandexperimentalinferenceisnotpossibleforATAdisksthatdonothaveacommandsimilartotheSCSImodepage.7.4DavidScalabilityDavidisaimedatprovidingscalableemulationusingcommodityhardware;itisimportantthataccuracyis10 NumDisks RandR RandW SeqR SeqW Measured 3 232.77 72.37 119.29 119.98 2 156.76 72.02 119.11 119.33 1 78.66 71.88 118.65 118.71 Modeled 3 238.79 73.77 119.44 119.40 2 159.36 72.21 119.16 119.21 1 79.56 72.15 118.95 118.83 Table3:DavidSoftwareRAID-1Emulation.ShowsIOPSforasoftwareRAID-1setupusingDavidwithmemoryasbackingstore;workloadissues20000readandwriterequeststhroughconcurrentprocesseswhichequalthenumberofdisksintheexperiment.1diskexperimentsrunw/oRAID-1.notcompromisedatlargerscale.Figure8showstheaccuracyandstoragespacesavingsprovidedbyDavidwhilecreatingle-systemimagesof100sofGB.Usinganavailablecapacityofonly10GB,DavidcanmodeltheruntimeofImpressionsincreatingarealisticle-systemimageof800GB;incontrasttothelinearscalingofthetargetcapacitydemanded,Davidbarelyrequiresanyex-traavailablecapacity.Davidalsopredictsthebenchmarkruntimewithinamaximumof2.5%errorevenwiththehugedisparitybetweentargetandavailabledisksatthe800GBmark,asshowninFigure8.Thereasonwelimittheseexperimentstoatargetca-pacityoflessthan1TBisbecausewehadaccesstoonlyaterabytesizeddiskagainstwhichwecouldvalidatetheaccuracyofDavid.Extrapolatingfromthisexperience,webelieveDavidwillenableonetoemulatedisksof10sor100sofTBgiventhe1TBdisk.7.5DavidforRAIDWepresentabriefevaluationandvalidationofsoftwareRAID-1congurationsusingDavid.Table3showsasimpleexperimentwhereDavidemulatesamulti-disksoftwareRAID-1(mirrored)conguration;eachdeviceisemulatedusingamemory-diskasbackingstore.How-ever,sincethemultiplediskscontaincopiesofthesameblock,asinglephysicalcopyisstored,furtherreducingthememoryfootprint.Ineachdisksetup,asetofthreadswhichequalinnumbertothenumberofdisksissueato-talof20000requests.DavidisabletoaccuratelyemulatethesoftwareRAID-1setupupto3disks;morecomplexRAIDschemesareleftaspartoffuturework.7.6DavidOverheadDavidisdesignedtobeusedforbenchmarkingandnotasaproductionsystem,thusscalabilityandaccuracyarethemorerelevantmetricsofevaluation;wedohoweverwanttomeasurethememoryandCPUoverheadofus-ingDavidontheavailablesystemtoensureitisprac-ticaltouse.AllmemoryusagewithinDavidistracked 0 % 20 % 40 % 60 % 80 % 100 % 0 10 20 30 40 50 60 70 0 20 40 60 80 100 CPU Busy PercentageMemory used (MB)Time (in units of 5 Seconds) SM MemD Mem WOD CPU SM CPU D CPU SM Mem D Mem Figure9:DavidCPUandMemoryOverhead.ShowsthememoryandpercentageCPUconsumptionbyDavidwhilecreatinga10GBle-systemimageusingimpressions.WODCPU:CPUwithoutDavid,SMCPU:CPUwithStorageModelalone,DCPU:totalCPUwithDavid,SMMem:StorageModelmemoryalone,DMem:totalmemorywithDavid.usingseveralcounters;Davidprovidessupporttomea-surethememoryusageofitsdifferentcomponentsusingioctls.TomeasuretheCPUoverheadoftheStorageModelalone,Davidisruninthemodel-onlymodewhereblockclassication,remappinganddatasquashingareturnedoff.Inourexperiencewithrunningdifferentworkloads,wefoundthatthememoryandCPUusageofDavidisacceptableforthepurposesofbenchmarking.Asanex-ample,Figure9showstheCPUandmemoryconsump-tionbyDavidcapturedat5secondintervalswhilecre-atinga10GBle-systemimageusingImpressions.Forthisexperiment,theStorageModelconsumeslessthan1MBofmemory;theaveragememoryconsumedintotalbyDavidislessthan90MB,ofwhichthepre-allocatedcacheusedbytheJournalSnoopingtotemporarilystorethejournalwritesitselfcontributes80MB.AmountofCPUusedbytheStorageModelaloneisinsignicant,howeverimplicitclassicationbytheBlockClassieristheprimaryconsumerofCPUusing10%onaveragewithoccasionalspikes.TheCPUoverheadisnotanis-sueatallifoneusesexplicitnotication.8RelatedWorkMemulator[10]makesagreatcaseforwhystorageem-ulationprovidestheuniqueabilitytoexplorenonexistentstoragecomponentsandtakeend-to-endmeasurements.Memulatorisa“timing-accurate”storageemulatorthatallowsasimulatedstoragecomponenttobepluggedintoarealsystemrunningrealapplications.Memula-torcanusethememoryofeitheranetworkedmachineorthelocalmachineasthestoragemediaoftheemulated11 disk,enablingfullsystemevaluationofhypotheticalstor-agedevices.Althoughthisprovidesexibilityindeviceemulation,high-capacitydevicesrequiresanequivalentamountofmemory;Davidprovidesthenecessaryscala-bilitytoemulatesuchdevices.Inturn,Davidcanbenetfromthenetworked-emulationcapabilitiesofMemula-torinscenarioswheneitherthehostmachinehaslimitedCPUandmemoryresources,orwhentheinterferenceofrunningDavidonthesamemachinecompetingforthesameresourcesisunacceptable.Onealternatetoemulationistosimplybuyalargerca-pacityornewerdeviceanduseittorunthebenchmarks.Thisissometimesfeasible,butoftennotdesirable.Evenifonebuysalargerdisk,inthefuturetheywouldneedanevenlargerone;Davidallowsonetokeepupwiththisarmsracewithoutalwaysinvestinginnewdevices.Notethatwechose1TBastheupperlimitforevaluationinthispaperbecausewecouldvalidateourresultsforthatsize.HavingalargediskwillalsonotaddresstheissueofemulatingmuchfasterdevicessuchasSSDsorRAIDcongurations.Davidemulatesfasterdevicesthroughanefcientuseofmemoryasbackingstore.Anotheralternateistosimulatethestoragecomponentundertest;disksimulatorslikeDisksim[7]allowsuchanevaluationexibly.However,simulationresultsareoftenfarfromperfect[9]–theyfailtocapturesystemdepen-denciesandrequirethegenerationofrepresentativeI/Otraceswhichisachallengeinitself.Finally,onemightuseanalyticalmodelingforthestor-agedevices;whileveryusefulinsomecircumstances,itisnotwithoutitsownsetofchallengesandlimita-tions[20].Inparticular,itisextremelyhardtocapturetheinteractionsandcomplexitiesinrealsystems.Wher-everpossible,Daviddoesleveragewell-testedanalyticalmodelsforindividualcomponentstoaidtheemulation.Bothsimulationandanalyticalmodelingarecomple-mentarytoemulation,perfectlyusefulintheirownright.Emulationdoeshoweverprovideareasonablemiddlegroundintermsofexibilityandrealism.EvaluationofhowwellanI/Osystemscaleshasbeenofinterestinpriorresearchandisbecomingincreas-inglymorerelevant[28].ChenandPattersonproposeda“self-scaling”benchmarkthatscaleswiththeI/Osys-tembeingevaluated,tostressthesysteminmeaningfulways[8].AlthoughusefulfordiskandI/Osystems,theself-scalingbenchmarksarenotdirectlyapplicableforlesystems.TheevaluationoftheXFSlesys-temfromSiliconGraphicsusesanumberofbenchmarksspecicallyintendedtotestitsscalability[23];suchanevaluationcanbenetfromDavidtoemployevenlargerbenchmarkswithgreaterease;SpecSFS[27]alsocon-tainssometechniquesforscalableworkloadgeneration.Similartoouremulationofscaleinastoragesystem,Guptaetal.fromUCSDproposeatechniquecalledtimedilationforemulatingnetworkspeedsordersofmag-nitudefasterthanavailable[11].Timedilationallowsonetoexperimentwithunmodiedapplicationsrunningoncommodityoperatingsystemsbysubjectingthemtomuchfasternetworkspeedsthanactuallyavailable.AkeychallengeinDavidistheabilitytoidentifydataandmeta-datablocks.BesidesSDS[21],XN,thestablestoragesystemfortheXokexokernel[12]dealtwithsim-ilarissues.XNemployedatemplateofmetadatatrans-lationfunctionscalledUDFsspecictoeachletype.TheresponsibilityofprovidingUDFsrestedwiththelesystemdeveloper,allowingthekerneltohandlearbitrarymetadatalayoutswithoutunderstandingthelayoutitself.Specifyinganencodingoftheon-diskschemecanbetrickyforalesystemsuchasReiserFSthatusesdy-namicallocation;however,inthefuture,David'smeta-dataclassicationschemecanbenetfromamorefor-mallyspeciedon-disklayoutperle-system.9ConclusionDavidisbornoutofthefrustrationindoinglarge-scaleexperimentationonrealisticstoragehardware–aprob-lemmanyinthestoragecommunityface.Davidmakesitpracticaltoexperimentwithbenchmarksthatwereoth-erwiseinfeasibletorunonagivensystem,bytranspar-entlyscalingdownthestoragecapacityrequiredtoruntheworkload.TheavailablebackingstoreunderDavidcanbeordersofmagnitudesmallerthanthetargetde-vice.Davidensuresaccuracyofbenchmarkingresultsbyusingadetailedstoragemodeltopredicttheruntime.Inthefuture,weplantoextendDavidtoincludesupportforanumberofotherusefulstoragedevicesandcongu-rations.Inparticular,theStorageModelcanbeextendedtosupportash-basedSSDsusinganexistingsimulationmodel[5].WebelieveDavidwillbeausefulemulatorforleandstoragesystemevaluation.10AcknowledgmentsWethanktheanonymousreviewersandRobRoss(ourshepherd)fortheirfeedbackandcomments,whichhavesubstantiallyimprovedthecontentandpresentationofthispaper.TherstauthorthanksthemembersoftheStorageSystemsGroupatNECLabsfortheircommentsandfeedback.ThismaterialisbaseduponworksupportedbytheNationalScienceFoundationunderthefollowinggrants:CCF-0621487,CNS-0509474,CNS-0834392,CCF-0811697,CCF-0811697,CCF-0937959,aswellasbygenerousdonationsfromNetApp,SunMicrosystems,andGoogle.Anyopinions,ndings,andconclusionsorrecommendationsexpressedinthismaterialarethoseoftheauthorsanddonotnecessarilyreecttheviewsofNSForotherinstitutions.12 References[1]GraySortBenchmark.http://sortbenchmark.org/FAQ.htm#gray.[2]N.Agrawal,A.C.Arpaci-Dusseau,andR.H.Arpaci-Dusseau.GeneratingRealisticImpressionsforFile-SystemBenchmarking.InProceedingsofthe7thConferenceonFileandStorageTech-nologies(FAST'09),SanFrancisco,CA,February2009.[3]N.Agrawal,W.J.Bolosky,J.R.Douceur,andJ.R.Lorch.AFive-YearStudyofFile-SystemMeta-data.InProceedingsofthe5thUSENIXSymposiumonFileandStorageTechnologies(FAST'07),SanJose,California,February2007.[4]N.Agrawal,W.J.Bolosky,J.R.Douceur,andJ.R.Lorch.Ave-yearstudyofle-systemmetadata:Microsoftlongitudinaldataset.http://iotta.snia.org/traces/list/Static,2007.[5]N.Agrawal,V.Prabhakaran,T.Wobber,J.D.Davis,M.Manasse,andR.Panigrahy.DesignTradeoffsforSSDPerformance.InProceedingsoftheUsenixAnnualTechnicalConference(USENIX'08),Boston,MA,June2008.[6]E.Anderson.Simpletable-basedmodelingofstor-agedevices.TechnicalReportHPL-SSP-2001-04,HPLaboratories,July2001.[7]J.S.BucyandG.R.Ganger.TheDiskSimSimu-lationEnvironmentVersion3.0ReferenceManual.TechnicalReportCMU-CS-03-102,CarnegieMel-lonUniversity,January2003.[8]P.M.ChenandD.A.Patterson.ANewApproachtoI/OPerformanceEvaluation–Self-ScalingI/OBenchmarks,PredictedI/OPerformance.InPro-ceedingsofthe1993ACMSIGMETRICSConfer-enceonMeasurementandModelingofComputerSystems(SIGMETRICS'93),pages1–12,SantaClara,California,May1993.[9]G.R.GangerandY.N.Patt.Usingsystem-levelmodelstoevaluatei/osubsystemdesigns.IEEETrans.Comput.,47(6):667–678,1998.[10]J.L.Grifn,J.Schindler,S.W.Schlosser,J.S.Bucy,andG.R.Ganger.Timing-accurateStorageEmulation.InProceedingsofthe1stUSENIXSym-posiumonFileandStorageTechnologies(FAST'02),Monterey,California,January2002.[11]D.Gupta,K.Yocum,M.McNett,A.C.Snoeren,A.Vahdat,andG.M.Voelker.Toinnityandbe-yond:time-warpednetworkemulation.InProceed-ingsofthe3rdconferenceonNetworkedSystemsDesignandImplementation(NSDI'06),SanJose,CA,2006.[12]M.F.Kaashoek,D.R.Engler,G.R.Ganger,H.Brice˜no,R.Hunt,D.Mazieres,T.Pinckney,R.Grimm,J.Jannotti,andK.Mackenzie.Applica-tionPerformanceandFlexibilityonExokernelSys-tems.InProceedingsofthe16thACMSymposiumonOperatingSystemsPrinciples(SOSP'97),pages52–65,Saint-Malo,France,October1997.[13]J.Katcher.PostMark:ANewFileSystemBench-mark.TechnicalReportTR-3022,NetworkAppli-anceInc.,oct1997.[14]J.Mayeld,T.Finin,andM.Hall.Usingautomaticmemoizationasasoftwareengineeringtoolinreal-worldaisystems.ArticialIntelligenceforAppli-cations,Conferenceon,0:87,1995.[15]R.McDougall.Filebench:Applica-tionlevellesystembenchmark.http://www.solarisinternals.com/si/tools/filebench/index.php.[16]E.L.Miller.Towardsscalablebenchmarksformassstoragesystems.In5thNASAGoddardConferenceonMassStorageSystemsandTechnologies,1996.[17]E.Riedel,M.Kallahalla,andR.Swaminathan.AFrameworkforEvaluatingStorageSystemSecu-rity.InProceedingsofthe1stUSENIXSymposiumonFileandStorageTechnologies(FAST'02),pages14–29,Monterey,California,January2002.[18]M.Rinard,C.Cadar,D.Dumitran,D.M.Roy,T.Leu,andJ.WilliamS.Beebe.EnhancingServerAvailabilityandSecurityThroughFailure-ObliviousComputing.InProceedingsofthe6thSymposiumonOperatingSystemsDesignandIm-plementation(OSDI'04),SanFrancicso,CA,De-cember2004.[19]C.RuemmlerandJ.Wilkes.AnIntroductiontoDiskDriveModeling.IEEEComputer,27(3):17–28,March1994.[20]E.Shriver.Performancemodelingforrealisticstor-agedevices.PhDthesis,NewYork,NY,USA,1997.[21]M.Sivathanu,V.Prabhakaran,F.I.Popovici,T.E.Denehy,A.C.Arpaci-Dusseau,andR.H.Arpaci-Dusseau.Semantically-SmartDiskSystems.In13 Proceedingsofthe2ndUSENIXSymposiumonFileandStorageTechnologies(FAST'03),pages73–88,SanFrancisco,California,April2003.[22]StandardPerformanceEvaluationCorporation.SPECmail2009Benchmark.http://www.spec.org/mail2009/.[23]A.Sweeney,D.Doucette,W.Hu,C.Anderson,M.Nishimoto,andG.Peck.ScalabilityintheXFSFileSystem.InProceedingsoftheUSENIXAnnualTechnicalConference(USENIX'96),SanDiego,California,January1996.[24]A.TraegerandE.Zadok.Howtocheatatbench-marking.InUSENIXFASTBirdsofafeatherses-sion,SanFrancisco,CA,February2009.[25]S.C.Tweedie.JournalingtheLinuxext2fsFileSystem.InTheFourthAnnualLinuxExpo,Durham,NorthCarolina,May1998.[26]Wikipedia.Btrfs.en.wikipedia.org/wiki/Btrfs,2009.[27]M.WittleandB.E.Keith.LADDIS:ThenextgenerationinNFSleserverbenchmarking.InUSENIXSummer,pages111–128,1993.[28]E.Zadok.Fileandstoragesystemsbenchmarkingworkshop.UCSantaCruz,CA,May2008.14