/
Fig.1:OverviewoftheArgonneIBMBlueGene/P(Intrepid)computingenvironmenta Fig.1:OverviewoftheArgonneIBMBlueGene/P(Intrepid)computingenvironmenta

Fig.1:OverviewoftheArgonneIBMBlueGene/P(Intrepid)computingenvironmenta - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
388 views
Uploaded On 2016-09-23

Fig.1:OverviewoftheArgonneIBMBlueGene/P(Intrepid)computingenvironmenta - PPT Presentation

TABLEITopfourwriteintensivejobsonIntrepidDecember2011 ProjectProcsNodesTotalRunTime AvgSizeandSubsequentIdleTimeforWriteBursts1GiBWrittenhours CountSizeSizeNodeSizeIONIdleTimesec PlasmaPhys ID: 469997

TABLEI:Topfourwrite-intensivejobsonIntrepid December2011 ProjectProcsNodesTotalRunTime Avg.SizeandSubsequentIdleTimeforWriteBursts1GiBWritten(hours) CountSizeSize/NodeSize/IONIdleTime(sec) PlasmaPhys

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Fig.1:OverviewoftheArgonneIBMBlueGene/P(..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Fig.1:OverviewoftheArgonneIBMBlueGene/P(Intrepid)computingenvironmentandstorageservices.Thisgurehighlightshowourproposedtierofburstbuffers(greenboxes)wouldintegratewiththeexistingI/Onodes.thesystem(Figure1).Withthistierofburstbuffers,applica-tionscanpushdataoutofmemoryandreturntocomputationwithoutwaitingfordatatobemovedtoitsnalrestingplaceonanexternal,parallellesystem.Webegin(InSectionII),bydescribingthemotivatingfactorsthatleadustoinvestigateaugmentingstoragesystemswithburstbuffersandpresentourdesign.WestudythisstoragesystemarchitectureusingtheCODESparalleldiscrete-eventstoragesystemsimulator(describedinSectionIII).WeevaluateseveralcommonI/Oworkloadsfoundinscienticapplicationstodeterminetheappropriatedesignparametersforstoragesystemsthatincludeburstbuffers(presentedinSectionIV).Wediscussresearchrelatedtoourrecentwork(inSectionV).Weconcludethispaper(inSectionVI)byenumeratingthecontributionsgeneratedbyourwork,inparticularbetterquantifyingtherequirementsofaburstbufferimplementationandthedegreetowhichexternalstoragehardwarerequirementsmightbereducedusingthisapproach.II.MANAGINGBURSTYI/OBurstyapplicationI/Obehaviorisawell-knownphenome-non.ThisbehaviorhasbeenobservedinpriorstudiesforHPCapplicationsperformingperiodiccheckpoints[10],[19],[28],[32],[37],[43]andfortheaggregateI/OactivityacrossallapplicationsexecutingwithinlargeHPCdata-centers[8],[17].Tobetterunderstandtheviabilityoftheburstbufferapproach,weneedquantitativedataonapplicationI/OburstssothatwecanaccuratelyrepresentthisbehaviorinoursimulatedI/Oworkloads.Inthissection,wepresentouranalysisofburstyapplicationI/Obehaviorthatweobservedonalarge-scaleHPCstoragesystem.First,weanalyzethemostburstyandwrite-intensiveapplicationsweobservedoveraone-monthperiodonalarge-scaleHPCsystem.Next,wedescribehowthesetrendshindertheperformanceofcurrentsystems.Then,wediscusshowtomanagethisbehaviorthroughtheuseofburstbuffers.A.StudyofBurstyApplicationsTheArgonneLeadershipComputingFacilitymaintainstheIntrepidIBMBlueGene/Psystem.Intrepidisa557TFleadership-classcomputationalplatformandprovidesaccesstomultiplepetabytesofGPFSandPVFSexternalstorage.Figure1providesanoverviewofIntrepidandtheexternalstorageservicesintegratedwiththesystem.SystemssuchasIntrepidhostadiversesetofapplicationsfrommanyscienticdomains,includingclimate,physics,combustion,andEarthsciences.Workloadsfromthesescienticdomainsareoftencharacterizedbyperiodicburstsofintensewriteactivity.TheseburstsresultfromdefensiveI/Ostrategies(e.g.,checkpointsthatcanbeusedtorestartcalculationsfollowingasystemfault)orstorageofsimulationoutputforsubsequentanalysis(e.g.,recordingtimeseriesdataforuseinvisualization).ToquantifythisbehavioronIntrepid,weanalyzedonemonthofproductionI/OactivityfromDecember2011usingtheDarshanlightweightI/Ocharacterizationtool[9].Darshancapturesapplication-levelaccesspatterninformationwithperprocessandperlegranularity.Itthenproducesasummaryofthatinformationinacompactformatforeachjob.InDecember2011,Darshaninstrumentedapproximately52%ofallproductioncore-hoursconsumedonIntrepid.Weidentiedthefourmostwrite-intensiveapplicationsforwhichwehadcompletedataandanalyzedthelargestproductionexampleofeachapplication.TheresultsofthisanalysisareshowninTableI.Projectnameshavebeengeneralizedtoindicatethescienceorengineeringdomainoftheproject.Wediscoveredexamplesofproductionapplicationsthatgeneratedasmuchas67TiBofdatainasingleexecution.Twoofthetopfourapplications(Turbulence1andAstroPhysics)illustratetheclassicHPCI/Obehaviorinwhichdataiswritteninseveralburststhroughoutthejobexecution,eachfollowedbyasignicantperiodofidletimefortheI/Osystem.ThePlasmaPhysicsapplicationdivergedsomewhatinthatitproducedonlytwoburstsofsignicantwriteactivity;therstburstwasfollowedbyanextendedidleperiod,whilethesecondburstoccurredattheendofexecution.TheTurbulence2applicationexhibitedaseriesofrapidburststhatoccurrednearlyback-to-backattheendofexecution.Onapercomputenodebasis,theaveragewriterequestsrangefrom0.03%to50%ofthememorysizefortheseapplications.Weexpectthewriterequestpercomputenodetobelimitedbythephysical TABLEI:Topfourwrite-intensivejobsonIntrepid,December2011 ProjectProcsNodesTotalRunTime Avg.SizeandSubsequentIdleTimeforWriteBursts�1GiBWritten(hours) CountSizeSize/NodeSize/IONIdleTime(sec) PlasmaPhysics131,07232,76867.0TiB10.4 133.5TiB1.0GiB67.0GiB7554 133.5TiB1.0GiB67.0GiBendofjob Turbulence1131,07232,7688.9TiB11.5 5128.2GiB4.0MiB256.4MiB70 1128.2GiB4.0MiB256.4MiBendofjob 42119.6GiB627.2KiB39.2MiB70 AstroPhysics32,7688,0968.8TiB17.7 1550.9GiB68.9MiB4.3GiBendofjob 8423.4GiB52.9MiB3.3GiB240 37131.5GiB16.4MiB1.0GiB322 1401.6GiB204.8KiB12.8MiB318 Turbulence24,0964,0965.1TiB11.6 21235.8GiB59.0MiB3.7GiB1.2 1235.8GiB59.0MiB3.7GiBendofjob memoryofthenode(2GiBonIntrepid).FromtheI/Onodeperspective,thewriteburstsizesrangefrom40MiBto67GiB(or0.0511%to52.3%ofthetotalamountofmemoryavailabletotheapplication).Also,theobservedidletimebetweentwowriteburstsvariesamongtheapplicationsfromacoupleofminutestoaslongastwohours.Theyallfeatureanend-of-jobwriteburst.WeusedthesestatisticstoguidetheparameterizationofoursimulatedapplicationI/Oworkloads.B.ImpactonStorageSystemsExternalstoragesystems,suchastheoneintegratedwithIn-trepid,aredesignedtoquicklyandefcientlystoreapplicationcheckpointdata.ThisdesignpointleadstostoragesystemsthatmaybeidleduringapplicationcomputationphasesandsaturatedduringapplicationI/Ophases.WehaveobservedextremecasesofburstyapplicationI/Opatternsandthelackofstoragesystemutilizationinpriorwork[8].Wemadeseveralobservationsoveratwo-monthperiod,thatindicatethattheachievablethroughputtoIntrepid'sexternalstoragesystemisoftennotrealized,exceptforsporadicburstsofI/Oactivitythatcorrespondwithstoragesystemhighutilization.Inpriorwork,Intrepid'sstoragebandwidthwasobservedtobeat33%orlessutilizationfor99.2%ofthetimeoveratwo-monthpe-riod.Overthissameperiod,Intrepid'sexternalstoragesystemoperatedat5%orlessofitsexpectedpeakbandwidthfor69.2%ofthetime.Whilethemaximumobservedthroughputoverthisintervalwas35.05GiB/s,theaveragethroughputwas1.93GiB/s.Theseobservationsindicatethatalower-bandwidthexternalstoragesystemcouldprovidetherequiredlevelofservice,ifburstyI/Otrafccouldbespreadoutoveralongerperiodoftime(whileallowingapplicationstoresumecomputation).Giventhedecreasingcostofsolid-statestorage,thisapproachislikelytobemoreeconomical.C.AbsorbingBurstyI/OPatternsOnesolutiontohandlingI/Oburstsinlarge-scaleHPCsys-temsistoabsorbtheI/Oburstsatanintermediatestoragelayerconsistingofburstbuffers.Burstbuffersarehigh-throughput,low-capacitystoragedevicesthatactasastagingareaorawrite-behindcacheforHPCstoragesystems.AcompellingapproachtoincorporatingburstbuffersistoplacethesebuffersonI/OnodesthatconnecttotheexternalstoragesystemandtomanagethesebuffersaspartoftheI/Oforwardingservices.Figure1illustrateshowburstbufferscouldbeintegratedwithinIntrepid'sexistinginfrastructure.WeenvisionthatburstbufferswillintegrateatHPCI/OnodesandwillbemanagedbyI/Oforwardingsoftware.Iftheburstbuffersaresufcientlylargeandfast,theycanabsorbtherelativelyinfrequentI/Oburstsobservedduringourstudies.Fromthedatawegatheredduringourrecentworkloadstudy,weseethatthreeofthetopfourproductionapplicationsonIntrepidcouldbenetgreatlyfromaburstbufferarchitec-tureintermsofpotentialreductioninperceivedI/Otime.ThemajorityofexistingapplicationsexhibitseveralperiodsofidleI/OactivitybetweenI/Obursts.ByaggregatingandabsorbingtheI/Orequestsintotheburstbufferlayer,applicationscanoverlapcomputationsthatfollowI/Oburstswhilebleedingthedatafromtheburstbuffertoexternalstorage.Withouttheseburstbuffers,applicationswouldblockuntilallI/OrequestshadcompletedandwouldallownopotentialforoptimizationoroverlappingcomputationandI/Oactivity.TheTurbulence2applicationwouldlikelybenettheleastbecauseallofitsdataiswrittenattheendofexecution,allowingnoopportunitytooverlapbufferushingactivitywithapplicationcomputation.However,ifdatawereallowedtotrickleoutoftheburstbufferafterthejobwascomplete,theTurbulence2applicationwouldstillseeareductioninoverallruntime.III.MODELINGHPCSTORAGESYSTEMSANDAPPLICATIONSOursimulationtoolsarebuiltontopoftheRensselaerOptimisticSimulationSystem(ROSS).UsingROSS,wehaveimplementedandvalidated[20]astoragesystemsimulatorfortheCODESexascalestoragesystemproject.ThissimulatormodelsthestoragesystemhardwareandsoftwareprotocolsusedbytheALCF'sIntrepidIBMBlueGene/Psystem.Aspartofthiswork,weextendedthisstoragesystemsimulatortoincludeaburstbuffertierofstorageandupdatedtheI/Oforwardingsoftwareprotocolstomanagedatastoredintheburstbuffers.Intheremainderofthissection,wepresentourtoolsusedinthesimulationcomponentofourburstbufferstudy. Fig.2:IntegrationofburstbuffermodelintotheCODESstoragesystemwriterequestmodel;dottedarrowsrepresentasynchronousI/Owhenburstbufferisinuse.A.ParallelDiscrete-EventSimulationROSSisamassivelyparalleldiscrete-eventsimulatorthathasdemonstratedtheabilitytoprocessbillionsofeventspersecondbyleveraginglarge-scaleHPCsystems[4],[21].Aparalleldiscrete-eventsimulation(PDES)systemconsistsofacollectionoflogicalprocesses,orLPs,eachmodelingadistinctcomponentofthesystembeingmodeled(e.g.,aleserver).LPscommunicatebyexchangingtimestampedeventmessages(e.g.,denotingthearrivalofanewI/Orequestatthatserver).ThegoalofPDESistoefcientlyprocessalleventsinaglobaltimestamporderwhileminimizinganyprocessorsynchronizationoverheads.Twowell-establishedapproachestowardthisgoalarebroadlycalledconservativeprocessingandoptimisticprocessing.ROSSsupportsbothapproaches.Allresultspresentedinthispaperusetheconservativeapproach.B.CODESStorageSystemModelTheend-to-endstoragesystemmodeliscomposedofseveralcomponentmodelsthatcapturetheinteractionsofthesystemsoftwareandhardwareforI/Ooperations.Themodelsusedinoursimulationsincludenetworks,hardwaredevices,andsoftwareprotocols.ThestoragesystemmodelalsoprovidesseveralcongurationparametersthatdictatetheexecutionbehaviorofapplicationI/Orequests.WeabstractedthecommonfeaturesofeachBlueGene/Phardwarecomponentintocomputenode(CN),I/Onode(ION),andparallellesystem(PFS)models.ThePFSmodelincludesleserverandenterprisestoragesubmodels.Thesemodelsarethelogicalprocessesinourend-to-endstoragesystemmodel,whicharethemostbasicunitsofourparalleldiscrete-eventmodel.ThevariousBG/PnetworksaremodeledasthelinksconnectingeachLP.EachLPincludesthreebuffers.TheincomingbufferisusedtomodelthequeuingeffectsfrommultipleLPstryingtosendmessagestothesameLP.TheoutgoingbufferisusedtomodelqueuingeffectswhenanLPtriestosendmultiplemessagestodifferentLPs.Theprocessingbufferisusedtomodelqueuingeffectscausedbyaprocessingunit,suchasCPU,DMAengine,storagecontroller,orrouterprocessors.TheunitsprocessincomingmessagesinFIFOorder.ThenetworkconnectionbetweentwoLPsismodeledasmessagestransmittedbetweenthetwoLPs,whereeachLP'sincomingbufferisconnectedtotheotherLP'soutgoingbuffer.Furthermore,thecommoditynetworks(EthernetandMyrinetnetworks)aremodeledbythelinksconnectingtheIONswiththestorageservers.Ifweincreasethedelityofourmodelsinthefuture,additionalnetworkcomponents,suchasroutersandswitches,canbemodeledasLPs.Inpriorwork,weevaluatedtheaccuracyofthesemod-els[20].Thestoragesystemmodelsweredevelopedtobesoftwareprotocol-levelaccurate.Weoptedforprotocol-leveldelityovercycle-leveldelitytoprovideasimulationframe-workthatishigh-performanceandaccurateenoughforourexperiments.WevalidatedthesemodelsusingdatacollectedonArgonne'sIntrepidBlueGene/PsystemwhileIntrepidwasbeingdeployed.Oursimulatedresultsaccuratelyreectedtheperformancevariationsreportedinpriorwork[18]from2,048to131,072clientprocesses,atroughly10%errorrate.C.BurstBufferSupportintheCODESStorageSystemModelWemadeseveralmodicationstotheexistingCODESstoragesystemsimulatortosupportatierofburstbufferstorage.First,wedevelopedasimplemodelofasolid-statestoragedevice,identiedtheparametersofinterestformodelingthesedevices,andconsultedtheliteratureforcurrentSSDproductstodenereasonablevaluesfortheseparameters.Next,weupdatedtheI/OnodehardwaremodelintheCODESsimulatortosupporttheintegrationofburstbuffers.Then,weupdatedthewriterequestprotocolofoursimulatorsothattheI/OforwardingsoftwarelayermanagedI/Orequestsstoredintheburstbuffers.T=Ldata BBB+TBB(1)Equation1describestheanalyticalmodelusedbyoursimulatortocomputetheburstbufferdataaccesscosts.Wedenetheaccesstime(T)suchthatitisinuencedbythedatatransfersize(Ldata),thedevicethroughput(BBB),andthedataaccesslatency(TBB).Thisisana¨vemodelofSSDdevices;itassumesthesamecostsforreadandwriteoperations,anditignoresdecreasesindevicereliabilityandwriteendurance[6],[33].However,thismodelissufcientforapproximatingthegeneralperformanceproleofanend-to-endstoragesystemusingthesedevicessincethedelityofthismodelisonaparwithotherhardwaremodelsusedbyoursimulator.Toidentifyrealisticparametersfortheburstbufferdevicemodel,weinvestigatedtheparametersandcharacteristicsofseveralsolid-statestoragedevices.Currently,severalsolid-statestoragedevicesareappropriateforuseasburstbuffers.TableIIsummarizesthecapacity,latency,andthroughputparametersforsomeofthesedevices.Ingeneral,thesedevicesprovidebetween0.25TiBand1.4TiBofstoragecapacity,writethroughputsrangingfrom0.21GiB/sto1.3GiB/s,andaccesslatenciesbetween15µsand80µs,whichareonapar TABLEII:SummaryofrelevantSSDdeviceparametersandtechnologyavailableasofJanuary2012.BandwidthLatencySize(GiB/s)(µs) Vendor(TiB)NANDWriteReadWriteRead FusionIO0.40SLC1.301.401547FusionIO1.20MLC1.201.301568Intel0.25MLC0.320.508065Virident0.30SLC1.101.401647Virident1.40MLC0.601.301962 withpracticalapplicationrequirementsdescribedinTableI.OnIntrepidthecomputenodetoI/Onoderatiois64:1,witheachcomputenodehaving2GiBofmainmemory.Thesmallestsolid-statedeviceinoursurveyhasacapacityof0.25TiB,whichismorethanenoughtoholdallthedatainmainmemoryonallassociatedcomputenodesinasingleburst.Weusedtherangeofdevicecapacity,accesslatency,andthroughputparameterstodictatepossiblestoragesystemcongurationsinoursimulations.Figure2illustrateshowtheburstbuffermodelintegrateswithexistingCODESstoragemodels.Atthecomputenodelevel,thewriterequestkernelistranslatedtoasimulatortriggerevent,andthecontrolispassedtotheunderlyingsimulationengine.Allthefollowingevents(boxes)representthemodeldetailsoftheprotocolsusedinthestoragesystem.Intheoriginalprotocol,thecomputenodesforwardapplicationI/OrequeststotheI/Onodes,andtheI/Oforwardingsoftwarereplaystheserequeststothelesystem.TosupportburstbufferstorageattheI/Onodes,wemodiedtheI/Oforwardingwriterequestprotocoltomanageapplicationdatacachedinthesedevices.Ourburstbufferdatamanagementprotocolmodelissimilartoawrite-behindcachepolicy.First,I/Oforwardingsoftwareattemptstoreservespaceintheburstbufferfortheapplicationdata.TheI/Oforwardingsoftwarethenreceivestheapplicationdataanddepositsthisdataintotheburstbuffer.Oncethedataisbuffered,theI/Oforwardingsoftwaresignalstheapplicationthatthewriteoperationcom-pleted.TheI/Oforwardingsoftwarethentransfersthebuffereddatatothelesystem.Nootheradjustmentsweremadetothewriterequestprotocol.Theapplicationclientcanforcealldataintheburstbuffertobeushedtothelesystemthroughacommitmethod.ThedetailsoftheleserversanddisklevelmodelarenotaddressedinFigure2.Additionaldetailsofthesemodelsaredocumentedinourpriorwork[20].SeveralparameterscaninuenceapplicationI/Obehaviorinthismodel.Thesolid-statedeviceparametersincludela-tency,bandwidth,andcapacity.Latencydeterminesthedataaccessratestothesedevices,andcapacityistheamountofbufferspacethatthedevicescanprovide.ThememorycopybandwidthontheI/OnodesalsolimitstherateatwhichtheapplicationdatapayloadcanbetransferredbetweenRAMandtheburstbuffers.Thestoragenetworkthroughputandthedisk-basedstoragesystemthroughputdictatehowquicklytheburstbufferscanbeushed.Thecomputenodenetworkcontrolshowquicklywecanlltheburstbufferswithapplicationdata.Ifsufcientspaceisavailabletobuffertheapplicationdata,applicationswilltransferdatadirectlytotheburstbufferandavoidotherstoragesystemprotocolorhardwarecosts.However,thecostofushingtheburstbuffersandtheprotocolinteractionswiththeexternalstoragesystemarestillaccountedforandvisibletotheapplications.Ifspaceisnotavailabletobuffertheapplicationdata,theoriginalwriterequestprotocol(withoutburstbuffersupport)isexecuted.D.CODESI/OWorkloadsTodrivethesimulator,weusedseveralworkloadsderivedfromtheI/Opatternsofsyntheticbenchmarksandscienticapplications.Inourpriorwork[20],theI/Oworkloadsweretightlyintegratedwithoursimulator.TofurthergeneralizeourstoragesimulatorandtoallowustoevaluatemultipletypesofI/Oworkloads,wedevelopedasmallI/Odescriptionlanguageandinterpreter.WeintegratedthelanguageinterpreterintotheCODESsimulator.Thelanguageandinterpreterallowustodescribeavarietyofapplicationworkloads;providefeaturestodictatetheplacementofapplicationswithinthesystem;anddeneI/Oworkloadsconsistingofmultiple,parallel,concurrentlyexecutingapplications.Fortheexperimentswepresentinthispaper,wedevelopedseveralI/OkernelsbasedontheIORsyntheticbenchmark,theFLASHastrophysicsapplication,andgeneralI/OkernelsthatmimictheI/OpatternsdescribedinTableI.TherstI/OkernelwedevelopedwasfortheIORbench-mark.IORisasuitableproxyforsomeHPCI/OworkloadsandisoftenusedforevaluatingsupercomputerI/Operfor-mance[40].TheIORI/OpatternrepresentedinourI/Okernelconsistsofmanyprocessesconcurrentlywritinglargeblocksofdata(multiplemegabytechunks)intoasharedle.Wevalidatedtheresultsgeneratedbythiskernelagainstresultsgeneratedduringourpriorwork[20].Next,weevaluatedtheI/OworkloadfortheFLASHastro-physicscode[36]anddevelopedanI/Okernelforthiscode.FLASHisascienticapplicationusedtostudynuclearashesthatoccuronthesurfacesofwhitedwarfsandneutronstars.Itisanextremelyscalablescienticapplicationthatsuccessfullyreachedscalesofatleast65,536coresonArgonne'sIntrepidBlueGene/Psystem[19].Forourexperiments,wedistilledtwophasesoftheFLASHI/Oworkload:thecheckpointI/OphaseandtheplotleI/Ophase.DuringthecheckpointI/OphaseofFLASH,eachprocessinterleavesseverallargeblocksofvariabledatastoredindouble-precisionoating-pointformatintoasingle,shareddatale.DuringtheplotleI/Ophase,eachprocessinter-leavesseveralmediumsizedblocksofvariabledatastoredinsingle-precisionoating-pointformatintoasingle,shareddatale.FLASHwritesdataforeachvariableasacontiguousbufferintothecheckpointandplotles.Thehigh-levelI/OlibraryusedandI/OoptimizationsenabledbythislibrarycandictatetheI/Opatternusedtodescribethedatageneratedbytheapplication.Forexample,usingtheHDF5high-levelI/OlibrarywithcollectiveI/Ooptimizationsdisabledwillforceallapplicationprocessestowriteseveralsmallblocks(between Fig.3:SimulatedperformanceofIORforvariousstoragesystemandburstbuffercongurations.100and4,000bytes)tothelebeforewritingthelargervariabledatablocks.Manyoftheleaccessesareunaligned.WeevaluatedtheFLASHI/Okernelatascaleof65,536applicationprocessesinourstoragesystemsimulator.OurI/OkernelwasconguredtomimictheHDF5I/ObehaviorofFLASHwhencollectiveI/Ooptimizationsaredisabled.Inpriorwork[18],weobservedonIntrepidthatFLASHcheckpointlescouldbestoredat21.05GiB/sandtheFLASHplotlecouldbestoredat6.4GiB/s.Initially,weobservedthatthesimulatorstoredtheFLASHcheckpointledataat30.77GiB/sandtheplotledataat15.51GiB/s.WediscoveredthatthesimulatordidnotcorrectlyaccountforthesaturationofthecommoditystoragenetworkanddidnotcorrectlyhandlesmallHDF5I/Orequests(100sto1000sofbytes)writtentotheenterprisestoragemodel.Weadjustedourcommoditynetworkmodeltoaccountfornetworksaturationwhenusingmorethan65,536processesandpenalizedsmallI/Orequestswrittentotheenterprisestoragedevicemodel.Withthesechanges,thesimulatorcomputedthecheckpointledatastoragerateas20.01GiB/sandtheplotledatastoragerateas6.715GiB/s.IV.BURSTBUFFERSTUDYInthissection,weexploretheburstbufferstoragesystemdesignusingoursimulator.First,weinvestigatehowburstbuffersinuenceindividualapplicationI/Obehavior.Next,weexplorehowburstbuffersinuencetheperformanceandde-signoftheexternalstoragesystemwhensimulatingmultiple,concurrentapplications.A.SingleI/OWorkloadCaseStudiesTounderstandhowburstbuffersinuenceapplicationI/Operformance,weevaluatedseveralI/Oworkloadsinoursim-ulator.Intheseexperiments,wefocusedonexploringtheparameterspaceoftheburstbuffersandapplicationI/Oaccesspatterns.WelimitedtheseanalysestoasingleapplicationI/Oworkloadsothatwecouldobservetheimpactofburstbuffersfromtheapplication'sperspective.ThegoalofourrstexperimentwastoquantifytheI/Oaccelerationthatburstbufferscanprovidetoanapplication.TheseexperimentsusedtheIORI/OkernelwedescribedinSectionIII-D.WeconguredtheI/Okernelintheseexperimentstowritefourconsecutivechunksofdata4MiBinlength.Weconguredthesimulatedstoragesystemtouse4MiBstripes.Thus,theI/Okernelwillwritestripe-aligneddatarequeststothestoragesystemandshouldachievethebestpossiblethroughputfortheI/Okernel.Thecongurationofthesimulatedstoragesystemwassimilartoexperimentspresentedinourpriorwork[20].Thiscongurationincluded123leserversand123enterprisestorageLUNsconnectedtotheBlueGene/Psystemthroughacommoditystoragenetwork.FilesystemclientswerehostedontheI/Onodesandgroupsof256computeprocesses(on64computenodes)sharedaccesstoasingleI/Onode.Inadditiontothesecongurationparameters,wemadetwoadjustmentstotheexperimentalsetupofoursimulation.First,weconguredanadditionalexternalstoragesystemthatconsistsof64leserversand64enterprisestorageLUNs.ThisadditionalstoragesystemcongurationisapproximatelyhalfthestoragesystemprovidedbytheexistingALCFcomputingenvironment.Next,weconguredeachI/Onodetousea4GiBburstbufferwithatransferrateof1.5GiB/s.Whilea4GiBburstbufferissubstantiallysmallerthanthedevicespresentedinTableII,thissizecantalltheapplicationdatageneratedbythisexperiment.Increasingtheburstbuffersizedoesnotaffecttheperformanceofthemodelorthesimulation.Figure3illustratestheperceivedI/Obandwidthattheclientprocessesforavarietyofscales.ThisbandwidthaccountsforthetimefortheapplicationstocompleteitsI/Orequests.Itdoesnotaccountforthetimetoopenorclosethele,orthetimetoensurethedurabilityoftheapplicationdataontheexternalstoragesystem.Thus,theresultsofthisexperimentillustratethemaximumachievableapplicationI/Obandwidthforstoragesystemswithandwithoutburstbuffers.Thefullstoragesystemwithoutaburstbufferachievessustainedband-widthsimilartoourpriorstudy[20]becauseallapplicationI/Orequestsinteractwiththeexternalstoragesystem.ThehalfstoragesystemcongurationwithoutaburstbufferexhibitssimilarI/Operformancetrendsbutachievesonlyhalfthebandwidthofthefullstoragesystem.Whenburstbuffersareenabledforeitherexternalstoragesystemconguration,weachievelinearscalingofI/Obandwidth.TheburstbuffersessentiallycachetheapplicationI/Orequestsandsuppressthecostofinteractingwiththeexternalstoragesystem.Inthistest,theapplicationI/OperformanceislimitedbythebandwidthoftheBlueGene/PtreenetworkthatconnectsthecomputenodeswiththeI/Onodes.Thisnetworkhasamaximumbandwidthof700MiB/sfor4MiBI/Orequests.Inordertoobservethebenetsoftheburstbufferswhenaccountingfordataushes,applicationsmustoverlapcom-putationswiththeburstbufferush.WemodiedourinitialIORexperimenttoaccountforthecostofushingburstbufferdatasothatwecouldquantifythebenetofcomputationandI/Ooverlap.Thisnewexperimentmeasuresapplicationruntimesatvariousscalesforstoragesystemsconguredwithandwithoutburstbuffers.Inthisexperiment,weusedthefull (a)Simulatedapplicationruntimeperformanceforvar-iousI/Oandburstbuffercongurations. (b)SimulatedIORperformanceforvariousburstbuffercongurationsandI/Oworkloads.Fig.4:ResultsoftheburstbufferandapplicationI/Oworkloadparameterspaceinvestigations.storagesystemcongurationconsistingof123leservers.Weaddeda20GiBburstbuffertoeachI/Onode.Figure4aillustratesthesimulatedapplicationruntime(com-putationtimeplusI/Otime)atvariousscales.Wecollecteddataforthreeapplicationcongurations.Eachapplicationexecutedtwocomputationphasesthatconsumedveminutesofapplicationruntime.First,wemeasuredtheapplicationruntimewhentheapplicationexecutednoI/Ooperations.Thisexperimentrepresentsthebestpossiblecasefortheapplicationbecauseitdoesnotinteractwiththestoragesystemandisnotaffectedbyexternalstoragesystemcosts.Next,wemeasuredtheruntimeforanapplicationthatperformsI/Odirectlytotheexternalstoragesystemanddoesnotuseaburstbuffer.Aftereachcomputationphase,eachprocessentersanI/Ophaseandwritesfour20MiBchunksofdata.Forthisapplication,thistestcaserepresentstheworstpossiblecasefortheapplicationruntime,sincetheI/Oisnotoverlappedwithapplicationexecution.Then,wemeasuredtheapplicationruntimewithburstbuffersenabled.Similartotheprevioustestcase,eachprocesswritesfour20MiBchunksofdataaftereachcom-putationphase.However,theI/OforwardinglayercompletestheapplicationI/Orequestoncethedataistransferredtotheburstbuffer.Thus,theapplicationcomputationcanoverlaptheburstbufferdatatransferstotheexternalstoragesystem.ThistestcasehighlightshowtheapplicationruntimedecreaseswhentheapplicationcomputationisoverlappedwiththeI/Oforwardinglayerwritingburstbufferdatatoexternalstorage.Next,weinvestigatedhowtheapplicationbandwidthisaffectedbytheburstbuffercapacityandapplicationI/Orequestsizes.TheresultsofthisexperimentareillustratedinFigure4b.Forthisexperiment,weconguredtheIORI/OkerneltoissuefourI/Orequestsrangingfrom1MiBto16MiB.WeevaluatedthisI/Opatternusing32,768IORprocesses,123leservers,anda4GiBburstbufferoneachI/Onode.WemeasuredtheperceivedapplicationI/Obandwidthandignoredthecostofushingtheburstbufferdatatotheexternalstoragesystem.Whentheburstbufferisdisabled,theapplicationI/Operfor-manceislimitedbytheexternalstoragesystemperformanceforallI/Orequestsizes.Whentheburstbufferisenabled,theapplicationI/Operformanceislimitedbythetimetotransferdatafromthecomputenodeintotheburstbufferwhiletheburstbufferstillhasfreespace.Inthisexperiment,thisoccursat4MiBrequestsizetest.Afterthispoint,theapplicationI/OperformanceislimitedbytheexternalstoragesystembecausetheI/Orequestsoverowtheburstbuffer.B.MultiapplicationCaseStudyAninterestingquestionthatariseswhenburstbuffersareintroducedintothesystemisthedegreetowhichmultipleapplicationsrunningsimultaneouslymightconictwithoneanotherandinwhatways.I/Oburststhatsaturatethestor-agesystemcandelayorstarvedataaccessesgeneratedbyotherapplicationscompetingforaccesstothesamestorageresources[23].InterleavedI/Orequestsfrommultiple,unre-latedsourcescanleadtorandomI/Oworkloadsforstoragedevicestohandleandoftenrequireadditionalsoftwaretoeffectivelyreordertheserequests[24],[30].Aspartofourstudywesimulatedthebehaviorofthesystemwithmultipleapplications.WenotethatthecurrentI/OforwardingandburstbuffermodelsdonotperformanyintelligentreorderingortransformationontheI/Orequeststream;rather,theyreplayoperationsinorderoneachI/Oforwarder.Forthisstudyweinitializedthesimulatortohave32Kcomputenodesand512IONs.TheSLCNANDFusionIOcardparameterswereusedfortheburstbuffermodel(i.e.,400GiBcapacity,1.3GiB/secwriterate).Twoexternalstoragecongurationsweretested:128leserverswith16enterprisestorageracks(fullstoragesystem)and64leserverswith8enterprisestorageracks(halfstoragesystem).Weseparatedthecomputenodesintothreepartitions:two8Knodepartitionsandone16Knodepartition.FromthedatainTableIwegeneratedthreeapplicationI/OpatternsthatreectthepatternsseeninthePlasmaPhysics,As-troPhysics,andTurbulence1applications.ThePlasmaPhysicskernelwasconguredsothateachapplicationprocessissuedtwolarge(256MiB)writerequestburstsseparatedbyatwo-hourinterval.TheAstroPhysicskernelwassetupto executethreesmallI/OphasesfollowedbyalargeI/Ophase,whereeachphasewasseparatedbyave-minuteinterval.Thispatternwasrepeatedeleventimes.TheTurbulence1I/Okernelexecuted220smallI/Ophasesseparatedbya70-secondinterval.Weconguredthesimulationtorunfor5simulatedhours,enoughtimeforthe“applications”toreachsteadystateandforustoobserveI/Oactivitiesandconicts.Wesimulatedtestcaseswithburstbuffersenabledanddisabled.Additionally,weconductedtestsusingthefullandhalfexternalstoragesystemcongurations.TheresultsofthesesimulationsareillustratedinFigures5and6.Duringthesesimulations,wecollected2,000samplesofthetotalamountofdatatransferredthroughoutthesimulatorattensecondintervals.ThetensecondaveragedatatransferratesarereportedinFigures5and6.OneoftheobservationswemadefromthemultiapplicationexperimentisthatburstbuffersacceleratetheapplicationperceivedthroughputundermixedI/Oworkloads.The400GiBburstbufferwaslargeenoughtobufferthedatarequestsgeneratedbyallthreeworkloads.Additionally,decreasingthesizeofthestoragesystemwhileusingburstbuffershadnonoticeableimpactonthemixedI/Oworkloadsperformance.Withoutburstbuffers,decreasingtheexternalstoragesystembyhalfitsoriginalsizeimpactedtheapplications'I/Operfor-mance.Figures5aand5bshowthatthepeakbandwidthofthestoragesystemisslightlylessinthehalfstoragesystemconguration,andthetimetocompletealltheI/Orequestsisextended.Whenburstbuffersareenabled,decreasingtheexternalstoragesystemstillprovidesexceptionalI/Oband-widthfortheapplications.Figures5cand5dindicatenoobservabledifferenceinaggregateI/Operformancewhenthestoragesystemisdecreasedbyhalfitsoriginalsize.Figures6athrough6dillustratesimulateddiskI/Ostatisticsmonitoredduringsametimeperiod.Whentheburstbufferisenabled,thediskI/Orequestsappearmorecondensedresultinginbetterdiskutilization.ComparingFigure6cand6d,wendthatjobexecutiontimeisapproximatelythesame,whichshowsanevenhigherdiskutilizationinthehalfstoragesystemtestcase.TheburstbufferiscapableoffeedingthestoragesystemwithlargeenoughI/Orequests,andtheserequestsarefullydigestedbystoragesystem,inthetimeintervalsbetweenthewriterequests.Thisindicatesthepotentialofsavingstorageresourceswiththeburstbufferapproachwhilehittingperformancetargets.V.RELATEDWORKThreeareasofworkarerelatedtoourresearch.First,asignicantamountofresearchhasbeendevotedtoaccurateandscalablesimulationmethodsandframeworksforHPCsystems,witharelativeincreaseinactivitytogeneratesimu-lationtoolsfordesigningexascalecomputingsystems.ThesecondareaofrelatedresearchinvolvesasynchronousI/OmethodsforHPCsystems;thisareafocusesonidentifyingtechniquesforprocessingHPCapplicationI/Orequestsinthebackgroundofapplicationexecution.ThethirdareaofrelatedworkfocusesonintegratingNVRAMorsolid-statestorageintoHPCsystemsforuseasafasttierofstorage.Aspartoftheexascaleco-designprocess,signicantinteresthasariseninunderstandinghowparallelsystemsoftwaresuchasMPIandtheassociatedsupercomputingapplicationswillscaleonfuturearchitectures.Forexample,Perumalla'ssystem[34]willallowMPIprogramstobetransparentlyexecutedontopoftheMPImodelinglayerandsimulatetheMPImessages.AnumberofuniversitiesandnationallaboratorieshavejoinedtogethertocreatetheStructuralSimulationToolkit(SST)[35].SSTincludesacollectionofhardwarecomponentmodelsincludingprocessors,memory,andnetworksatdifferentaccuracy.Thesemodelsuseparallel,component-baseddiscrete-eventsimulationbasedonMPI.BigSim[44]focusesonmodelingandpredictingthebehaviorofsequentialexecutionblocksoflarge-scaleparallelapplica-tions.Oursimulatordiffersfromtheseprojectsbecauseourtoolssupporttheconstructionofstoragesystemmodelsandaddresshowtorepresentthestoragesystemsoftwareprotocolsinparalleldiscrete-eventsimulators.Researchershavealsodevelopedanumberofparallellesystemsimulators.TheIMPIOUSsimulator[26]wasde-velopedforfastevaluationofparallellesystemdesigns.ItsimulatesPVFS,PanFS,andCephlesystemsbasedonuser-providedlesystemspecications,includingdataplacementstrategies,replicationstrategies,lockingdisciplines,andcachingstrategies.TheHECIOSsimulator[38]isanOMNeT++simulatorofPVFS;itwasusedtoevaluatescalablemetadataoperationsandle-data-cachingstrategiesforPVFS.PFSsim[22]isanOMNeT++PVFSsimulatorthatallowsresearcherstoexploreI/Oschedulingalgorithmdesign.PVFSandext3lesystemshavebeensimulatedbyusingcoloredPetrinets[29];thissimulationmethodyieldedlowsimulationerror,withlessthan10%errorreportedforsomesimulations.CheckpointworkloadsandstoragesystemcongurationswererecentlyevaluatedwiththeSIMCANsimulator,anOMNeT++simulatorforHPCarchitectures[31];fromtheSIMCANsimulations,theauthorsconcludedthatincreasingtheperfor-manceoftheexternalstoragesystemandstoragenetworkswereeffectivemethodsforimprovingapplicationcheckpointperformance.ThefocusofCODESsetsitapartfromtheserelatedsimulationtools.OneofthegoalsofCODESistoaccuratelyandquicklysimulatelarge-scalestoragesystems.Todate,CODEShasbeenusedtosimulateupto131,072applicationprocesses,512PVFSlesystemclients,and123PVFSleservers.Theexistingstudieslimitedtheirsimu-lationstosmallerparallelsystems(upto10,000applicationprocessesandupto100leservers).AnabundanceofresearchfocusonasynchronousleI/O,datastaging,andI/OofoadingresearchforHPCsystems.ManyoftherecentchallengesassociatedwiththisresearchareafocusonhowtominimizetheimpactofasynchronousleI/Onetworkactivityonapplicationcommunicationoversharedinterconnects.GLEAN[42]andDART[11]providedataofoadingandstagingcapabilitiesforusebyHPCappli-cationsindata-centerandwide-areacomputingenvironments.BothofthesetoolsshipapplicationI/Orequeststodatastaging (a)burstbufferturnedoff,fullstoragesysteminuse,128leservers (b)burstbufferturnedoff,halfstoragesysteminuse,64leservers (c)burstbufferturnedon,fullstoragesysteminuse,128leservers (d)burstbufferturnedon,halfstoragesysteminuse,64leserversFig.5:Tensecondaveragedatatransferrateforthecomputenodesobservedduringthemultiapplicationsimulations.nodesandallowthesenodestomanagetheI/Orequestsonbehalfoftheapplication.TheADIOSandDataStagerteamshavedoneextensiveresearchonhowtoadaptandisolateapplicationsfromseveralsourcesofinterference,includingsourcesthatimpactasynchronousI/Ocapabilities[1]andusageofheavilyutilizedlesystemresources[23].Ourworkiscomplementarytothatexistingwork.Thoseasynchronoustoolsprovidetheinterfacesandmechanismstoprovideasyn-chronousI/Ocapabilitiestotheapplication,whereasourworkprovidesinsightonhowthosetoolswillworkonlarge-scalesystemswithadedicatedbufferspace.Severalrecentactivitieshavefocusedontheusageofsolid-statedevicesinHPCsystems.Thisworkincludesevaluatingtheuseofsolid-statestoragefordata-intensivecomputa-tions[13],[25],investigatingtheuseofsolid-statedevicesinstoragesystems[2],designingdata-intensiveHPCsystemsthatcanhostlargedatasetsnearprocessingelements[16],andusingSSDsasamediumfortemporarilystoringapplicationcheckpointdata[14],[30].Ofparticularinterestandrelevancetoourworkarestoragesystemdesignsthatusenonvolatileburstbuffers[5].Thesesystemdesignsintegratenonvolatilemediabetweenthemainmemoryaccessibletotheprocessingelementsandtheslowerspinningdiskmediathatholddurablecopiesofapplicationdata.Thus,itisusedasanintermediatestoragetierlocatedbetweentheprocessingelementsandtheexternalstoragesystem.Inthispaper,weinvestigatetheroleofburstbuffersinlarge-scaleHPCsystems.Weadopttheusagemodelofnonvolatileburstbuffers[5],deneseverallarge-scaleHPCstoragesystemcongurationsintegratedwithburstbuffers,andevaluateapplicationI/OworkloadsusingthesesystemcongurationsthroughtheCODESstoragesystemsimulator.VI.CONCLUSIONSThisstudyexploresthepotentialforburstbuffersinHPCstoragesystemsandhighlightstheirpotentialimpactthroughsimulationsofanexistinglarge-scaleHPCsystem,theAr-gonneIBMBlueGene/P“Intrepid”includingapplicationI/Opatternsandenhancedstoragesystem.WegatherdatafromtheproductioncomputingsystemtoprovideanaccuratemodelofapplicationI/Oburstsintoday'ssystems,andwesurveycurrentnonvolatilestorageproductstoproviderealisticparametersforourburstbuffermodel.Weprovideaparalleldiscrete-eventmodelforburstbuffersinanHPCsystembasedonourpreviousworkofaleadership-classstoragesystemmodel.ThemodelistestedthroughaseriesofI/Obenchmarksandmixedworkloads,andsimulationsareperformedwithareducedcongurationofexternalstorage. (a)burstbufferturnedoff,fullstoragesysteminuse,128leservers (b)burstbufferturnedoff,halfstoragesysteminuse,64leservers (c)burstbufferturnedon,fullstoragesysteminuse,128leservers (d)burstbufferturnedon,halfstoragesysteminuse,64leserversFig.6:Tensecondaveragedatatransferratefortheexternalstoragesystemobservedduringthemultiapplicationsimulations.Thisstudyclearlyindicatesthatburstbuffersarepracticalinthecontextofthisexamplesystemandsetofapplica-tions.Whilecurrentsystemsoperateeffectivelywithoutburstbuffers,weshowthattheuseofburstbufferscanallowforamuchlesscapableexternalstoragesystemwithnomajorimpactonperceivedapplicationI/Orates.Fortoday'ssystemsthismeansthatstoragesystemcostscouldlikelybesignicantlyreduced—fewerleservers,racksofstorage,andexternalswitchportsareneeded.Forsystemsinthe2020timeframe,burstbuffersarelikelytobeamandatorycomponentifpeakI/Oratesaretobeattained.Futureworkwillfollowalongtwolines.First,inordertofurtherimprovetheefciencyofoursimulationsonlarge-scalesystems,wewilladaptourmodelstooptimisticexecution.Second,wecontinuetoimprovethedelityofthemodel.Weareincorporatingmodelsofthetorusandtreenetworksaspartofthisactivity,andweareinvestigatingmethodstobettersimulatethediskdrivesinourenterprisestoragemodel.ACKNOWLEDGMENTSThisworkwassupportedinpartbytheOfceofAdvancedScienticComputerResearch,OfceofScience,U.S.Dept.ofEnergy,underContractDE-AC02-06CH11357andpartiallybyContractDE-SC0005428andtheLANL/UCSCInstituteforScalableScienticDataManagement(ISSDM).ThisresearchusedresourcesoftheArgonneLeadershipComputingFacilityatArgonneNationalLaboratory,whichissupportedbytheOfceofScienceoftheU.S.DepartmentofEnergyundercontractDE-AC02-06CH11357.REFERENCES[1]H.Abbasi,M.Wolf,G.Eisenhauer,S.Klasky,K.Schwan,andF.Zheng.DataStager:Scalabledatastagingservicesforpetascaleapplications.InProceedingsofthe18thACMinternationalSymposiumonHighPerformanceDistributedComputing,June2009.[2]S.Alam,H.El-Harake,K.Howard,N.Stringfellow,andF.Verzelloni.ParallelI/Oandthemetadatawall.InProceedingsofthe6thParallelDataStorageWorkshop(PDSW'11),November2011.[3]N.Ali,P.Carns,K.Iskra,D.Kimpe,S.Lang,R.Latham,R.Ross,L.Ward,andP.Sadayappan.ScalableI/Oforwardingframeworkforhigh-performancecomputingsystems.InIEEEInternationalConferenceonClusterComputing(Cluster2009),NewOrleans,LA,September2009.[4]D.W.BauerJr.,C.D.Carothers,andA.Holder.ScalabletimewarponBlueGenesupercomputers.InProceedingsofthe2009ACM/IEEE/SCS23rdWorkshoponPrinciplesofAdvancedandDistributedSimulation,pages35–44,Washington,DC,USA,2009.IEEEComputerSociety.[5]J.BentandG.Grider.UsabilityatLosAlamosNationalLab.InThe5thDOEWorkshoponHPCBestPractices:FileSystemsandArchives,September2011.[6]S.BoboilaandP.Desnoyers.Writeenduranceinashdrives:Measure-mentsandanalysis.InProceedingsofthe8thUSENIXConferenceonFileandStorageTechnologies,February2010. [7]D.L.BrownandP.Messina.Scienticgrandchallenges:Cross-cuttingtechnologiesforcomputingattheexascale.Technicalreport,DepartmentofEnergyOfceofAdvancedScienticComputingResearchandOfceofAdvancedSimulationandComputing,February2010.[8]P.Carns,K.Harms,W.Allcock,C.Bacon,S.Lang,R.Latham,andR.Ross.Understandingandimprovingcomputationalsciencestorageaccessthroughcontinuouscharacterization.Trans.Storage,7:1–26,October2011.[9]PhilipCarns,RobertLatham,RobertRoss,KamilIskra,SamuelLang,andKatherineRiley.24/7characterizationofpetascaleI/Oworkloads.InProceedingsoftheFirstWorkshoponInterfacesandAbstractionsforScienticDataStorage(IASDS),NewOrleans,LA,September2009.[10]J.DennisandR.Loft.Optimizinghigh-resolutionclimatevariabilityexperimentsontheCrayXT4andXT5systemsatNICSandNERSC.InProceedingsofthe51stCrayUserGroupConference(CUG),2009.[11]C.Docan,M.Parashar,andS.Klasky.Enablinghigh-speedasyn-chronousdataextractionandtransferusingDART.ConcurrencyandComputation:PracticeandExperience,22(9):1181–1204,June2010.[12]J.Dongarra.Impactofarchitectureandtechnologyforextremescaleonsoftwareandalgorithmdesign.PresentedattheDepartmentofEnergyWorkshoponCross-cuttingTechnologiesforComputingattheExascale,February2010.[13]B.VanEssen,R.Pearce,S.Ames,andM.Gokhale.OntheroleofNVRAMindata-intensivearchitectures:anevaluation.InInternationalSymposiumonParallelandDistributedProcessing(toappear),2012.[14]L.Gomez,M.Maruyama,F.Cappello,andS.Matsuoka.Distributeddisklesscheckpointforlargescalesystems.InProceedingsofthe10thIEEE/ACMInternationalConferenceonCluster,Cloud,andGridComputing(CCGrid'10),May2010.[15]G.Grider.Exa-scaleFSIO-Canwegetthere?Canweaffordto?Presentedatthe7thIEEEInternationalWorkshoponStorageNetworkArchitectureandParallelI/O,May2011.[16]J.He,J.Bennett,andA.Snavely.DASH-IO:andempiricalstudyofash-basedIOforHPC.InProceedingsofTeraGrid'10,August2010.[17]Y.Kim,R.Gunasekarana,G.Shipman,D.Dillow,Z.Zhang,andB.Settlemyer.Workloadcharacterizationofaleadershipclassstoragecluster.InProceedingsofthe5thParallelDataStorageWorkshop(PDSW'10),November2010.[18]S.Lang,P.Carns,R.Latham,R.Ross,K.Harms,andW.Allcock.I/Operformancechallengesatleadershipscale.InProceedingsoftheConferenceonHighPerformanceComputingNetworking,StorageandAnalysis,page40.ACM,2009.[19]R.Latham,C.Daley,W.K.Liao,K.Gao,R.Ross,A.Dubey,andA.Choudhary.AcasestudyforscienticI/O:ImprovingtheFLASHastrophysicscode.InunderreviewtoJournalofComputationalScienceandDiscovery,2012.[20]N.Liu,C.Carothers,J.Cope,P.Carns,R.Ross,A.Crume,andC.Maltzahn.Modelingaleadership-scalestoragesystem.InProceed-ingsofthe9thInternationalConferenceonParallelProcessingandAppliedMathematics,2011.[21]N.LiuandC.D.Carothers.Modelingbillion-nodetorusnetworksusingmassivelyparalleldiscrete-eventsimulation.InProceedingsofthe2011IEEEWorkshoponPrinciplesofAdvancedandDistributedSimulation,PADS'11,pages1–8,Washington,DC,USA,2011.IEEEComputerSociety.[22]Y.Liu,R.Figueiredo,D.Clavijo,Y.Xu,andM.Zhao.TowardssimulationofparallellesystemschedulingalgorithmswithPFSsim.InProceedingsofthe7thIEEEInternationalWorkshoponStorageNetworkArchitecturesandParallelI/O,May2011.[23]J.Lofstead,F.Zheng,Q.Liu,S.Klasky,R.Oldeld,T.Kordenbrock,K.Schwan,andM.Wolf.ManagingvariabilityintheIOperformanceofpetascalestoragesystems.InProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis2010(SC10),November2010.[24]T.Madhyastha,G.Gibson,andC.Faloutsos.Informedprefetchingofcollectiveinput/outputrequests.InProceedingsofthe1999ACM/IEEEConferenceonSupercomputing(SC'99),November1999.[25]N.Master,M.Andrews,J.Hick,S.Canon,andN.Wright.Performanceanalysisofcommodityandenterpriseclassashdevices.InProceedingsofthe5thParallelDataStorageWorkshop(PDSW'10),November2010.[26]E.Molina-Estolano,C.Maltzahn,J.Bent,andS.Brandt.Buildingaparallellesystemsimulator.InJournalofPhysics:ConferenceSeries,volume180,2009.[27]J.Moreira,M.Brutman,J.Casta˜nos,T.Engelsiepen,M.Giampapa,T.Gooding,R.Haskin,T.Inglett,D.Lieber,P.McCarthy,M.Mundy,J.Parker,andB.Wallenfelt.Designingahighly-scalableoperatingsystem:theBlueGene/Lstory.InProceedingsofthe2006ACM/IEEEConferenceonSupercomputing,November2006.[28]H.Naik,R.Gupta,andP.Beckman.Analyzingcheckpointingtrendsforapplicationsonpetascalesystems.InSecondInternationalWorkshoponParallelProgrammingModelsandSystemsSoftware(P2S2)forHigh-EndComputing,2009.[29]H.Q.Nguyen.Filesystemsimulation:Hierachicalperformancemeasurementandmodeling.PhDthesis,UniversityofArkansas,2011.[30]P.Nowoczynski,N.Stone,J.Yanovich,andJ.Sommereld.Zest:Checkpointstoragesystemforlargesupercomputers.In3rdPetascaleDataStorageWorkshop,November2008.[31]A.Nunez,J.Fernandez,J.Carretero,L.Prada,andM.Blaum.Optimiz-ingdistributedarchitecturestoimproveperformanceoncheckpointingapplications.InProceedingsofthe13thIEEEInternationalConferenceonHighPerformanceComputingandCommunications(HPCC'11),September2011.[32]R.Oldeld,S.Arunagiri,P.Teller,S.Seelam,M.Varela,R.Riesen,andP.Roth.Modelingtheimpactofcheckpointsonnext-generationsystems.In24thIEEEConferenceonMassStorageSystemsandTechnologies(MSST2007),pages30–46,September2007.[33]Y.Pan,G.Dong,andT.Zhang.Exploitingmemorydevicewear-outdynamicstoimproveNANDashmemorysystemperformance.InProceedingsofthe9thUSENIXConferenceonFileandStorageTechnologies,February2011.[34]K.S.Perumalla.:ascalableandtransparentsystemforsimulatingMPIprograms.InProceedingsofthe3rdInternationalICSTConferenceonSimulationToolsandTechniques,SIMUTools'10,pages62:1–62:6,ICST,Brussels,Belgium,2010.[35]A.F.Rodrigues,K.S.Hemmert,B.W.Barrett,C.Kersey,R.Oldeld,M.Weston,R.Risen,J.Cook,P.Rosenfeld,E.CooperBalls,andB.Jacob.TheStructuralSimulationToolkit.SIGMETRICSPerform.Eval.Rev.,38:37–42,March2011.[36]R.Rosner,A.Calder,J.Dursi,B.Fryxell,D.Lamb,J.Niemeyer,K.Olson,P.Ricker,F.Timmes,J.Truran,H.Tufo,Y.Young,M.Zin-gale,E.Lusk,andR.Stevens.Flashcode:Studyingastrophysicalthermonuclearashes.InComputinginScienceandEngineering,volume2,pages33–41,2000.[37]P.Roth.CharacterizingtheI/ObehaviorofscienticapplicationsontheCrayXT.InProceedingsofthe2ndParallelDataStorageWorkshop(PDSW'07),November2007.[38]B.W.Settlemyer.AStudyofClient-sideCachinginParallelFileSystems.PhDthesis,ClemsonUniversity,Clemson,SouthCarolina,USA,2009.[39]J.Shalf.Exascalecomputingtechnologychallenges.PresentedattheHECFSIOWorkshop2010,August2010.[40]H.Shan,K.Antypas,andJ.Shalf.CharacterizingandpredictingtheI/OperformanceofHPCapplicationsusingaparameterizedsyntheticbenchmark.InProceedingsofthe2008ACM/IEEEconferenceonSupercomputing,SC'08,pages42:1–42:12,Piscataway,NJ,USA,2008.IEEEPress.[41]G.Shipman,D.Dillow,S.Oral,F.Wang,D.Fuller,J.Hill,andZ.Zhang.Lessonslearnedindeployingtheworld'slargestscalelustrelesystem.InProceedingsofthe52ndCrayUserGroupConference(CUG),May2010.[42]V.Vishwanath,M.Hereld,V.Morozov,andM.Papka.Topology-awaredatamovementandstagingforI/OaccelerationonBlueGene/Psupercomputingsystems.InProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis2011(SC11),November2011.[43]F.Wang,Q.Xin,B.Hong,S.Brandt,E.Miller,D.Long,andT.McLarty.Filesystemworkloadanalysisforlargescalescienticcomputingapplications.Technicalreport,LawrenceLivermoreNationalLaboratory,January2004.[44]G.Zheng,G.Gupta,E.Bohm,I.Dooley,andL.V.Kale.Simulatinglargescaleparallelapplicationsusingstatisticalmodelsforsequentialexecutionblocks.InProceedingsofthe16thInternationalConferenceonParallelandDistributedSystems(ICPADS2010),number10-15,December2010.