/
extensions(1)makeacceleratorsintorstclassschedu-lableentitiesand(2)su extensions(1)makeacceleratorsintorstclassschedu-lableentitiesand(2)su

extensions(1)makeacceleratorsintorstclassschedu-lableentitiesand(2)su - PDF document

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
361 views
Uploaded On 2016-07-07

extensions(1)makeacceleratorsintorstclassschedu-lableentitiesand(2)su - PPT Presentation

Figure1LogicalviewofPegasusarchitectureHowcanheterogeneousresourcesbemanagedHardwareheterogeneitygoesbeyondvaryingcomputespeedstoincludedifferinginterconnectdistancesdifferentandpossiblydisjointm ID: 394669

Figure1:LogicalviewofPegasusarchitectureHowcanheterogeneousresourcesbemanaged?:Hardwareheterogeneitygoesbeyondvaryingcomputespeedstoincludedifferinginterconnectdistances differ-entandpossiblydisjointm

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "extensions(1)makeacceleratorsintorstcla..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

extensions(1)makeacceleratorsintorstclassschedu-lableentitiesand(2)supportschedulingmethodsthaten-ableefcientuseofboththegeneralpurposeandacceler-atorcoresofheterogeneoushardwareplatforms.Speci-cally,forplatformscomprisedofx86CPUsconnectedtoNVIDIAGPUs,theseextensionscanbeusedtomanagealloftheplatform'sprocessingresources,toaddressthebroadrangeofneedsofGPGPU(generalpurposecom-putationongraphicsprocessingunits)applications,in-cludingthehighthroughputrequirementsofcomputein-tensivewebapplicationsliketheimageprocessingcodeoutlinedaboveandthelowlatencyrequirementsofcom-putationalnance[24]orsimilarlycomputationallyin-tensivehighperformancecodes.Forhighthroughput,platformresourcescanbesharedacrossmanyapplica-tionsand/orclients.Forlowlatency,resourcemanage-mentwithsuchsharingalsoconsidersindividualapplica-tionrequirements,includingthoseoftheinter-dependentpipeline-basedcodesemployedforthenancialandim-ageprocessingapplications.ThePegasushypervisorextensionsdescribedinSec-tions3and5donotgiveapplicationsdirectaccesstoaccelerators[28],nordotheyhidethembehindavirtuallesystemlayer[5,15].Instead,similartopastworkonself-virtualizingdevices[29],Pegasusexposestoap-plicationsavirtualacceleratorinterface,anditsupportsexistingGPGPUapplicationsbymakingthisinterfaceidenticaltoNVIDIA'sCUDAprogrammingAPI[13].Asaresult,wheneveravirtualmachineattemptstousetheacceleratorbycallingthisAPI,controlrevertstothehypervisor.Thismeans,ofcourse,thatthehypervisor`sees'theapplication'sacceleratoraccesses,therebyget-tinganopportunitytoregulate(schedule)them.Asec-ondsteptakenbyPegasusistothenexplicitlycoordi-natehowVMsusegeneralpurposeandacceleratorre-sources.WiththeXenimplementation[7]ofPegasusshowninthispaper,thisisdonebyexplicitlyschedulingguestVMs'acceleratoraccessesinXen'sDom0,whileatthesametimecontrollingthoseVMs'useofgeneralpur-poseprocessors,thelatterexploitingDom0'sprivilegedaccesstotheXenhypervisoranditsVMscheduler.Pegasuselevatesacceleratorstorstclassschedula-blecitizensinamannersomewhatsimilartothewayitisdoneintheHeliosoperatingsystem[26],whichusessatellitekernelswithstandardinterfacesforXScale-basedIOcards.However,giventhefastrateoftech-nologydevelopmentinacceleratorchips,weconsideritprematuretoimposeacommonabstractionacrossallpossibleheterogeneousprocessors.Instead,Pegasususesamorelooselycoupledapproachinwhichitas-sumessystemstohavedifferent`schedulingdomains',eachofwhichisadeptatcontrollingitsownsetofre-sources,e.g.,acceleratorvs.generalpurposecores.Pe-gasusscheduling,then,coordinateswhenandtowhatex-tent,VMsusetheresourcesmanagedbythesemultipleschedulingdomains.Thisapproachleveragesnotionsof`cellular'hypervisorstructures[11]orfederatedsched-ulersthathavebeenshownusefulinothercontexts[20].ConcurrentuseofbothCPUandGPUresourcesisoneclassofcoordinationmethodsPegasusimplements,withothermethodsaimedatdeliveringbothhighperformanceandfairnessintermsofVMusageofplatformresources.Pegasusreliesonapplicationdevelopersortoolchainstoidentifytherighttargetprocessorsfordifferentcom-putationaltasksandtogeneratesuchtaskswiththeap-propriateinstructionsetarchitectures(ISAs).Further,itscurrentimplementationdoesnotinteractwithtoolchainsorruntimes,butwerecognizethatsuchinter-actionscouldimprovetheeffectivenessofitsruntimemethodsforresourcemanagement[8].Anadvantagederivedfromthislackofinteraction,however,isthatPegasusdoesnotdependoncertaintoolchainsorrun-times,nordoesitrequireinternalinformationaboutac-celerators[23].Asaresult,Pegasuscanoperatewithboth`closed'acceleratorslikeNVIDIAGPUsandwith`open'oneslikeIBMCell[14],anditsapproachcaneas-ilybeextendedtosupportotherAPIslikeOpenCL[19].Summarizing,thePegasushypervisorextensionsmakethefollowingcontributions:Acceleratorsasrstclassschedulableentities—accelerators(acceleratorphysicalCPUsoraPCPUs)canbemanagedasrstclassschedulableentities,i.e.,theycanbesharedbymultipletasks,andtaskmappingstoprocessorsaredynamic,withintheconstraintsimposedbytheacceleratorsoftwarestacks.Visibleheterogeneity—PegasusrespectsthefactthataPCPUsdifferincapabilities,havedifferentmodesofaccess,andsometimesusedifferentISAs.Ratherthanhidingthesefacts,Pegasusexposesheterogeneitytotheapplicationsandtheguestvirtualmachines(VMs)thatarecapableofexploitingit.Diversityinscheduling—acceleratorsareusedinmultipleways,e.g.,tospeedupparallelcodes,toincreasethroughput,ortoimproveaplatform'spower/perfor-manceproperties.Pegasusaddressesdifferingapplica-tionneedsbyofferingadiversityofmethodsforschedul-ingacceleratorandgeneralpurposeresources,includingco-schedulingforconcurrencyconstraints.`Coordination'asthebasisforresourcemanage-ment—internally,acceleratorsusespecializedexecutionenvironmentswiththeirownresourcemanagers[14,27].Pegasususescoordinatedschedulingmethodstoalignacceleratorresourceusagewithplatform-levelmanage-ment.Whilecoordinationappliesexternalcontrolstocontroltheuseof`closed'accelerators,i.e.,acceleratorswithresourcemanagersthatdonotexportcoordinationinterfaces,itcouldinteractmoreintimatelywith`open'managersaspertheirinternalschedulingmethods. Figure1:LogicalviewofPegasusarchitectureHowcanheterogeneousresourcesbemanaged?:Hardwareheterogeneitygoesbeyondvaryingcomputespeedstoincludedifferinginterconnectdistances,differ-entandpossiblydisjointmemorymodels,andpotentiallydifferentornon-overlappingISAs.Thismakesitdifculttoassimilatetheseacceleratorsintoonecommonplat-form.Exacerbatingthesehardwaredifferencesaresoft-warechallenges,likethosecausedbythefactthatthereisnogeneralagreementaboutprogrammingmodelsandruntimesforaccelerator-basedsystems[19,28].Arethereefcientmethodstoutilizeheterogeneousre-sources?:Thehypervisorhaslimitedcontroloverhowtheresourcesinternaltoclosedacceleratorsareused,andwhethersharingispossibleintime,space,orbothbe-causethereisnodirectcontroloverscheduleractionsbeyondtheproprietaryinterfaces.Theconcreteques-tion,then,iswhetherandtowhatextentthecoordinatedschedulingapproachadoptedbyPegasuscansucceed.Pegasusthereforeallowsschedulerstorunresourceal-locationpoliciesthatofferdiversityinhowtheymaxi-mizeapplicationperformanceand/orfairnessinresourcesharing.3.1AcceleratorVirtualizationWithGViM[13],weoutlinemethodsforlow-overheadvirtualizationofGPUsfortheXenhypervisor,address-ingheterogeneoushardwarewithgeneralpurposeandacceleratorcores,usedbyVMswithsuitablecodes(e.g.,forLarrabeeorTolapaicores,codesthatareIAin-structionsetcompatiblevs.non-IAcompatiblecodesforNVIDIAorCellaccelerators).Buildingonthisapproachandacknowledgingthecurrentoff-chipnatureofaccel-erators,PegasusassumesthesehardwareresourcestobemanagedbyboththehypervisorandXen's`Dom0'man-agement(anddriver)domain.Hence,Pegasususesfrontend/backendsplitdrivers[3]tomediateallaccessestoGPUsconnectedviaPCIe.Specically,therequestsforGPUusageissuedbyguestVMs(i.e.,CUDAtasks)arecontainedincallbufferssharedbetweenguestsandDom0,asshowninFigure2,usingaseparatebufferforeachguest.Buffersareinspectedby`poller'threadsthatpickcallpacketsfromper-guestbuffersandissuethemtotheactualCUDAruntime/driverresidentinDom0.Thesepollerthreadscanbewokenupwheneverado-mainhascallrequestswaiting.Thismodelofexecutioniswell-matchedwiththewaysinwhichguestsuseaccel-erators,typicallywishingtoutilizetheircomputationalcapabilitiesforsometimeandwithmultiplecalls.Forgeneralpurposecores,aVCPUasthe(virtual)CPUrepresentationofferedtoaVMembodiesthestaterepresentingtheexecutionoftheVM'sthreads/processesonphysicalCPUs(PCPUs).Asasimilarabstraction,PegasusintroducesthenotionofanacceleratorVCPU(aVCPU),whichembodiestheVM'sstateconcerningtheexecutionofitscallstotheaccelerator.FortheXen/NVIDIAimplementation,thisabstractionisacom-binationofstateallocatedonthehostandontheacceler-ator(i.e.,Dom0pollingthread,CUDAcalls,anddrivercontextformtheexecutioncontextwhilethedatathatisoperateduponformsthedataportion,whencomparedwiththeVCPUs).ByintroducingaVCPUs,Pegasuscanthenexplicitlyschedulethem,justliketheirgeneralpur-posecounterparts.Further,andasseenfromSection6,virtualizationcostsarenegligibleorlowandwiththisAPI-basedapproachtovirtualization,Pegasusleavestheuseofresourcesontheacceleratorhardwareuptotheapplication,ensuresportabilityandindependencefromlow-levelchangesinNVIDIAdriversandhardware.3.2ResourceManagementFrameworkForVMsusingbothVCPUsandaVCPUs,resourceman-agementcanexplicitlytrackandscheduletheirjointuseofbothgeneralpurposeandacceleratorresources.Tech-nically,suchmanagementinvolvesschedulingtheirVC-PUsandaVCPUstomeetdesiredServiceLevelObjec-tives(SLOs),concurrencyconstraints,andtoensurefair-nessindifferentguestVMs'resourceusage.Forhighperformance,Pegasusdistinguishestwophasesinacceleratorrequestscheduling.First,theacceleratorselectionmodulerunsintheAccelera-torDomain—whichinourcurrentimplementationisDom0—henceforth,calledDomA.Thismoduleasso-ciatesadomain,i.e.,aguestVM,withanacceleratorthathasavailableresources,byplacingthedomainintoan`acceleratorreadyqueue',asshowninFigure2.Do-mainsareselectedfromthisqueuewhentheyarereadytoissuerequests.Second,itisonlyafterthisselectionthatactualusagerequestsareforwardedto,i.e.,sched-uledandrunon,theselectedaccelerator.Therearemul-tiplereasonsforthisdifferenceinacceleratorvs.CPUscheduling.(1)AnacceleratorliketheNVIDIAGPU Figure2:LogicalviewoftheresourcemanagementframeworkinPegasushaslimitedmemory,anditassociatesacontextwitheach`user'(e.g.,athread)thatlockssomeoftheGPU'sre-sources.(2)Memoryswappingbetweenhostandaccel-eratormemoryoveraninterconnectlikePCIeisexpen-sive,whichmeansthatitiscostlytodynamicallychangethecontextcurrentlyrunningontheGPU.Inresponse,PegasusGPUschedulingrestrictsthenumberofdomainssimultaneouslyscheduledoneachacceleratorandinad-dition,itpermitseachsuchdomaintousetheacceleratorforsomeextensivetimeduration.Thefollowingparam-etersareusedforacceleratorselection.Acceleratorproleandqueue—acceleratorsvaryintermsofclockspeed,memorysize,in-outbandwidthsandothersuchphysicalcharacteristics.Thesearestaticorhardwarepropertiesthatcanidentifycapabilitydif-ferencesbetweenvariousacceleratorsconnectedinthesystem.Therealsoaredynamicpropertieslikeallo-catedmemory,numberofassociateddomains,etc.,atanygiventime.Thisstaticanddynamicinformationiscapturedinan`acceleratorprole'.An`acceleratorweight'computedfromthisproleinformationdeter-minescurrenthardwarecapabilitiesandloadcharacter-isticsfortheaccelerator.Theseweightsareusedtoor-deracceleratorsinapriorityqueuemaintainedwithintheDomAScheduler,termedas`acceleratorqueue'.Forexample,themoreanacceleratorisused,theloweritsweightbecomessothatitdoesnotgetoversubscribed.Theacceleratorwiththehighestweightisthemostca-pableandisthersttobeconsideredwhenadomainrequestsacceleratoruse.Domainprole—domainsmaybemoreorlessde-mandingofacceleratorresourcesandmorevs.lesscapa-bleofusingthem.The`domainproles'maintainedbyPegasusdescribethesedifferences,andtheyalsoquan-titativelycapturedomainrequirements.Concretely,thecurrentimplementationexpectscreditassignments[7]foreachdomainthatgivesitproportionalaccesstotheaccelerator.Anotherexampleistomatchthedo-main'sexpectedmemoryrequirementsagainsttheavail-ablememoryonanaccelerator(withCUDA,itispossi-bletodeterminethisfromapplicationmetadata).Sincetheexecutionpropertiesofdomainschangeovertime,domainexecutioncharacteristicsshouldbedetermineddynamically,whichwouldthencausetheruntimemod-icationofadomain'sacceleratorcreditsand/oraccessprivilegestoaccelerators.Automatedmethodsfordoingso,basedonruntimemonitoring,aresubjectofourfu-turework,withinitialideasreportedin[8].Thispaperlaysthegroundworkforsuchresearch:(1)weshowco-ordinationtobeafundamentallyusefulmethodforman-agingfutureheterogeneoussystems,and(2)wedemon-stratetheimportanceoftheseruntime-basedtechniquesandperformanceadvantagesderivedfromtheiruseinacoordinatedschedulingenvironment.Onceadomainhasbeenassociatedwithanacceler-ator,theDomASchedulerinFigure2schedulesexe-cutionofindividualdomainrequestsperacceleratorbyactivatingthecorrespondingdomain'saVCPU.Foralldomainsinitsreadyqueue,the`DomAScheduler'hascompletecontroloverwhichdomain'srequestsaresub-mittedtotheaccelerator(s),anditcanmakesuchdeci-sionsincoordinationwiththehypervisor'sVCPUsched-uler,byexchangingrelevantacceleratorandscheduledata.Schedulinginthissecondphase,canthusbeen-hancedbycoordinatingtheactionsofthehypervisorandDomAscheduler(s)presentontheplatform,asintro-ducedinFigure1.Inaddition,certaincoordinationpoli-ciescanusethemonitoring/feedbackmodule,whichcur-rentlytrackstheaveragevaluesofwaittimesforacceler-atorrequests,thegoalbeingtodetectSLO(servicelevelobjective)violationsforguestrequests.VariouspoliciessupportedbytheDomAscheduleraredescribedinthefollowingsection.4ResourceManagementPoliciesforHeterogeneity-awareHypervisorsPegasuscontributesitsnovel,federated,andheterogeneity-awareschedulingmethodstothesub-stantivebodyofpastworkinresourcemanagement.Thepoliciesdescribedbelow,andimplementedbytheDomAscheduler,arecategorizedbasedontheirlevelofinteractionwiththehypervisor'sscheduler.Theyrangefromsimpleandeasilyimplementedschemesofferingbasicschedulingpropertiestocoordination-basedpoliciesthatexploitinformationsharingbetweenthehypervisorandacceleratorsubsystems.Policiesaredesignedtodemonstratetherangeofachievablecoordinationbetweenthetwoschedulersubsystems Algorithm1:SimpliedRepresentationofSchedulingDataandFunctionsforCredit-basedSchemes /*D=Domainbeingconsidered*//*X=Domaincpuoracceleratorcredits*//*T=Schedulertimerperiod*//*Tc=TicksassignedtonextD*//*Tm=maximumticksDgetsbasedonX*/Data:ReadyqueueRQAofdomains(D)/*RQisorderedbyX*/Data:AcceleratorqueueAccQofaccelerators/*AccQisorderedbyacceleratorweight*/InsertDomainforScheduling(D)ifDnotinRQAthenTc 1,Tm X XminA PickAccelerator(AccQ,D)InsertDomainInRQ CreditSorted(RQA,D)else/*DalreadyinsomeRQA*/ifContextEstablishedthenTc TmelseTc 1DomASchedule(RQA)InsertDomainforScheduling(Curr Dom)D RemoveHeadandAdvance(RQA)SetD'stimerperiodtoTc;Curr dom D andthebenetsseenbysuchcoordinationforvariousworkloads.Thespecicpropertyofferedbyeachpolicyisindicatedinsquarebrackets.4.1HypervisorIndependentPoliciesThesimplestmethodsdonotsupportschedulerfedera-tion,limitingtheirschedulinglogictoDomA.Noschedulinginbackend(None)[rst-come-rst-serve]—providesbasefunctionalitythatassignsdomainstoacceleratorsinaroundrobinmanner,butreliesonNVIDIA'sruntime/driverlayertohandleallrequestscheduling.DomAschedulerplaysnoroleindomainrequestscheduling.Thisservesasourbaseline.AccCredit(AccC)[proportionalfair-share]—recognizingthatdomainsdifferintermsoftheirdesireandabilitytouseaccelerators,acceleratorcreditsareassociatedwitheachdomain,basedonwhichdifferentdomainsarepolledfordifferenttimeperiods.Thismakesthetimegiventoaguestproportionaltohowmuchitdesirestousetheaccelerator,asapparentinthepseudo-codeshowninAlgorithm1,wheretherequestsfromthedomainattheheadofthequeuearehandleduntilitnishesitsawardednumberofticks.Forinstance,withcreditassignments(Dom1,1024),(Dom2,512),(Dom3,256),and(Dom4,512),thenumberoftickswillbe4,2,1,and2,respectively.BecausetheacceleratorsusedwithPegasusrequiretheirapplicationstoexplicitlyallocateandfreeaccelera-torstate,itiseasytodeterminewhetherornotadomaincurrentlyhascontext(state)establishedonanaccelera- Algorithm2:SimpliedRepresentationofCoSchedandAugCSchemes /*RQcpu=PerCPUreadyqinhypervisor*//*HS=VCPU-PCPUschedulefornextperiod*//*X=domaincredits*/HypeSchedule(RQcpu)PickVCPUsforallPCPUsinsystem8D;AugCreditD=RemainingCreditPassHStoDomAschedulerDomACoSchedule(RQA,HS)/*Tohandle#cpus�#accelerators*/8D2(RQATHS)PickDwithhighestXifD=nullthen/*ToimproveGPUutilization*/PickDwithhighestXinRQADomAAugSchedule(RQA,HS)foreachD2RQAdoPickDwithhighest(AugCredit+X) tor.TheDomAscheduler,therefore,interpretsadomaininaContextEstablishedstateasonethatisactivelyusingtheaccelerator.WheninaNoContextEstablishedstate,aminimumtimetick(1)isassignedtothedomainforthenextschedulingcycle(seeAlgorithm1).4.2HypervisorControlledPolicyTherationalebehindcoordinatingVCPUsandaVCPUsisthattheoverallexecutiontimeofanapplication(com-prisedofbothhostandacceleratorportions)canbere-ducedifitscommunicatinghostandacceleratortasksarescheduledatthesametime.Weimplementonesuchmethoddescribednext.Strictco-scheduling(CoSched)[latencyreduc-tionbyoccasionalunfairness]—analternativetotheaccelerator-centricpoliciesshownabove,thispolicygivescompletecontroloverschedulingtothehyper-visor.Here,acceleratorcoresaretreatedasslavestohostcores,sothatVCPUsandaVCPUsarescheduledatthesametime.Thispolicyworksparticularlywellforlatency-sensitiveworkloadslikecertainnancialpro-cessingcodes[24]orbarrier-richparallelapplications.ItisimplementedbypermittingthehypervisorschedulertocontrolhowDomAschedulesaVCPUs,asshowninAlgorithm2.For`singularVCPUs',i.e.,thosewithoutassociatedaVCPUs,schedulingrevertstousingastan-dardcredit-basedscheme.4.3HypervisorCoordinatedPoliciesAknownissuewithco-schedulingispotentialunfair-ness.Thefollowingmethodshavethehypervisorac-tivelyparticipateinmakingschedulingdecisionsratherthangoverningthem:Augmentedcredit-basedscheme(AugC)[through-putimprovementbytemporarycreditboost]—going 6ExperimentalEvaluationKeycontributionsofPegasusare(1)acceleratorsasrstclassschedulableentitiesand(2)coordinatedschedul-ingtoprovideapplicationswiththehighlevelsofper-formancesoughtbyuseofheterogeneousprocessingresources.ThissectionrstshowsthatthePegasuswayofvirtualizingacceleratorsisefcient,nextdemon-stratestheimportanceofcoordinatedresourcemanage-ment,andnally,presentsanumberofinterestingin-sightsabouthowdiversecoordination(i.e.,scheduling)policiescanbeusedtoaddressworkloaddiversity.Testbed:Allexperimentalevaluationsareconductedonasystemcomprisedof(1)a2.5GHzXeonquad-coreprocessorwith3GBmemoryand(2)anNVIDIA9800GTXcardwith2GPUsandthev169.09GPUdriver.TheXen3.2.1hypervisorandthe2.6.18LinuxkernelareusedinDom0andguestdomains.Guestdomainsuse512MBmemoryand1VCPUeach,thelatterpinnedtocertainphysicalcores,dependingontheexperimentsbeingconducted.6.1BenchmarksandApplicationsPegasusisevaluatedwithanextensivesetofbench-marksandwithemulationsofmorecomplexcomputa-tionallyexpensiveenterprisecodesliketheweb-basedimageprocessingapplicationmentionedearlier.Bench-marksinclude(1)parallelcodesrequiringlowlevelsofdeviationforhighlysynchronousexecution,and(2)throughput-intensivecodes.AcompletelistingappearsinTable1,identifyingthemasbelongingtoeithertheparboilbenchmarksuite[30]ortheCUDASDK1.1.Benchmark-basedperformancestudiesgobeyondrun-ningindividualcodestousingrepresentativecodemixesthathavevaryingneedsanddifferencesinbehaviorduetodifferentdatasetsizes,datatransfertimes,it-erationcomplexity,andnumbersofiterationsexecutedforcertaincomputations.ThelattertwoareagoodmeasureofGPU`kernel'sizeandthedegreeofcou-plingbetweenCPUsorchestratingacceleratoruseandtheGPUsrunningthesekernelsrespectively.DependingontheiroutputsandthenumberofCUDAcallsmade,(1)throughput-sensitivebenchmarksareMC,BOp,PI,(2)latency-sensitivebenchmarksincludeFWT,andsci-entic,and(3)somebenchmarksareboth,e.g.,BS,CP.Abenchmarkisthroughput-sensitivewhenitsperfor-manceisbestevaluatedasthenumberofsomequan-tityprocessedorcalculatedpersecond,andabenchmarkislatency-sensitivewhenitmakesfrequentCUDAcallsanditsexecutiontimeissensitivetopotentialvirtual-izationoverheadand/ordelaysor`noise'inacceleratorscheduling.Theimageprocessingapplication,termedPicSer,emulateswebcodeslikePhotoSynth.BlackSc-holesrepresentsnancialcodeslikethoserunbyoptiontradingcompanies[24].Category Source Benchmarks Financial SDK Binomial(BOp),BlackSc-holes(BS),Monte-Carlo(MC) Mediaprocess-ing SDKorpar-boil ProcessImage(PI)=matrixmultiply+DXTC,MRIQ,FastWalshTransform(FWT) Scientic parboil CP,TPACF,RPESTable1:SummaryofBenchmarks6.2GPGPUVirtualizationVirtualizationoverheadswhenusingPegasusarede-pictedinFigures3(a)–(c),usingthebenchmarkslistedinTable1.Resultsshowtheoverhead(orspeedup)whenrunningthebenchmarkinquestioninaVMvs.whenrunningitinDom0.TheoverheadiscalculatedasthetimeittakesthebenchmarktoruninaVMdividedbythetimetorunitinDom0.Weshowtheoverhead(orspeedup)fortheaveragetotalexecutiontime(TotalTime)andtheaveragetimeforCUDAcalls(CudaTime)across50runsofeachbenchmark.CudaTimeiscalcu-latedasthetimetoexecuteallCUDAcallswithintheapplication.RunningthebenchmarkinDom0isequiv-alenttorunningitinanon-virtualizedsetting.Forthe1VMnumbersinFigure3(a)and(c),allfourcoresareenabled,andtoavoidschedulerinteractions,Dom0andtheVMarepinnedonseparatecores.TheexperimentsreportedinFigure3(b)haveonly1coreenabledandtheexecutiontimesarenotaveragedovermultipleruns,withabackendrestartforeveryrun.Thisisdoneforreasonsexplainednext.Allcasesuseanequalnumberofphysi-calGPUs,andDom0testsarerunwithasmanycoresastheDom0–1VMcase.Aninterestingobservationabouttheseresultsisthatsometimes,itisbettertousevirtualizedratherthannon-virtualizedaccelerators.Thisisbecause(1)thePegasusvirtualizationsoftwarecanbenetfromtheconcurrencyseenfromusingdifferentcoresfortheguestvs.Dom0domains,and(2)furtheradvantagesarederivedfromad-ditionalcachingofdataduetoaconstantlyrunning—inDom0—backendprocessandNVIDIAdriver.ThisisconrmedinFigure3(b),whichshowshigherover-headswhenthebackendisstoppedbeforeeveryrun,wipingoutanydrivercacheinformation.Alsoofinter-estisthespeedupseenbysay,BOporPIvs.theperfor-manceseenbysay,BSorRPES,inFigure3(a).Thisisduetoanincreaseinthenumberofcallsperapplication,seeninBOp/PIvs.BS/RPES,emphasizingthevirtual-izationoverheadaddedtoeachexecutedCUDAcall.Inthesecases,thebenetsfromcachingandthepresenceofmultiplecoresareoutweighedbythepercalloverheadmultipliedbythenumberofcallsmade. theruns.Schedulingisneededwhensharingaccelerators:Figure3(c)showstheoverheadofsharingtheGPUwhenapplicationsarerunbothinDom0andinvirtualizedguests.Inthegure,the1VMquantitiesrefertoover-head(orspeedup)seenbyabenchmarkrunningin1VMvs.whenitisrunnonvirtualizedinDom0.2dom0and2VMvaluesaresimilarlynormalizedwithrespecttotheDom0values.2dom0valuesindicateexecutiontimesobservedforabenchmarkwhenitsharesaGPUrun-ninginDom0,i.e.,intheabsenceofGPUvirtualiza-tion,and2VMvaluesindicatesimilarvalueswhenrunintwoguestVMssharingtheGPU.Forthe2VMcase,theBackendimplementsRR,aschedulingpolicythatiscompletelyfairtobothVMs,andtheirCPUcreditsaresetto256forequalCPUsharing.Thesemeasurementsshowthat(1)whiletheperformanceseenbyapplicationssuffersfromsharing(duetoreducedacceleratoraccess),(2)aclearbenetisderivedformostbenchmarksfromusingevenasimpleschedulingmethodforacceleratoraccess.Thisisevidentfromthevirtualizedcasethatusesaroundrobinscheduler,whichshowsbetterperformancecomparedwiththenonvirtualizedrunsinDom0formostbenchmarks,particularlytheoneswithlowernumbersofCUDAcallinvocations.ThisshowsthatschedulingisimportanttoreducecontentionintheNVIDIAdriverandthushelpsminimizetheresultingperformancedegrada-tion.MeasurementsreportCudaTimeandTotalTime,whichisthemetricusedinFigures3(a)–(b).WespeculatethatsharingoverheadscouldbereducedfurtherifPegasuswasgivenmorecontroloverthewayGPUresourcesareused.Additionalbenetsmayarisefromimprovedhardwaresupportforsharingtheacceler-ator,asexpectedforfutureNVIDIAhardware[27].Coordinationcanimproveperformance:Withen-couragingresultsfromthesimpleRRscheduler,wenextexperimentwiththemoresophisticatedpoliciesde-scribedinSection4.Inparticular,weuseBlackScholes(outputsoptionsandhenceitsthroughputisgivenbyOptions/sec)which,withmorethan512computekernellaunchesandalargenumberofCUDAcalls,hasahighdegreeofCPU-GPUcoupling.ThismotivatesustoalsoreportthelatencynumbersseenbyBS.Animportantinsightfromtheseexperimentsisthatcoordinationinschedulingisparticularlyimportantfortightlycoupledcodes,asdemonstratedbythefactthatourbasecase,None,showslargevariationsandworseoverallperformance,whereasAugCandCoSchedshowthebestperformanceduetotheirhigherdegreesofcoor-dination.Figures4(a)–(c)showthatthesepoliciesper-formwellevenwhendomainshaveequalcredits.TheBlackScholesrunusedinthisexperimentgenerates2millionoptionsover512iterationsinallourdomains.Figure4(a)showsthedistributionofthroughputvaluesinMillionoptions/sec,asexplainedearlier.WhileXCandSLAFseehighvariationduetohigherdependenceondriverschedulingandnoattemptforCPUandGPUcoscheduling,theystillperformatleast33%betterthanNonewhencomparingthemedians.AugCandCoSchedaddanadditional4%–20%improvementasseenfromFigure4(a).ThehigherperformanceseenwithDom1andDom3fortotalworkdoneinFigure4(b)incaseofAugCandCoSchedisbecauseofthelowersignalingla-tencyseenbytheincomingandoutgoingdomainback-endthreads,duetotheirco-locationwiththeschedulingthreadandhence,theaffectedcallorderingdonebytheNVIDIAdriver(whichisbeyondourcontrol).Beyondtheimprovementsshownabove,futurede-ploymentscenariosinutilitydatacenterssuggesttheim-portanceofsupportingprioritizationofdomains.Thisisseenbyexperimentsinwhichwemodifythecreditsassignedtoadomain,whichcanfurtherimproveperfor-mance(seeFigure5).WeagainuseBlackScholes,butwithDomaincreditsas(1)(Dom1,256),(2)(Dom2,512),(3)(Dom3,1024),and(4)(Dom4,256),respectively.Theeffectsofsuchschedulingareapparentfromthefactthat,asshowninFigure5(b),Dom3succeedsinper-forming2.4Xor140%moreworkwhencomparedwithNone,withitsminimumandmaximumthroughputval-uesshowing3Xto2Ximprovementrespectively.Thisisbecausedomainssometimescompleteearly(e.g.,Dom3completesitsdesignatedrunsbeforeDom1)whichthenfreesuptheacceleratorforotherdomains(e.g.,Dom1)tocompletetheirworkinamodesimilartonon-sharedoperation,resultinginhighthroughput.The`workdone'metriccapturesthisbecauseaveragethroughputiscal-culatedfortheentireapplicationrun.AnotherimportantpointseenfromFigure5(c)isthatthelatencyseenbyDom4variesmoreascomparedtoDom2forsayAugCbecauseofthetemporaryunfairnessresultingfromthedifferenceincreditsbetweenthetwodomains.Analinterestingnoteisthatschedulingbecomeslessimpor-tantwhenacceleratorsarenothighlyutilized,asevidentfromothermeasurementsnotreportedhere.Coordinationrespectsproportionalcreditassign-ments:ThepreviousexperimentsuseequalamountsofacceleratorandCPUcredits,butingeneral,notallguestVMsneedequalacceleratorvs.generalpurposeprocessorresources.WedemonstratetheeffectsofdiscretionarycreditallocationsusingtheBSbenchmark,sinceitiseasilyconguredforvariableCPUandGPUexecutiontimes,basedontheexpectednumberofcallandputoptionsandthenumberofiterationsdenotedbyBSh#options,#iterationsi.EachdomainisassigneddifferentGPUandCPUcred-itsdenotedbyDom#hAccC,XC,SLAproportioni.Thisresultsinthecongurationforthisexperimentbeing:Dom1h1024,256,0.2irunningBSh2mi,128i, DifferentXenandacceleratorcreditsfordo-mainsFigure6:Performanceofdifferentschedulingschemes[BS] a)PicSerWorkdoneb)LatencyforGPUprocessingFigure7:Performanceofselectedschedulingschemesforrealworldbenchmark a)AllDoms=256b)Dom1=256,Dom2=1024,Dom3=2048,Dom4=256Figure8:Averagelatenciesseenfor[FWT] Figure9:[CP]withsharingusingtwo,three,andfourdomainsassignedequalcred-its,withamixofdifferentworkloads,ourmeasurementsshowthatingeneral,schedulingworkswellandexhibitslittlevariation,especiallyintheabsenceofacceleratorsharing.Whilethoseresultsareomittedduetolackofspace,wedoreporttheworstcaseschedulingover-headsseenperschedulercallinTable2,fordifferentschedulingpolicies.MSinthetablereferstotheMonitorandSweepthreadresponsibleformonitoringcreditvaluechangesforguestVMsandcleaningoutstatefornon-existingVMs.XenkernelreferstothechangesmadetothehypervisorCPUschedulingmethod.Acc0andAcc1refertotheschedulers(fortimerbasedschemeslikeRR,XC,SLAF)inourdualacceleratortestbed.HypereferstotheuserlevelthreadrunforpolicieslikeAugCandCoSchedforcoordinatingCPUandGPUactivities.Asseenfromthetable,thePegasusbackendcompo-nentshavelowoverhead.Forexample,XCsees0.5msperschedulercallperaccelerator,comparedtoatyp-icalexecutiontimeofCUDAapplicationsofbetween250msto5000msandwithtypicalschedulingperiodsof30ms.Themostexpensivecomponent,withanoverheadof1ms,isMS,whichrunsonceeverysecond.Schedulingcomplexworkloads:WhenevaluatingschedulingpolicieswiththePicSerapplication,werunthreedual-core,512MBguestsonourtestbed.OneVM(Dom2)isusedforpriorityserviceandhencegiven1024creditsand1GPU,whiletheremainingtwoareassigned256credits,andtheysharethesecondGPU.VM2islatency-sensitive,andalloftheVMscareaboutthrough-Policy MS XenKernel Acc0/Hype Acc1 (msec) (msec) (msec) (msec) None 272 0.85 0 0XC 1119 0.85 507 496AugC 1395 0.9 3.36 0SLAF 1101 0.95 440 471CoSched 1358 0.825 2.71 0Table2:Backendscheduleroverheadput.SchedulingisimportantbecauseCPUsaresharedbymultipleVMs.Figure7(a)showstheaveragethroughput(Pixels/sectoincorporatedifferentimagesizes)seenbyeachVMwithfourdifferentpolicies.WechooseAugCandCoSchedtohighlighttheco-schedulingdifferences.Noneistoprovideabaseline,andSLAFisanenhancedversionofallofthecreditbasedschemes.AugCtriestoimprovethethroughputofallVMs,whichresultsinasomewhatlowervalueforDom2.CoSchedgivesprioritytoDom2andcanpenalizeotherVMs,asevidentfromtheGPUlatenciesshowninFigure7(b).`Noschedul-ing'doesnotperformwell.Moregenerally,itisclearthatcoordinatedschedulingcanbeeffectiveinmeetingtherequirementsofmulti-VMapplicationssharingCPUandGPUresources.6.4DiscussionExperimentalresultsshowthatthePegasusapproachef-cientlyvirtualizesGPUsandinaddition,caneffec-tivelyscheduletheiruse.Evenbasicacceleratorrequestschedulingcanimprovesharingperformance,withad-ditionalbenetsderivedfromactiveschedulingcoordi- nationschemes.Amongthesemethods,XCcanper-formquitewell,butfailstocapitalizeonCPU-GPUco-ordinationopportunitiesfortightlycoupledbenchmarks.SLAF,whenappliedtoCPUcredits,hasasmoothingeffectonthehighvariationsofXC,becauseofitsfeed-backloop.Formostbenchmarks,especiallythosewithahighdegreeofcoupling,AugCandCoSchedperformsignicantlybetterthatotherschemes,butrequiresmallchangestothehypervisor.Moregenerally,schedulingschemesworkwellintheabsenceofover-subscription,helpingregulatetheowofcallstotheGPU.Regulationalsoresultsinloweringthedegreesofvariabilitycausedbyun-coordinateduseoftheNVIDIAdriver.AugCandCoSched,inparticular,constituteaninter-estingpathtowardrealizingourgoalofmakingaccel-eratorsrstclasscitizens,andfurtherimprovementstothoseschemescanbederivedfromgatheringadditionalinformationaboutacceleratorresources.Thereisnot,however,asingle`best'schedulingpolicy.Instead,thereisaclearneedfordiversepoliciesgearedtomatchdiffer-entsystemgoalsandtoaccountfordifferentapplicationcharacteristics.Pegasusschedulingusesglobalplatformknowledgeavailableathypervisorlevel,anditsimplementationben-etsfromhypervisor-levelefcienciesintermsofre-sourceaccessandcontrol.Asaresult,itdirectlyad-dressesenterpriseandcloudcomputingsystemsinwhichvirtualizationisprevalent.Yet,clearly,methodslikethoseinPegasuscanalsoberealizedatOSlevel,par-ticularlyforthehighperformancedomainwherehyper-visorsarenotyetincommonuse.Infact,wearecur-rentlyconstructingaCUDAinterposerlibraryfornon-virtualized,nativeguestOSes,whichweintendtousetodeployschedulingsolutionsakintothoserealizedinPegasusatlargescaleontheKeenelandmachine.7RelatedWorkTheimportanceofdealingwiththeheterogeneityoffu-turemulti-coreplatformsiswidelyrecognized.Cy-press[10]hasexpressedthedesignprinciplesforhy-pervisorsactuallyrealizedinPegasus(e.g.,partitioning,localization,andcustomization),butPegasusalsoarticu-latesandevaluatesthenotionofcoordinatedscheduling.Multikernel[4]andHelios[26]changesystemstructuresformulticores,advocatingdistributedsystemmodelsandsatellitekernelsforprocessorgroups,respectively.Incomparison,Pegasusretainstheexistingoperatingsys-temstack,thenusesvirtualizationtoadapttodiverseun-derlyinghardware,andnally,leveragesthefederationapproachshownscalableinothercontextstodealwithmultipleresourcedomains.PriorworkonGPUvirtualizationhasusedtheOpenGLAPI[21]or2D-3Dgraphicsvirtualization(Di-rectX,SVGA)[9].Incomparison,Pegasusoperatesonentirecomputationalkernelsmorereadilyco-scheduledwithVCPUsrunningongeneralpurposeCPUs.ThisapproachtoGPUvirtualizationisoutlinedinanearlierworkshoppaper,termedGViM[13],whichalsopresentssomeexamplesthatmotivatetheneedforQoS-awarescheduling.Incomparison,thispaperthoroughlyevalu-atestheapproach,developsandexploresatlengththeno-tionofcoordinatedschedulingandtheschedulingmeth-odswehavefoundsuitableforGPGPUuseandforlatency-vs.throughput-intensiveenterprisecodes.Whilesimilarinconcept,Pegasusdiffersfromcoordi-natedschedulingatthedatacenterlevel,inthatitsde-terministicmethodswithpredictablebehavioraremoreappropriateatthene-grainedhypervisorlevelthantheloosely-coordinatedcontrol-theoreticorstatisticaltech-niquesusedindatacentercontrol[20].Pegasusco-schedulingdiffersinimplementationfromtraditionalgangscheduling[36]inthat(1)itoperatesacrossmul-tipleschedulingdomains,i.e.,GPUvs.CPUschedul-ing,withoutdirectcontroloverhoweachofthosedo-mainsschedulesitsresources,and(2)becauseitlimitstheidlingofGPUs,byrunningworkloadsfromotheraVCPUswhenacurrentlyscheduledVCPUdoesnothaveanyaVCPUstorun.ThisisappropriatebecausePegasusco-schedulingschemescanaffordsomeskewbetweenCPUandGPUcomponents,sincetheiraimisnottosolvethetraditionallockingissue.RecenteffortslikeQilin[23]andpredictiveruntimecodescheduling[16]bothaimtobetterdistributetasksacrossCPUsandGPUs.Suchworkiscomplementaryandcouldbeusedcombinedwiththeruntimeschedul-ingmethodsofPegasus.Upcominghardwaresupportforaccelerator-levelcontexts,contextisolation,andcontext-switching[27]mayhelpintermsofloadbalancingop-portunitiesandmoreimportantly,itwillhelpimproveacceleratorsharing[9].8ConclusionsandFutureWorkThispaperadvocatesmakingallofthediversecoresofheterogeneousmanycoresystemsintorstclassschedu-lableentities.ThePegasusvirtualization-basedapproachfordoingso,istoabstractacceleratorinterfacesthroughvirtualizationandthendeviseschedulingmethodsthatcoordinateacceleratorusewiththatofgeneralpurposehostcores.TheapproachisappliedtoacombinedNVIDIA-andx86-basedGPGPUmulticoreprototype,enablingmultipleguestVMstoefcientlysharehet-erogenousplatformresources.EvaluationsusingalargesetofrepresentativeGPGPUbenchmarksandcompu-tationallyintensivewebapplicationsresultininsightsthatinclude:(1)theneedofcoordinationwhensharingacceleratorresources,(2)itscriticalimportanceforap-plicationsthatfrequentlyinteractacrosstheCPU-GPUboundary,and(3)theneedfordiversepolicieswhenco-