/
PTask Operating System Abstractions To Manage GPUs as PTask Operating System Abstractions To Manage GPUs as

PTask Operating System Abstractions To Manage GPUs as - PDF document

liane-varnes
liane-varnes . @liane-varnes
Follow
403 views
Uploaded On 2015-04-22

PTask Operating System Abstractions To Manage GPUs as - PPT Presentation

Rossbach Microsoft Research crossbacmicrosoftcom Jon Currey Microsoft Research jcurreymicrosoftcom Mark Silberstein Technion markscstechnionacil Baishakhi Ray University of Texas at Austin braycsutexasedu Emmett Witchel University of Texas at Austin ID: 53299

Rossbach Microsoft Research crossbacmicrosoftcom

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "PTask Operating System Abstractions To M..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

causetheOSmanagesGPUsasperipheralsratherthanassharedcomputeresources,theOSleavesresourcemanagementforGPUstovendor-supplieddriversanduser-moderun-times.WithnoroleinGPUresource-management,theOScannotprovideguaranteesoffairnessandperformanceisolation.Forapplicationsthatrelyonsuchguarantees,GPUsareconsequentlyanimpracticalchoice.Thispaperproposesasetofkernel-levelabstractionsformanag-inginteractive,high-computedevices.GPUsrepresentanewkindofperipheraldevice,whosecomputationanddatabandwidthex-ceedthatoftheCPU.Thekernelmustexposeenoughhardwaredetailoftheseperipheralstoallowprogrammerstotakeadvantageoftheirenormousprocessingcapabilities.Butthekernelmusthideprogrammerinconvenienceslikememorythatisnon-coherentbe-tweentheCPUandGPU,andmustdosoinawaythatpreservesperformance.GPUsmustbepromotedtorst-classcomputingre-sources,withtraditionalOSguaranteessuchasfairnessandisola-tion,andtheOSmustprovideabstractionsthatallowprogrammerstowritecodethatisbothmodularandperformant.Ournewabstractions,collectivelycalledthePTaskAPI,provideadataowprogrammingmodelinwhichtheprogrammerwritescodetomanageagraph-structuredcomputation.Theverticesinthegrapharecalledptasks(shortforparalleltask)whichareunitsofworksuchasashaderprogramthatrunsonaGPU,oracodefrag-mentthatrunsontheCPUoranotheracceleratordevice.PTaskver-ticesinthegraphhaveinputandoutputportsexposingdatasourcesandsinksinthecode,andareconnectedbychannels,whichrep-resentadataowedgeinthegraph.Thegraphexpressesbothdatamovementandpotentialconcurrencydirectly,whichcangreatlysimplifyprogramming.Theprogrammermustexpressonlywheredatamustmove,butnothoworwhen,allowingthesystemtopar-allelizeexecutionandoptimizedatamovementwithoutanyaddi-tionalcodefromtheprogrammer.Forexample,twosiblingptasksinagraphcanrunconcurrentlyinasystemwithmultipleGPUswithoutadditionalGPUmanagementcode,anddoublebufferingiseliminatedwhenmultipleptasksthatrunonasingleacceleratoraredependentandsequentiallyordered.UndercurrentGPUpro-grammingmodels,suchoptimizationsrequiredirectprogrammerintervention,butwiththePTaskAPI,thesamecodeadaptstorunoptimallyondifferenthardwaresubstrates.APTaskgraphconsistsofOS-managedobjects,sothekernelhassufcientvisibilityandcontroltoprovidesystem-wideguaranteeslikefairnessandperformanceisolation.ThePTaskruntimetracksGPUusageandprovidesastatemachineforptasksthatallowsthekerneltoscheduletheminawaysimilartoprocesses.Undercur-rentGPUframeworks,GPUschedulingiscompletelyhiddenfromthekernelbyvendor-provideddrivercode,andoftenimplementssimplisticpoliciessuchasround-robin.Thesesimplepoliciescanthwartkernelschedulingpriorities,underminingfairnessandin-vertingpriorities,ofteninadramaticway.Kernel-levelptasksenabledatamovementoptimizationsthatareimpossiblewithcurrentGPUprogrammingframeworks.Forex-ample,consideranapplicationthatusestheGPUtoacceleratereal-timeimageprocessingfordatacomingfromaperipherallikeacamera.CurrentGPUframeworksinduceexcessivedatacopybycausingdatatomigratebackandforthacrosstheuser-kernelboundary,andbydouble-bufferingindrivercode.APTaskgraph,conversely,providestheOSwithpreciseinformationaboutdata'sorigin(s)anddestination(s).TheOSusesthisinformationtoelimi-nateunnecessarydatacopies.Inthecaseofreal-timeprocessingofimagedatafromacamera,thePTaskgraphenablestheeliminationoftwolayersofbuffering.BecausedataowsdirectlyfromthecameradrivertotheGPUdriver,anintermediatebufferisunneces-sary,andacopytouserspaceisobviated. WehaveimplementedthefullPTaskAPIforWindows7andPTaskschedulinginLinux.OurexperienceusingPTasktoac-celerateagesturalinterfaceinWindowsandaFUSE-baseden-cryptedlesysteminLinuxshowsthatkernel-levelsupportforGPUabstractionsprovidessystem-wideguarantees,enablessignif-icantperformancegains,andcanmakeGPUaccelerationpracticalinapplicationdomainswherepreviouslyitwasnot.Thispapermakesthefollowingcontributions.  ProvidesquantitativeevidencethatmodernOSabstractionsareinsufcienttosupportaclassof“interactive”applicationsthatuseGPUs,showingthatsimpleGPUprogramscanre-ducetheresponsetimesforadesktopthatusestheGPUbynearlyanorderofmagnitude.  ProvidesadesignforOSabstractionstosupportawiderangeofGPUcomputationswithtraditionalOSguaranteeslikefairnessandisolation.  ProvidesaprototypeofthePTaskAPIandaGPU-acceleratedgesturalinterface,alongwithevidencethatPTasksenable“interactive”applicationsthatwerepreviouslyimpractical,whileprovidingfairnessandisolationguaranteesthatwerepreviouslyabsentfromtheGPGPUecosystem.ThedataowprogrammingmodelsupportedbythePTaskAPIde-liversthroughputimprovementsupto4acrossarangeofmicrobenchmarksanda5improvementforourprototypegesturalinterface.  DemonstratesaprototypeofGPU-awareschedulingintheLinuxkernelthatforcesGPU-usingapplicationstorespectkernelschedulingpriorities.2.MOTIVATIONThispaperfocusesonGPUsupportforinteractiveapplicationslikegesture-basedinterfaces,neuralinterfaces(alsocalledbrain-computerinterfacesorBCIs)[48],encryptinglesystemsandreal-timeaudio/visualinterfacessuchasspeechrecognition.Thesetasksarecomputationallydemanding,havereal-timeperformanceandlatencyconstraints,andfeaturemanydata-independentphasesofcomputation.GPUsareanidealcomputesubstrateforthesetaskstoachievetheirlatencydeadlines,butlackofkernelsupportforcesdesignersoftheseapplicationstomakedifcultandoftenunten-abletradeoffstousetheGPU.Tomotivateournewkernelabstractionsweexploretheproblemofinteractivegesturerecognitionasacasestudy.Agesturalinter-faceturnsauser'shandmotionsintoOSinputeventssuchasmousemovementsorclicks[36].Forcingtheusertowearspecialglovesmakesgesturerecognitioneasierforthemachine,butitisunnat-ural.Thegesturalinterfaceweconsiderdoesnotrequiretheusertowearanyspecialclothing.Suchasystemmustbetoleranttovi-sualnoiseonthehands,likepoorlightingandrings,andmustusecheap,commoditycamerastodothegesturesensing.Agesturalinterfaceworkloadiscomputationallydemanding,hasreal-timela-tencyconstraints,andisrichwithdata-parallelalgorithms,makingitanaturaltforGPU-acceleration.GesturerecognitionissimilartothecomputationaltaskperformedbyMicrosoft'sKinect,thoughthatsystemhasfewercameras,lowerdataratesandgrosserfea-tures.Kinectonlyrunsasingleapplicationatatime(thecurrentgame),whichcanuseallavailableGPUresources.Anoperatingsystemmustmultiplexcompetingapplications.Figure2showsabasicdecompositionofagesturerecognitionsystem.Thesystemconsistsofsomenumberofcameras(inthisexample,photogrammetricsensors[28]),andsoftwaretoanalyzeimagescapturedfromthecameras.Becausesuchasystemfunc-tionsasauserinputdevice,gestureeventsrecognizedbythesys-temmustbemultiplexedacrossapplicationsbytheOS;tobeus- Figure4:TheeffectofGPU-boundworkonCPU-boundtasksThegraphshowsthefrequency(inHz)withwhichtheOSisabletodelivermousemovementeventsoveraperiodof60sec-ondsduringwhichaprogrammakesheavyuseoftheGPU.AverageCPUutilizationovertheperiodisunder25%. chronousbuffercopy,CUDAstreams(ageneralizationofthelat-ter),andpinningofmemorybufferstotoleratedatamovementla-tencybyoverlappingcomputationandcommunication.However,tousesuchfeatures,aprogrammermustunderstandOS-levelis-sueslikememorymapping.Forexample,CUDAprovidesAPIstopinallocatedmemorybuffers,allowingtheprogrammertoavoidalayerofbufferingaboveDMAtransfer.Theprogrammeriscau-tionedtousethisfeaturesparinglyasitreducestheamountofmem-oryavailabletothesystemforpaging[59].Usingstreamseffectivelyrequiresastaticknowledgeofwhichtransferscanbeoverlappedwithwhichcomputations;suchknowl-edgemaynotalwaysbeavailablestatically.Moreover,streamscanonlybeeffectiveifthereisavailablecommunicationtoperformthatisindependentofthecurrentcomputation.Forexample,copy-ingdataforstreama1toorfromthedeviceforexecutionbyker-nelAcanbeoverlappedwiththeexecutionofkernelB;attemptstooverlapwithexecutionofAwillcauseserialization.Conse-quently,modulesthatofoadlogicallyseparatecomputationtotheGPUmustbeawareofeachother'scomputationandcommunica-tionpatternstomaximizetheeffectivenessofasynchrony.NewarchitecturesmayaltertherelativedifcultyofmanagingdataacrossGPUandCPUmemorydomains,butsoftwarewillre-tainanimportantrole,andoptimizingdatamovementwillremainimportantfortheforeseeablefuture.AMD'sFusionintegratestheCPUandGPUontoasingledie,andenablescoherentyetslowaccesstothesharedmemorybybothprocessors.Howeverhighperformanceisonlyachievablevianon-coherentaccessesorbyusingprivateGPUmemory,leavingdataplacementdecisionstosoftware.Intel'sSandyBridge,anotherCPU/GPUcombination,isfurtherindicationthatthecomingyearswillseevariousformsofintegratedCPU/GPUhardwarecomingtomarket.Newhybridsystems,suchasNVIDIAOptimus,haveapower-efcienton-dieGPUandahigh-performancediscreteGPU.DespitethepresenceofacombinedCPU/GPUchip,suchsystemsstillrequireexplicitdatamanagement.WhilethereisevidencethatGPUswithcoherentaccesstosharedmemorymayeventuallybecomecommon,evenacompletelyintegratedvirtualmemorysystemrequiressystemsup-portforminimizingdatacopies. Figure5:TheeffectofCPU-boundworkonGPU-boundtasks.H!DisaCUDAworkloadthathascommunicationfromthehosttotheGPUdevice,whileH DhascommunicationfromtheGPUtothehost,andH$Dhasbidirectionalcommunica-tion. 2.3TheschedulingproblemModernOSescannotcurrentlyguaranteefairnessandperfor-manceforsystemsthatuseGPUsforcomputation.TheOSdoesnottreatGPUsasasharedcomputationalresource,likeaCPU,butratherasanI/Odevice.ThisdesignbecomesaseverelimitationwhentheOSneedstousetheGPUforitsowncomputation(e.g.,asWindows7doeswiththeAerouserinterface).Underthecurrentregime,watchdogtimersensurethatscreenrefreshratesaremain-tained,butOSschedulingprioritiesareeasilyunderminedbytheGPUdriver.GPUworkcausessystempauses.Figure4showstheimpactofGPU-boundworkonthefrequencywithwhichthesystemcancollectanddelivermousemovements.Inourexperiments,signi-cantGPU-workathighframeratescausesWindows7tobeunre-sponsiveforsecondsatatime.Tomeasurethisphenomenon,weinstrumenttheOStorecordthefrequencyofmouseeventsdeliv-eredthroughtheHIDclassdriverovera60secondperiod.WhennoconcurrentGPUworkisexecuting,thesystemisabletodelivermouseeventsatastable120Hz.However,whentheGPUisheav-ilyloaded,themouseeventrateplummets,oftentobelow20Hz.TheGPU-boundtaskisconsole-based(doesnotupdatethescreen)andperformsunrelatedworkinanotherprocesscontext.Moreover,CPUutilizationisbelow25%,showingthattheOShascomputere-sourcesavailabletodeliverevents.Acombinationoffactorsareatworkinthissituation.GPUsarenotpreemptible,withtheside-effectthatin-progressI/Orequestscannotbecanceledoncebegun.BecauseWindowsreliesoncancelationtoprioritizeitsownwork,itsprioritymechanismfails.TheproblemiscompoundedbecausethedevelopersoftheGPUruntimeuserequestbatchingtoimprovethroughputforGPUprograms.Ultimately,WindowsisunabletointerruptalargenumberofGPUinvocationssubmittedinbatch,andthesystemappearsunresponsive.TheinabilityoftheOStomanagetheGPUasarst-classresourceinhibitsitsabilitytoloadbalancetheentiresystemeffectively.CPUworkinterfereswithGPUthroughput.Figure5showstheinabilityofWindows7toloadbalanceasystemthathascon-current,butfundamentallyunrelatedworkontheGPUandCPUs.Thedatainthegurewerecollectedonamachinewith64-bitWin-dows7,IntelCore2Quad2.66GHz,8GBRAM,andanNVIDIAGeForceGT230GPU.ThegureshowstheimpactofaCPU-boundprocess(usingall4corestoincrementcountervariables)ontheframerateofashaderprogram(thexformprogramfromourprototypeimplementation).TheframerateoftheGPUpro-gramdropsby2x,despitethenearabsenceofCPUworkinthe program:xformusestheCPUonlytotriggerthenextcomputationontheGPUdevice.TheseresultssuggestthatGPUsneedtobetreatedasarst-classcomputingresourceandmanagedbytheOSschedulerlikeanor-malCPU.SuchabstractionswillallowtheOStoprovidesystem-widepropertieslikefairnessandperformanceisolation.Userpro-gramsshouldinteractwithGPUsusingabstractionssimilartothreadsandprocesses.CurrentOSesprovidenoabstractionsthattthismodel.Inthefollowingsections,weproposeabstractionstoad-dresspreciselythisproblem.3.DESIGNWeproposeasetofnewOSabstractionstosupportGPUpro-grammingcalledthePTask(ParallelTask)API.ThePTaskAPIconsistsofinterfacesandruntimelibrarysupporttosimplifytheof-oadingofcompute-intensivetaskstoacceleratorssuchasGPUs.PTasksupportsadataowprogrammingmodelinwhichindivid-ualtasksareassembledbytheprogrammerintoadirectedacyclicgraph:vertices,calledptasks,areexecutablecodesuchasshaderprogramsontheGPU,codefragmentsonotheraccelerators(e.g.aSmartNIC),orcallbacksontheCPU.Edgesinthegraphrepresentdataow,connectingtheinputsandoutputsofeachvertex.PTaskisbestsuitedforapplicationsthathavesignicantcomputationaldemands,featurebothtask-anddata-levelparallelism,andrequirebothhighthroughputandlowlatency.PTaskwasdevelopedwiththreedesigngoals.(1)BringGPUsunderthepurviewofasingle(perhapsfederated)resourcemanager,allowingthatentitytoprovidemeaningfulguaranteesforfairnessandisolation.(2)Provideaprogrammingmodelthatsimpliesthedevelopmentofcodeforacceleratorsbyabstractingawaycodethatmanagesdevices,performsI/O,anddealswithdisjointmem-oryspaces.InatypicalDirectXorCUDAprogram,onlyafrac-tionofthecodeimplementsalgorithmsthatrunontheGPU,whilethebulkofthecodemanagesthehardwareandorchestratesdatamovementbetweenCPUandGPUmemories.Incontrast,PTaskencapsulatesdevice-speciccode,freeingtheprogrammertofocusonapplication-levelconcernssuchasalgorithmsanddataow.(3)Provideaprogrammingenvironmentthatallowscodetobebothmodularandfast.BecausecurrentGPUprogrammingenviron-mentspromoteatightcouplingbetweendevice-memorymanage-mentcodeandGPU-kernelcode,writingreusablecodetoleverageaGPUmeanswritingbothalgorithmcodetorunontheGPUandcodetorunonthehostthattransferstheresultsofaGPU-kernelcomputationwhentheyareneeded.Thisapproachoftentranslatestosub-optimaldatamovement,higherlatency,andundesirableper-formanceartifacts.3.1IntegratingPTaskschedulingwiththeOSThetwochiefbenetsofcoordinatingOSschedulingwiththeGPUareefciencyandfairness(designgoals(1)and(3)).Byef-ciencywemeanbothlowlatencybetweenwhenaptaskisreadyandwhenitisscheduledontheGPU,andschedulingenoughworkontheGPUtofullyutilizeit.ByfairnesswemeanthattheOSschedulerprovidesOSpriority-weightedaccesstoprocessescon-tendingfortheGPU,andbalancesGPUutilizationwithothersys-temtaskslikeuserinterfaceresponsiveness.Separateprocessescancommunicatethrough,orshareagraph.Forexample,processesAandBmayproducedatathatisinputtothegraph,andanotherprocessCcanconsumetheresults.Theschedulermustbalancethread-specicschedulingneedswithPTask-specicschedulingneeds.Forexample,gangschedulingthepro-ducerandconsumerthreadsforagivenPTaskgraphwillmaximizesystemthroughput. matrixgemm(A,B){matrixres=newmatrix();copyToDevice(A);copyToDevice(B);invokeGPU(gemm_kernel,A,B,res);copyFromDevice(res); return res;}matrixmodularSlowAxBxC(A,B,C){matrixAxB=gemm(A,B);matrixAxBxC=gemm(AxB,C); return AxBxC;}matrixnonmodularFastAxBxC(A,B,C){matrixintermed=newmatrix();matrixres=newmatrix();copyToDevice(A);copyToDevice(B);copyToDevice(C);invokeGPU(gemm_kernel,A,B,intermed);invokeGPU(gemm_kernel,intermed,C,res);copyFromDevice(res); return res;} Figure6:Pseudo-codetoofoadmatrixcomputation(AB)CtoaGPU.ThismodularapproachusesthegemmsubroutinetocomputebothABand(AB)C,forcinganunnecessaryround-tripfromGPUtomainmemoryfortheintermediatere-sult. 3.2Efciencyvs.modularityConsiderthepseudo-codeinFigure6,whichreusesamatrixmultiplicationsubroutinecalledgemmtoimplement((AB)C).GPUstypicallyhaveprivatememoryspacesthatarenotco-herentwithmainmemoryandnotaddressablebytheCPU.ToofoadcomputationtotheGPU,thegemmimplementationmustcopyinputmatricesAandBtoGPUmemory.IttheninvokesaGPU-kernelcalledgemm_kerneltoperformthemultiplication,andcopiestheresultbacktomainmemory.Iftheprogrammerreusesthecodeforgemmtocomposetheproduct((AB)C)asgemm(gemm(A,B),C)(modularSlowAxBxCinFigure6),theintermediateresult(AB)iscopiedbackfromtheGPUattheendoftherstinvocationofgemmonlytobecopiedfrommainmemorytoGPUmemoryagainforthesecondinvocation.Theper-formancecostsfordatamovementaresignicant.Theproblemcanbetriviallysolvedbywritingcodespecializedtotheproblem,suchasthenonmodularFastAxBxCinthegure.However,thecodeisnolongeraseasilyreused.Withinasingleaddressspace,suchcodemodularityissuescanoftenbeaddressedwithalayerofindirectionandencapsulationforGPU-sideresources.However,theproblemofoptimizingdatamovementinevitablybecomesanOS-levelissueasotherdevicesandresourcesinteractwiththeGPUorGPUs.WithOS-levelsup-port,computationsthatinvolvesGPUsandOS-managedresourcessuchascameras,networkcards,andlesystemscanavoidprob-lemslikedouble-buffering.Bydecouplingdataowfromalgorithm,PTaskeliminatesdif-culttradeoffsbetweenmodularityandperformance(designgoal(3)):therun-timeautomaticallyavoidsunnecessarydatamovement(designgoal(2)).WithPTask,matrixmultiplicationisexpressedasagraphwithAandBasinputstoonegemmnode;theout-putofthatnodebecomesaninputtoanothergemmnodethatalsotakesCasinput.Theprogrammerexpressesonlythestructureofthecomputationandthesystemisresponsibleformaterializingaconsistentviewofthedatainamemorydomainonlywhenitisac- intheOccupiedstateuntilanewdatablockiswrittenintoitschan-nel.ThePTaskrun-timemaintainsathread-poolfromwhichitas-signsthreadstoptasksasnecessary:portdatamovement(aswellasGPUdispatch)isperformedbythreadsfromthispool.PortshaveaptagsmemberwhichindicateswhethertheportisboundtoGPU-sideresourcesorisconsumedbytherun-time.Thept-agsalsoindicatewhetherdatablocksowingthroughthatportaretreatedasin/outparametersbyGPUcode.Channelshaveparameterizable(non-zero)capacity,whichisthenumberofdatablocksthatthechannelmayqueuebetweencon-sumingthemfromitssourceandpassingthemtoitsdestinationinFIFOorder.Anapplicationpushesdata(usingsys_push)intochannels.Ifthechannelisnotready(becauseitalreadyhasdat-ablocksqueueduptoitscapacity),thesys_pushcallblocksuntilthechannelhasavailablecapacity.Likewise,asys_pullcallonachannelwillblockuntiladatablockarrivesatthatchannel.4.1.1DatablocksandTemplatesDataowsthroughagraphasdiscretedatablocks,eveniftheex-ternalinputtoand/oroutputfromthegraphisacontinuousstreamofdatavalues.Datablocksrefertoandaredescribedbytemplateobjects(seebelow)whicharemeta-datadescribingthedimensionsandlayoutofdataintheblock.Thedatablockabstractionprovidesacoherentviewondatathatmaymigratebetweenmemoryspaces.Datablocksencapsulatebuffersinmultiplememoryspacesusingabuffer-mappropertywhoseentriesmapmemoryspacestodevice-specicbufferobjects.Thebuffer-maptrackswhichbuffer(s)rep-resentthemostup-to-dateview(s)oftheunderlyingdata,enablingadatablocktomaterializeviewsindifferentmemoryspacesonde-mand.Forexample,adatablockmaybecreatedbasedonabufferinCPUmemory.Whenaptaskisabouttoexecuteusingthatdat-ablock,theruntimewillnoticethatnocorrespondingbufferexistsintheGPUmemoryspacewheretheptaskhasbeenscheduled,andwillcreatethatviewaccordingly.TheconverseoccursfordatawrittenbytheGPU—buffersintheCPUmemorydomainwillbepopulatedlazilybasedontheGPUversiononlywhenarequestforthatdataoccurs.Datablockscontainarecord-countmember,usedtohelpmanagedownstreammemoryallocationforcomputationsthatworkwithrecordstreamsorvariable-stridedata(seebelow).Datablockscanbepushedconcurrentlyintomultiplechannels,canbesharedacrossprocesses,andaregarbage-collectedbasedonref-erencecounts.IterationSpace.GPUhardwareexecutesinaSIMT(SingleInstructionMultipleThread)fashion,allocatingahardwarethreadforeachpointinaniterationspace.2Hence,thedataitemsonwhichaparticularGPUthreadoperatesmustbededucedinGPUcodefromauniqueiden-tierassignedtoeachthread.Forexample,invectoraddition,eachhardwarethreadsumstheelementsatasingleindex;theiterationspaceissetofallvectorindices,andeachthreadmultipliesitsidentierbytheelementstridetondtheoffsetoftheelementsitwilladd.ToexecutecodeontheGPU,thePTaskrun-timemustknowtheiterationspacetocorrectlycongurethenumberofGPUthreads.ForcaseswherethemappingbetweenGPUthreadsandtheiterationspaceisnotstraightforward(e.g.becausethreadscom-puteonmultiplepointsintheiterationspace,orbecauseinputel-ementsdonothavexedstride),thesys_set_geometrycallallowstheprogrammertospecifyGPUthreadparametersexplic- 2Theiterationspaceisthesetofallpossibleassignmentsofcontrolvariablesinaloopnest.Conceptually,GPUsexecutesubsetsofaniterationspaceinparallel. itly.Inthecommoncase,therun-timecaninfertheiterationspaceandGPUthreadcongurationfromtemplates.Templates.Templatesprovidemeta-datathatdescribestherawdatainadat-ablock'sbuffers.Atemplatecontainsadimensionsmember(strideandthree-dimensionalarraybounds),andadbtagsmemberthatindicateswhetherthedatashouldbetreatedasanarrayofxed-strideelements,astreamofvariable-sizedelements,oranopaquebytearray.Thedbtagsalsoindicatethetype(s)ofresource(s)thedatawillbeboundtoatexecutiontime:examplesincludebuffersinGPUglobalmemory,buffersinGPUconstantmemory,orformalparameters.Templatesserveseveralpurposes.First,theyallowtherun-timetoinfertheiterationspacewhenitisnotspeciedbytheprogrammer.Inthecommoncase,theiterationspaceiscompletelydescribedbytheproductofthestrideandarraybounds,andtheGPUshouldlaunchahardwarethreadforeverypointintheitera-tionspace.Second,templatesboundtoportsenabletherun-timetoallocatedatablocksthatwillbeconsumedbyinternalandout-putchannelsinthegraph.Finally,templatesenabletherun-timetogivereasonablefeedbacktotheprogrammerwhenAPIcallsaremis-used.Forexample,constructingachannelrequiresatemplate;channelcreationfailsifeitherthesourceordestinationporthasatemplatethatspeciesanincompatiblegeometry.Similarly,at-temptstopushdatablocksintochannelsalsofailonanytemplatemismatch.4.1.2HandlingirregulardataComputationsondatathatlackaxedstriderequiretemplatesthatdescribeavariable-geometry,meaningthatdata-layoutisonlyavailabledynamically.Insuchcases,atemplatefundamentallycannotcarrysufcientinformationfortherun-timetodeducetheit-erationspace.Useofavariable-geometrychannelrequiresanad-ditionalmeta-datachannelcontainingper-elementgeometryinfor-mation.Ameta-datachannelcarries,atsomexedstride,informa-tionthatcanbeusedbyhardwarethreadsasamapfordataintheotherchannel.Forexample,ifoneinputchannelcarriesdatablockswithrecordsofvariablelength,itsmeta-datachannelcarriesdat-ablocksofintegersindicatingtheoffsetandlengthofeachrecordindatablocksreceivedontherstchannel.PTaskmustusetemplatestoallocatedatablocksfordownstreamchannels,becausetheprogrammerdoesnotwritecodetoallo-catedatablocks(exceptforthosepushedintoinputchannels),andbecausetheruntimematerializesdevice-sideandhost-sideviewsofbuffersondemand,typicallyafterthedatablockitselfisallo-cated.Forxed-stridecomputations,allocatingoutputdatablocksisstraightforward:thebuffersizeistheproductofthestrideanddimensionsofthetemplate.Forcomputationsthatmayproduceavariable-lengthoutput(suchasaselectorjoin),theruntimeneedsadditionalinformation.Toaddressthisneed,atemplateonanOut-putPortwithitsvariable-sizerecordstreamptagssetcontainsapointertoanInputPortwhichisdesignatedtoprovideoutputsizeinformation.ThatInputPort'sptagsmustindicatethatitisboundtoarun-timeinput,indicatingitisusedbytheruntimeandnotbyGPU-sidecode.Therun-timegeneratesanoutputsizeforeachupstreaminvocation.ComputationswithdynamicallydeterminedoutputsizesrequirethisstructurebecausememoryallocatedontheGPUmustbeallo-catedbyacallfromthehost:dynamicmemoryallocationinGPUcodeisgenerallynotpossible.3Consequently,thecanonicalap-proachistorstrunacomputationontheGPUthatdetermines 3Althoughsomesupportfordevice-sidememoryallocationhas Figure8:Adataowgraphforthegesturerecognitionsystemusingtheptask,port,andchannelabstractions. outputsizeandcomputesamapofoffsetswhereeachhardwarethreadwritesitsoutput.Themapofoffsetsisusedtoallocateout-putbuffersandisconsumedasanadditionalinputtotheoriginalcomputation[35,30,34].Inshort,wearguethatallvariable-lengthGPUcomputationsfollowthistypeofstructure,andwhilethepat-ternmaybeburdensometotheprogrammer,thatburdenisfunda-mentaltoGPUprogramming,andisnotimposedbythePTaskpro-grammingmodel.ThePTaskAPIenablesanyvariable-geometrystructuresthatarepossiblewithotherGPUprogrammingframe-works.4.2PTaskinvocationAptaskcanbeinoneoffourstates:Waiting(forinputs),Queued(inputsavailable,waitingforaGPU),Executing(runningontheGPU),orCompleted(nishedexecution,waitingtohaveitsout-putsconsumed).Whenallofaptask'sinputportsareOccupied,theruntimeputstheptaskonarunqueueandtransitionsitfromWaitingtoQueued.AptaskisinvokedwhenitisattheheadoftherunqueueandaGPUisavailablethatiscapable4ofrunningit:invocationisthetransitionfromtheQueuedtoExecutingstate.Whenaptaskisinvoked,theruntimereadstheDatablocksoccupy-ingthatptask'sInputPorts.Foranynon-stickyInputPort,thiswillremovethedatablockfromtheportandwillcausetheporttopullfromitsupstreamchannel;theportgoestotheUnoccupiedstateiftheupstreamchannelisempty.Incontrast,aStickyPortremainsintheOccupiedstatewhentheruntimereadsitsdatablock.Whentheruntimeidentiesthatithascomputeresourcesavailable,itchoosesaptaskintheQueuedstate.SchedulingalgorithmsareconsideredinmoredetailinSection5.1.Uponcompletionofaninvocation,theruntimesetsaptask'sstatetoCompleted,andmovesitsoutputdatablockstoitsOutputPorts,ifandonlyifalltheoutputportsareintheUnoccupiedstate.TheptaskremainsintheCompletedstateuntilallitsoutputportsareoccupied,afterwhichitisreturnedtotheWaitingstate.TheCompletedstateisnecessarybecausechan-nelshavenitecapacity.AsaresultitispossibleforexecutionontheGPUtocompletebeforeadownstreamchanneldrains.4.3GesturalInterfacePTaskGraphThegesturalinterfacesystemcanbeexpressedasaPTaskgraph(seeFigure8),yieldingmultipleadvantages.First,thegraphelimi-natesunnecessarycommunication.AchannelconnectsUSBsourceports(usbsrc_0,usbsrc_1)toimageinputports(rawimg_0,rawimg_1).DatatransferacrossthischanneleliminatesdoublebufferingbysharingabufferbetweentheUSBdevicedriver,PTaskrun-time,andGPUdriver,orwithhardwaresupport,goingdirectly arrivedwithCUDA4.0,GPUsandtheirprogrammingframeworksingeneraldonotsupportit.4AsystemmayhavemultipleGPUswithdifferentfeatures,andaptaskcanonlyrunonGPUsthatsupportsallfeaturesitrequires. fromtheUSBdevicetoGPUmemory,ratherthantakinganunnec-essarydetourthroughsystemmemory.Achannelconnectingtheoutputportofxform(cloud_*)totheinputs(i_*)portofthelterptaskcanavoiddatacopyingaltogetherbyreusingtheoutputofoneptaskastheinputofthenext.BecausethetwoxformptasksandthelterptasksrunontheGPU,thesystemcandetectthatthesourceanddestinationmemorydomainsarethesameandelideanydatamovementasaresult.ThisPTask-baseddesignalsominimizesinvolvementofhost-baseduser-modeapplicationstocoordinatecommonGPUactivi-ties.Forexample,thearrivalofdataattherawimageinputofthexformprogramcantriggerthecomputationforthenewframeus-inginterrupthandlersintheOS,ratherthanwaitingforahost-basedprogramtobescheduledtostarttheGPU-basedprocessingofthenewframe.Theonlyapplication-levelcoderequiredtocausedatatomovethroughthesystemisthesys_pullcallontheoutputchannelofthehidinputprocess.Underthisdesign,thegraphexpressesconcurrencythattherun-timeexploitswithoutrequiringtheprogrammertowritecodewithexplicitthreads.Datacapturedfromdifferentcameraperspectivescanbeprocessedinparallel.WhenmultipleGPUsarepresent,orasingleGPUwithsupportforconcurrentkernels[5],thetwoxformPTaskscanexecuteconcurrently.RegardlessofwhatGPU-levelsupportforconcurrencyispresentinthesystem,thePTaskdesignleveragesthepipelineparallelismexpressedinthegraph,forex-ample,byperformingdatamovementalongchannelsinparallelwithptaskexecutiononboththehostandtheCPU.Nocodemod-icationsarerequiredbytheprogrammerforthesystemtotakeadvantageofanyoftheseopportunitiesforconcurrency.5.IMPLEMENTATIONWehaveimplementedthePTaskdesigndescribedinSection3onWindows7,andintegrateditbothintoastand-aloneuser-modelibrary,andintothedevicedriverforthephotogrammetriccamerasusedinthegesturalinterface.Thestand-alonelibraryallowsustoevaluatethebenetsofthemodelinisolationfromtheOS.Theuser-modeframeworksupportsptaskscodedinHLSL(DirectX),CUDA,andOpenCL,implement-ingdataowgraphsupportontopofDirectX11,theCUDA4.0driver-API,andtheOpenCLimplementationprovidedwithNVIDIA'sGPUComputingToolkit4.0.Thedriver-integratedversionemulateskernel-levelsupportforptasks.Whenptasksruninthedriver,weassignarangeofioctlcodesin1:1correspondencewiththesystemcallinterfaceshownintable1,allowingapplicationsotherthanthegesturalinterfacetousethePTaskAPIbyopeningahandletothecameradriver,andcallingioctl(DeviceIoControlinWindows).Thedriver-levelimplementationsupportsonlyptaskscodedinHLSL,andisbuiltontopofDXGI,whichisthesystemcallinterfaceWindows7providestomanagethegraphicspipeline.GPUdriversarevendor-specicandproprietary,sonokernel-facinginterfaceexiststocontrolGPUs.Whilethisremainsthecase,kernel-levelmanagementofGPUsmustinvolvesomeuser-modecomponent.TheWindowsDriverFoundation[7]enablesalayereddesignfordrivers,allowingustoimplementthecam-eradriverasacombinationkernel-mode(KMDF)anduser-mode(UMDF)driver,whereresponsibilitiesofthecompositedriveraresplitbetweenkernel-anduser-modecomponents.Thecomponentthatmanagesthecamerasrunsinkernelmode,whilethecompo-nentthatimplementsPTaskrunsinuser-mode.Thetwocompo-nentsavoiddatacopyacrosstheuser/kernelboundarybymappingthememoryusedtobufferrawdatafromthecamerasintothead-dressspaceoftheuser-modedriver.Whenthekernel-modecom- ponenthascapturedanewsetofframesfromacamera,itsignalstheuser-modecomponent,whichcanbeginworkingdirectlyonthecaptureddatawithoutrequiringbuffercopy.ThePTasklibrarycomprisesroughly8000linesofCandC++code,andthecameradriverisimplementedinabout3000linesofC.Codetoassembleandmanagetheptaskgraphforthegesturalinterfaceintroducesapproximatelyanadditional400LOC.5.1PTaskSchedulingPTaskschedulingfacesseveralchallenges.First,GPUhardwarecannotcurrentlybepreemptedorcontext-switched,rulingouttra-ditionalapproachestotime-slicinghardware.Second,trueinte-grationwiththeprocessschedulerisnotcurrentlypossibleduetolackofanOS-facinginterfacetocontroltheGPUinWindows.Third,whenmultipleGPUsarepresentinasystem,datalocalitybecomestheprimarydeterminantofperformance.Parallelexecu-tiononmultipleGPUsmaynotalwaysbeprotable,becausetheincreasedlatencyduetodatamigrationcanbegreaterthanthela-tencyreductiongainedthroughconcurrency.Ourprototypeimplementsfourschedulingmodes,rst-available,fo,priority,anddata-aware.Inrst-availablemode,everyptaskisassignedamanagerthread,andthosethreadscompeteforavailableaccelerators.Readyptasksarenotqueuedinthismode,sowhenreadyptasksoutnumberavailableaccelerators,accessisar-bitratedbylocksontheacceleratordatastructures.Inthecommoncase,withonlyasingleaccelerator,thisapproachissomewhatrea-sonablebecausedataowsignalingwillwakeupthreadsthatneedthelockanyway.Thefopolicyenhancestherst-availablepol-icywithqueuing.Inprioritymode,ptasksareenhancedwithastaticpriority,andproxypriority.TheproxypriorityistheOSpriorityofthethreadmanagingitsinvocationanddataows.Proxypriorityal-lowsthesystemtoavoidprioritylaundering,wherethepriorityoftherequestingprocessisineffectivebecauserequestsrunwiththepriorityofthreadsinthePTaskrun-timeratherthanwiththepriorityoftherequester.Proxypriorityavoidsthisbyenablingaptask'smanagerthreadtoassumethepriorityofarequestingpro-cess.Aptask'sstaticandproxyprioritycanbothbesetwiththesys_set_ptask_priosystemcall.Theschedulermanagesareadyqueueofptasks,andalistofavailableaccelerators.WhenanyptasktransitionsintoQueued,Executing,orCompletedstate,aschedulerthreadwakesupandcomputesaneffectivepriorityvalueforeachptaskinthequeue.Theeffectivepriorityistheweightedsumoftheptask'sstaticpri-orityandboostvaluesderivedfromitscurrentwaittime,itsav-eragewaittime(computedwithanexponentialmovingaverage),averageruntime,anditsproxypriority.Weightsarechosensuchthat,ingeneral,aptask'seffectiveprioritywillincreaseifa)ithaslongerthanaveragewaittime,b)ithaslowerthanaverageGPUruntime,orc)itsproxypriorityishigh.Boostingpriorityinresponsetolongwaitsavoidsstarvation,boostinginresponsetoshortruntimesincreasesthroughputbypreferringlow-latencyPTasks,andboostingforhighproxypriorityhelpsthePTaskschedulerrespectthepriorityoftheOSprocessscheduler.Pseudo-codeforcom-putingeffectivepriorityisshowninFigure9.Whentheeffectivepriorityupdateiscomplete,theschedulersortsthereadyqueueindescendingorderofeffectivepriority.Toassignanacceleratortoaptask,theschedulerrstconsiderstheheadofthequeue,andchoosesfromthelistofavailableaccel-eratorsbasedontnessandstrength.Anaccelerator'stnessisafunctionofwhethertheacceleratorsupportstheexecutionenviron-mentandfeaturesetrequiredbytheptask:untacceleratorsaresimplyeliminatedfromthepoolofcandidates.Thestrengthofthe void update_eff_prio(ptasks){avg_gpu=avg_gpu_time(ptasks);avg_cwait=avg_current_wait(ptasks);avg_dwait=avg_decayed_wait(ptasks);avg_pprio=avg_proxy_prio(ptasks);foreach(pinptasks){boost=W_0*�(p-last_cwait-avg_cwait);boost+=W_1*�(p-avg_wait-avg_dwait);boost+=W_2*�(p-avg_gpu-avg_gpu);boost+=W_3*�(p-proxy_prio-avg_pprio);�p-eff_prio=�p-prio+boost;}}gpumatch_gpu(ptask){gpu_list=available_gpus();remove_unfit(gpu_list,ptask);sort(gpu_list);//bydescendingstrength return remove_head(gpu_list);} void schedule(){update_eff_prio(ptasks);sort(ptasks);//bydescendingeffprio while (gpus_available()&&�size(ptasks)0){foreach(pinptasks){best_gpu=match_gpu(p); if (best_gpu!=null){remove(ptasks,p);�p-dispatch_gpu=best_gpu;signal_dispatch(p); return ;}}}} Figure9:Pseudo-codeforalgorithmsusedbyPTask'spriorityschedulingalgorithm. acceleratoristheproductofthenumberofcores,thecoreclockspeed,andthememoryclockspeed:inourexperience,thisisanimperfectbuteffectiveheuristicforrankingacceleratorssuchthatlow-latencyexecutionispreferred.Thescheduleralwayschoosesthestrongestacceleratorwhenachoiceisavailable.Ifthesched-ulerisunabletoassignanacceleratortotheptaskattheheadofthequeue,ititeratesovertherestofthequeueuntilanassignmentcanbemade.Ifnoassignmentcanbemade,theschedulerblocks.Onasuccessfulassignment,theschedulerremovestheptaskfromthequeue,assignstheacceleratortoit,movestheptasktoExecutingstate,andsignalstheptask'smanagerthreadthatitcanexecute.Theschedulerthreadrepeatsthisprocessuntilitrunsoutofavail-ableacceleratorsorptasksontherunqueue,andthenblockswait-ingforthenextscheduler-relevantevent.Pseudo-codeforschedul-ingandmatchingacceleratorstoptasksisshowninFigure9asthescheduleandmatch_gpufunctionsrespectively.Thedata-awaremodeusesthesameeffectiveprioritysystemthattheprioritypolicyuses,butalterstheacceleratorselectional-gorithmtoconsiderthememoryspaceswhereaptask'sinputsarecurrentlyup-to-date.IfasystemsupportsmultipleGPUs,thein-putsrequiredbyaptaskmayhavebeenmostrecentlywritteninthememoryspaceofanotherGPU.Thedataawarepolicyndstheac-celeratorwherethemajorityofaptask'sinputsareup-to-date,des-ignatingittheptask'spreferredaccelerator.Ifthepreferredacceler-atorisavailable,theschedulerassignsit.Otherwise,theschedulerexaminestheptask'seffectiveprioritytodecidewhethertosched-uleitonanacceleratorthatrequiresdatamigration.PTaskswithhigheffectivepriorityrelativetoaparameterizablethreshold(em-piricallydetermined)willbeassignedtothestrongesttavailableaccelerator.Ifaptaskhasloweffectiveprioritytheschedulerwillleaveitononthequeueinhopesthatitspreferredacceleratorwillbecomeavailableagainsoon.Thepolicyisnotwork-conserving:itispossiblethatQueuedptasksdonotexecuteevenwhenaccel- memoryhierarchyandthreadingmodels.LikeotherSDFlan-guages[50]suchasLUSTRE[31]andESTEREL[18],Spongeprovidesastaticschedule,whilePTaskdoesnot.Moreimpor-tantly,Spongeisprimarilyconcernedwithoptimizingcompiler-generatedGPU-sidecode,whilePTaskaddressessystems-levelis-sues.APTask-basedsystemcouldbenetfromSpongesupport,andvice-versa.LiquidMetal[39]andLime[10]provideprogrammingenviron-mentsforheterogeneoustargetssuchassystemscomprisingCPUsandFGPAs.Lime'slters,andI/Ocontainersallowacomputa-tiontobeexpressed(bythecompiler,inintermediateform)asapipeline,whilePTask'sgraph-structuredcomputationisexpressedexplicitlybytheprogrammer.Lime'sbufferobjectsprovideasimi-larencapsulationofdatamovementacrossmemorydomainstothatprovidedbyPTask'schannelsanddatablocks.Flextream[37]iscompilationframeworkfortheSDFmodelthatdynamicallyadaptsapplicationstotargetarchitecturesinthefaceofchangingavail-abilityofFPGA,GPU,orCPUresources.LikePTask,Flextreamapplicationsarerepresentedasagraph.Aslanguage-leveltools,LiquidMetal,LimeandFlextreamprovidenoOS-levelsupportandthereforecannotaddressisolation/fairnessguarantees,andcannotaddressdatasharingacrossprocessesorincontextswhereacceler-atorsdonothavetherequiredlanguage-levelsupport.PTaskisnotcoupledwithaparticularlanguageoruser-moderuntime.I/Oanddatamovement.PacketShader[32]isasoftwarerouterthatacceleratespacketprocessingonGPUsandSSLShader[42]acceleratesasecuresocketslayerserverbyofoadingAESandRSAcomputationstoGPUs.BothSSLShaderandPacketShaderrelyheavilyonbatching(alongwithoverlapofcomputation/com-municationwiththeGPU)toaddresstheoverheadsofI/OtotheGPUandconcomitantkernel/userswitches.ThePTasksystemcouldhelpbyreducingkernel/userswitchesandeliminatingdou-blebufferingbetweentheGPUandNIC.IO-Lite[60]supportsuni-edbufferingandcachingtominimizedatamovement.UnlikeIO-Lite'sbufferaggregateabstraction,PTask'sdatablocksaremutable,butPTask'schannelimplementationsshareIO-Lite'stechniqueofeliminatingdouble-bufferingwithmemory-mapping(alsosimilartofbufs[25],ContainerShipping[62],andzero-copymechanismsproposedbyThadaniet.al[65]).WhileIO-Liteaddressesdata-movementacrossprotectiondomains,itdoesnotaddresstheprob-lemofdatamovementacrossdisjoint/incoherentmemoryspaces,suchasthoseprivatetoaGPUorotheraccelerator.8.CONCLUSIONThispaperproposesanewsetofOSabstractionsforacceleratorssuchasGPUscalledthePTaskAPI.PTasksexposeonlyenoughhardwaredetailasisrequiredtoenableprogrammerstoachievegoodperformanceandlowlatency,whileprovidingabstractionsthatpreservemodularityandcomposability.ThePTaskAPIpro-motesGPUstoageneral-purpose,sharedcomputeresource,man-agedbytheOS,whichcanprovidefairnessandisolation.9.ACKNOWLEDGEMENTSWethankAshwinPrasadforimplementationofthefdtdandgrpbymicrobenchmarks.ThisresearchissupportedbyNSFCa-reerawardCNS-0644205,NSFawardCNS-1017785,anda2010NVIDIAresearchgrant.Wethankourshepherd,SteveHandforhisvaluableanddetailedfeedback.10.REFERENCES [1] IBM709electronicdata-processingsystem:advancedescription.I.B.M.,WhitePlains,NY,1957.7 [2] TheImagineStreamProcessor,2002.7 [3] Recommendationforblockciphermodeslofoperation:thexts-aesmodeforcondentialityonblock-orientedstoragedevices.NationalInstituteofStandardsandTechnology,SpecialPublication800-e8E,2009.6.2 [4] NVIDIAGPUDirect.2011.1 [5] NVIDIA'sNextGenerationCUDATMComputeArchitecture:Fermi.2011.4.3 [6] Top500supercomputersites.2011.1 [7] WindowsDriverFoundation(WDF).2011.5 [8] M.Andrecut.ParallelGPUImplementationofIterativePCAAlgorithms.ArXive-prints,Nov.2008.6.3 [9] M.Andrecut.ParallelGPUImplementationofIterativePCAAlgorithms.JournalofComputationalBiology,16(11),Nov.2009.6.3 [10] J.S.Auerbach,D.F.Bacon,P.Cheng,andR.M.Rabbah.Lime:ajava-compatibleandsynthesizablelanguageforheterogeneousarchitectures.InOOPSLA.ACM,2010.7 [11] C.AugonnetandR.Namyst.StarPU:AUniedRuntimeSystemforHeterogeneousMulti-coreArchitectures.7 [12] C.Augonnet,S.Thibault,R.Namyst,andM.Nijhuis.ExploitingtheCell/BEArchitecturewiththeStarPUUniedRuntimeSystem.InSAMOS'09,pages329–339,2009.7 [13] R.M.Badia,J.Labarta,R.Sirvent,J.M.Pérez,J.M.Cela,andR.Grima.ProgrammingGridApplicationswithGRIDSuperscalar.JournalofGridComputing,1:2003,2003.7 [14] C.Banino,O.Beaumont,L.Carter,J.Ferrante,A.Legrand,andY.Robert.Schedulingstrategiesformaster-slavetaskingonheterogeneousprocessorplatforms.2004.7 [15] A.Baumann,P.Barham,P.-E.Dagand,T.Harris,R.Isaacs,S.Peter,T.Roscoe,A.Schüpbach,andA.Singhania.Themultikernel:anewOSarchitectureforscalablemulticoresystems.InSOSP,2009.7 [16] A.Bayoumi,M.Chu,Y.Hanafy,P.Harrell,andG.Refai-Ahmed.ScienticandEngineeringComputingUsingATIStreamTechnology.ComputinginScienceandEngineering,11(6):92–97,2009.7 [17] P.Bellens,J.M.Perez,R.M.Badia,andJ.Labarta.CellSs:aprogrammingmodelforthecellBEarchitecture.InSC2006.7 [18] G.BerryandG.Gonthier.Theesterelsynchronousprogramminglanguage:design,semantics,implementation.Sci.Comput.Program.,19:87–152,November1992.7 [19] B.N.Bershad,S.Savage,P.Pardyak,E.G.Sirer,M.E.Fiuczynski,D.Becker,C.Chambers,andS.Eggers.Extensibilitysafetyandperformanceinthespinoperatingsystem.SIGOPSOper.Syst.Rev.,29:267–283,December1995.7 [20] I.Buck,T.Foley,D.Horn,J.Sugerman,K.Fatahalian,M.Houston,andP.Hanrahan.BrookforGPUs:StreamComputingonGraphicsHardware.ACMTRANSACTIONSONGRAPHICS,2004.7 [21] E.Caspi,M.Chu,R.Huang,J.Yeh,J.Wawrzynek,andA.DeHon.Streamcomputationsorganizedforrecongurableexecution(score).FPL'00,2000.7 [22] S.C.Chiu,W.-k.Liao,A.N.Choudhary,andM.T.Kandemir.Processor-embeddeddistributedsmartdisksforI/O-intensiveworkloads:architectures,performancemodelsandevaluation.J.ParallelDistrib.Comput.,65(4):532–551,2005.7 [23] C.H.Crawford,P.Henning,M.Kistler,andC.Wright.Acceleratingcomputingwiththecellbroadbandengineprocessor.InCF2008,2008.7 [24] A.Currid.TCPofoadtotherescue.Queue,2(3):58–65,2004.7 [25] P.DruschelandL.L.Peterson.Fbufs:ahigh-bandwidthcross-domaintransferfacility.SIGOPSOper.Syst.Rev.,27:189–202,December1993.7 [26] M.Garland,S.LeGrand,J.Nickolls,J.Anderson,J.Hardwick,S.Morton,E.Phillips,Y.Zhang,andV.Volkov.ParallelComputingExperienceswithCUDA.Micro,IEEE,28(4):13–27,2008.1