/
Appeared in Proceedings of th International Symposium on Computer Architecture ISCA  PipeRench Appeared in Proceedings of th International Symposium on Computer Architecture ISCA  PipeRench

Appeared in Proceedings of th International Symposium on Computer Architecture ISCA PipeRench - PDF document

tawny-fly
tawny-fly . @tawny-fly
Follow
501 views
Uploaded On 2015-01-14

Appeared in Proceedings of th International Symposium on Computer Architecture ISCA PipeRench - PPT Presentation

Reed aylor Ronald Laufer School of Computer Science and Department of ECE Carne gie Mellon Uni ersity Pittsb ur gh 15213 sethmihaib cscmuedu hermanmoecadambirt2irel ece cmue du Abstract Futur computing workloads will emphasize an ar hi tectur e abil ID: 31185

Reed aylor Ronald Laufer

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Appeared in Proceedings of th Internatio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AppearedinProceedingsof26thInternationalSymposiumonComputerArchitecture,ISCA1999PipeRench:ACoprocessorforStreamingMultimediaAccelerationSethCopenGoldsteinyHermanSchmitMatthewMoeMihaiBudiuySrihariCadambiR.ReedTaylorRonaldLauferSchoolofComputerScienceyandDepartmentofECECarnegieMellonUniversityPittsburgh,PA15213yfseth,mihaibg@cs.cmu.edufherman,moe,cadambi,rt2i,relg@ece.cmu.eduAbstractFuturecomputingworkloadswillemphasizeanarchi-tecture'sabilitytoperformrelativelysimplecalculationsonmassivequantitiesofmixed-widthdata.Thispa-perdescribesanovelrecongurablefabricarchitecture,PipeRench,optimizedtoacceleratethesetypesofcompu-tations.PipeRenchenablesfast,robustcompilers,supportsforwardcompatibility,andvirtualizescongurations,thusremovingthexedsizeconstraintpresentinotherfabrics.Forthersttimeweexplorehowthebit-widthofprocessingelementsaffectsperformanceandshowhowthePipeRencharchitecturehasbeenoptimizedtobalancetheneedsofthecompileragainsttherealitiesofsilicon.Finally,wedemonstrateextremeperformancespeeduponcertaincom-putingkernels(upto190xversusamodernRISCproces-sor),andanalyzehowthisaccelerationtranslatestoappli-cationspeedup.1.IntroductionWorkloadsforcomputingdevicesarerapidlychanging.Onthedesktop,theintegrationofdigitalmediahasmadereal-timemediaprocessingtheprimarychallengeforar-chitects[10].Embeddedandwirelesscomputingdevicesneedtoprocesscopiousdatastreamingfromsensorsandreceivers.Thesechangesemphasizesimple,regularcompu-tationsonlargesetsofsmalldataelements.Therearetwoimportantrespectsinwhichthisneeddoesnotmatchtheprocessingstrengthsofconventionalprocessors.First,thesizeofthedataelementsunderutilizestheprocessor'swidedatapath.Second,theinstructionbandwidthismuchhigherthanitneedstobetoperformregular,dataow-dominatedcomputationsonlargedatasets.Bothoftheseproblemsarebeingaddressedthroughpro-cessorarchitecture.MostrecentISAshavemultimediain-structionsetextensionsthatallowawidedatapathtobeswitchedintoSIMDoperation[17].Theinstructionband-widthissuehascreatedrenewedinterestinvectorprocess-ing[14,24].Afundamentallydifferentwayofaddressingtheseprob-lemsistocongureconnectionsbetweenprogrammablelogicelementsandregistersinordertoconstructanef-cient,highlyparallelimplementationoftheprocessingker-nel.Thisinterconnectednetworkofprocessingelementsiscalledarecongurablefabric,andthedatasetusedtopro-gramtheinterconnectandprocessingelementsisacongu-ration.Afteracongurationisloadedintoarecongurablefabric,thereisnofurtherinstructionbandwidthrequiredtoperformthecomputation.Furthermore,becausetheoper-ationsarecomposedofsmallbasicelements,thesizeoftheprocessingelementscancloselymatchtherequireddatasize.Thisapproachiscalledrecongurablecomputing.Despitereportsofamazingperformance[11],recong-urablecomputinghasnotbeenacceptedasamainstreamcomputingtechnologybecausemostpreviouseffortswerebasedupon,orinspiredby,commercialFPGAsandfailtomeettherequirementsofthemarketplace.TheproblemsinherentinusingstandardFPGAsinclude1.Logicgranularity:FPGAsaredesignedforlogicre-placement.Thegranularityofthefunctionalunitsisoptimizedtoreplacerandomlogic,nottoperformmul-timediacomputations.2.Congurationtime:Thetimeittakestoloadacon-gurationinthefabriciscalledcongurationtime.IncommercialFPGAs,congurationtimesrangefromhundredsofmicrosecondstohundredsofmillisec-onds.Toshowaperformanceimprovementthisstart-uplatencymustbeamortizedoverhugedatasets,whichlimitstheapplicabilityofthetechnique.3.Forward-compatibility:FPGAsrequireredesignorrecompilationtogainbenetfromfuturegenerationsofthechip.4.Hardconstraints:FPGAscanimplementonlyker- nelsofaxedandrelativelysmallsize.Thisispartofthereasonthatcompilationisdifcult—everythingmustt.Italsocauseslargeandunpredictablediscon-tinuitiesbetweenkernelsizeandperformance.5.Compilationtime:Currentlythesynthesis,placementandroutingphasesofdesignstakehundredsoftimeslongerthanwhatthecompilationofthesamekernelwouldtakeforageneral-purposeprocessor.ThispaperdescribesPipeRench,arecongurablefab-ricdesignedtoincreaseperformanceonfuturecomputingworkloads.PipeRenchrealizestheperformancepromisesofrecongurablecomputingwhilesolvingtheproblemsoutlinedabove.PipeRenchusesatechniquecalledpipelinerecongurationtosolvetheproblemsofcompilability,re-congurationtime,andforward-compatibility.Thearchi-tecturalparametersofPipeRench,includingthelogicblockgranularity,wereselectedtooptimizetheperformanceofasuiteofkernels,balancingtheneedsofacompileragainstdesignrealitiesindeep-submicronprocesstechnology.PipeRenchiscurrentlyusedasanattachedprocessor.Thisplacessignicantlimitationsonthetypesofapplica-tionsthatcanrealizespeedup,duetolimitedbandwidthbe-tweenPipeRench,themainmemoryandtheprocessor.Webelievethisrepresentstheinitialphaseintheevolutionofrecongurableprocessors.Justasoating-pointcomputa-tionmigratedfromsoftwareemulation,toattachedproces-sors,tocoprocessors,andnallytofullincorporationintoprocessorISAs,sowillrecongurablecomputingeventu-allybeintegratedintotheCPU.Inthenextsection,weuseseveralexamplestoillustratetheadvantagesandarchitecturalrequirementsofrecong-urablefabrics.Weintroducetheideaofpipelinerecongu-rationinSection3,anddescribehowthistechniquesolvesthepracticalproblemsfacedbyrecongurablecomputing.Section4describesaclassofarchitecturesthatcanimple-mentpipelinedreconguration.Weevaluatethesearchitec-turesinSection5.WecoverrelatedworkinSection6,andinSection7wesummarizeanddiscussfutureresearch.2.RecongurableComputing2.1.AttributesofTargetKernelsFunctionsforwhicharecongurablefabriccanprovideasignicantbenetexhibitoneormoreofthefollowingfeatures:1.Thefunctionoperatesonbit-widthsthataredifferentfromtheprocessor'sbasicwordsize.2.Thedatadependenciesinthefunctionallowmultiplefunctionunitstooperateinparallel.for(inti=0;ii++){y[i]=0;for(intj=0;jj++)y[i]=y[i]+x[i+j]*w[j];}Figure1.CcodeforaFIR®lterandapipelinedver-sionforathree-tap®lter.3.Thefunctioniscomposedofaseriesofbasicopera-tionsthatcanbecombinedintoasinglespecializedoperation.4.Thefunctioncanbepipelined.5.Constantpropagationcanbeperformed,reducingthecomplexityoftheoperations.6.Theinputvaluesarereusedmanytimeswithinthecomputation.Thesefunctionstaketwoforms.Stream-basedfunctionsprocessalargedatainputstreamandproducealargedataoutputstream,whilecustominstructionstakeafewinputsandproduceafewoutputs.Afterpresentingasimpleex-ampleofeachtypeoffunctiontoillustratehowarecong-urablefabriccanimproveperformance,wediscussthewaysinwhichafabriccanbeintegratedintoacompletesystem.2.2.AStream­BasedFunction:FIRArecongurablefabriccanbemosteffectivewhenusedtoimplemententirepipelinesfromapplications.Hereweinvestigateasimplebutprototypicalpipelineforimple-mentinganite-impulseresponse(FIR)lter.TheFIRlterexhibitsallbutfeature3fromtherequirementlistinSec-tion2.1.Figure1showstheCcodeandahardwareimple-mentation.WhenFIRismappedtoarecongurablefabric,thegeneral-purposemultipliersshowninthehardwarede-scriptionareimplementedasconstantmultipliers,wheretheconstantsarethew[i]values.Thisresultsinlesshardwareandfewercyclesthanageneral-purposemultiplier.Figure2comparesan8-bitFIRusing12-bitcoefcientsrunningonaparticularinstanceofPipeRenchtoimplemen-tationsonaXilinxFPGAusingparalleldistributedarith-metic(shownasXilinxPDAinFigure2)anddouble-ratebit-serialdistributedarithmetic(shownasXilinxDDA).BoththePipeRenchchipandXilinxFPGAareimple-mentedin100mm2ofsiliconusinga0.35micronprocess.2 050100150200050100150200250FIR Filter TapsPipeRenchXilinx PDAXilinx DDATI DSPMegasamples Per Second(MSPS)Figure2.Performanceon8-bitFIR®lters:PipeRench,XilinxFPGAusingparallelandserialarithmetic,andTexasInstrumentsDSP.intn){intsum=0,i;for(i=0;ii++)sum+=(n00;00;i)&1;returnsum;}Figure3.Ccodeanditshardwareimplementationforpopulationcount.TheFPGArunsatapproximately60MHzforbothapplica-tions,whilePipeRench'sclockis100MHz.PipeRenchout-performsbothXilinximplementationsoverabroaderrangeofltersizes.Similarly,PipeRenchoutperformstheTexasInstrumentsTMS320C6201,acommercialDSPthatrunsat200MHzandcontainstwo16x16-bitintegermultipli-ers,onlterslargerthanafewtaps.PipeRenchexhibitsthesamehighlevelofperformanceastheFPGA.Duetoitssupportforhardwarevirtualization,asdescribedinSec-tion3,PipeRenchexhibitsthesamegracefuldegradationsofperformanceastheDSP.2.3.CustomInstructions:PopulationCountIn­structionMostprocessors,withtheexceptionofvectorsupercom-puters,donotincludeanativepopulationcountinstruc-tionandthusitmustbeimplementedinsoftware(seeFig-ure3).Usingarecongurablefabric,popCount()canbeimplementedasacustominstructiongivingarawperfor-manceimprovementofmorethananorderofmagnitude.Thefunctionexhibitsthreeofthequalifyingfeatures(1,2, CPUL2L1Main Memory Memory BusI/O BusReconÞgurable Fabric Functional UnitLoosely Coupled Tightly Coupled Coproccessor800MB/sec1.3GB/sec5GB/sec133MB/secFigure4.Possiblelocationsforrecon®gurablefabricinmemoryhierarchy.Bandwidth®guresaretypicalfora300MHzSunUltraSPARC-II.and3)fromSection2.1.Therecongurablecomputingso-lutionreplacestheO(n)loopwithanaddertreeofheightO(logn).Furthermore,theaddersusedaresignicantlynarrowerthantheaddersontheprocessor.Thecircuitcanalsobepipelined,sothatwhenexecutedonavectoritre-tiresoneresulteverycycle.Inevaluatingarecongurablefabric,itisimportanttotakeintoaccountbothcongurationtimeandthecommu-nicationlatencyandbandwidthbetweentheprocessorandthefabric.IfpopCount()iscalledonlyonce,itmakeslittlesensetocongurethefabrictoperformtheoperationsincethetimetocongurethefabricwillbelargerthanthesav-ingsobtainedbyexecutingpopCount()onthefabric.WhenpopCount()isusedoutsideofaloopanddatade-pendenciesrequirethattheresultbeusedimmediatelyafteritiscomputed,thefabricneedsdirectaccesstotheproces-sorregisters.Ontheotherhand,ifpopCount()isusedinaloop,wheretherearenoimmediatedependenciesontheresults,performancecanbebetterifthefabriccandirectlyaccessmemory.Inthispaperweconcentrateonthelattercase..TheFabric'sPlaceRecongurablefabricsprovidethecomputationaldatap-athwithmoreexibility.Theirutilityandapplicabilityisinuencedbythemannerinwhichtheyareintegratedintothedatapath.Werecognizethreebasicwaysinwhichafabricmaybeintegratedintoasystem:asanattachedpro-cessorontheI/Oormemorybus,asacoprocessor,orasafunctionalunitonthemainCPU.(SeeFigure4.)Attached-processorsystems,e.g.PAM[1],Splash[4],andDISC[25],havenodirectaccesstotheprocessor.Rather,theyarecontrolledoverabus.Theprimaryfea-tureofattachedprocessorsisthattheyareeasytoaddto3 existingcomputersystems.However,duetothebandwidthandlatencyconstraintsimposedbythebustheycanen-hanceonlycomputationsthathaveahighcomputation–to–memory-bandwidthratio.Thus,theyaremostsuitedtostream-basedfunctionsthatrequirelittleornocommuni-cationwiththehostprocessor.Incoprocessorarchitectures,thereisalow-latency,high-bandwidthconnectionbetweentheprocessorandthere-congurablefabric,whichincreasesthenumberofstream-basedfunctionsthatcanprotablyberunonthefabric.RecentexamplesofsuchsystemsincludeGarp[13]andNapa-1000[19].Furtherspecializationoccurswhenthefabricisonthemainprocessor'sdatapath,asinfunctional-unitarchitectureslikeP-RISC[18],Chimaera[12],andOneChip[26].Alloftheseallowcustominstructionstobeexecuted.Therecongurableunitisontheprocessordata-pathandhasaccesstoregisters.However,theseimplemen-tationsrestricttheapplicabilityoftherecongurableunitbydisallowingstatetobestoredinthefabricandinsomecasesbydisallowingdirectaccesstomemory,essentiallyeliminatingtheirusefulnessforstream-basedprocessing.Inthispaperwedescribepipelinedrecongurablearchi-tectures,whichcanbeusedinanyofthefashionsdescribedabove.However,inordertodescribethesystemwearecurrentlybuilding,welimitourselvestodescribinghowwewouldapplyitasanattached-processorsystem.Thenaturalevolutionofthisfabrictoacoprocessororafunctionunitwouldonlyenhanceitsapplicability.3.PipelinedRecongurableArchitecturesIntheprevioussection,wedescribedhowapplication-speciccongurationsofrecongurablefabricscanbeusedtoacceleratecertainapplications.Thecomputationisem-beddedinasinglestaticcongurationratherthaninasequenceofinstructions,therebyreducingtheinstructionbandwidth.Thestaticnatureofthesecongurations,however,causestwosignicantproblems.First,thecomputationmayrequiremorehardwarethanisavailable.Second,givenmorehardware,thereisnowaythatasinglehardwarede-signcanexploittheadditionalresourcesthatwillinevitablybecomeavailableinfutureprocessgenerations.Inthissection,wereviewatechniquecalledpipelinerecongu-ration[20],thatallowsalargelogicaldesigntobeimple-mentedonasmallpieceofhardwarethroughrapidrecon-gurationofthathardware.Withthistechnique,thecom-pilerisnolongerresponsibleforsatisfyingxedhardwareconstraints.Inaddition,theperformanceofadesignim-provesinproportiontotheamountofhardwareallocatedtothatdesign;asfutureprocesstechnologymakesmoretran-sistorsavailable,thesamehardwaredesignsachievehigherlevelsofperformance.Pipelinerecongurationisamethodofvirtualizingpipelinedhardwareapplicationdesignsbybreakingthesinglestaticcongurationintopiecesthatcorrespondtopipelinestagesintheapplication.Thesecongurationsarethenloaded,onepercycle,intothefabric.Thismakesitpossibletoperformthecomputation,evenifthoughthewholecongurationisneverpresentinthefabricatonetime.ThevirtualizationprocessisillustratedinFigure5,whichshowsave-stagepipelinebeingvirtualizedonathree-stagefabric.Thetopportionofthisgureshowstheve-stageapplicationandthestateofeachofthestagesofthepipelineinveconsecutivecycles.Thebottomhalfofthegureshowsthestateofthephysicalstagesinthefab-ricthatisexecutingthisapplication.Aneffectivemetaphorforthisprocedureisscrollingonatextwindow.Oncethepipelineisfull,everyvecyclesgeneratestworesultsfromthepipeline.Ingeneral,whenav-stageapplicationisvir-tualizedonadevicewithacapacityofp-stages(pv),thethroughputoftheimplementationisproportionalto(p1)=v.Throughputisalinearfunctionofthecapacityofthedevice;thereforeperformanceimprovesduetobothincreasesinclockfrequencyanddecreasesinfeaturesize,withoutanyredesign,untilp=v.Thereafter,applications'performancecontinuestogainonlythroughincreasedclockspeed.Becausethecongurationofstageshappensconcur-rentlywiththeexecutionofotherstages,thereisnolossinperformanceduetoreconguration.Asthepipelineisllingwithdata,stagesofthecomputationarebeingcon-guredaheadofthatdata.Evenifthereisnovirtualization,congurationtimeisequivalenttothepipelinelltimeoftheapplicationandthereforedoesnotreducethemaximumthroughputoftheapplication.Inorderforthisvirtualizationprocesstowork,thestateinanypipelinestagemustbeafunctiononlyofthecur-rentstateofthatstageandthecurrentstateofthepreviousstage.Inotherwords,cyclicdependenciesmusttwithinonestageofthepipeline.Interconnectthatdirectlyskipsoveroneormorestagesisnotallowed,norareconnectionsfromonestagetoapreviousstage.Fortunately,manycom-putationsonstreamingdatacanbepipelinedwithintheseconstraints.Furthermore,byincludingstructureswecallpassregisters,itispossibletocreatevirtualconnectionsbetweendistantstages.Theprimarychallengefacingpipelinerecongurationisconguringacomputationallysignicantpipelinestageinoneclockcycle.Todothis,weconnectawideon-chipcon-gurationbuffer(eitherSRAMorDRAM)tothenearbyfabricallowingapipelinestagetobeconguredinonecy-cle.Weusethewordstripetodescribeboththephysicalstagesinthefabric(thephysicalstripes),andthecongu-rationwordsthatarewrittenintothem(thevirtualstripes).4 Stage 1Stage 2Stage 3Stage 4Stage 5123456Cycle:Virtual Pipestage1Stage 1Stage 2Stage 345121232343Physical PipestageConfiguringExecutingLegend:Figure5.PipelineRecon®guration.Thisdiagramshowstheprocessofvirtualizinga®ve-stagepipelineonathree-stagedevice.Anyvirtualstripecanbewrittenintoanyphysicalstripe.Therefore,allphysicalstripesmusthaveidenticalfunction-alityandinterconnect.Beforeaphysicalstripeisreconguredwithanewvir-tualstripe,thestateoftheresidentvirtualstripe,ifany,mustbestoredoutsideofthefabric.Conversely,whenavirtualstripeisreturnedtothefabric,anystoredstateforthestripemustberestoredwithinthephysicalstripe[5].4.PipeRenchInthissection,wedescribeaclassofpipelinerecong-urablefabrics,calledPipeRenchdevices,anddenecriticalarchitecturalparametersforthisclassoffabrics.Thesear-chitecturalparametersarethesubjectoftheperformanceevaluationdescribedinSection5.AnabstractviewofthePipeRencharchitecturalclassisshowninFigure6.Thedeviceiscomposedofasetofphys-icalpipelinestages,orstripes.Eachstripeiscomposedofinterconnectandprocessingelements(PE),whichcontainregistersandALUs.AnALUiscomposedoflook-uptables(LUTs)andextracircuitryforcarry-chains,zero-detection,etc.ThePEshaveaccesstoaglobalI/Obus.Throughtheinterconnectnetwork,thePEscanaccessoperandsfromregisteredoutputsofthepreviousstripeaswellasregisteredorunregisteredoutputsoftheotherPEsinthestripe.Therearenobussesthatgotoapreviousstripe;thisisrequiredbyhardwarevirtualization(asdiscussedin[5])andmakesALUALURegister FilePE 0PE 0ALUALUPE 1PE 1InterconnectStripe nStripe n+1Stripe n+2ALUALUPE N-1PE N-1.........Register FileRegister FileRegister FileRegister FileRegister FileNetworkInterconnectNetworkGloal BusesGloal BusesFigure6.PipeRenchArchitecture:PEsandinter-connect.longfeedbackloopsimpossible,sinceanyfeedbackmustbecontainedwithinonestripe.TheglobalI/Obussesarerequiredbecausethepipelinestagesinanapplicationmaybephysicallylocatedinanyofthestripesinthefabric;in-putstoandoutputsfromtheapplicationmustuseaglobalbustogettotheirdestination.1AllPipeRenchdeviceshavefourglobalbusses.Twoofthesebussesarededicatedtostoringandrestoringstripestateduringhardwarevirtualization.Theothertwoareusedforinputandoutput.CombinationallogicisimplementedusingusingasetofNB-bitwideALUs.TheALUoper-ationisstaticwhileaparticularvirtualstripeislocatedinthephysicalstripe.ThecarrylinesofPipeRench'sALUsmaybecascadedtoconstructwiderALUs.Furthermore,ALUsmaybechainedtogetherviatheinterconnectnetworktobuildcomplexcombinationalfunctions.4.1.PassRegisterFileWeorganizeeachstripeasanarrayofprocessingele-ments(PEs).EachPEcontainsoneALUandapassreg-isterle.AsdescribedinSection3,therecanbenoun-registeredinterconnectbetweenstripes.Furthermore,anystatecausedbyregisteredfeedbackwithinthestripemustbesavedandrestored.Thepassregisterisdesignedtopro-videefcientpipelined(registered)interstripeconnections.Eachpassregisterlehasonededicatedregisterthatcanbeusedforintra-stripefeedbackandthereforemusthaveitsstatestoredandrestored.AsillustratedinFigure7,theoutputoftheALUcanbewrittentoanyoneofthePregistersinthepassregis-terle.IftheregisterisnotwrittenbytheALU,thevalueinthepassregisterisloadedfromthevalueinthecorre-spondingpassregisterinthepreviousstripe.Thisreduces1Bylimitingthesetofphysicalstripesthatmayholdaparticularvirtualstripeonecaneliminatetheglobalbusses.Thisreducesutilization,butmayincreaseclockfrequencysuf®cientlytomakeitworthwhile.5 ALU ...... ReadPortBB12PPass Register FileWritePort Stripe n Stripe n +1 ALUFigure7.Thepassregisterinterconnect. OutGlobalBussesInterconnect NetworkP PassRegistersOutput fromPreviousStripeBarrel ShifterBarrel Shifter B -1bitsfromnextPE B -1bitstonextPEControl/CarryBitsTo Interconnect Network ALU Control/CarryBits BBB To Global Output BusABABFigure8.Completearchitecturalclass.theamountofstatethatcanbecontainedinthepassregis-terletoasingleregister,becausedatathattravelsthroughthepipelinedoesnotneedtobesavedandrestored.Thepassregisterlealsoprovidesawaytorouteintermediateresultscomputedononestripetoastripesomewheredownthepipeline,withoutwastingALUsortheinterconnectnet-workwithinthestripe.LiketheALUoperation,thespecicregistersthatarewrittentoandreadfromthepassregisterlearestaticwhileavirtualstripeisresident;differentPEscanreadandwritedifferentregisters,buttheregistersthataparticularPEaccesseschangeonlywhenadifferentvirtualstripeconguresthephysicalstripe.4.2.InterconnectNetworkThepassregisterleprovidespipelinedinterconnectfromaPEinonestripetothecorrespondingPEinsub-sequentstripes.Ifdatavaluesneedtomovelaterallywithinthestripe,theymustusetheinterconnectnetwork,whichisillustratedasahorizontalbarinFigure6.Ineachstripe,theinterconnectnetworkacceptsinputsfromtheeachofthePEsinthatstripe,aswellasoneoftheregisteredval-uesfromthepreviousstripe.LiketheALUoperationsandthepassregisterles,theinterconnectnetworkispro-grammedduringcongurationandremainsunchangeddur-ingthelifetimeofthevirtualstripe.TheinterconnectweevaluateinSection5isafullcross-bar.Thisisexpensiveintermsofhardware,butitmakesev-erydesigneasilyplaceablebythecompiler.Furthermore,arichnetworkisnecessarytoachievegoodutilizationinare-congurablefabric[9].Infact,mostfabricsuseover50%oftheiravailableareaoninterconnect.AsshowninSec-tion5,evenwithafullcrossbarweuselessthan50%oftheareafortheinterstripeinterconnect.Thoughweuseafullcrossbar,itconnectsonlyPEstoPEs—i.e.,itisaB-bitwide,NxNcrossbar,asopposedtoan(NxB)x(NxB)crossbar.AkeytomakingthisinterconnectusefulisthateachPEhasabarrelshifterthatcanshiftitsinputsuptoB1bitstotheleft(seeFigure8).Thisallowsourarchitec-turetododataalignmentsthatarenecessaryforword-basedarithmeticasdescribedin[6].4.3.PhysicalImplementationCurrentlyweareplanningtodesignthissystemin100mm2ofsiliconina0.25micronprocess.Halfofthatareaisfortherecongurablefabric,whiletheotherhalfisformemorytostorevirtualstripes,control,andchipI/O.Fiftysquaremillimetersofsiliconprovidesapproximately500kbofvirtualcongurationstorage,whichisadequateforverylargeapplications.4.4.ArchitecturalParametersFigure8summarizesoneoftheNPEsinastripeforourparameterizedarchitecture.Inthefollowingsection,weexplorethefollowingthreearchitecturalparameters:N:thenumberofPEsinthestripe;B:thewidth,inbits,ofeachPE;P:thenumberofB-bitwideregistersinthepassreg-isterleperPE.5.EvaluationInthissectionweexplorethedesignspaceofpipelinedrecongurablearchitecures.UsingacompilerandCADtools,welookathowseveralkernelsperformonimplemen-tationsofthefabricthatdifferintheparametersdescribedinSection4.4.5.1.KernelsandApplicationsPerformanceandutilizationdataweregatheredforPipeRenchimplementationsofvariouskernels.Thekernels6 werechosenbasedondemandfortheapplicationsinthepresentandnearfuture,theirrecognitionasindustryperfor-mancebenchmarks,andtheirabilitytotintoourcompu-tationalmodel.ATRimplementstheshapesumkerneloftheSandiaalgo-rithmforautomatictargetrecognition[22].Thisalgo-rithmisusedtondaninstanceofatemplateimageinalargerimage,andtodistinguishbetweenimagesthatcontaindifferenttemplates.Cordicisa12stageimplementationoftheHoneywelltim-ingbenchmarkforCordicvectorrotations[15].Givenavectorinrectangularcoordinatesandarotationangleindegrees,thealgorithmndsacloseapproximationtotheresultantrotation.DCTisaone-dimensional,eight-pointdiscretecosinetransform[16].DCT-2D,atwo-dimensionalDCT,isanimportantalgorithmindigitalsignalprocessingandisthecoreofJPEGimagecompression.FIRisdescribedinSection2.2.HereweimplementaFIRlterwith20tapsand8-bitcoefcients.IDEAimplementsacompleteeight-roundInternationalDataEncryptionAlgorithmwiththekeycompiledintotheconguration[21].IDEAistheheartofPhilZim-merman'sPrettyGoodPrivacy(PGP)dataencryption.NqueensisanevaluatorfortheNqueensproblemonan8x8board.Giventhecoordinatesofchessqueensonachessboard,itdetermineswhetheranyofthequeenscanattackeachother.OverimplementsthePorter-Duffoveroperator[2].Thisisamethodofjoiningtwoimagesbasedonamaskoftransparencyvaluesforeachpixel.PopCountisdescribedinsectionSection2.3.WealsoevalutetheperformanceofPipeRenchontwocompleteapplications,JPEGandPGP.Ineachoftheseap-plicationsweassumePipeRenchisintegratedintothesys-temonthePCIbus,whichhasapeakmemorybandwidthof132MB/sec.5.2.MethodologyOurapproachistouseCADtoolstosynthesizeastripebasedontheparametersN,B,andP.Wejointhisauto-maticallysynthesizedlayoutwithacustomlayoutfortheinterconnect.Usingthenallayoutwedeterminethenum-berofphysicalstripesthatcantinoursiliconbudgetof50mm2(5mmx10mm)andthedelaycharacterisiticsofthecomponentsofthestripe(e.g.,LUTs,carry-chain,intercon-nect,etc.).Thedelaycharacterisiticsandnumberofregis-tersarethenusedbythecompilertocreatecongurationsforeachofthearchitecturalinstances,yieldingadesignofacertainnumberofstripesataparticularfrequency.Wecanthendeterminetheoverallspeedofthekernel,intermsofthroughput,foreacharchitecturalinstance.TheCADtoolowsynthesizeseachdesignpointandautomaticallyplacesandroutesthenaldesign.Althoughtheautomatictoolowdoesnotyieldtheoptimaldesign,weassumethatthevariouspointsareequallynon-optimal,allowingustocomparethedesigns.PreliminaryanalysisshowedtheCADtoolsdoingquitewell,exceptforthein-terconnect,whichwehandoptimize.Thekernelsarewritteninasingle-assignmentC-likelan-guage,DIL,whichisintendedforbothprogrammersandasanintermediatelanguageforahigh-levellanguagecompilerthattargetsrecongurablearchitectures.TheDILcompilerautomaticallysynthesizesandplacesandroutesourlargestdesignsinafewseconds[3].Itisparameterizablesothatwecangeneratecongurationsforanypipelinedrecong-urablearchitecureasdescribedinSection4.5.3.TheFabricTherearetwomainconstraintsthatdeterminewhichpa-rametersgeneraterealizablefabrics:thewidthofastripeandthenumberofverticalwiresthatmustpassovereachstripe.ThewidthofastripeisinuencedbythesizeandnumberofthePEsandthenumberofregistersallocatedtoeachPE.Welimitthewidthofastripeto4.9mminordertoallowtwoofthemtobeplacedsidebyside.2Thesecondconstraintistoaccomodatethenumberofverticalwiresthatpassoverthestripeswithintwometallayers.Thesewiresincludethosefortheglobalbusses,thepassregisters,andthecongurationbits.WeexploretheregionofthespaceboundedbyPEbit-widths(B)of2,4,8,16,and32bits;stripewidths(NxB)ofbetween64bitsand256bits;andregisters(P)of2,4,8and16.3Figure9showsthecomputationalden-sity(bit-ops/area-time)oftherealizableparameterswhenfourandeightregistersareallocatedtoeachPE.Interest-ingly,theresultisessentiallyindependentofstripewidth.Thereasonforthisisthatasthestripewidthincreases,theamountofareaperstripedevotedtointerconnectincreases,butthetotalnumberofstripesdecreases—yeildingacon-stantamountoftotalareadevotedtointerconnect.Infact,thetotalareadevotedtointerstripeinterconnectislessthan50%oftheareadevotedtothefabric.ThetotaldelayfromtheoutputofonestripeintothePEofthenextstriperemainsapproximatelyconstantbecausethewirecapacitanceofthe2Virtualizationrequiresthatdatabeallowedto¯owbetweenanytwostripes,includingthelastphysicaloneandthe®rstphysicalone.Toobtainconsistentroutingdelaytimeswearrangethestripesintwocolumns:inonecolumnthedata¯owsdownandintheotherit¯owsup.Thisavoidsalongpathfromthelasttothe®rstphysicalstripe.3Someofthewiderstripescanbeimplementedonlywitheightregisters.7 Four Register Computational Density0500010000150002000025000648096112128144160176192208224240256Stripe WidthMegabit Operations/mm2-sec2481632PE Bit-widthEight Register Computational Density0500010000150002000025000648096112128144160176192208224240256Stripe WidthMegabit Operations/mm2-sec2481632PE Bit-widthFigure9.Computationaldensity.interstripeinterconnect(5mmlonginallcases)dominatesthetransistordelays.ThecomputationaldensitydoesnotseemtohaveamonotonicrelationshipwithPEwidth.Thisseemscounter-intuitive;asPEsizeincreases,theoverheadofcongura-tiondecreasesandtheabilitytooptimizethePEincreases.Therefore,computationaldensityshouldincrease.ButourdelaymetricincludesthedelayassociatedthecarrychainofonePE,whichincreaseswithPEwidth.TheincreasedcarrychaindelaycountersthereductioninsizeperbitofthewiderPEscausingthecomputationaldensitytoremainrelativelyconstant.Ontheotherhand,ifweweretouseonlylogicaloperationstomeasuredelay,wewouldobserveanear-linearincreaseincomputationaldensityasPEsizeincreases.Becauseregistersconsumesubstantialarea,densitygoesdownasthenumberofregistersincreases(comparethetwographsinFigure9).Infact,sinceweuseregistersmainlytoimplementpipelinedinterstripeinterconnect,theycon-tributelittletocomputationaldensity.However,aswewillsee,theyareextremelyusefulincompilingkernelstothefabric.Thelasteffectweexamineisthesizeofthecongurationword.ThecongurationwordsizeapproximatelyhalvesasPEwidthsdouble.Ontheotherhand,asthewidthofthestripeincreases,thecongurationwordincreasesslightly.For128-bitstripes,thecongurationbitsforastriperangefrom1280bitsfora4-bitPEto164bitsfora32-bitPE.5.4.TheCompilerThecompilationprocessmapssourcewritteninadataowintermediatelanguage(DIL)toaparticularin-stanceofPipeRench.DILisasingle-assignmentlanguagewithCoperatorsandatypesystemthatallowsthebit-widthofvariabletobespecied.Thecompilerconvertsthesourceintoadataowgraphandthen,throughmanytransforma-tions,createsanalconguration.Theimportanttransfor-mationsforthisstudyareoperatordecomposition,operatorrecomposition,tting,andplace-and-route.Theoperatordecompositionpassbreaksupoperatorssothattheycanexecutewithinthetargetcycletime.Forexam-ple,awideadderneedstobebrokenupintoseveralsmalleraddersduetothecarry-chaindelays.Thedecompositionmustalsocreatenewoperatorsthathandletheroutingofthecarrybitsbetweenthepartialsums.Foroperationsthatrequirecarrybits,thedecomposedversionissignicantlylargerandhasadditionalroutingconstraints.Thus,asPEsizedecreases,thepenaltyfordecompositionincreases.Currently,theinteractionbetweenoperatordecompositionandplace-and-routerequireseachstripetohaveatleastsixPEs.Thenaivedecompositionforanoperatorroutesthecarrysignalontheinterstripeinterconnect.Italsoresultsinsign-extendingthesinglecarrybittothesizeofthesmalleradders.Tocompensateforthis,theoperatorrecomposi-tionpassusespatternmatchingtondsubgraphsthatcanbemappedtoparameterizedmodulesdesignedtotakead-vantageofarchitecture-specicroutingandPEcapabilities.Mostimportantlyforthisstudy,thisslightlyreducestheoverheadofdecomposedcarryoperations.ThettingpassmatchesthewireandoperatorwidthstothesizeofaPE.Thiscanrequiretheinsertionofsign-extensionoperatorstoincreasethewidthofwiresthatarenotmultiplesofaPEwidth.AsthePEwidthincreases,thiscausesbothunderutilizationofPEsandalargerpercentageofsign-extensionPEs.Furthermore,routingoperationsbe-comemorecomplexasextractingbitsfromwiresthatarenotPE-alignedofteninvolvesusinganextraPE.Place-and-routeisthekeytothecompiler.Itplacesand8 0248Figure10.Theharmonicmeanofthethroughputforallfabricparametersasafunctionofregisters.routestheoperatorsinthegraphontostripesunderthetim-ingconstraintimposedbythetargetcycletime.Thus,astheclockrateorthedelaythroughthePEincreases,theuti-lizationofeachstripecandecrease,unlessthekernelhassufcientparallelismsothatindependentoperatorscanbeplacedinastripe.ThisisparticularlytrueofstripeswithmanyPEs.InadditiontoassigningoperatorstoPEsandwirestotheinterconnect,theplace-and-routepassassignswirestothepassregisters.Ifthereareinsufcientpassregisters,thecompilerwilltime-multiplexwiresonregisters.Time-multiplexingslowsthecircuitdowninordertoallowmul-tiplevaluestoresideinasingleregisters.Forexample,iftwowiresareassignedtoasingleregister,thentheregisterholdsonewireontheoddcyclesandanotherontheevenones.Whiletime-multiplexingdoesnotincreasethecir-cuitsizesignicantly,itdoesreducethethroughputbyaconstantfactor.Forarchitectureswithfewregistersthisisaseverepenalty,astime-multiplexingfactorsofmorethantenmayberequired.OneofthegoalsfortheDILcompilerwascompilationspeed.Itachieveshighspeedcompilationinpartbytrad-ingoffresultqualityforfastercompilation.Thisaffectstheresultsbyintroducingmoretime-multiplexingthanneces-sary.5.5.Compiler/FabricInteractionTherealquestion,ofcourse,isnottherawhardwareper-formanceavailable,buthowwellitcanbeutilized.Usingtheparameterizablecompilerwecompiledcongurationsforeachkernel.Beforeevaluatingtheeffectsofoverallwidth,numberofPEs,ornumberofbitsperPE,wenarrowdownthedesignspacebyexamingtheeffectofpassregis-ters.Foragivenstripewidthandbits-per-PE,asthenumberofregistersincrease,thecomputationaldensitydecreases.Howeversincepassregistersmakeupanimportantcompo-nentoftheinterstripeinterconnect,reducingthenumberofpassregistersincreasesroutingpressure,whichdecreasesstripeutilizationandcausesthecompilertotime-multiplexthevaluesontheregisters.AsFigure10shows,thebestbalanceofcomputationdensitywithutilizationismostof-tenachievedateightregisters.Theaveragetimemultiplex-ingfactorforallthekernelsaverageacrossallthefabricsrangesfromover60fortworegisters,to12atfourregis-ters,2ateightregisters,and1atsixteenregisters.IDEAandNqueenshavehigherfactorsateightregistersthantheotherkernels.TherestoftheevaluationoccurswitheightpassregistersperPE,i.e.P=8.Figure11showsthethroughputachievedforvariousstripewidthsandPEsizesateightregistersperPE.Ascanbeseen,thoughthewiderPEsizescreatefabricswithhighercomputationaldensity,thenaturaldatasizesofthekernelsaresmaller,causing32-bitPEstobeunderutilized.Ontheotherendofthespectrum,2-bitPEsarenotcompetitiveduetoincreasedtimesforarithmeticoperations,thelackofrawcomputationaldensity,andtheincreasednumberofcong-urationbitsneededperapplication.Ifweexaminetheperformanceoftheindividualkernels(seeFigure11)wecanseethatthecharacteristicsofthekernelsgreatlyinuencewhichparametersarebest.Forexample,DCTneedsatleast8PEsinthestripe4,rulingout32-bitPEsforallbutthewideststripe.Thepeakat128bitsoccursbecausethereisasufcientnumberofPEstoeliminatetimemultiplexing.WhilewiderstripescanbeutilizedbecausethereissufcientparallelismintheDCTalgorithm.FIRoperatesmostlyon8-bitandwidernumbers.Thismakes4-bitPEslessattactiveduetothecarrychaindelayassociatedwithcrossingPEs.Thereisenoughparallelismtokeepthewiderstripesbusy.Thesestripeshavefewerregisters,whichincreasesthenumberofstripesintheim-plementation,therebyreducingitsoverallthroughput.IDEAtakeswideinputs,sostripesoflessthan96-bitsrequiresubstantialtime-multiplexing.UnlikeDCTandFIRthereisnotenoughparallelismtoutilizethewiderstripes.Insummary,weneedtochooseafabricthatisatleast128bitswide.Wealsowantatleast12PEsinthestripe.Sincenotallkernelshavesufcientparallelismtoutilizewidestripes,wewanttochoosethenarroweststripetowhichallkernelscanbecompiled.Thus,wechoosea128-bitwidefabricmadeupofeight-bitPEswitheightregisterseach.4EightPEsarerequiredtotransposethedataforthetwo-dimensionalDCT.9 Throughput for IDEA012345678648096112128144160176192208224240256Stripe Width in BitsMillions of Inputs/Second2481632PE Bit-widthB = 2B = 8B = 4B = 16B = 32Throughput for FIR051015202530354045648096112128144160176192208224240256Stripe Width in BitsMillions of Inputs/Second2481632PE Bit-widthB = 2B = 8B = 4B = 16B = 32Throughput for DCT02468101214161820648096112128144160176192208224240256Stripe Width in BitsMillions of Inputs/Second2481632PE Bit-widthB = 2B = 8B = 4B = 16B = 32Harmonic Mean of Throughput0510152025648096112128144160176192208224240256Stripe Width in BitsMillions of Inputs/Second2481632PE Bit-widthB = 2B = 8B = 4B = 16B = 32Figure11.Thethroughputforvariouskernelsona100MHzPipeRench.Thekernelsuseupto8registers.5.6.PerformanceVersusGeneral­PurposeProces­sorsUsingtheeight-bitPE128-bitstripewitheightregis-terswecomparetheperformanceofPipeRenchtothatofageneral-purposeprocessor,theUltraSparc-IIrunningat300Mhz.Figure12showstherawspeedupforallker-nels.ThisperformanceishardtoachievewithPipeRenchconnectedviatheI/Obus,butalargefractionoftherawspeedupisachievable.Table1showsthespeedupfromusingPipeRenchversusdoingtheentireapplicationonthemainprocessor.ForPGP,wereplacecodeforIDEA(accountingfor12%oftheap-plication)withinvocationsofPipeRench,reducingthetimeforthisportionofthecodetozeroandyieldinganaveragespeedupofalmost12%.ForJPEG,byrunningthetwo-dimensionalDCTkernelonPipeRench,weobtainanaver-ageimprovementofabout7.2%.WealsondthatthePCIbusimposesnoseriousbottlenecksontheperformanceoftheseapplications.1Figure12.Speedupofeigth-bitPE,eightregistersperPE,128-bitwidestripe.6.RelatedWorkNumerousotherarchitecturalresearcheffortsarefo-cusedonefcientlyharnessinghugenumbersoftransis-10 ApplicationInputSize(MB)SpeedupPGP0.141.079.21.1218.391.1227.591.12JPEG2.021.0610.131.0811.741.0711.751.07Table1.SpeedupsforPGPandJPEGusinga100MHz128-bit53-stripePipeRenchona32-bit33MHzPCIbuscomparedtoa330MhzUltraSparc-II.torsformedia-centriccomputingworkloads.ThelineageofthesesystemsderivesfromeitherFPGAsorexistingcomputerarchitectures.ThosedecendedfromFPGAsaretermed“recongurablecomputingsystems”,andincludePRISC[18],DISC[25],NAPA[19],GARP[13],Chi-maera[12],One-Chip[26],RAW[23],andRaPiD[8].Noneoftheserecongurablecomputingsystemssupportanarchitecturalabstractionlikevirtualhardware.Inev-erycase,thecompilermustbeawareofallthesystemconstraints,andifitviolatesanyconstraint,ithasfailed.Thismakescompilationdifcult,slow,andunpredictable.Furthermore,thereisnofacilityinthesearchitecturesforforward-compatibility,sothateveryapplicationneedstobecompiledforeverynewchip.PipeRenchoffershard-warevirtualization,forwardcompatibility,andeasiercom-pilation.Likemostoftheaforementionedarchitectures,PipeRenchdiffersfromFPGAsinthatitsbasicwordsizeismorethanoneortwobitsandthatitsinterconnectislessgeneralandmoreefcientforcomputation.PipeRenchaddressesmanyoftheproblemsfacedbyothercomputerarchitectures.Wefocusonuniproces-sorsystemsbecausePipeRenchexploitsne-grainedpar-allelism.ThemostinsightfulcomparisonsaretoMMX,VLIW,andvectormachines.Themismatchbetweenapplicationdatasizeandna-tiveoperatingdatasizehasbeenaddressedbyextendingtheISAsofmicroprocessorstoallowawidedatapathtobesplitintomultipleparalleldatapaths,asinIntel'sMMX[17].ObtainingSIMDparallelismtoutilizethepar-alleldatapathsisnontrivial,andworksonlyforveryreg-ularcomputationswherethecostofdataalignmentdoesnotoverwhelmthegaininparallelism.PipeRenchhasarichinterconnecttoprovideforalignmentandallowsPEstohavedifferentcongurationssothatparallelismneednotbeSIMD.VLIWarchitecturesaredesignedtoexploitdataowpar-allelismthatcanbedeterminedatcompiletime[7].VLIWshaveextremelyhighinstructionbandwidthdemands.Asin-glePipeRenchstripeissimilartoaVLIWprocessorusingmanysmall,simplefunctionalunits.ButinPipeRench,af-terthestripeiscongured,itisusedtoperformthesamecomputationonalargedataset,therebyamortizingthein-structionsovermoredata.TheinstructionbandwidthissuehasbeenaddressedbyvectormicroprocessorssuchasT0[24]andIRAM[14].Theproblemwithvectorarchitecturesisthatthevectorreg-isterleisaphysicalorlogicalbottleneckthatlimitsscal-ability.Allocatingadditionalfunctionalunitsinavectorprocessorrequiresanadditionalportonthevectorregis-terle.Thephysicalbottleneckoftheregisterlecanbeamelioratedbyprovidingdirectforwardingpathstoallowchainedoperationstobypasstheregisterle,asinT0[24].Thisplaceslargedemandsontheissuehardware.Alogicalbottleneckiscausedbythelimitednamespaceofthereg-isterle.Thiscanbeaddressedbyimplementingregisterrenamingtoavoidfalsedependencies.Thus,vectormicro-processorsaresubjecttothesamecomplexitiesinissueandcontrolhardwaredesignasmodernsuperscalarprocessors.AllconnectionsinPipeRencharelocal,andthereisnocen-trallogicalorphysicalbottleneck.Therefore,thenumberoffunctionalunitscangrowwithoutincreasingthecomplexityoftheissueandcontrolhardware.7.FutureWorkandConclusionsInthispaperwehavedescribedanewrecongurablecomputingarchitecture,PipeRench,whichemphasizesper-formanceonfuturecomputingworkloads.PipeRenchusespipelinedrecongurationtoovercomemanyofthedifcul-tiesfacedbypreviousattemptstouserecongurablecom-putingtotackletheseimportantapplications.PipeRenchenablesfast,robustcompilers;supportsforwardcompatibil-ity;andvirtualizeshardware,removingthexedsizecon-straintpresentinotherfabrics.Asaresult,thedesignerbaseisbroadened,developmentcyclesareshortened,andappli-cationdeveloperscanamortizethecostofdevelopmentovermultipleprocessgenerations.Werstexaminedcomputationaldensityofthefabric,byautomaticallysynthesizinghardwarebasedonanumberofarchitecturalparameters,including:sizeofthePE,thenumberofPEs,andthenumberofregisters.Rawcompu-tationaldensityisrelativelyatacrossthespaceofarchi-tectures.Thearchitecturalparameterscouldonlybetunedwhenwehadaretargetablecompilerandcouldmeasuretheamountofexploitablecomputationalpowerinthefabric.Usingthecompilerandhardwaresynthesisowintan-dem,wefoundthatPEswithbit-widthsofeightarethebestcompromisebetweenexibilityandefciencyacrossabroadrangeofkernels.WhenthesePEsarearrangedinmoderatelywidestripes(e.g.128bitswide)wecan11 obtainsignicantperformanceimprovementsovergeneral-purposeprocessors,insomecasesachievingimprovementoftwoordersofmagnitude.Theseperformancenumbersareconservative.Bothhardwareperformanceandcompilerefciencycanbesignicantlyoptimized.WearecurrentlybuildingaPCI-basedboardthatwillin-cludeoneormorePipeRenchchips.AlthoughPipeRenchiscurrentlybeingbuiltintoasystemasanattachedpro-cessor,weareexamininghowtomoveitclosertothepro-cessor.Weexpectthatjustasthecomputingdemandsofthepastdecadesforcedoating-pointprocessorstobecomeoating-pointunits,thecomputingworkloadsofthenearfuturewillcausePipeRenchtomovefromanattachedpro-cessortorecongurableunit.8.AcknowledgementsTheauthorswishthankthereviewersforthehelpfulcomments.ThisworkwassupportedbyDARPAcontractDABT63-96-C-0083.WealsoreceivednancialsupportfromAlteraCorporation,andtechnicalsupportfromSTMi-croelectronics.ences[1]P.BertinandH.Touati.PAMprogrammingenvironments:Practiceandexperience.InD.A.BuellandK.L.Pocek,ed-itors,ProceedingsofIEEEWorkshoponFPGAsforCustomComputingMachines,pages133±138,Napa,CA,Apr.1994.[2]J.Blinn.FugueforMMX.IEEEComputerGraphicsandApplications,pages88±93,March-April1997.[3]M.BudiuandS.Goldstein.Fastcompilationforpipelinedrecon®gurablefabrics.InProceedingsofthe1999ACM/SIGDASeventhInternationalSymposiumonFieldProgrammableGateArrays(FPGA'99),Montery,CA,Feb.1999.[4]D.Buell,J.Arnold,andP.Athanas.SPLASH2:FPGAsinacustomcomputingmachine.AW,196.[5]S.Cadambi,J.Weener,S.Goldstein,H.Schmit,andD.Thomas.Managingpipeline-recon®gurableFPGAs.InProceedingsofthe1998ACM/SIGDASixthInternationalSymposiumonFieldProgrammableGateArrays,February1998.[6]D.CherepachaandD.Lewis.Adatapathorientedarchitec-tureforFPGAs.InSecondInternationalACM/SIGDAWork-shoponFieldProgrammableGateArrays,1994.[7]R.P.Colwell,R.P.Nix,J.J.O'Donenell,D.B.Papworth,andP.K.Rodman.AVLIWarchitectureforatraceschedul-ingcompiler.InProceedingsofASPLOS-II,pages180±192,Mar.1987.[8]D.Cronquist,P.Franklin,S.Berg,andC.Ebling.Speci-fyingandcompilingapplicationsforRaPiD.InK.PocekandJ.Arnold,editors,ProceedingsofIEEEWorkshoponFPGAsforCustomComputingMachines,pages116±127,Napa,CA,Apr.1998.IEEEComputerSociety,IEEECom-puterSocietyPress.[9]A.DeHon.Recon®gurableArchitecturesforGeneral-PurposeComputing.PhDthesis,MassachusettsInstituteofTechnology,September1996.[10]K.DiefendorffandR.Dubey.Howmultimediaworkloadswillchangeprocessordesign.IEEEComputer,30(9):43±45,September1997.[11]S.Hauck.TherolesofFPGAsinreprogrammablesystems.ProceedingsoftheIEEE,pages615±638,Apr.1998.[12]S.Hauck,T.W.Fry,M.M.Hosler,andJ.P.Kao.TheChi-maerarecon®gurablefunctionalunit.InIEEESymposiumonFPGAsforCustomComputingMachines(FCCM'97),pages87±96,April1997.[13]J.HauserandJ.Wawrzynek.Garp:AMIPSprocessorwitharecon®gurablecoprocessor.InIEEESymposiumonFPGAsforCustomComputingMachines,pages24±33,April1997.[14]C.Kozyrakis,S.Perissakis,D.Patterson,T.Anderson,K.Asanovic,N.Cardwell,R.Fromm,J.Golbus,B.Gribstad,K.Keeton,R.Thomas,N.Treuhaft,andK.Yelick.Scalableprocessorsinthebillion-transistorera:IRAM.IEEECom-puter,pages75±78,September1997.[15]S.Kumarandetal.Timimgsensitivitystressmark.Tech-nicalReportCDRLA001,Honeywell,Inc.,January1997.http://www.htc.honeywell.com/projects/acsbench/.[16]C.Loef¯er,A.Ligtenberg,andG.Moschytz.Practicalfast1-ddctalgorithmswith11multiplications.InProc.Interna-tionalConferenceonAcousticsSpeech,andSignalProcess-ing1989(ICASSP'89),pages9880±991,1989.[17]A.Peleg,S.Wilkie,andU.Weiser.IntelMMXformultime-diaPCs.CommunicationsoftheACM,40(1):24±38,1997.[18]R.RazdanandM.Smith.Ahigh-performancemicroarchi-tecturewithhardware-programmablefunctionalunits.InMICRO-27,pages172±180,November1994.[19]C.Rupp,M.Landguth,T.Garverick,E.Gomersall,H.Holt,J.Arnold,andM.Gokhale.TheNAPAadaptiveprocess-ingarchitecture.InIEEESymposiumonFPGAsforCustomComputingMachines(FCCM'98),April1998.[20]H.Schmit.Incrementalrecon®gurationforpipelinedappli-cations.InJ.ArnoldandK.L.Pocek,editors,ProceedingsofIEEEWorkshoponFPGAsforCustomComputingMa-chines,pages47±55,Napa,CA,Apr.1997.[21]B.Schneier.TheIDEAencryptionalgorithm.Dr.Dobb'sJournal,18(13):50,52,54,56,December1993.[22]J.Villasenor,B.Schoner,K.Chia,andC.Zapata.Con-®gurablecomputingsolutionsforautomatictargetrecogni-tion.InJ.ArnoldandK.L.Pocek,editors,ProceedingsofIEEEWorkshoponFPGAsforCustomComputingMa-chines,pages70±79,Napa,CA,Apr.1996.[23]E.Waingold,M.Taylor,D.Srikrishna,etal.Baringitalltosoftware:Rawmachines.IEEEComputer,pages86±93,September1997.[24]J.Wawrzynek,K.Asanovic,B.Kingsbury,J.Beck,D.John-son,andN.Morgan.Spert-II:Avectormicroprocessorsys-tem.IEEEComputer,29(3):79±86,March1996.[25]M.J.WirthlinandB.L.Hutchings.Adynamicinstructionsetcomputer.InP.AthanasandK.L.Pocek,editors,Pro-ceedingsofIEEEWorkshoponFPGAsforCustomComput-ingMachines,pages99±107,Napa,CA,Apr.1995.[26]R.WittigandP.Chow.OneChip:AnFPGAprocessorwithrecon®gurablelogic.InIEEESymposiumonFPGAsforCustomComputingMachines,1996.12