/
Recon Recon

Recon - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
391 views
Uploaded On 2016-07-20

Recon - PPT Presentation

ThisresearchwassupportedinpartbyMotorolaIncDARPAandNSFKComptonwassupportedbyanNSFfellowshipSHauckwassupportedinpartbyanNSFCAREERawardandaSloanResearchFellowshipAuthors ID: 411778

ThisresearchwassupportedinpartbyMotorola Inc. DARPA andNSF.K.ComptonwassupportedbyanNSFfellowship.S.HauckwassupportedinpartbyanNSFCAREERawardandaSloanResearchFellowship.Authors

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Recon" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

ReconÞgurableComputing:ASurveyofSystemsandSoftwareKATHERINECOMPTONNorthwesternUniversitySCOTTHAUCKUniversityofWashingtonDuetoitspotentialtogreatlyaccelerateawidevarietyofapplications,reconÞgurablecomputinghasbecomeasubjectofagreatdealofresearch.Itskeyfeatureistheabilitytoperformcomputationsinhardwaretoincreaseperformance,whileretainingmuchoftheßexibilityofasoftwaresolution.Inthissurvey,weexplorethehardwareaspectsofreconÞgurablecomputingmachines,fromsinglechiparchitecturestomulti-chipsystems,includinginternalstructuresandexternalcoupling.Wealsofocusonthesoftwarethattargetsthesemachines,suchascompilationtoolsthatmaphigh-levelalgorithmsdirectlytothereconÞgurablesubstrate.Finally,weconsidertheissuesinvolvedinrun-timereconÞgurablesystems,whichreusetheconÞgurablehardwareduringprogramexecution.CategoriesandSubjectDescriptors:A.1[IntroductoryandSurvey];B.6.1[]:DesignStyleÑlogicarrays;B.6.3[LogicDesign]:DesignAids;B.7.1.7.1IntegratedCircuits]:TypesandDesignStylesÑgatearraysGeneralTerms:Design,PerformanceAdditionalKeyWordsandPhrases:Automaticdesign,Þeld-programmable,FPGA,manualdesign,reconÞgurablearchitectures,reconÞgurablecomputing,reconÞgurable1.INTRODUCTIONTherearetwoprimarymethodsincon-ventionalcomputingfortheexecution ThisresearchwassupportedinpartbyMotorola,Inc.,DARPA,andNSF.K.ComptonwassupportedbyanNSFfellowship.S.HauckwassupportedinpartbyanNSFCAREERawardandaSloanResearchFellowship.AuthorsÕaddresses:K.Compton,DepartmentofElectricalandComputerEngineering,NorthwesternUni-versity,2145SheridanRoad,Evanston,IL60208-3118;e-mail:kati@ece.northwestern.edu;S.Hauck,De-partmentofElectricalEngineering,TheUniversityofWashington,Box352500,Seattle,WA98195;e-mail:hauck@ee.washington.edu.PermissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforproÞtordirectcommercialadvantageandthatcopiesshowthisnoticeontheÞrstpageorinitialscreenofadisplayalongwiththefullcitation.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,torepublish,topostonservers,toredistributetolists,ortouseanycompo-nentofthisworkinotherworksrequirespriorspeciÞcpermissionand/orafee.PermissionsmayberequestedfromPublicationsDept.,ACM,Inc.,1515Broadway,NewYork,NY10036USA,fax1(212)869-0481,orpermissions@acm.org.2002ACM0360-0300/02/0600-0171$5.00ofalgorithms.TheÞrstistousehard-wiredtechnology,eitheranApplicationSpeciÞcIntegratedCircuit(ASIC)oragroupofindividualcomponentsformingaACMComputingSurveys,Vol.34,No.2,June2002,pp.171Ð210. K.ComptonandS.Hauckboard-levelsolution,toperformtheoper-ationsinhardware.ASICsaredesignedspeciÞcallytoperformagivencomputa-tion,andthustheyareveryfastandefÞcientwhenexecutingtheexactcom-putationforwhichtheyweredesigned.However,thecircuitcannotbealteredaf-terfabrication.ThisforcesaredesignandrefabricationofthechipifanypartofitscircuitrequiresmodiÞcation.Thisisanex-pensiveprocess,especiallywhenonecon-sidersthedifÞcultiesinreplacingASICsinalargenumberofdeployedsystems.Board-levelcircuitsarealsosomewhatin-ßexible,frequentlyrequiringaboardre-designandreplacementintheeventofchangestotheapplication.Thesecondmethodistousesoft-ware-programmedmicroprocessorsÑafarmoreßexiblesolution.Processorsexecuteasetofinstructionstoperformacompu-tation.Bychangingthesoftwareinstruc-tions,thefunctionalityofthesystemisalteredwithoutchangingthehardware.However,thedownsideofthisßexibilityisthattheperformancecansuffer,ifnotinclockspeedtheninworkrate,andisfarbelowthatofanASIC.Theprocessormustreadeachinstructionfrommemory,decodeitsmeaning,andonlythenexe-cuteit.Thisresultsinahighexecutionoverheadforeachindividualoperation.Additionally,thesetofinstructionsthatmaybeusedbyaprogramisdeterminedatthefabricationtimeoftheprocessor.Anyotheroperationsthataretobeim-plementedmustbebuiltoutofexistingReconÞgurablecomputingisintendedtoÞllthegapbetweenhardwareandsoft-ware,achievingpotentiallymuchhigherperformancethansoftware,whilemain-tainingahigherlevelofßexibilitythanhardware.ReconÞgurabledevices,in-cludingÞeld-programmablegatearrays(FPGAs),containanarrayofcomputa-tionalelementswhosefunctionalityisde-terminedthroughmultipleprogrammableconÞgurationbits.Theseelements,some-timesknownaslogicblocks,areconnectedusingasetofroutingresourcesthatarealsoprogrammable.Inthisway,customdigitalcircuitscanbemappedtotherecon-Þgurablehardwarebycomputingthelogicfunctionsofthecircuitwithinthelogicblocks,andusingtheconÞgurableroutingtoconnecttheblockstogethertoformthenecessarycircuit.FPGAsandreconÞgurablecomputinghavebeenshowntoaccelerateavarietyofapplications.Dataencryption,forexam-ple,isabletoleveragebothparallelismandÞne-graineddatamanipulation.AnimplementationoftheSerpentBlockCipherintheXilinxVirtexXCV1000showsathroughputincreasebyafactorofover18comparedtoaPentiumProPCrunningat200MHz[ElbirtandPaar2000].Additionally,areconÞgurablecom-putingimplementationofsievingforfac-toringlargenumbers(usefulinbreakingencryptionschemes)wasacceleratedbyafactorof28overa200-MHzUltraSparcworkstation[KimandMangione-Smith2000].TheGarparchitectureshowsacomparablespeed-upforDES[HauserandWawrzynek1997],asdoesanFPGAimplementationofanellipticcurvecryptographyapplication[Leungetal.OtherrecentapplicationsthathavebeenshowntoexhibitsigniÞcantspeed-upsusingreconÞgurablehardwareinclude:automatictargetrecognition[RencherandHutchings1997],stringpat-ternmatching[WeinhardtandLuk1999],GolombRulerDerivation[Dollasetal.1998;Sotiriadesetal.2000],transitiveclosureofdynamicgraphs[Huelsbergen2000],BooleansatisÞability[Zhongetal.1998],datacompression[Huangetal.2000],andgeneticalgorithmsforthetra-vellingsalesmanproblem[GrahamandNelson1996].InordertoachievetheseperformancebeneÞts,yetsupportawiderangeofappli-cations,reconÞgurablesystemsareusu-allyformedwithacombinationofre-conÞgurablelogicandageneral-purposemicroprocessor.TheprocessorperformstheoperationsthatcannotbedoneefÞ-cientlyinthereconÞgurablelogic,suchasdata-dependentcontrolandpossiblymemoryaccesses,whilethecomputationalcoresaremappedtothereconÞgurablehardware.ThisreconÞgurablelogiccanbeACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputingcomposedofeithercommercialFPGAsorcustomconÞgurablehardware.CompilationenvironmentsforreconÞg-urablehardwarerangefromtoolstoassistaprogrammerinperformingahandmap-pingofacircuittothehardware,tocom-pleteautomatedsystemsthattakeacir-cuitdescriptioninahigh-levellanguagetoaconÞgurationforareconÞgurablesys-tem.ThedesignprocessinvolvesÞrstpar-titioningaprogramintosectionstobeim-plementedonhardware,andthosewhicharetobeimplementedinsoftwareonthehostprocessor.ThecomputationsdestinedforthereconÞgurablehardwarearesyn-thesizedintoagatelevelorregistertrans-ferlevelcircuitdescription.Thiscircuitismappedontothelogicblockswithinthere-conÞgurablehardwareduringthetechnol-ogymappingphase.ThesemappedblocksarethenplacedintothespeciÞcphysi-calblockswithinthehardware,andthepiecesofthecircuitareconnectedusingthereconÞgurablerouting.Aftercompi-lation,thecircuitisreadyforconÞgura-tionontothehardwareatrun-time.Thesesteps,whenperformedusinganautomaticcompilationsystem,requireverylittleef-fortonthepartoftheprogrammertoutilizethereconÞgurablehardware.How-ever,performingsomeoralloftheseoper-ationsbyhandcanresultinamorehighlyoptimizedcircuitforperformance-criticalapplications.SinceFPGAsmustpayanareapenaltybecauseoftheirreconÞgurability,devicecapacitycansometimesbeaconcern.Sys-temsthatareconÞguredonlyatpower-upareabletoaccelerateonlyasmuchoftheprogramaswillÞtwithinthepro-grammablestructures.AdditionalareasofaprogrammightbeacceleratedbyreusingthereconÞgurablehardwareduringpro-gramexecution.Thisprocessisknownasrun-timereconÞguration(RTR).WhilethisstyleofcomputinghasthebeneÞtofallowingfortheaccelerationofagreaterportionofanapplication,italsointroducestheoverheadofconÞguration,whichlim-itstheamountofaccelerationpossible.Be-causeconÞgurationcantakemillisecondsorlonger,rapidandefÞcientconÞgurationisacriticalissue.MethodssuchasconÞg-urationcompressionandthepartialreuseofalreadyprogrammedconÞgurationscanbeusedtoreducethisoverhead.Thisarticlepresentsasurveyofcur-rentresearchinhardwareandsoftwaresystemsforreconÞgurablecomputing,aswellastechniquesthatspeciÞcallytargetrun-timereconÞgurability.WeleadoffthisdiscussionbyexaminingthetechnologyrequiredforreconÞgurablecomputing,fol-lowedbyamorein-depthexaminationofthevarioushardwarestructuresusedinreconÞgurablesystems.Next,welookatthesoftwarerequiredforcompilationofalgorithmstoconÞgurablecomputers,andthetrade-offsbetweenhand-mappingandautomaticcompilation.Finally,wediscussrun-timereconÞgurablesystems,whichfurtherutilizetheintrinsicßexibilityofconÞgurablecomputingplatformsbyopti-mizingthehardwarenotonlyfordifferentapplications,butfordifferentoperationswithinasingleapplicationaswell.Thissurveydoesnotseektocoverev-erytechniqueandresearchprojectintheareaofreconÞgurablecomputing.Instead,ithopestoserveasanintroductiontothisrapidlyevolvingÞeld,bringingin-terestedreadersquicklyuptospeedondevelopmentsfromthelasthalf-decade.ThoseinterestedinfurtherbackgroundcanÞndcoverageofoldertechniquesandsystemselsewhere[Roseetal.1993;HauckandAgarwal1996;Vuilleminetal.1996;Mangione-Smithetal.1997;Hauck2.TECHNOLOGYReconÞgurablecomputingasaconcepthasbeeninexistenceforquitesometime[Estrinetal.1963].Evengeneral-purposeprocessorsusesomeofthesamebasicideas,suchasreusingcomputationalcom-ponentsforindependentcomputations,andusingmultiplexerstocontroltheroutingbetweenthesecomponents.How-ever,thetermreconÞgurablecomput-,asitisusedincurrentresearch(andwithinthissurvey),referstosys-temsincorporatingsomeformofhard-wareprogrammabilityÑcustomizinghowthehardwareisusedusinganumberACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauck Fig.1.AprogrammingbitforSRAM-basedFPGAs[Xilinx1994](left)andapro-grammableroutingconnection(right).ofphysicalcontrolpoints.Thesecontrolpointscanthenbechangedperiodicallyinordertoexecutedifferentapplicationsus-ingthesamehardware.TherecentadvancesinreconÞgurablecomputingareforthemostpartde-rivedfromthetechnologiesdevelopedforFPGAsinthemid-1980s.FPGAswereoriginallycreatedtoserveasahy-briddevicebetweenPALsandMask-ProgrammableGateArrays(MPGAs).LikePALs,FPGAsarefullyelectricallyprogrammable,meaningthatthephysicaldesigncostsareamortizedovermultipleapplicationcircuitimplementations,andthehardwarecanbecustomizednearlyin-stantaneously.LikeMPGAs,theycanim-plementverycomplexcomputationsonasinglechip,withdevicescurrentlyinpro-ductioncontainingtheequivalentofoveramilliongates.Becauseofthesefeatures,FPGAshadbeenprimarilyviewedasglue-logicreplacementandrapid-prototypingvehicles.However,asweshowthrough-outthisarticle,theßexibility,capacity,andperformanceofthesedeviceshasopenedupcompletelynewavenuesinhigh-performancecomputation,formingthebasisofreconÞgurablecomputing.MostcurrentFPGAsandreconÞg-urabledevicesareSRAM-programmable(Figure1left),meaningthatSRAMbitsareconnectedtotheconÞgurationpointsintheFPGA,andprogrammingtheSRAMbitsconÞgurestheFPGA. ThetermÒSRAMÓistechnicallyincorrectformanyFPGAarchitectures,giventhattheconÞgurationmemorymayormaynotsupportrandomaccess.Infact,theconÞgurationmemorytendstobecontinu-allyreadinordertoperformitsfunction.However,thisisthegenerallyacceptedtermintheÞeldandcorrectlyconveystheconceptofstaticvolatilemem-oryusinganeasilyunderstandablelabel.Thus,thesechipscanbeprogrammedandreprogrammedaboutaseasilyasastan-dardstaticRAM.Infact,oneresearchproject,thePAMproject[Vuilleminetal.1996],considersagroupofoneormoreFPGAstobeaRAMunitthatperformscomputationbetweenthememorywrite(sendingtheconÞgurationinformationandinputdata)andmemoryread(read-ingtheresultsofthecomputation).ThisleadssometousethetermActiveMemoryPAMOneexampleofhowtheSRAMconÞgu-rationpointscanbeusedistocontrolrout-ingwithinareconÞgurabledevice[Chowetal.1999a].ToconÞguretheroutingonanFPGA,typicallyapassgatestructureisemployed(seeFigure1right).HeretheprogrammingbitwillturnonaroutingconnectionwhenitisconÞguredwithatruevalue,allowingasignaltoßowfromonewiretoanother,andwilldisconnecttheseresourceswhenthebitissettofalse.Withaproperinterconnectionoftheseele-ments,whichmayincludemillionsofrout-ingchoicepointswithinasingledevice,arichroutingfabriccanbecreated.AnotherexampleofhowtheseconÞgu-rationbitsmaybeusedistocontrolmul-tiplexers,whichwillchoosebetweentheoutputofdifferentlogicresourceswithinthearray.Forexample,toprovideoptionalstateholdingelementsaDßip-ßop(DFF)maybeincludedwithamultiplexerse-lectingwhethertoforwardthelatchedorunlatchedsignalvalue(seeFigure2left).Thus,forsystemsthatrequirestate-holdingtheprogrammingbitscontrollingthemultiplexerwouldbeconÞguredtose-lecttheDFFoutput,whilesystemsthatdonotneedthisfunctionwouldchoosethebypassroutethatsendstheinputdi-rectlytotheoutput.SimilarstructuresACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing Fig.2.Dßip-ßopwithoptionalbypass(left)anda3-inputLUT(right).canchoosebetweenotheron-chipfunc-tionalities,suchasÞxed-logiccomputationelements,memories,carrychains,orotherfunctions.Finally,theconÞgurationbitsmaybeusedascontrolsignalsforacomputationalunitorasthebasisforcomputationit-self.Asacontrolsignal,aconÞgurationbitmaydeterminewhetheranALUper-formsanaddition,subtraction,orotherlogiccomputations.Ontheotherhand,withastructuresuchasalookuptable(LUT),theconÞgurationbitsthemselvesformtheresultofthecomputation(seeFigure2right).Theseelementsareessen-tiallysmallmemoriesprovidedforcom-putingarbitrarylogicfunctions.LUTscancomputeanyfunctionofinputs(whereisthenumberofcontrolsignalsfortheLUTÕsmultiplexer)byprogrammingtheprogrammingbitswiththetruthta-bleofthedesiredfunction.Thus,ifallprogrammingbitsexcepttheonecorre-spondingtotheinputpattern111weresettozeroa3-inputLUTwouldactasa3-inputANDgate,whileprogrammingitwithallonesexceptin000wouldcomputeaNAND.3.HARDWAREReconÞgurablecomputingsystemsuseFPGAsorotherprogrammablehardwaretoacceleratealgorithmexecutionbymap-pingcompute-intensivecalculationstothereconÞgurablesubstrate.Thesehardwareresourcesarefrequentlycoupledwithageneral-purposemicroprocessorthatisresponsibleforcontrollingthereconÞg-urablelogicandexecutingprogramcodethatcannotbeefÞcientlyaccelerated.Inverycloselycoupledsystems,therecon-Þgurabilitylieswithincustomizablefunc-tionalunitsontheregulardatapathofthemicroprocessor.Ontheotherhand,areconÞgurablecomputingsystemcanbeaslooselycoupledasanetworkedstand-aloneunit.MostreconÞgurablesystemsarecategorizedsomewherebetweenthesetwoextremes,frequentlywiththerecon-Þgurablehardwareactingasacoproces-sortoahostmicroprocessor.Thepro-grammablearrayitselfcanbecomprisedofoneormorecommerciallyavailableFPGAs,orcanbeacustomdevicedesignedspeciÞcallyforreconÞgurablecomputing.ThedesignoftheactualcomputationblockswithinthereconÞgurablehardwarevariesfromsystemtosystem.Eachunitofcomputation,orlogicblock,canbeassim-pleasa3-inputlookuptable(LUT),orascomplexasa4-bitALU.Thisdifferenceinblocksizeiscommonlyreferredtoasofthelogicblock,wherethe3-bitLUTisanexampleofaveryÞne-grainedcomputationalelement,anda4-bitALUisanexampleofaquitecoarse-grainedunit.TheÞner-grainedblocksareusefulforbit-levelmanipulations,whilethecoarse-grainedblocksarebetteropti-mizedforstandarddatapathapplications.Somearchitecturesemploydifferentsizesortypesofblockswithinasinglerecon-ÞgurablearrayinordertoefÞcientlysup-portdifferenttypesofcomputation.Forexample,memoryisfrequentlyembeddedwithinthereconÞgurablehardwaretopro-videtemporarydatastorage,formingaheterogeneousstructurecomposedofbothlogicblocksandmemoryblocks[Ebelingetal.1996;Altera1998;Lucent1998;Marshalletal.1999;Xilinx1999].ACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.HauckTheroutingbetweenthelogicblockswithinthereconÞgurablehardwareisalsoofgreatimportance.RoutingcontributessigniÞcantlytotheoverallareaofthere-conÞgurablehardware.Yet,whentheper-centageoflogicblocksusedinanFPGAbe-comesveryhigh,automaticroutingtoolsfrequentlyhavedifÞcultyachievingthenecessaryconnectionsbetweentheblocks.Goodroutingstructuresarethereforees-sentialtoensurethatadesigncanbesuc-cessfullyplacedandroutedontotherecon-Þgurablehardware.OnceacircuithasbeenprogrammedontothereconÞgurablehardware,itisreadytobeusedbythehostprocessordur-ingprogramexecution.Therun-timeop-erationofareconÞgurablesystemoccursintwodistinctphases:conÞgurationandexecution.Theprogrammingoftherecon-Þgurablehardwareisunderthecontrolofthehostprocessor.Thishostprocessordi-rectsastreamofconÞgurationdatatothereconÞgurablehardware,andthisconÞg-urationdataisusedtodeÞnetheactualoperationofthehardware.ConÞgurationscanbeloadedsolelyatstart-upofapro-gram,orperiodicallyduringruntime,de-pendingonthedesignofthesystem.Moreconceptsinvolvedinrun-timereconÞgu-ration(thedynamicreconÞgurationofde-vicesduringcomputationexecution)arediscussedinalatersection.Theactualexecutionmodelofthere-conÞgurablehardwarevariesfromsys-temtosystem.Forexample,theNAPAsystem[Ruppetal.1998]bydefaultsuspendstheexecutionofthehostpro-cessorduringexecutionontherecon-Þgurablehardware.However,simulta-neouscomputationcanoccurwiththeuseoffork-and-joinprimitives,similartomultiprocessorprogramming.REMARC[MiyamoriandOlukotun1998]isare-conÞgurablesystemthatusesapipelinedsetofexecutionphaseswithintherecon-Þgurablehardware.Thesepipelinestagesoverlapwiththepipelinestagesofthehostprocessor,allowingforsimultaneousex-ecution.IntheChimaerasystem[Haucketal.1997],thereconÞgurablehardwareisconstantlyexecutingbaseduponthein-putvaluesheldinasubsetofthehostpro-cessorÕsregisters.AcalltotheChimaeraunitisinactualityonlyafetchofthere-sultvalue.ThisvalueisstableandvalidafterthecorrectinputvalueshavebeenwrittentotheregistersandhaveÞlteredthroughthecomputation.Inthenextsections,weconsideringreaterdepththehardwareissuesinre-conÞgurablecomputing,includingbothlogicandrouting.Tosupportthecompu-tationdemandsofreconÞgurablecomput-ing,weconsiderthelogicblockarchitec-turesofthesedevices,includingpossiblytheintegrationofheterogeneouslogicre-sourceswithinadevice.Heterogeneityalsoextendsbetweenchips,whereoneofthemostimportantconcernsisthecou-plingofthereconÞgurablelogicwithstan-dard,general-purposeprocessors.How-ever,reconÞgurabledevicesaremorethanjustlogicdevices;theroutingresourcesareatleastasimportantaslogicre-sources,andthusweconsiderintercon-nectstructures,including1D-orientedde-vicesthatarebeginningtoappear.3.1.CouplingFrequently,reconÞgurablehardwareiscoupledwithatraditionalmicroprocessor.ProgrammablelogictendstobeinefÞcientatimplementingcertaintypesofopera-tions,suchasvariable-lengthloopsandbranchcontrol.Inordertorunanapplica-tioninareconÞgurablecomputingsystemmostefÞciently,theareasoftheprogramthatcannotbeeasilymappedtotherecon-Þgurablelogicareexecutedonahostmi-croprocessor.Meanwhile,theareaswithahighdensityofcomputationthatcanben-eÞtfromimplementationinhardwarearemappedtothereconÞgurablelogic.Forthesystemsthatuseamicroprocessorincon-junctionwithreconÞgurablelogic,thereareseveralwaysinwhichthesetwocom-putationstructuresmaybecoupled,asFigure3shows.First,reconÞgurablehardwarecanbeusedsolelytoprovidereconÞgurablefunctionalunitswithinahostproces-sor[RazdanandSmith1994;Haucketal.1997].Thisallowsforatradi-tionalprogrammingenvironmentwiththeACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing Fig.3.DifferentlevelsofcouplinginareconÞgurablesystem.ReconÞgurablelogicisshaded.additionofcustominstructionsthatmaychangeovertime.Here,thereconÞgurableunitsexecuteasfunctionalunitsonthemainmicroprocessordatapath,withreg-istersusedtoholdtheinputandoutputoperands.Second,areconÞgurableunitmaybeusedasacoprocessor[WittigandChow1996;HauserandWawrzynek1997;MiyamoriandOlukotun1998;Ruppetal.1998;Chameleon2000].Acoprocessoris,ingeneral,largerthanafunctionalunit,andisabletoperformcomputationswith-outtheconstantsupervisionofthehostprocessor.Instead,theprocessorinitial-izesthereconÞgurablehardwareandei-thersendsthenecessarydatatothelogic,orprovidesinformationonwherethisdatamightbefoundinmemory.ThereconÞg-urableunitperformstheactualcomputa-tionsindependentlyofthemainprocessor,andreturnstheresultsaftercompletion.ThistypeofcouplingallowsthereconÞg-urablelogictooperateforalargenum-berofcycleswithoutinterventionfromthehostprocessor,andgenerallypermitsthehostprocessorandthereconÞgurablelogictoexecutesimultaneously.Thisre-ducestheoverheadincurredbytheuseofthereconÞgurablelogic,comparedtoareconÞgurablefunctionalunitthatmustcommunicatewiththehostprocessoreachtimeareconÞgurableÒinstructionÓisused.Oneideathatissomewhatofahybridbe-tweentheÞrstandsecondcouplingmeth-ods,istheuseofprogrammablehardwarewithinaconÞgurablecache[Kimetal.2000].Inthissituation,thereconÞgurablelogicisembeddedintothedatacache.Thiscachecanthenbeusedaseitheraregularcacheorasanadditionalcom-putingresourcedependingonthetargetThird,anattachedreconÞgurableprocessingunit[Vuilleminetal.1996;Annapolis1998;Lauferetal.1999]be-havesasifitisanadditionalprocessorinamultiprocessorsystemoranadditionalcomputeengineaccessedsemifrequentlythroughexternalI/O.ThehostprocessorÕsdatacacheisnotvisibletotheattachedreconÞgurableprocessingunit.Thereis,therefore,ahigherdelayincommunica-tionbetweenthehostprocessorandthere-conÞgurablehardware,suchaswhencom-municatingconÞgurationinformation,inputdata,andresults.Thiscommuni-cationisperformedthoughspecializedprimitivessimilartomultiprocessorsys-tems.However,thistypeofreconÞgurablehardwaredoesallowforagreatdealofcomputationindependence,byshiftinglargechunksofacomputationovertothereconÞgurablehardware.Finally,themostlooselycoupledformofreconÞgurablehardwareisthatofanexternalstand-aloneprocessingunit[Quickturn1999a,1999b].ThistypeofreconÞgurablehardwarecommunicatesinfrequentlywithahostprocessor(ifpresent).Thismodelissimilartothatofnetworkedworkstations,wherepro-cessingmayoccurforverylongperiodsoftimewithoutagreatdealofcommu-nication.InthecaseoftheQuickturnsystems,however,thishardwareisgearedACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.HauckmoretowardsemulationthanreconÞg-urablecomputing.Eachofthesestyleshasdistinctben-eÞtsanddrawbacks.Thetighterthein-tegrationofthereconÞgurablehardware,themorefrequentlyitcanbeusedwithinanapplicationorsetofapplicationsduetoalowercommunicationoverhead.How-ever,thehardwareisunabletooperateforsigniÞcantportionsoftimewithoutin-terventionfromahostprocessor,andtheamountofreconÞgurablelogicavailableisoftenquitelimited.Themorelooselycou-pledstylesallowforgreaterparallelisminprogramexecution,butsufferfromhighercommunicationsoverhead.Inapplicationsthatrequireagreatdealofcommunica-tion,thiscanreduceorremoveanyaccel-erationbeneÞtsgainedthroughthistypeofreconÞgurablehardware.3.2.TraditionalFPGAsBeforediscussingthedetailedarchitec-turedesignofreconÞgurabledevicesingeneral,wewillÞrstdescribethelogicandroutingofFPGAs.TheseconceptsapplydirectlytoreconÞgurablesystemsusingcommercialFPGAs,suchasPAM[Vuilleminetal.1996]andSplash2[Arnoldetal.1992;Buelletal.1996],andmanyalsoextendtoarchitecturesdesignedspeciÞcallyforreconÞgurablecomputing.HardwareconceptsapplyingspeciÞcallytoarchitecturesdesignedforreconÞgurablecomputing,aswellasvari-ationsonthegenericFPGAdescriptionprovidedhere,arediscussedfollowingthissection.MoredetailedsurveysofFPGAar-chitecturesthemselvescanbefoundelse-where[Brownetal.1992a;Roseetal.SincetheintroductionofFPGAsinthemid-1980s,therehavebeenmanydiffer-entinvestigationsintowhatcomputationelement(s)shouldbebuiltintothear-ray[Roseetal.1993].OnecouldconsiderFPGAsthatwerecreatedwithPAL-likeproducttermarrays,ormultiplexer-basedfunctionality,orevenbasicÞxedfunctionssuchassimpleNANDandXORgates.Infact,manysucharchitectureshavebeenbuilt.However,itseemstobefairlywell Fig.4.Abasiclogicblock,witha4-inputLUT,carrychain,andaD-typeßip-ßopwithbypass.establishedthatthebestfunctionblockforastandardFPGA,adevicewhosepri-maryroleistheimplementationofran-domdigitallogic,istheonefoundintheÞrstdevicesdeployedÑthelookuptable(Figure2right).Asdescribedinthepre-vioussection,an-inputLUTisbasicallyamemorythat,whenprogrammedappro-priately,cancomputeanyfunctionofuptoinputs.Thisßexibility,withrelativelysimpleroutingrequirements(eachinputneedonlyberoutedtoasinglemultiplexercontrolinput)turnsouttobeverypower-fulforlogicimplementation.Althoughitislessarea-efÞcientthanÞxedlogicblocks,suchasastandardNANDgate,thetruthisthatmostcurrentFPGAsuselessthan10%oftheirchipareaforlogic,devotingthemajorityofthesiliconrealestateforroutingresources.ThetypicalFPGAhasalogicblockwithoneormore4-inputLUT(s),op-tionalDßip-ßops(DFF),andsomeformoffastcarrylogic(Figure4).TheLUTsallowanyfunctiontobeimplemented,pro-vidinggenericlogic.Theßip-ßopcanbeusedforpipelining,registers,statehold-ingfunctionsforÞnitestatemachines,oranyothersituationwhereclockingisre-quired.Notethattheßip-ßopswilltypi-callyincludeprogrammableset/resetlinesandclocksignals,whichmaycomefromglobalsignalsroutedonspecialresources,orcouldberoutedviathestandardin-terconnectstructuresfromsomeotherinputorlogicblock.ThefastcarrylogicACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing Fig.5.Agenericisland-styleFPGAroutingarchi-tecture.isaspecialresourceprovidedinthecelltospeedupcarry-basedcomputations,suchasaddition,parity,wideANDop-erations,andotherfunctions.Thesere-sourceswillbypassthegeneralroutingstructure,connectinginsteaddirectlybe-tweenneighborsinthesamecolumn.Sincethereareveryfewroutingchoicesinthecarrychain,andthuslessdelayonthecomputation,theinclusionofthesere-sourcescansigniÞcantlyspeedupcarry-basedcomputations.JustastherehasbeenagreatdealofexperimentationinFPGAlogicblockarchitectures,therehasbeenequallyasmuchinvestigationintointerconnectstructures.AslogicblockshavebasicallystandardizedonLUT-basedstructures,routingresourceshavebecomeprimarilyisland-style,withlogicsurroundedbygen-eralroutingchannels.MostFPGAarchitecturesorganizetheirroutingstructuresasarelativelysmoothseaofroutingresources,allowingfastandefÞcientcommunicationalongtherowsandcolumnsoflogicblocks.AsshowninFigure5,thelogicblocksareem-beddedinageneralroutingstructure,withinputandoutputsignalsattachingtotheroutingfabricthroughconnectionblocks.Theconnectionblocksprovidepro-grammablemultiplexers,selectingwhichofthesignalsinthegivenroutingchannelwillbeconnectedtothelogicblockÕster-minals.Theseblocksalsoconnectshorterlocalwirestolonger-distanceroutingre-sources.Signalsßowfromthelogicblockintotheconnectionblock,andthenalonglongerwireswithintheroutingchannels.Attheswitchboxes,thereareconnectionsbetweenthehorizontalandverticalrout-ingresourcestoallowsignalstochangetheirroutingdirection.Oncethesignalhastraversedthroughroutingresourcesandinterveningswitchboxes,itarrivesatthedestinationlogicblockthroughoneofitslocalconnectionblocks.Inthisman-ner,relativelyarbitraryinterconnectionscanbeachievedbetweenthelogicblocksinthesystem.Withinagivenroutingchannel,theremaybeanumberofdifferentlengthsofroutingresources.Somelocalinterconnec-tionsmayonlymovebetweenadjacentlogicblocks(carrychainsareagoodex-ampleofthis),providinghigh-speedlo-calinterconnect.Mediumlengthlinesmayrunthewidthofseverallogicblocks,pro-vidingforsomelongerdistanceintercon-nect.Finally,longlinesthatruntheentirechipwidthorheightmayprovideformoreglobalsignals.Also,manyarchitecturescontainspecialÒgloballinesÓthatprovidehigh-speed,andoftenlow-skew,connec-tionstoallofthelogicblocksinthearray.Theseareprimarilyusedforclocks,resets,andothertrulyglobalsignals.WhiletheroutingarchitectureofanFPGAistypicallyquitecomplexÑthecon-nectionblocksandswitchboxessurround-ingasinglelogicblocktypicallyhavethou-sandsofprogrammingpointsÑtheyaredesignedtobeabletosupportfairlyarbi-traryinterconnectionpatterns.Mostusersignoretheexactdetailsofthesearchitec-turesandallowtheautomaticphysicalde-signtoolstochooseappropriateresourcestouseinordertoachieveagivenintercon-nectpattern.3.3.LogicBlockGranularityMostreconÞgurablehardwareisbaseduponasetofcomputationstructuresthatarerepeatedtoformanarray.Thesestructures,commonlycalledlogicblocks,varyincomplexityfromaveryACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauck Fig.6.ThefunctionalunitfromaXilinx6200cell[Xilinx1996].smallandsimpleblockthatcancalculateafunctionofonlythreeinputs,toastruc-turethatisessentiallya4-bitALU.SomeoftheseblocktypesareconÞgurable,inthattheactualoperationisdeterminedbyasetofloadedconÞgurationdata.OtherblocksareÞxedstructures,andtheconÞg-urabilityliesintheconnectionsbetweenthem.Thesizeandcomplexityoftheba-siccomputingblocksisreferredtoastheblockÕsAnexampleofaveryÞne-grainedlogicblockcanbefoundintheXilinx6200seriesofFPGAs[Xilinx1996].Thefunctionalunitfromoneofthesecells,asshowninFigure6,canimplementanytwo-inputfunctionandsomethree-inputfunctions.However,althoughthistypeofarchitec-tureisusefulforveryÞne-grainedbitma-nipulation,itcanbetooÞne-grainedtoef-Þcientlyimplementmanytypesofcircuits,suchasmultipliers.Similarly,ÞnitestatemachinesarefrequentlytoocomplextoeasilymaptoareasonablenumberofveryÞne-grainedlogicblocks.However,Þ-nitestatemachinesarealsotoodependentuponsinglebitvaluestobeefÞcientlyim-plementedinaverycoarse-grainedarchi-tecture.Thistypeofcircuitismoresuitedtoanarchitecturethatprovidesmoreconnectionsandcomputationalpowerperlogicblock,yetstillprovidessufÞcientca-pabilityforbit-levelmanipulation.ThelogiccellintheAlteraFLEX10Kar-chitecture[Altera1998]isaÞne-grainedstructurethatissomewhatcoarserthanthe6200.Thisarchitecturemainlycon-sistsofasingle4-inputLUTwithaßip-ßop.Additionally,thereisspecializedcarry-chaincircuitrythathelpstoacceler-ateaddition,parity,andotheroperationsthatuseacarrychain.ThesetypesoflogicblocksareusefulforÞne-grainedbit-levelmanipulationofdata,ascanfrequentlybefoundinencryptionandimageprocessingapplications.Also,becausethecellsareÞne-grained,computationstructuresofarbitrarybitwidthscanbecreated.Thiscanbeusefulforimplementingdatapathcircuitsthatarebasedondatawidthsnotimplementedonthehostprocessor(5bitmultiply,18bitaddition,etc).ReconÞg-urablehardwarecannotonlytakeadvan-tageofsmallbitwidths,butalsolargedatawidths.Whenaprogramusesbitwidthsinexcessofwhatisnormallyavailableinahostprocessor,theprocessormustper-formthecomputationsusinganumberofextrastepsinordertohandlethefulldatawidth.AÞne-grainedarchitecturewouldbeabletoimplementthefullbitwidthinasinglestep,withoutthefetching,decoding,andexecutionofadditionalinstructions,aslongasenoughlogiccellsareavailable.AnumberofreconÞgurablesystemsuseagranularityoflogicblockthatwecat-egorizeasmedium-grained[Xilinx1994;HauserandWawrzynek1997;HaynesandCheung1998;Lucent1998;Marshalletal.1999].Forexample,Garp[HauserandWawrzynek1997]isdesignedtoperformanumberofdifferentoperationsonuptofour2-bitinputs.Anothermedium-grainedstructurewasdesignedspeciÞ-callytobeembeddedinsideofageneral-purposeFPGAtoimplementmultipliersofaconÞgurablebitwidth[HaynesandCheung1998].ThelogicblockusedinthemultiplierFPGAiscapableofimplement-inga44multiplication,orcascadedintolargerstructures.TheCHESSarchitec-ture[Marshalletal.1999]alsooperateson4-bitvalues,witheachofitscellsact-ingasa4-bitALU.Medium-grainedlogicblocksmaybeusedtoimplementdatapathcircuitsofvaryingbitwidths,similartotheÞne-grainedstructures.However,withtheabilitytoperformmorecomplexoper-ationsofagreaternumberofinputs,thistypeofstructurecanbeusedefÞcientlytoimplementawidervarietyofoperations.ACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing Fig.7.OnecellintheRaPiD-IreconÞgurablearchitecture[Ebelingetal.1996].Theregisters,RAM,ALUs,andmultiplieralloperateon16-bitvalues.Themultiplieroutputsa32-bitresult,splitintothehigh16bitsandthelow16bits.Allroutinglinesshownare16-bitwidebusses.TheshortparallellinesonthebussesrepresentconÞgurablebusconnectors.Verycoarse-grainedarchitecturesareprimarilyintendedfortheimplementa-tionofword-widthdatapathcircuits.Be-causethelogicblocksusedareoptimizedforlargecomputations,theywillperformtheseoperationsmuchmorequickly(andconsumelesschiparea)thanasetofsmallercellsconnectedtoformthesametypeofstructure.However,becausetheircompositionisstatic,theyareunabletoleverageoptimizationsinthesizeofoperands.Forexample,theRaPiDarchi-tecture[Ebelingetal.1996],showninFigure7,aswellastheChameleonar-chitecture[Chameleon2000],areexam-plesofthisverycoarse-grainedtypeofdesign.Eachofthesearchitecturesiscom-posedofword-sizedadders,multipliers,andregisters.Ifonlythree1-bitvaluesarerequired,thentheuseofthesearchi-tecturessuffersanunnecessaryareaandspeedoverhead,asallofthebitsinthefullwordsizearecomputed.However,thesecoarse-grainedarchitecturescanbemuchmoreefÞcientthanÞne-grainedarchitec-turesforimplementingfunctionsclosertotheirbasicwordsize.Analternateformofacoarse-grainedsystemisoneinwhichthelogicblocksareactuallyverysmallprocessors,poten-tiallyeachwithitsowninstructionmem-oryand/ordatavalues.TheREMARCar-chitecture[MiyamoriandOlukotun1998]iscomposedofan88arrayof16-bitprocessors.Eachoftheseprocessorsusesitsowninstructionmemoryinconjunctionwithaglobalprogramcounter.Thisstyleofarchitecturecloselyresemblesasingle-chipmultiprocessor,althoughwithmuchsimplercomponentprocessorsbecausethesystemisintendedtobecoupledwithahostprocessor.TheRAWproject[Moritzetal.1998]isafurtherexampleofare-conÞgurablearchitecturebasedonamul-tiprocessordesign.ThegranularityoftheFPGAalsohasapotentialeffectonthereconÞgurationtimeofthedevice.Thisisanimportantissueforrun-timereconÞguration,whichisdiscussedinfurtherdepthinalatersec-tion.AÞne-grainedarrayhasmanyconÞg-urationpointstoperformverysmallcom-putations,andthusrequiresmoredatabitsduringconÞguration.3.4.HeterogeneousArraysInordertoprovidegreaterperformanceorßexibilityincomputation,somerecon-Þgurablesystemsprovideaheterogeneousstructure,wherethecapabilitiesofthelogiccellsarenotthesamethroughoutthesystem.OneuseofheterogeneityinreconÞgurablesystemsistoprovidemul-tiplierfunctionblocksembeddedwithinthereconÞgurablehardware[HaynesandACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.HauckCheung1998;Chameleon2000;Xilinx2001].BecausemultiplicationisoneofthemoredifÞcultcomputationstoimplementefÞcientlyinatraditionalFPGAstruc-ture,thecustommultiplicationhardwareembeddedwithinareconÞgurablearrayallowsasystemtoperformeventhatfunc-tionwell.Anotheruseofheterogeneousstruc-turesistoprovideembeddedmemoryblocksscatteredthroughoutthereconÞg-urablehardware.Thisallowsstorageoffrequentlyuseddataandvariables,andallowsforquickaccesstothesevaluesduetotheproximityofthememorytothelogicblocksthataccessit.MemorystructuresembeddedintothereconÞg-urablefabriccomeintwoforms.TheÞrstissimplytheuseofavailableLUTsasRAMstructures,ascanbedoneintheXilinx4000series[Xilinx1994]andVirtex[Xilinx1999]FPGAs.AlthoughmakingtheseverysmallblocksintoalargerRAMstructureintroducesoverheadtothememorysystem,itdoesprovidelocal,vari-ablewidthmemorystructures.Somearchitecturesincludededicatedmemoryblockswithintheirarray,suchastheXilinxVirtexseries[Xilinx1999,2001]andAltera[Altera1998]FPGAs,aswellastheCS2000RCP(reconÞgurablecommunicationsprocessor)devicefromChameleonSystems,Inc.[Chameleon2000].Thesememoryblockshavegreaterperformanceinlargesizesthansimilar-sizedstructuresbuiltfrommanysmallLUTs.Whilethesestructuresaresome-whatlessßexiblethantheLUT-basedmemories,theycanalsoprovidesomecus-tomization.Forexample,theAlteraFLEX10KFPGA[Altera1998]providesembed-dedmemoriesthathavealimitedtotalnumberofwires,butallowatrade-offbe-tweenthenumberofaddresslinesandthedatabitwidth.WhenembeddedmemoriesarenotusedfordatastoragebyaparticularconÞg-uration,theareathattheyoccupydoesnotnecessarilyhavetobewasted.Byus-ingtheaddresslinesofthememoryasfunctioninputsandthevaluesstoredinthememoryasfunctionoutputs,logicalexpressionsofalargenumberofinputscanbeemulated[Altera1998;CongandXu1998;Wilton1998;HeileandLeaver1999].Infact,becausetheremaybemorethanonevalueoutputfromthememoryonareadoperation,thememorystruc-turemaybeabletoperformmultipledif-ferentcomputations(oneforeachbitofdataoutput),providedthatallnecessaryinputsappearontheaddresslines.Inthismanner,theembeddedRAMbehavesthesameasaverylargeLUT.Therefore,em-beddedmemoryallowsaprogrammerorasynthesistooltoperformatrade-offbe-tweenlogicandmemoryusageinordertoachievehigherareaefÞciency.Furthermore,afewofthecommercialFPGAcompanieshaveannouncedplanstoincludeentiremicroprocessorsasembed-dedstructureswithintheirFPGAs.AlterahasdemonstratedapreliminaryARM9-basedExcaliburdevice,whichcombinesreconÞgurablehardwarewithanembed-dedARM9processorcore[Altera2001].Meanwhile,XilinxisworkingwithIBMtoincludeaPowerPCprocessorcorewithintheVirtex-IIFPGA[Xilinx2000].Bycon-trast,AdaptiveSiliconÕsfocusistoprovidereconÞgurablelogiccorestocustomersforembeddingintheirownsystem-on-a-chip(SoC)devices[Adaptive2001].3.5.RoutingResourcesInterconnectresourcesareprovidedinareconÞgurablearchitecturetoconnectto-getherthedeviceÕsprogrammablelogicel-ements.Theseresourcesareusuallycon-Þgurable,wherethepathofasignalisdeterminedatcompileorrun-timeratherthanfabricationtime.Thisßexibleinter-connectbetweenlogicblocksorcomputa-tionalelementsallowsforawidevarietyofcircuitstructures,eachwiththeirowninterconnectrequirements,tobemappedtothereconÞgurablehardware.Forex-ample,theroutingforFPGAsisgener-allyisland-style,withlogicsurroundedbyroutingchannels,whichcontainsev-eralwires,potentiallyofvaryinglengths.Withinthistypeofroutingarchitecture,however,therearestillvariations.Someofthesedifferencesincludetheratioofwirestologicinthesystem,howlongeachoftheACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing Fig.8.Segmented(left)andhierarchical(right)routingstructures.Thewhiteboxesarelogicblocks,whilethedarkboxesareconnectionswitches.wiresshouldbe,andwhethertheyshouldbeconnectedinasegmentedorhierarchi-calmanner.AstepinthedesignofefÞcientrout-ingstructuresforFPGAsandreconÞg-urablesystemsthereforeinvolvesexam-iningthelogicvs.routingareatrade-offwithinreconÞgurablearchitectures.Onegrouphasarguedthattheinterconnectshouldconstituteamuchhigherpropor-tionofareainordertoallowforsuccessfulroutingunderhigh-logicutilizationcondi-tions[Takaharaetal.1998].However,forFPGAs,high-LUTutilizationmaynotnec-essarilybethemostdesirablesituation,butratherefÞcientroutingusagemaybeofmoreimportance[DeHon1999].ThisisbecausetheroutingresourcesoccupyamuchlargerpartoftheareaofanFPGAthanthelogicresources,andthereforethemostareaefÞcientdesignswillbethosethatoptimizetheiruseoftheroutingre-sourcesratherthanthelogicresources.Theamountofrequiredroutingdoesnotgrowlinearlywiththeamountoflogicpresent;therefore,largerdevicesrequireevengreateramountsofroutingperlogicblockthansmallones[Trimbergeretal.Therearetwoprimarymethodstopro-videbothlocalandglobalroutingre-sources,asshowninFigure8.TheÞrstistheuseofsegmentedrouting[BetzandRose1999;Chowetal.1999a].Inseg-mentedrouting,shortwiresaccommodatelocalcommunicationstrafÞc.Theseshortwirescanbeconnectedtogetherusingswitchboxestoemulatelongerwires.Fre-quently,segmentedroutingstructuresalsocontainlongerwirestoallowsig-nalstotravelefÞcientlyoverlongdis-tanceswithoutpassingthroughagreatnumberofswitches.Hierarchicalrouting[AggarwalandLewis1994;LaiandWang1997;Tsuetal.1999]isthesecondmethodtoprovidebothlocalandglobalcommu-nication.Routingwithinagroup(orclus-ter)oflogicblocksisatthelocallevel,onlyconnectingwithinthatcluster.Attheboundariesoftheseclusters,however,longerwiresconnectthedifferentclusterstogether.Thisispotentiallyrepeatedatanumberoflevels.Theideabehindtheuseofhierarchicalstructuresisthat,providedagoodplacementhasbeenmadeontothehardware,mostcommunicationshouldbelocalandonlyalimitedamountofcom-municationwilltraverselongdistances.Therefore,thewiringisdesignedtoÞtthismodel,withagreaternumberoflocalrout-ingwiresinaclusterthandistanceroutingwiresbetweenclusters.BecauseroutingcanoccupyalargepartoftheareaofareconÞgurabledevice,thetypeofroutingusedmustbecarefullycon-sidered.Ifthewiresavailablearemuchlongerthanwhatisrequiredtorouteasig-nal,theexcesswirelengthiswasted.Ontheotherhand,ifthewiresavailablearemuchshorterthannecessary,thesignalACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauck Fig.9.Atraditionaltwo-dimensionalisland-styleroutingstructure(left)andaone-dimensionalroutingstructure(right).Thewhiteboxesrepresentlogicelements.mustpassthroughswitchboxesthatcon-necttheshortwirestogetherintoalongerwire,orthroughlevelsoftheroutinghier-archy.Thisinducesadditionaldelayandslowstheoveralloperationofthecircuit.Furthermore,theswitchboxcircuitryoc-cupiesareathatmightbebetterusedforadditionallogicorwires.Thereareafewalternativestotheisland-styleofroutingresources.SystemssuchasRaPiD[Ebelingetal.1996]usesegmentedbus-basedrouting,wheresig-nalsarefullword-sizedinwidth.Thisismostcommonintheone-dimensionaltypeofarchitecture,asdiscussedinthenext3.6.One-DimensionalStructuresMostcurrentFPGAsareofthetwo-dimensionalvariety,asshowninFigure9.Thisallowsforagreatdealofßexibility,asanysignalcanberoutedonanearlyarbitrarypath.However,providingthislevelofroutingßexibilityrequiresagreatdealofroutingarea.Italsocomplicatestheplacementandroutingsoftware,asthesoftwaremustconsideraverylargenum-berofpossibilities.Onesolutionistouseamoreone-dimensionalstyleofarchitecture,alsode-pictedinFigure9.Here,placementisrestrictedalongoneaxis.Withamorelimitedsetofchoices,theplacementcanbeperformedmuchmorequickly.RoutingisalsosimpliÞed,becauseitisgenerallyalongasingledimensionaswell,withtheotherdimensiongenerallyonlyusedforcalculationsrequiringashiftoperation.Onedrawbackoftheone-dimensionalroutingisthatiftherearenotenoughroutingresourcesinaparticularareaofamappedcircuit,routingthatcircuitbe-comesactuallymoredifÞcultthanonatwo-dimensionalarraythatprovidesmorealternatives.Anumberofdifferentre-conÞgurablesystemshavebeendesignedinthismanner.BothGarp[HauserandWawrzynek1997]andChimaera[Haucketal.1997]arestructuresthatprovidecellsthatcomputeasmallnumberofbitpositions,andarowofthesecellsto-gethercomputesthefulldataword.ArowcanonlybeusedbyasingleconÞg-uration,makingthesedesignsonedimen-sional.Inthismanner,eachconÞgurationoccupiessomenumberofcompleterows.Althoughmultiplenarrow-widthcompu-tationscanÞtwithinasinglerow,thesestructuresareoptimizedforword-basedcomputationsthatoccupytheentirerow.TheNAPAarchitecture[Ruppetal.1998]issimilar,withafullcolumnofcellsact-ingastheatomicunitforaconÞgura-tion,asisPipeRench[Cadambietal.1998;Goldsteinetal.2000].Insomesystems,thecomputationblocksinaone-dimensionalstructureop-erateonword-widthvaluesinsteadofsinglebits.Therefore,bussesareroutedinsteadofindividualvalues.Thisalsodecreasesthetimerequiredforrouting,asthebitsofabuscanbeconsideredtogetherratherthanasseparateroutes.AsshownpreviouslyinFigure7,RaPiD[Ebelingetal.1996]isbasicallyaone-dimensionaldesignthatonlyincludesword-widthprocessingelements.Thedif-ferentcomputationunitsareorganizedinasingledimensionalongthehorizontalACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing Fig.10.Mesh(left)andpartialcrossbar(right)interconnecttopologiesformulti-FPGAsystems.axis.Thegeneralßowofinformationfol-lowsthislayout,withthemajorroutingbussesalsolaidoutinahorizontalman-ner.Additionally,allroutingisofword-sizedvalues,andthereforeallroutingisofbusses,notindividualwires.Afewver-ticalresourcesareincludedinthearchi-tecturetoallowsignalstotransferbe-tweenbusses,ortotravelfromabustoacomputationnode.However,themajor-ityoftheroutinginthisarchitectureis3.7.Multi-FPGASystemsReconÞgurablesystemsthatarecomposedofmultipleFPGAchipsinterconnectedonasingleprocessingboardhaveaddi-tionalhardwareconcernsoversingle-chipsystems.Inparticular,thereisaneedforanefÞcientconnectionschemebetweenthechips,aswellastoexternalmemoryandthesystembus.ThisistoprovideforcircuitsthataretoolargetoÞtwithinasingleFPGA,butmaybepartitionedoverthemultipleFPGAsavailable.Anumberofdifferentinterconnectionschemeshavebeenexplored[ButtsandBatcheller1991;Haucketal.1998a;Hauck1998;Khalid1999]includingmeshesandcrossbars,asshowninFigure10.Ameshconnectsthenearest-neighborsinthearrayofFPGAchips.ThisallowsforefÞcientcommuni-cationbetweentheneighbors,butmayrequirethatsomesignalspassthroughanFPGAsimplytocreateaconnectionbetweennon-neighbors.Althoughthiscanbedone,andisquitepossible,itusesvalu-ableI/OresourcesontheFPGAthatformstheroutingbridge.Onesystemthatusesameshtopologywithadditionalboard-levelcolumnandrowbussesistheP1systemdevelopedwithinthePAMproject[Vuilleminetal.1996].Thisarchitectureusesacentralarrayof16commercialFPGAswithconnectionstonearest-neighbors.However,four16-bitrowbussesandfour16-bitcolumnbussesrunthelengthofthearrayandfacilitatecommu-nicationbetweennon-neighborFPGAs.Acrossbarattemptstoremovethisprob-lembyusingspecialrouting-onlychipstoconnecteachFPGApotentiallytoanyotherFPGA.Theinter-chipdelaysaremoreuniform,giventhatasignaltrav-elstheexactsameÒdistanceÓtogetfromoneFPGAtoanother,regardlessofwherethoseFPGAsarelocated.However,acrossbarinterconnectdoesnotscaleeas-ilywithanincreaseinthenumberofFPGAs.ThecrossbarpatternofthechipsisÞxedatfabricationofthemulti-FPGAboard.Variantsonthesetwobasictopolo-giesattempttoremovesomeoftheprob-lemsencounteredinmeshandcrossbartopologies[Arnoldetal.1992;Vargheseetal.1993;Buelletal.1996;Vuilleminetal.1996;Lewisetal.1997;KhalidandRose1998].OneofthesevariantscanbefoundintheSplash2system[Arnoldetal.1992;Buelletal.1996].Thepredecessor,Splash1,usedalinearsystoliccommu-nicationmethod.Thistypeofconnectionwasfoundtoworkquitewellforavari-etyofapplications.However,thishighlyconstrainedcommunicationmodelmadesometypesofcomputationsdifÞcultorevenimpossible.Therefore,Splash2wasdesignedtoincludenotonlythelinearcon-nectionsofSplash1thatwerefoundtobeusefulformanyapplications,butalsoacrossbarnetworktoallowanyFPGAACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.HaucktocommunicatewithanyotherFPGAonthesameboard.Formulti-FPGAsystems,becauseoftheneedforefÞcientcommu-nicationbetweentheFPGAs,determin-ingtheinter-chiproutingtopologyisaveryimportantstepinthedesignprocess.Moredetailsonmulti-FPGAsystemarchi-tecturescanbefoundelsewhere[Hauck1998b;Khalid1999].3.8.HardwareSummaryThedesignofreconÞgurablehardwarevarieswildlyfromsystemtosystem.ThereconÞgurablelogicmaybeusedasaconÞgurablefunctionalunit,ormaybeamulti-FPGAstand-aloneunit.WithinthereconÞgurablelogicitself,thecom-plexityofthecorecomputationalunits,orlogicblocks,varyfromverysimpletoextremelycomplex,someimplementinga4-bitALUorevena1616multi-plication.Theseblocksarenotrequiredtobeuniformthroughoutthearray,astheuseofdifferenttypesofblockscanaddhigh-performancefunctionalityinthecaseofspecializedcomputationcircuitry,orexpandedstorageinthecaseofem-beddedmemoryblocks.Routingresourcesalsoofferavarietyofchoices,primarilyinamount,length,andorganizationofthewires.SystemshavebeendevelopedthatÞtintomanydifferentpointswithinthisdesignspace,andnotrueÒbestÓsystemhasyetbeenagreedupon.4.SOFTWAREAlthoughreconÞgurablehardwarehasbeenshowntohavesigniÞcantperfor-mancebeneÞtsforsomeapplications,itmaybeignoredbyapplicationprogram-mersunlesstheyareabletoeasilyin-corporateitsuseintotheirsystems.ThisrequiresasoftwaredesignenvironmentthataidsinthecreationofconÞgurationsforthereconÞgurablehardware.Thissoft-warecanrangefromasoftwareassistinmanualcircuitcreationtoacompleteautomatedcircuitdesignsystem.Manualcircuitdescriptionisapowerfulmethodforthecreationofhigh-qualitycircuitde-signs.However,itrequiresagreatdealofbackgroundknowledgeoftheparticular Fig.11.ThreepossibledesignßowsforalgorithmimplementationonareconÞgurablesystem.Greystagesindicatemanualeffortonthepartofthede-signer,whilewhitestagesaredoneautomatically.Thedottedlinesrepresentpathstoimprovethere-sultingcircuit.Itshouldbenotedthatthemiddledesigncycleisonlyoneofthepossiblecompromisesbetweenautomaticandmanualdesign.reconÞgurablesystememployed,aswellasasigniÞcantamountofdesigntime.Ontheotherendofthespectrum,anauto-maticcompilationsystemprovidesaquickandeasywaytoprogramforreconÞg-urablesystems.ItthereforemakestheuseofreconÞgurablehardwaremoreaccessi-bletogeneralapplicationprogrammers,butqualitymaysuffer.Bothformanualandautomaticcir-cuitcreation,thedesignprocessproceedsthroughanumberofdistinctphases,asindicatedinFigure11.CircuitspeciÞca-tionistheprocessofdescribingthefunc-tionsthataretobeplacedontherecon-Þgurablehardware.ThiscanbedoneassimplyasbywritingaprograminCthatrepresentsthefunctionalityofthealgo-rithmtobeimplementedinhardware.Ontheotherhand,thiscanalsobeascomplexasspecifyingtheinputs,outputs,andop-erationofeachbasicbuildingblockinthereconÞgurablesystem.BetweenthesetwomethodsisthespeciÞcationofthecircuitusinggenericcomplexcomponents,suchasaddersandmultipliers,whichwillbemappedtotheactualhardwarelaterinthedesignprocess.Fordescriptionsinahigh-levellanguage(HLL),suchasC/CorJava,oronesusingcomplexbuildingblocks,thiscodemustbecompiledintoanetlistofgate-levelcomponents.FortheHLLimplementations,thisinvolvesACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing Fig.12.AwidefunctionimplementedwithmultipleLUTs.generatingcomputationalcomponentstoperformthearithmeticandlogicopera-tionswithintheprogram,andseparatestructurestohandletheprogramcontrol,suchasloopiterationsandbranchingop-erations.Givenastructuraldescription,eithergeneratedfromaHLLorspeciÞedbytheuser,eachcomplexstructureisre-placedwithanetworkofthebasicgatesthatperformthatfunction.Onceadetailedgate-orelement-leveldescriptionofthecircuithasbeencreated,thesestructuresmustbetranslatedtotheactuallogicelementsofthereconÞgurablehardware.Thisstageisknownastech-nologymapping,andisdependentupontheexacttargetarchitecture.ForaLUT-basedarchitecture,thisstagepartitionsthecircuitintoanumberofsmallsubfunc-tions,eachofwhichcanbemappedtoasingleLUT[Brownetal.1992a;Abouzeidetal.1993;Sangiovanni-Vincentellietal.1993;Hwangetal.1994;Changetal.1996;HauckandAgarwal1996;YiandJhon1996;ChowdharyandHayes1997;Linetal.1997;CongandWu1998;PanandLin1998;Togawaetal.1998;Congetal.1999].Somearchitectures,suchastheXilinx4000series[Xilinx1994],con-tainmultipleLUTsperlogiccell.TheseLUTscanbeusedeitherseparatelytogen-eratesmallfunctions,ortogethertogen-eratesomewider-inputfunctions[InuaniandSaul1997;CongandHwang1998].BytakingadvantageofmultipleLUTsandtheinternalroutingwithinasinglelogiccell,functionswithmoreinputsthancanbeimplementedusingasingleLUTcanefÞcientlybemappedintotheFPGAar-chitecture.Figure12showsoneexampleofawidefunctionmappedtoamulti-LUTFPGAlogiccell.ForreconÞgurablestructuresthatin-cludeembeddedmemoryblocks,themap-pingstagemayalsoconsiderusingthesememoriesaslogicunitswhentheyarenotbeingusedfordatastorage.ThememoriesactasverylargeLUTs,wherethenumberofinputsisequaltothenumberofaddresslines.Inordertousethesememoriesaslogic,themappingsoftwaremustanalyzehowmuchofthememoryblocksareactu-allyusedasstorageinagivenmapping.Itmustthendeterminewhichareavailableinordertoimplementlogic,andwhatpartorpartsofthecircuitarebestmappedtothememory[CongandXu1998;WiltonAfterthecircuithasbeenmapped,theresultingblocksmustbeplacedontothereconÞgurablehardware.EachoftheseblocksisassignedtoaspeciÞclocationwithinthehardware,hopefullyclosetotheotherlogicblockswithwhichitcom-municates.AsFPGAcapacitiesincrease,theplacementphaseofcircuitmappingbecomesmoreandmoretimeconsuming.Floorplanningisatechniquethatcanbeusedtoalleviatesomeofthiscost.AßoorplanningalgorithmÞrstpartitionsthelogiccellsintoclusters,wherecellswithalargeamountofcommunicationaregroupedtogether.TheseclustersarethenplacedasunitsontoregionsofthereconÞgurablehardware.Oncethisglobalplacementiscomplete,theactualplace-mentalgorithmperformsdetailedplace-mentoftheindividuallogicblockswithintheboundariesassignedtothecluster[SankarandRose1999].Theuseofaßoorplanningtoolispar-ticularlyhelpfulforsituationswherethecircuitstructurebeingmappedisofadat-apathtype.Largecomputationalcompo-nentsormacrosthatarefoundindatapathcircuitsarefrequentlycomposedofhighlyregularlogic.Thesestructuresareplacedasentireunits,andtheircomponentcellsarerestrictedtotheßoorplannedlocation[ShiandBhatia1997;EmmertandBhatia1999].ThisencouragestheplacertoÞndaveryregularplacementoftheselogiccells,resultinginahigherperformancelayoutofthecircuit.Anothertechniqueforthemappingandplacementofdatapathele-mentsistoperformbothofthesestepssimultaneously[Callahanetal.1998].ACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.HauckThismethodalsoexploitstheregular-ityofthedatapathelementstogener-atemappingsandplacementsquicklyandefÞciently.Floorplanningisalsoimportantwhendealingwithhierarchicallystructuredre-conÞgurabledesigns.Inthesearchitec-tures,theavailableresourceshavebeengroupedbythelogicorroutinghierarchyofthehardware.Becauseperformanceisbestwhenroutinglengthsareminimized,thecellstobeplacedshouldbegroupedsuchthatcellsthatrequireagreatdealofcommunicationorwhichareonacriti-calpathareplacedtogetherwithinalogicclusteronthehardware[Krupnovaetal.1997;Senoucietal.1998].Afterßoorplanning,theindividuallogicblocksareplacedintospeciÞclogiccells.Onealgorithmthatiscommonlyusedisthesimulatedannealingtechnique[ShahookarandMazumder1991;BetzandRose1997;SankarandRose1999].Thismethodtakesaninitialplacementofthesystem,whichcanbegenerated(pseudo-)randomly,andperformsaseriesofÒmovesÓonthatlayout.Amoveissim-plythechangingofthelocationofasin-glelogiccell,ortheexchangingofloca-tionsoftwologiccells.Thesemovesareattemptedoneatatimeusingrandomtargetlocations.Ifamoveimprovesthelayout,thenthelayoutischangedtore-ßectthatmove.Ifamoveisconsideredtobeundesirable,thenitisonlyacceptedasmallpercentageofthetime.AcceptingafewÒbadÓmoveshelpstoavoidanylocalminimaintheplacementspace.Otheral-gorithmsexistthatarenotsobasedonrandommovements[GehringandLudwig1996],althoughthissearchesasmallerareaoftheplacementspaceforasolution,andthereforemaybeunabletoÞndaso-lutionwhichmeetsperformancerequire-mentsifadesignusesahighpercentageofthereconÞgurableresources.Finally,thedifferentreconÞgurablecomponentscomprisingtheapplicationcircuitareconnectedduringtheroutingstage.ParticularsignalsareassignedtospeciÞcportionsoftheroutingresourcesofthereconÞgurablehardware.ThiscanbecomedifÞcultiftheplacementcausesmanyconnectedcomponentstobeplacedfarfromoneanother,asthesignalsthattravellongdistancesusemoreroutingresourcesthanthosethattravelshorterones.Agoodplacementisthereforees-sentialtotheroutingprocess.OneofthechallengesinroutingforFPGAsandreconÞgurablesystemsisthattheavail-ableroutingresourcesarelimited.Ingen-eralhardwaredesign,thegoalistomin-imizethenumberofroutingtracksusedinachannelbetweenrowsofcomputationunits,butthechannelscanbemadeaswideasnecessary.InreconÞgurablesys-tems,however,thenumberofavailableroutingtracksisdeterminedatfabricationtime,andthereforetheroutingsoftwaremustperformwithintheseboundaries.Thus,FPGAroutingconcentratesonmin-imizingcongestionwithintheavailabletracks[Brownetal.1992b;McMurchieandEbeling1995;AlexanderandRobins1996;ChanandSchlag1997;LeeandWu1997;Thakuretal.1997;WuandMarek-Sadowska1997;Swartzetal.1998;Nametal.1999].Becauseroutingisoneofthemoretime-intensiveportionsofthedesigncycle,itcanbehelpfultodeter-mineifaplacedcircuitcanberoutedbeforeactuallyperformingtheroutingstep.ThisquicklyinformsthedesignerifchangesneedtobemadetothelayoutoralargerreconÞgurablestructureisre-quired[WoodandRutenbar1997;Swartzetal.1998].Eachofthedesignphasesmentionedabovemaybeimplementedeithermanu-allyorautomaticallyusingcompilertools.Theoperationofsomeoftheseindividualstepsaredescribedingreaterdepthinthefollowingsections.4.1.Hardware-SoftwarePartitioningForsystemsthatincludebothreconÞg-urablehardwareandatraditionalmicro-processor,theprogrammustÞrstbepar-titionedintosectionstobeexecutedonthereconÞgurablehardwareandsectionstobeexecutedinsoftwareonthemicro-processor.Ingeneral,complexcontrolse-quencessuchasvariable-lengthloopsaremoreefÞcientlyimplementedinsoftware,ACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputingwhileÞxeddatapathoperationsmaybemoreefÞcientlyexecutedinhardware.MostcompilerspresentedforreconÞg-urablesystemsgenerateonlythehard-wareconÞgurationforthesystem,ratherthanbothhardwareandsoftware.Insomecases,thisisbecausethereconÞgurablehardwaremaynotbecoupledwithahostprocessor,soonlyahardwareconÞgura-tionisnecessary.Forcaseswhererecon-Þgurablehardwaredoesoperatealongsideahostmicroprocessor,somesystemscur-rentlyrequirethatthehardwarecompila-tionbeperformedseparatelyfromthesoft-warecompilation,andspecialfunctionsarecalledfromwithinthesoftwareinordertoconÞgureandcontrolthereconÞg-urablehardware.However,thisrequireseffortonthepartofthedesignertoiden-tifythesectionsthatshouldbemappedtohardware,andtotranslatetheseintospecialhardwarefunctions.InordertomaketheuseofthereconÞgurablehard-waretransparenttothedesigner,thepar-titioningandprogrammingofthehard-wareshouldoccursimultaneouslyinasingleprogrammingenvironment.Forcompilersthatmanageboththehardwareandsoftwareaspectsofapplica-tiondesign,thehardware/softwareparti-tioningcanbeperformedeithermanually,orautomaticallybythecompileritself.Whenthepartitioningisperformedbytheprogrammer,compilerdirectivesareusedtomarksectionsofprogramcodeforhardwarecompilation.TheNAPAClan-guage[GokhaleandStone1998]providesstatementstoallowaprogram-mertospecifywhetherasectionofcodeistobeexecutedinsoftwareontheFixedIn-structionProcessor(FIP),orinhardwareontheAdaptiveLogicProcessor(ALP).CardosoandNeto[1999]presentanothercompilerthatrequirestheusertospecify(usinginformationgainedthroughtheuseofproÞlingtools)whichareasofcodetomaptothereconÞgurablehardware.Alternately,thehardware/softwarepar-titioningcanbedoneautomatically[ChichkovandAlmeida1997;Kressetal.1997;Callahanetal.2000;Lietal.2000a].Inthiscase,thecompilerwillusecostfunctionsbasedupontheamountofac-celerationgainedthroughtheexecutionofacodefragmentinhardwaretode-terminewhetherthecostofconÞgurationisovercomebythebeneÞtsofhardware4.2.CircuitSpeciÞcationInordertousethereconÞgurablehard-ware,designersmustsomehowbeabletospecifytheoperationoftheircustomcir-cuits.Beforehigh-levelcompilationtoolsaredevelopedforaspeciÞcreconÞgurablesystem,thisisdonethroughhandmap-pingofthecircuit,wherethedesignerspeciÞestheoperationofthecomponentsintheconÞgurablesystemdirectly.Here,thedesignersutilizethebasicbuildingblocksofthereconÞgurablesystemtocre-atethedesiredcircuit.Thisstyleofcir-cuitspeciÞcationisprimarilyusefulonlywhenasoftwarefront-endforcircuitde-signisunavailable,orforthedesignofsmallcircuitsorcircuitswithveryhighperformancerequirements.Thisisduetothegreatamountoftimeinvolvedinmanualcircuitcreation.However,forcir-cuitsthatcanbereasonablyhandmapped,thisprovidespotentiallythesmallestandfastestimplementation.Becausenotalldesignerscanbeinti-matelyfamiliarwitheveryreconÞgurablearchitecture,somedesigntoolsabstractthespeciÞcsofthetargetarchitecture.Creatingacircuitusingastructuralde-signlanguageinvolvesdescribingacir-cuitusingbuildingblockssuchasgates,ßip-ßopsandlatches[BellowsandHutch-ings1998;GehringandLudwig1998;Hutchingsetal.1999].Thecompilerthenmapsthesemodulestooneormoreba-siccomponentsofthearchitectureofthereconÞgurablesystem.StructuralVHDLisoneexampleofthistypeofprogram-ming,andcommercialtoolsareavail-ableforcompilingfromthislanguageintovendor-speciÞcFPGAs[SynplicityHowever,thesetwomethodsrequirethatthedesignerpossesseitheranin-timateknowledgeofthetargetedrecon-Þgurablehardware,oratleastawork-ingknowledgeoftheconceptsinvolvedACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauckinhardwaredesign.InordertoallowagreaternumberofsoftwaredeveloperstotakeadvantageofreconÞgurablecom-puting,toolsthatallowforbehavioralcircuitdescriptionsarebeingdeveloped.Thesesystemstradesomeareaandper-formancequalityforgreaterßexibilityandeaseofuse.Behavioralcircuitdesignissimilartosoftwaredesignbecausethedesignerin-dicatesthestepsahardwaresubsys-temmustgothroughinordertoper-formthedesiredcomputationratherthantheactualcompositionofthecircuit.Thesebehavioraldescriptionscanbeei-therinagenerichardwaredescriptionlanguagesuchasVHDLorVerilog,orageneral-purposehigh-levellanguagesuchasC/CorJava.Theeventualgoalofthistypeofcompilationistoallowuserstowriteprogramsincommonlyusedlan-guagesthatcompileequallywell,with-outmodiÞcation,tobothatraditionalsoftwareexecutableandtoanexecutablewhichleveragesreconÞgurablehardware.Workingtowardsthisdirection,TransmogriÞerC[Galloway1995]al-lowsasubsetoftheClanguagetobeusedtodescribehardwarecircuits.Whilemultiplication,division,pointers,arrays,andafewotherClanguagespeciÞcsarenotsupported,thissystemprovidesabehavioralmethodofcircuitdescriptionusingaprimitiveformoftheClanguage.Similarly,theCprogrammingenviron-mentusedfortheP1system[Vuilleminetal.1996]providesahybridmethodofdescription,usingacombinationofbe-havioralandstructuraldesign.SynopsysÕCoCentriccompiler[Synopsys2000],whichcanbetargetedtotheXilinxVirtexseriesofFPGA,usesSystemCtoprovideforbehavioralcompilationofC/Cwiththeassistanceofasetofadditionalhardware-deÞningclasses.Othercompil-ers,suchasNimble[Lietal.2000a]andtheGarpcompiler[Callahanetal.2000],arefullybehavioralCcompilers,handlingthefullsetoftheANSIClanguage.Althoughbehavioraldescription,andHLLdescriptioninparticular,providesaconvenientmethodfortheprogram-mingofreconÞgurablesystems,itdoessufferfromthedrawbackthatittendstoproducelargerandslowerdesignsthanthosegeneratedbyastructuraldescrip-tionorhand-mapping.Behavioraldescrip-tionscanleavemanyaspectsofthecir-cuitunspeciÞed.Forexample,acompilerthatencountersaloopmustgener-atecomplicatedcontrolstructuresinor-dertoallowforanunspeciÞednumberofiterations.Also,inmanyHLLimple-mentations,optimizationsbaseduponthebitwidthofoperandscannotbeperformed.Thecompilerisgenerallyunawareofanyapplication-speciÞclimitationsontheoperandsize;itonlyseestheprogram-merÕschoiceofdataformatintheprogram.Problemssuchasthesemightbesolvedthroughadditionalprogrammerefforttoloopswheneverpossibleloops,andtousecompilerdirec-tivestoindicateexactsizesofoperands[Galloway1995;GokhaleandStone1998].Thismethodofhardwaredesignfallsbe-tweenstructuraldescriptionandbehav-ioraldescriptionincomplexity,becausealthoughtheprogrammersdonotneedtoknowagreatdealabouthardwarede-sign,theyarerequiredtofollowaddi-tionalguidelinesthatarenotrequiredforsoftware-onlyimplementations.4.3.CircuitLibrariesTheuseofcircuitormacrolibrariescangreatlysimplifyandspeedthede-signprocess.Bypredesigningcommonlyusedstructuressuchasadders,mul-tipliers,andcounters,circuitcreationforconÞgurablesystemsbecomeslargelytheassemblyofhigh-levelcomponents,andonlyapplication-speciÞcstructuresrequiredetaileddesign.Theactualar-chitectureofthereconÞgurabledevicecanbeabstracted,providedonlylibrarycomponentsareused,astheselow-leveldetailswillalreadyhavebeenencapsu-latedwithinthelibrarystructures.Al-thoughtheusersofthecircuitlibrarymaynotknowtheintricaciesofthedes-tinationarchitecture,theyarestillabletomakeuseofarchitecture-speciÞcop-timizations,suchasspecializedcarryACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputingchains.Thisisbecausedesignersveryfamiliarwiththedetailsofthetargetar-chitecturecreatethecomponentswithinacircuitlibrary.TheycantakeadvantageofarchitecturespeciÞcswhencreatingthemodulestomakethesecomponentsfasterandsmallerthanadesignerunfamiliarwiththearchitecturelikelywould.AnaddedbeneÞtofthearchitectureabstrac-tionisthattheuseoflibrarycomponentscanalsofacilitatedesignmigrationfromonearchitecturetoanother,becausede-signersarenotrequiredtolearnanewarchitecture,butonlytoindicatethenewtargetforthelibrarycomponents.How-ever,thisdoesrequirethatacircuitli-brarycontainimplementationsformorethanonearchitecture.Onemethodforusinglibrarycom-ponentsistosimplyinstantiatethemwithinanHDLdesign[Xilinx1997;Altera1999].However,circuitlibrariescanalsobeusedingenerallanguagecompil-ersbycomparingthedataßowgraphoftheapplicationtothedataßowgraphsofthelibrarymacros[CadambiandGoldstein1999].Ifadataßowrepresen-tationofamacromatchesaportionoftheapplicationgraph,thecorrespond-ingmacroisusedforthatpartoftheAnotherbeneÞtofcircuitdesignwithlibrarymacrosisthatoffastcompila-tion.Becausethelibrarystructuresmayhavebeenpremapped,preplaced,andpre-routed(atleastwithinthemacrobound-aries),theactualcompiletimeisreducedtothetimerequiredtoplacethelibrarycomponentsandroutebetweenthem.Forexample,fastconÞgurationwasoneofthemainmotivationsforthecreationoflibrariesforcircuitdesignintheDISCreconÞgurableimageprocessingsystem[Hutchings1997].4.4.CircuitGeneratorsCircuitgeneratorsfulÞllarolesimilartocircuitlibraries,inthattheyprovideopti-mizedhigh-levelstructuresforusewithinlargerapplications.Again,designersarenotrequiredtounderstandthelow-leveldetailsofparticulararchitectures.How-ever,circuitgeneratorscreatesemicus-tomizedhigh-levelstructuresautomati-callyatcompiletime,asopposedtocircuitlibrariesthatonlyprovidestaticstruc-tures.Forexample,acircuitgeneratorcancreateanadderstructureoftheexactbitwidthrequiredbythedesigner,whereasacircuitlibraryislikelytocontainalimitednumberofadderstructures,noneofwhichmaybeofthecorrectsize.Circuitgener-atorsarethereforemoreßexiblethancir-cuitlibrariesbecauseofthecustomizationSomecircuitgenerators,suchasMacGen[Yasaretal.1996],areexecutedatthecommandlineusingcustomde-scriptionÞlestogeneratephysicaldesignlayoutdataÞles.Newercircuitgenera-tors,however,arefunctionsormethodscalledfromhigh-levellanguageprograms.PAM-Blox[Menceretal.1998],forexam-ple,isasetofcircuitgeneratorsexecutedinCthatgeneratestructuresforusewiththePCIPamettereconÞgurableprocessingboard.ThecircuitgeneratorpresentedbyChuetal.[1998]containsanumberofJavaclassestoallowaprogrammertogeneratearbitrarilysizedarithmeticandlogicalcomponentsforacircuit.AlthoughtheexamplespresentedinthatpaperweremappedtoaXilinx4000seriesFPGA,thegeneratorusesarchitecturespeciÞclibrariesformodulegeneration.Thetargetarchitecturecanthereforebechangedthroughtheuseofadifferentdesignlibrary.TheCarryLook-AheadcircuitgeneratordescribedbyStohmannandBarke[1996]isalsoretargetable,becauseitmapstoanFPGAlogiccellarchitecturedeÞnedbytheuser.Onedrawbackofthecircuitgeneratorsisthattheydependonaregularlogicandroutingstructure.Hierarchicalrout-ingstructures(suchasthosepresentintheXilinx6200series[Xilinx1996])andspecializedheterogeneouslogicblocksarefrequentlynotaccountedfor.Therefore,someoptimizedfeaturesofaparticularar-chitecturemaybeunused.Forthesecases,acircuitmacrofromalibrarymaypro-videamorehighlyoptimizedstructurethanonecreatedwithacircuitgenerator,ACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.HauckprovidedthatthelibrarymacroÞtstheneedsoftheapplication.4.5.PartialEvaluationFunctionsthataretobeimplementedonthereconÞgurablearrayshouldoccupyaslittleareaaspossible,soastomaxi-mizethenumberoffunctionsthatcanbemappedtothehardware.This,combinedwiththeminimizationofthedelayin-curredbyeachcircuit,increasestheover-allaccelerationoftheapplication.Partialevaluationistheprocessofreducinghard-warerequirementsforacircuitstructurethroughoptimizationbaseduponknownstaticinputs.SpeciÞcally,ifaninputisknowntobeconstant,thatvaluecanpo-tentiallybepropagatedthroughoneormoregatesinthestructureatcompiletime,andonlytheportionsofacircuitthatdependontime-varyinginputsneedtobemappedtothereconÞgurablestructure.OneexampleoftheusefulnessofthisoperationisthatofconstantcoefÞcientmultipliers.Ifoneinputtoamultiplierisconstant,amultiplierobjectcanbere-ducedfromageneral-purposemultipliertoasetofadditionswithstatic-lengthshiftsbetweenthemcorrespondingtothelocationsof1sinthebinaryconstant.Thistypeofreductionleadstoalowerarearequirementforthecircuit,andpo-tentiallyhigherperformanceduetofewergatedelaysencounteredonthecriticalpath.Partialevaluationcanalsobeper-formedinconjunctionwithcircuitgener-ation,wheretheconstantspassedtothegeneratorfunctionareusedtosimplifythecreatedhardwarecircuit[WangandLewis1997;Chuetal.1998].Otherexam-plesofthistypeofoptimizationforspeciÞcalgorithmsincludethepartialevaluationofDESencryptioncircuits[LeonardandMangione-Smith1997],andthepartialevaluationofconstantmultipliersandÞxedpolynomialdivisioncircuits[Payne4.6.MemoryAllocationAswithtraditionalsoftwareprograms,itmaybenecessaryinreconÞgurablecom-putingtoallocatememoriestoholdvari-ablesandotherdata.Off-chipmemoriesmaybeaddedtothereconÞgurablesys-tem.Alternately,ifareconÞgurablesys-temincludesmemoryblocksembeddedintothereconÞgurablelogic,thesemaybeused,providedthatthestoragerequire-mentsdonotsurpasstheavailableembed-dedmemory.Ifmultipleoff-chipmemoriesareavailabletoareconÞgurablesystem,variablesusedinparallelshouldbeplacedintodifferentmemorystructures,suchthattheycanbeaccessedsimultaneously[GokhaleandStone1999].Whensmallerembeddedmemoryunitsareused,largermemoriescanbecreatedfromthesmallerones.However,inthiscase,itisdesir-abletoensurethateachsmallermem-oryisclosetothecomputationthatmostrequiresitscontents[Babbetal.1999].Asmentionedearlier,thesmallembed-dedmemoriesthatarenotallocatedfordatastoragemaybeusedtoperformlogic4.7.ParallelizationOneofthebeneÞtsofreconÞgurablecom-putingistheabilitytoexecutemulti-pleoperationsinparallel.IncaseswherecircuitsarespeciÞedusingastructuralhardwaredescriptionlanguage,theuserspeciÞesallstructuresandtiming,andthereforeeitherimplicitlyorexplicitlyspeciÞesanyparalleloperation.However,forbehavioralandHLLdescriptions,therearetwomethodstoincorporateparal-lelism:manualparallelizationthroughspecialinstructionsorcompilerdirec-tives,andautomaticparallelizationbytheTomanuallyincorporateparallelismwithinanapplication,theprogrammercanspeciÞcallymarksectionsofcodethatshouldrunasparallelthreads,andusesimilaroperationstothoseusedintraditionalparallelcompilers[Cronquistetal.1998;GokhaleandStone1998].Forexample,asignal/waittechniquecanbeusedtoperformsynchronizationofthedifferentthreadsofthecomputation.TheRaPiD-Blanguage[Cronquistetal.1998]isonethatusesthismethodology.ACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputingAlthoughtheNAPACcompiler[GokhaleandStone1998]requiresprogrammerstomarktheareasofcodeforexecutingthehostprocessorandthereconÞgurablehardwareinparallel,italsodetectsandexploitsÞne-grainedparallelismwithincomputationsdestinedforthereconÞg-urablehardware.Automaticparallelizationofinnerloopsisanothercommontechniqueinrecon-ÞgurablehardwarecompilerstoattempttomaximizetheuseofthereconÞg-urablehardware.Thecompilerwillse-lecttheinnermostloopleveltobecom-pletelyunrolledforparallelexecutioninhardware,potentiallycreatingaheav-ilypipelinedstructure[Cronquistetal.1998;WeinhardtandLuk1999].Forthesecases,outerloopsmaynothavemulti-pleiterationsexecutingsimultaneously.Anyloopreorderingtoimprovethepar-allelismofthecircuitmustbedonebytheprogrammer.However,somecompilersys-temshavetakenthisprocedureastepfur-therandfocusontheparallelizationofallloopswithintheprogram,notjustthein-nerloops[WangandLewis1997;BudiuandGoldstein1999].Thistypeofcompilergeneratesacontrolßowgraphbasedupontheentireprogramsourcecode.Loopun-rollingisusedinordertoincreasetheavailableparallelism,andthegraphisthenusedtoscheduleparalleloperationsinthehardware.4.8.Multi-FPGASystemSoftwareWhenreconÞgurablesystemsusemorethanoneFPGAtoformthecompletereconÞgurablehardware,therearead-ditionalcompilationissuestodealwith[HauckandAgarwal1996].ThedesignmustÞrstbepartitionedintothediffer-entFPGAchips[Hauck1995;AcockandDimond1997;Vahid1997;BrasenandSaucier1998;Khalid1999].Thisisgen-erallydonebyplacingeachhighlycon-nectedportionsofacircuitintoasinglechip.Multi-FPGAsystemshavealimitednumberofI/Opinsthatconnectthechipstogether,andthereforetheirusemustbeminimizedintheoverallcircuitmapping.Also,byminimizingtheamountofroutingrequiredbetweentheFPGAs,thenum-berofpathswithahigh(inter-chip)de-layisreduced,andthecircuitmayhaveanoverallhigherperformance.Similarly,thosesectionsofthecircuitthatrequireashortdelaytimemustbeplaceduponthesamechip.Globalplacementthendeter-mineswhichoftheactualFPGAsinthemulti-FPGAsystemwillcontaineachofthepartitions.AfterthecircuithasbeenpartitionedintothedifferentFPGAchips,thecon-nectionsbetweenthechipsmustberouted[MakandWong1997;EjniouiandRanganathan1999].Aglobalroutingal-gorithmdeterminesatahighleveltheconnectionsbetweentheFPGAchips.ItÞrstselectsaregionofoutputpinsonthesourceFPGAforagivensignal,andde-termineswhich(ifany)routingswitchesoradditionalFPGAsthesignalmustpassthroughtogettothedestinationFPGA.Detailedroutingandpinassign-ment[Slimane-Kadeetal.1994;HauckandBorriello1997;MakandWong1997;EjniouiandRanganathan1999]arethenusedtoassignsignalstotracesonanexist-ingmulti-FPGAboard,ortocreatetracesforamulti-FPGAboardthatistobecre-atedspeciÞcallytoimplementthegivenBecausemulti-FPGAsystemsuseinter-chipconnectionstoallowthecircuitparti-tionstocommunicate,theyfrequentlyre-quireahigherproportionofI/Oresourcesvs.logicineachchipthanisnormallyre-quiredinsingle-FPGAuse.Forthisrea-son,someresearchhasfocusedonmeth-odstoallowpinsoftheFPGAstobereusedformultiplesignals.Thisprocedureisre-ferredtoasVirtualWires[Babbetal.1993;Agarwal1995;Selvidgeetal.1995],andallowsforaßexibletrade-offbetweenlogicandI/Owithinagivenmulti-FPGAsystem.Signalsaremultiplexedontoasinglewirebyusingmultiplevirtualclockcycles,onepermultiplexedsignal,withinauserclockcycle,thuspipeliningthecom-munication.Inthismanner,theI/Ore-quirementsofacircuitcanbereduced,whilethelogicrequirements(becauseoftheaddedcircuitryusedforthemultiplex-ing)areincreased.ACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauck4.9.DesignTestingAftercompilation,anapplicationneedstobetestedforcorrectoperationbe-foredeployment.ForhardwareconÞgu-rationsthathavebeengeneratedfrombehavioraldescriptions,thisissimilartothedebuggingofasoftwareapplica-tion.However,structurallyandmanu-allycreatedcircuitsmustbesimulatedanddebuggedwithtechniquesbaseduponthosefromthedesignofgeneralhard-warecircuits.Forthesestructures,simu-lationanddebuggingarecriticalnotonlytoensurepropercircuitoperation,butalsotopreventpossibleincorrectconnec-tionsfromcausingashortwithinthecir-cuit,whichcandamagethereconÞgurablehardware.ThereareseveraldifferentmethodsofobservingthebehaviorofaconÞgurationduringsimulation.Thecontentsofmem-orystructureswithinthedesigncanbeviewed,modiÞed,orsaved.Thisallowson-the-ßycustomizationofthesimulatedex-ecutionenvironmentofthereconÞgurablehardware,aswellasamethodforexam-iningthecomputationresults.Theinputandoutputvaluesofcircuitstructuresandsubstructurescanalsobeviewedeitheronageneratedschematicdrawingorwithatraditionalwaveformoutput.Byexamin-ingthesevalues,theoperationofthecir-cuitcanbeveriÞedforcorrectness,andconßictsonindividualwirescanbeseen.Anumberofsimulationanddebuggingsoftwaresystemshavebeendevelopedthatusesomeorallofthesetechniques[Arnoldetal.1992;Buelletal.1996;GehringandLudwig1996;LysaghtandStockwood1996;BellowsandHutchings1998;Hutchingsetal.1999;McKayandSingh1999;VasilkoandCabanis1999].4.10.SoftwareSummaryReconÞgurablehardwaresystemsrequiresoftwarecompilationtoolstoallowpro-grammerstoharnessthebeneÞtsofreconÞgurablecomputing.Ononeendofthespectrum,circuitsforreconÞg-urablesystemscanbedesignedmanu-ally,leveragingallapplication-speciÞcandarchitecture-speciÞcoptimizationsavail-abletogenerateahigh-performanceap-plication.However,thisrequiresagreatdealoftimeandeffortonthepartofthede-signer.Attheoppositeendofthespectrumisfullyautomaticcompilationofahigh-levellanguage.Usingtheautomatictools,asoftwareprogrammercantransparentlyutilizethereconÞgurablehardwarewith-outtheneedfordirectintervention.Thecircuitscreatedusingthismethod,whilequicklyandeasilycreated,aregenerallylargerandslowerthanmanuallycreatedversions.TheactualtoolsavailableforcompilationontoreconÞgurablesystemsfallatvariouspointswithinthisrange,wheremanyarepartiallyautomatedbutrequiresomeamountofmanualaid.Cir-cuitdesignersforreconÞgurablesystemsthereforefaceatrade-offbetweentheeaseofdesignandthequalityoftheÞnallayout.5.RUN-TIMERECONFIGURATIONFrequently,theareasofaprogramthatcanbeacceleratedthroughtheuseofreconÞgurablehardwarearetoonumer-ousorcomplextobeloadedsimultane-ouslyontotheavailablehardware.Forthesecases,itisbeneÞcialtobeabletoswapdifferentconÞgurationsinandoutofthereconÞgurablehardwareastheyareneededduringprogramexecution(Figure13).Thisconceptisknownasrun-timereconÞguration(RTR).Run-timereconÞgurationisbasedupontheconceptofvirtualhardware,whichissimilartovirtualmemory.Here,thephys-icalhardwareismuchsmallerthanthesumoftheresourcesrequiredbyeachoftheconÞgurations.Therefore,insteadofreducingthenumberofconÞgurationsthataremapped,weinsteadswaptheminandoutoftheactualhardwareastheyareneeded.Becauserun-timereconÞgu-rationallowsmoresectionsofanappli-cationtobemappedintohardwarethancanbeÞtinanon-run-timereconÞg-urablesystem,agreaterportionoftheprogramcanbeaccelerated.Thisprovidespotentialforanoverallimprovementinperformance.ACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing Fig.13.ApplicationswhicharetoolargetoentirelyÞtonthereconÞgurablehardwarecanbepartitionedintotwoormoresmallerconÞgurationsthatcanoccupythehardwareatdifferenttimes.DuringasingleprogramÕsexecution,conÞgurationsareswappedinandoutofthereconÞgurablehardware.SomeoftheseconÞgurationswilllikelyrequireac-cesstotheresultsofotherconÞgurations.ConÞgurationsthatareactiveatdiffer-entperiodsintimethereforemustbepro-videdwithamethodtocommunicatewithoneanother.Primarily,thiscanbedonethroughtheuseofregisters[Ebelingetal.1996;Cadambietal.1998;Ruppetal.1998;ScaleraandVazquez1998],thecon-tentsofwhichcanremainintactbetweenreconÞgurations.ThisallowsoneconÞgu-rationtostoreavalue,andalaterconÞg-urationtoreadbackthatvalueforuseinfurthercomputations.AnalternativeforreconÞgurablesystemsthatdonotincludestate-holdingdevicesistowritetheresultbacktoregistersormemoryexternaltothereconÞgurablearray,whichisthenreadbackbysuccessiveconÞgurations[Haucketal.1997].ThereareafewdifferentconÞgurationmemorystylesthatcanbeusedwithre-conÞgurablesystems.Asinglecontextde-viceisaseriallyprogrammedchipthatrequiresacompletereconÞgurationinor-dertochangeanyoftheprogrammingbits.Amulticontextdevicehasmultiplelayersofprogrammingbits,eachofwhichcanbeactiveatadifferentpointintime.De-vicesthatcanbeselectivelyprogrammedwithoutacompletereconÞgurationarecalledpartiallyreconÞgurable.Thesedif-ferenttypesofconÞgurationmemoryaredescribedinmoredetaillater.Anadvan-tageofthemulticontextFPGAoverasinglecontextarchitectureisthatital-lowsforanextremelyfastcontextswitch(ontheorderofnanoseconds),whereasthesinglecontextmaytakemillisecondsormoretoreprogram.ThepartiallyreconÞg-urablearchitectureisalsomoresuitedtorun-timereconÞgurationthanthesinglecontext,becausesmallareasofthearraycanbemodiÞedwithoutrequiringthattheentirelogicarraybereprogrammed.Foralloftheserun-timereconÞgurablearchitectures,therearealsoanumberofcompilationissuesthatarenotencoun-teredinsystemsthatonlyconÞgureatthebeginningofanapplication.Forex-ample,run-timereconÞgurablesystemsareabletooptimizebasedonvaluesthatareknownonlyatrun-time.Furthermore,compilersmustconsidertherun-timere-conÞgurabilitywhengeneratingthedif-ferentcircuitmappings,notonlytobeawareoftheincreaseintime-multiplexedcapacity,butalsotoschedulereconÞgura-tionssoastominimizetheoverheadthattheyincur.Thesesoftwareissues,aswellasanoverviewofmethodstoperformfastconÞguration,willbeexploredinthesec-tionsthatfollow.5.1.ReconÞgurableModelsTraditionalFPGAstructureshavebeensinglecontext,onlyallowingonefull-chipconÞgurationtobeloadedatatime.How-ever,designersofreconÞgurablesystemshavefoundthisstyleofconÞgurationtobetoolimitingorslowtoefÞcientlyimplementrun-timereconÞguration.TheACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauck Fig.14.ThedifferentbasicmodelsofreconÞgurablecomputing:singlecontext,multicon-text,andpartiallyreconÞgurable.EachofthesedesignsisshownperformingareconÞgu-followingdiscussiondeÞnesthesinglecon-textdevice,andfurtherconsidersnewerFPGAdesigns(multicontextandpartiallyreconÞgurable),alongwiththeirimpactonrun-timereconÞguration.5.1.1.SingleContext.CurrentsinglecontextFPGAsareprogrammedusingaserialstreamofconÞgurationinfor-mation.Becauseonlysequentialaccessissupported,anychangetoaconÞgu-rationonthistypeofFPGArequiresacompletereprogrammingoftheentirechip.AlthoughthisdoessimplifythereconÞgurationhardware,itdoesincurahighoverheadwhenonlyasmallpartoftheconÞgurationmemoryneedstobechanged.ManycommercialFPGAsareofthisstyle,includingtheXilinx4000se-ries[Xilinx1994],theAlteraFlex10Kseries[Altera1998],andLucentÕsOrcaseries[Lucent1998].ThistypeofFPGAisthereforemoresuitedforapplicationsthatcanbeneÞtfromreconÞgurablecom-putingwithoutrun-timereconÞguration.AsinglecontextFPGAisdepictedinFigure14.Inordertoimplementrun-timerecon-ÞgurationontoasinglecontextFPGA,theconÞgurationsmustbegroupedintocon-texts,andeachfullcontextisswappedinandoutoftheFPGAasneeded.Becauseeachoftheseswapoperationsinvolvere-conÞguringtheentireFPGA,agoodparti-tioningoftheconÞgurationsbetweencon-textsisessentialinordertominimizethetotalreconÞgurationdelay.Ifallthecon-Þgurationsusedwithinacertaintimepe-riodarepresentinthesamecontext,noreconÞgurationwillbenecessary.How-ever,ifanumberofsuccessiveconÞgura-tionsareeachpartitionedintodifferentcontexts,severalreconÞgurationswillbeneeded,slowingtheoperationoftherun-timereconÞgurablesystem.5.1.2.Multicontext.AmulticontextFPGAincludesmultiplememorybitsforeachprogrammingbitlocation[DeHon1996;Trimbergeretal.1997a;ScaleraandVazquez1998;Chameleon2000].Thesememorybitscanbethoughtofasmul-tipleplanesofconÞgurationinformation,asshowninFigure14.Oneplaneofcon-Þgurationinformationcanbeactiveatagivenmoment,butthedevicecanquicklyswitchbetweendifferentplanes,orcon-texts,ofalready-programmedconÞgura-tions.Inthismanner,themulticontextde-vicecanbeconsideredamultiplexedsetofsinglecontextdevices,whichrequiresthatacontextbefullyreprogrammedtoper-formanymodiÞcation.Thissystemdoesallowforthebackgroundloadingofacon-text,whereoneplaneisactiveandinex-ecutionwhileaninactiveplaceisintheprocessofbeingprogrammed.Figure15showsamulticontextmemorybit,asusedin[Trimbergeretal.1997a].Acommer-cialproductthatusesthistechniqueistheCS2000RCPseriesfromChameleon,Inc[Chameleon2000].ThisdeviceprovidesACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing Fig.15.Afour-bitmulticontextedprogrammingbit[Trimbergeretal.1997a].P0-P3arethestoredprogrammingbits,whileC0-C3arethechip-widecontrollinesthatselectthecontexttoprogramoractivate.twoseparateplanesofprogrammingin-formation.Atanygiventime,oneoftheseplanesiscontrollingcurrentexecutiononthereconÞgurablefabric,andtheotherplaneisavailableforbackgroundloadingofthenextneededconÞguration.FastswitchingbetweencontextsmakesthegroupingoftheconÞgurationsintocontextsslightlylesscritical,becauseifaconÞgurationisonadifferentcontextthantheonethatiscurrentlyactive,itcanbeactivatedwithinanorderofnanosec-onds,asopposedtomillisecondsorlonger.However,itislikelythatthenumberofcontextswithinagivenprogramislargerthanthenumberofcontextsavailableinthehardware.Inthiscase,thepartition-ingagainbecomesimportanttoensurethatconÞgurationsoccurringinclosetem-poralproximityareinasetofcontextsthatareloadedintothemulticontextde-viceatthesametime.Moreaspectsinvolv-ingtemporalpartitioningforsingle-andmulticontextdeviceswillbediscussedinthesectiononcompilersforrun-timere-conÞgurablesystems.5.1.3.PartiallyReconÞgurable.Insomecases,conÞgurationsdonotoccupythefullreconÞgurablehardware,oronlyapartofaconÞgurationrequiresmodiÞcation.Inbothofthesesituations,apartialrecon-Þgurationofthearrayisrequired,ratherthanthefullreconÞgurationrequiredbyasingle-ormulticontextdevice.Inapar-tiallyreconÞgurableFPGA,theunderly-ingprogrammingbitlayeroperateslikeaRAMdevice.Usingaddressestospec-ifythetargetlocationoftheconÞgurationdataallowsforselectivereconÞgurationofthearray.Frequently,theundisturbedportionsofthearraymaycontinueexecu-tion,allowingtheoverlapofcomputationwithreconÞguration.ThishasthebeneÞtofpotentiallyhidingsomeofthereconÞg-urationlatency.WhenconÞgurationsdonotrequiretheentireareaavailablewithinthearray,anumberofdifferentconÞgurationsmaybeloadedintounusedareasofthehard-wareatdifferenttimes.SinceonlypartofthearrayisreconÞguredatagivenpointintime,theentirearraydoesnotre-quirereprogramming.Additionally,someapplicationsrequiretheupdatingofonlyaportionofamappedcircuit,whiletherestshouldremainintact,asshowninFigure14.Forexample,inaÞlteringop-erationinsignalprocessing,asetofcon-stantvaluesthatchangeslowlyovertimemaybereinitializedtoanewvalue,yettheoverallcomputationinthecircuitremainsstatic.UsingthisselectivereconÞgurationcangreatlyreducetheamountofconÞgu-rationdatathatmustbetransferredtotheFPGA.Severalrun-timereconÞgurablesystemsarebaseduponapartiallyre-conÞgurabledesign,includingChimaera[Haucketal.1997],PipeRench[Cadambietal.1998;Goldsteinetal.2000],NAPA[Ruppetal.1998],andtheXilinx6200andVirtexFPGAs[Xilinx1996,1999].Unfortunately,sinceaddressinforma-tionmustbesuppliedwithconÞgura-tiondata,thetotalamountofinformationtransferredtothereconÞgurablehard-waremaybegreaterthanwhatisrequiredwithasinglecontextdesign.ThismakesafullreconÞgurationoftheentirearrayslowerthanthesinglecontextversion.However,apartiallyreconÞgurabledesignisintendedforapplicationsinwhichthesizeoftheconÞgurationsissmallenoughthatmorethanonecanÞtontheavailablehardwaresimultaneously.Plus,aswedis-cussinsubsequentsections,anumberoffastconÞgurationmethodshavebeenex-ploredforpartiallyreconÞgurablesystemsinordertohelpreducetheconÞgurationdatatrafÞcrequirements.5.1.4.PipelineReconÞgurable.AmodiÞ-cationofthepartiallyreconÞgurableFPGAdesignisoneinwhichthepartialACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauck Fig.16.AtimelineoftheconÞgurationandreconÞgurationofpipelinestagesonapipelinereconÞgurableFPGA.ThisexampleshowsthreephysicalpipelinestagesimplementingÞvevirtualpipelinestages[Cadambietal.1998].reconÞgurationoccursinincrementsofpipelinestages.ThisstyleofreconÞg-urablehardwareiscalledpipelinerecon-Þgurable,orsometimesastripedFPGA[Luketal.1997b;Cadambietal.1998;DeshpandeandSomani1999;Goldsteinetal.2000].EachstageisconÞguredasawhole.Thisisprimarilyusedindatapath-stylecomputations,wheremorepipelinestagesareusedthancanÞtsimultane-ouslyonavailablehardware.Figure16showsanexampleofapipelinereconÞg-urablearrayimplementingmorepipelinestagesthancanÞtontheavailablehard-ware.Inapipeline-reconÞgurableFPGA,therearetwoprimaryexecutionpossi-bilities.Eitherthenumberofhardwarepipelinestagesavailableisgreaterthanorequaltothenumberofpipelinestagesofthedesignedcircuit(virtualpipelinestages),orthenumberofvirtualpipelinestageswillexceedthenumberofhardwarepipelinestages.TheÞrstcaseisstraight-forward:thecircuitissimplymappedtothearray,andsomehardwarestagesmaygounused.Thesecondcaseismorecom-plexandistheonethatrequiresrun-timereconÞguration.ThepipelinestagesareconÞguredonebyone,fromthestartofthepipeline,throughtheendoftheavailablehardwarestages(steps1,2,and3inFigure16).Aftereachstageisprogrammed,itbeginscomputation.Inthismanner,theconÞgurationofastageisexactlyonestepaheadoftheßowofdata.OncethehardwarepipelinehasbeencompletelyÞlled,reuseofthehard-warepipelinestagesbegins.ConÞgura-tionofthenextvirtualstagebeginsattheÞrstpipelinelocationinthehard-ware(step4),overwritingtheÞrstvirtualpipelinestage.ThereconÞgurationofthehardwarepipelinestagescontinuesuntilthelastvirtualpipelinestagehasbeenprogrammed(step7),atwhichpointtheÞrststageofthevirtualpipelineisagainconÞguredontothehardwareforthenextdataset.ThesestructuresalsoallowfortheoverlapofconÞgurationandexecution,asonepipelinestageisconÞguredwhiletheothersareexecuting.Therefore,datavaluesareprocessedeachtimethevirtualpipelineisfullytraversedonan-stagehardwaresystem.5.2.Run-TimePartialEvaluationOneoftheadvantagesthatarun-timere-conÞgurabledevicehasoverasystemthatisonlyprogrammedatthebeginningofanapplicationistheabilitytoperformhardwareoptimizationsbaseduponval-uesdeterminedatrun-time.Partialevalu-ationwasalreadydiscussedinthisarticleinreferencetocompilationoptimizationsforgeneralreconÞgurablesystems.Run-timepartialevaluationallowsforthefur-therexploitationofÒconstantsÓbecausetheconÞgurationscanbemodiÞedbasednotonlyoncompletelystaticvalues,butalsothosethatchangeslowlyovertime[Burnsetal.1997;Luketal.1997a;Payne1997;WirthlinandHutchings1997;Chuetal.1998;McKayandSingh1999].ThisgivesreconÞgurablecircuitsthepotentialtoachieveanevenhigherperformancethananASIC,whichmustretaingener-alityinthesesituations.ThecircuitintheACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputingreconÞgurablesystemcanbecustomizedtotheapplicationatagiventime,ratherthantotheapplicationasacategory.Forexample,whereanASICmayhavetoincludeagenericmultiplier,areconÞg-urablesystemcouldinstantiateaconstantcoefÞcientmultiplierthatchangesovertime.Additionally,partialevaluationcanbeusedinencryptionsystems[LeonardandMangione-Smith1997].Akey-speciÞcreconÞgurableencrypterordecrypterisoptimizedfortheparticularkeybeingused,butretainstheabilitytousemorethanonekeyoverthelifetimeofthehard-ware(unlikeakey-specializedASIC)orduringactualrun-time.Althoughpartialevaluationcanbeusedtoreducetheoverallarearequirementsofacircuitbyremovingpotentiallyex-traneoushardwarewithintheimplemen-tation,occasionallyitispreferabletore-servesufÞcientareaforthelargestcase,andhaveallmappingsoccupythatarea.Thisallowsthepartiallyevaluatedpor-tionofagivenconÞgurationtobereconÞg-ured,whileleavingtheremainderofthecircuitintact.Forexample,ifaconstantcoefÞcientmultiplierwithinalargercon-Þgurationrequiresthattheconstantbechanged,onlytheareaoccupiedbythemultiplierrequiresreconÞguration.ThisistrueevenifthenewconstantcoefÞcientmultiplierisalargerstructurethanthepreviousone,becausethereservedareaforitisbaseduponthelargestpossibility[McKayandSingh1999].Althoughpar-tialevaluationdoesnotminimizetheareaoccupiedbythecircuitinthiscase,thespeedofconÞgurationisimprovedbymak-ingthemultiplieramodularreplaceablecomponent.Additionally,thismethodre-tainsthespeedbeneÞtsofpartialrecon-Þgurationbecauseitstillminimizesthelogicandroutingactuallyusedtoimple-mentthestructure.5.3.CompilationandConÞgurationSchedulingForsomereconÞgurablesystems,acon-Þgurationrequiresprogrammingthere-conÞgurablehardwareonlyatthestartofitsexecution.Ontheotherhand,inarun-timereconÞgurablesystem,thecir-cuitsloadedonthehardwarechangeovertime.IftheusermustspecifybyhandtheloadingandexecutionofthecircuitsinthereconÞgurablehardware,thenthecompilersmustincludemethodstoindi-catetheseoperations.JHDL[BellowsandHutchings1998;Hutchingsetal.1999]isonesuchcompiler.Itprovidesforthein-stantiationofconÞgurationsthroughtheuseofJavaconstructors,andtheremovalofthecircuitsfromthehardwarebyusingadestructoronthecircuitobjects.Thisal-lowstheprogrammertoindicateexactlytheloadingpatternoftheconÞgurations.Alternately,thecompilercanautomatetheuseoftherun-timereconÞgurablehardware.Forasinglecontextormulti-contextdevice,conÞgurationsmustbetemporallypartitionedintoanumberofdifferentfullcontextsofconÞgurationinformation.ThisinvolvesdeterminingwhichconÞgurationsarelikelytobeusednearintimetooneanother,andwhichconÞgurationsareabletoÞttogetherontothereconÞgurablehardware.Ideally,thenumberofreconÞgurationsthataretobeperformedisminimized.ByreducingthenumberofreconÞgurations,thepropor-tionoftimespentinreconÞguration(com-paredtothetimespentinusefulcompu-tation)isreduced.Theproblemofformingandschedul-ingsingle-andmulticonÞgurationcon-textsforuseinsinglecontextormulticon-textFPGAdesignshasbeendiscussedbyanumberofgroups[ChangandMarek-Sadowska1998;Trimberger1998;LiuandWong1999;PurnaandBhatia1999;Lietal.2000a].Inparticular,asinglecir-cuitthatistoolargetoÞtwithinthere-conÞgurablehardwaremaybepartitionedovertimetoformasequentialsetofcon-Þgurations.Thisinvolvesexaminingthecontrolßowgraphofthecircuitanddivid-ingthecircuitintodistinctcomputationnodes.Thenodescanthenbegroupedto-getherwithincontexts,basedupontheirproximitytooneanotherwithintheßowcontrolgraph.Ifpossible,thoseconÞg-urationsthatareusedinquicksucces-sionwillbeplacedwithinthesamegroup.ThesegroupsareÞnallymappedintofullACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauckcontexts,tobeloadedintothereconÞg-urablehardwareatrun-time.Nimble[Lietal.2000a]isoneofthecompilersthatperformthistypeofoperation.Thiscom-pilerfocusesonmappingcoreloopswithinCcodetoreconÞgurablehardware.Hard-waremodelsforthecandidateloopsthatwillÞtwithinthereconÞgurablehardwareareÞrstextractedfromtheCapplication.Thentheseloopsaregroupedintoindi-vidualconÞgurationsusingapartitioningmethodinordertoencouragethehard-wareloopsthatareusedinclosetemporalproximitytobemappedtothesameconÞg-uration,reducingconÞgurationoverhead.ForpartiallyreconÞgurabledesigns,thecompilermustdetermineagoodplace-mentinordertopreventconÞgurationsthatareusedtogetherinclosetemporalproximityfromoccupyingthesamere-sources.Again,throughminimizingthenumberofreconÞgurations,theoverallperformanceofthesystemisincreased,asconÞgurationisaslowprocess[Lietal.2000b].Analternativeapproach,whichallowstheÞnalplacementofaconÞgura-tiontobedeterminedatrun-time,isalsodiscussedwithintheFastConÞgurationsectionofthisarticle.5.4.FastConÞgurationBecauserun-timereconÞgurablesystemsinvolvereconÞgurationduringprogramexecution,thereconÞgurationmustbedoneasefÞcientlyandasquicklyaspos-sible.ThisisinordertoensurethattheoverheadofthereconÞgurationdoesnoteclipsethebeneÞtgainedbyhardwareac-celeration.StallingexecutionofeitherthehostprocessororthereconÞgurablehard-warebecauseofconÞgurationisclearlyundesirable.IntheDISCIIsystem,from25%[WirthlinandHutchings1996]to71%[WirthlinandHutchings1995]ofexecu-tiontimeisspentinreconÞguration,whileintheUCLAATRworkthisÞgurecanrisetoover98.5%[Mangione-Smith1999].IfthedelayscausedbyreconÞgurationarereduced,performancecanbegreatlyin-creased.Therefore,fastconÞgurationisanimportantareaofresearchforrun-timere-conÞgurablesystems.ThereareanumberofdifferenttacticsforreducingtheconÞgurationoverhead.First,loadingoftheconÞgurationscanbetimedsuchthattheconÞgurationover-lapsasmuchaspossiblewiththeexecu-tionofinstructionsbythehostprocessor.Second,compressiontechniquescanbein-troducedtodecreasetheamountofconÞg-urationdatathatmustbetransferredtothesystem.Third,specializedhardwarecanbeusedtoadjustthephysicalloca-tionofconÞgurationsatrun-timebasedonwherethefreeareaonthehardwareislo-catedatanygiventime.Finally,theactualprocessoftransferringthedatafromthehostprocessortothereconÞgurablehard-warecanbemodiÞedtoincludeaconÞgu-rationcache,whichwouldprovideafaster5.4.1.ConÞgurationPrefetching.Perfor-manceisimprovedwhentheactualcon-Þgurationofthehardwareisoverlappedwithcomputationsperformedbythehostprocessor,becauseprogrammingthereconÞgurablehardwarerequiresfrommillisecondstosecondstoaccomplish.OverlappingconÞgurationandexecutionpreventsthehostprocessorfromstallingwhileitiswaitingfortheconÞgurationtoÞnish,andhidestheconÞgurationtimefromtheprogramexecution.ConÞgura-tionprefetching[Hauck1998a]attemptstoleveragethisoverlapbydeterminingwhentoinitiatereconÞgurationofthehardwareinordertomaximizeoverlapwithusefulcomputationonthehostprocessor.ItalsoseekstominimizethechancethataconÞgurationwillbepre-fetchedfalsely,overwritingtheconÞgura-tionthatisactuallyusednext.5.4.2.ConÞgurationCompression.Unfor-tunately,therewillalwaysbecasesinwhichtheconÞgurationoverheadscannotbesuccessfullyhiddenusingaprefetch-ingtechnique.Thiscanoccurwhenacon-ditionalbranchoccursimmediatelybe-foretheuseofaconÞguration,potentiallymakinga100%correctprefetchpredic-tionimpossible,orwhenmultipleconÞg-urationsorcontextsmustbeloadedinquicksuccession.Inthesecases,thedelayincurredisminimizedwhentheamountACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputingofdatatransferredfromthehostproces-sortothereconÞgurablearrayismini-mized.ConÞgurationcompressioncanbeusedtocompactthisconÞgurationinfor-mation[Haucketal.1998b;HauckandWilson1999;LiandHauck1999;DandalisandPrasanna2001].OneformofconÞgurationcompressionhasalreadybeenimplementedinacom-mercialsystem.TheXilinx6200seriesofFPGA[Xilinx1996]containswildcardinghardware,whichprovidesamethodtopro-grammultiplelogiccellswithasinglead-dressanddatavalue.ThisisaccomplishedbysettingaspecialregistertoindicatewhichoftheaddressbitsshouldbehaveasÒdonÕt-careÓvalues,resolvingtomulti-pleaddressesforconÞguration.Forexam-ple,supposetwoconÞgurationaddresses,00010and00110,arebothtobepro-grammedwiththesamevalue.Bysettingthewildcardregisterto00100,theaddressvaluesentisinterpretedas00X10andboththeselocationsareprogrammedus-ingeitherofthetwoaddressesaboveinasingleoperation.Haucketal.[1998b]dis-cussthebeneÞtsofthishardware,whileLiandHauck[1999]coverapotentialex-tensiontotheconcept,whereÒdonÕtcareÓvaluesintheconÞgurationstreamcanbeusedtoallowareaswithsimilarbutnotidenticalconÞgurationdatavaluestoalsobeprogrammedsimultaneously.WithinpartiallyreconÞgurablesys-tems,thereisanaddedpotentialtocom-presseffectivelytheamountofdatasenttothereconÞgurablehardware.Acon-ÞgurationcanpossiblyreuseconÞgura-tioninformationalreadypresentonthearray,suchthatonlytheareasdifferinginconÞgurationvaluesmustberepro-grammed.Therefore,conÞgurationtimecanbereducedthroughtheidentiÞcationofthesecommoncomponentsandthecal-culationoftheincrementalconÞgurationsthatmustbeloaded[Luketal.1997a;Shirazietal.1998].Alternately,similaroperationscanbegroupedtogethertoformasinglecon-Þgurationthatcontainsextracontrolcir-cuitryinordertoimplementthevariousfunctionswithinthegroup[Kastrupetal.1999].BycreatinglargerconÞgurationsoutofgroupsofsmallerconÞgurations,theconÞgurationoverheadofpartialre-conÞgurationisreducedbecausemoreop-erationscanbepresentonchipsimul-taneously.However,therearesomeareaandexecutionpenaltiesimposedbythismethod,creatingatrade-offbetweenre-ducedreconÞgurationoverheadandfasterexecutionwithasmallerarea.5.4.3.RelocationandDefragmentationinPartiallyReconÞgurableSystems.PartiallyreconÞgurablesystemshavetheadvan-tageoversinglecontextsystemsinthattheyallowanewconÞgurationtobewrit-tentotheprogrammablelogicwhiletheconÞgurationsnotoccupyingthatsamearearemainintactandavailableforfutureuse.BecausetheseconÞgurationswillnothavetobereconÞguredontothearray,andbecausetheprogrammingofasin-gleconÞgurationcanrequirethetransferoffarlessconÞgurationdatathanthepro-grammingofanentirecontext,apartiallyreconÞgurablesystemcanincurlesscon-ÞgurationoverheadthanasinglecontextHowever,inefÞcienciescanariseiftwopartialconÞgurationsaresupposedtobelocatedatoverlappingphysicalloca-tionsontheFPGA.IftheseconÞgura-tionsarerepeatedlyusedoneafteran-other,theymustbeswappedinandoutofthearrayeachtime.ThistypeofconßictcouldnegatemuchofthebeneÞtachievedbypartiallyreconÞgurablesystems.AbettersolutiontothisproblemistoallowtheÞnalplacementoftheconÞgurationstooccuratrun-time,allowingforrun-timerelocationofthoseconÞgurations[Lietal.2000b;Comptonetal.2002].Usingrelocation,anewconÞgurationmaybeplacedontothereconÞgurablearraywhereitwillcauseminimumcon-ßictwithotherneededconÞgurationsal-readypresentonthehardware.Anum-berofdifferentsystemssupportrun-timerelocation,includingChimaera[Haucketal.1997],Garp[HauserandWawrzynek1997],andPipeRench[Cadambietal.1998;Goldsteinetal.2000].Evenwithrelocation,partiallyreconÞg-urablehardwarecanstillsufferfromsomeACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauckplacementconßictsthatcouldbeavoidedbyusinganadditionalhardwareoptimiza-tion.Overtime,asapartiallyreconÞg-urabledeviceloadsandunloadsconÞg-urations,thelocationoftheunoccupiedareaonthearrayislikelytobecomefrag-mented,similartowhatoccursinmem-orysystemswhenRAMisallocatedanddeallocated.TheremaybeenoughemptyareaonthedevicetoholdanincomingconÞguration,butitmaybedistributedthroughoutthearray.AconÞgurationnor-mallyrequiresacontiguousregionofthechip,soitwouldhavetooverwriteapor-tionofavalidconÞgurationinordertobeplacedontothereconÞgurablehard-ware.Asystemthatincorporatestheabil-itytoperformdefragmentationofthere-conÞgurablearray,however,wouldbeabletoconsolidatetheunusedareabymov-ingvalidconÞgurationstonewlocations[DiesselandElGindy1997;Comptonetal.2002].Thisareacanthenbeusedbyin-comingconÞgurations,potentiallywith-outoverwritinganyofthemovedconÞg-urations.5.4.4.ConÞgurationCaching.BecauseagreatdealofthedelaycausedbyconÞg-urationisduetothedistancebetweenthehostprocessorandthereconÞgurablehardware,aswellthereadingoftheconÞgurationdatafromaÞleormainmemory,aconÞgurationcachecanpoten-tiallyreducethecostsofreconÞguration[Deshpandeetal.1999;Lietal.2000b].BystoringtheconÞgurationsinfastmem-oryneartothereconÞgurablearray,thedatatransferduringreconÞgurationisac-celerated,andtheoveralltimerequiredisreduced.Additionally,aspecialconÞg-urationcachecanallowforspecializeddi-rectoutputtothereconÞgurablehardware[Comptonetal.2000].Thisoutputcanleveragethecloseproximityofthecachebyprovidinghigh-bandwidthcommunica-tionsthatwouldfacilitatewideparallelloadingoftheconÞgurationdata,furtherreducingconÞgurationtimes.5.5.PotentialProblemswithRTRPartialreconÞgurationinvolvesselec-tivelyprogrammingportionsoftherecon-Þgurablearray.However,inmanyarchi-tectures,therearesomeroutingresourcesthattraverselongdistances,andmaytra-verseareasallocatedtodifferentconÞg-urations.CaremustbetakensuchthatdifferentconÞgurationsdonotattempttodrivetothesewiressimultaneously,asmultipledriverstoawirecanpotentiallydamagethehardware.Therefore,systemssuchastheXilinx6200[Xilinx1996]andChimaera[Haucketal.1997]havespe-ciallydesignedroutingresourcesthatpre-ventmultipledrivers.LEGO[Chowetal.1999b]includesanadditionalcontrolsig-nalpreventingconßictsduringthespanoftimebetweenstartupandactualprogram-mingofthehardware.AnadditionaldifÞcultyinusingrun-timereconÞgurablesystemsoccurswhenthehostprocessorrunsmultiplethreadsorprocesses.ThesethreadsorprocessesmayeachhavetheirownsetsofconÞg-urationsthataretobemappedtothereconÞgurablehardware.Issuessuchasthecorrectuseofmemoryprotectionandvirtualmemorymustbeconsidereddur-ingmemoryaccessesbythereconÞgurablehardware[ChienandByun1999;JacobandChow1999;Jeanetal.1999].An-otherproblemcanoccurwhenonethreadorprocessconÞguresthehardware,whichisthenreconÞguredbyadifferentthreadorprocess.Threadsandprocessesmustbepreventedfromincorrectlycallinghard-warefunctionsthatnolongerappearonthereconÞgurablehardware.Thisre-quiresthatthestateofthereconÞgurablehardwarebesettoÒdirtyÓonamainpro-cessorcontextswitch,orre-loadedwiththecorrectconÞgurationcontext.PartiallyreconÞgurablesystemsmustalsoprotectagainstinter-processorinter-threadconßictswithinthearray.EvenifeachapplicationhasensuredthattheirownconÞgurationscansafelyco-exist,acombinationofconÞgurationsfromdifferentapplicationsre-introducesthepossibilityofinadvertentlycausinganelectricalshortwithinthereconÞgurablehardware.ThisparticularissuecanbesolvedthroughtheuseofanarchitecturethatdoesnothaveÒbadÓconÞgurations,suchasthe6200series[Xilinx1996]andACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputingChimaera[Haucketal.1997].Thepo-tentialforthistypeofconßictalsointro-ducesthepossibilityofextremelydestruc-tiveconÞgurationsthatcandestroythesystemÕsunderlyinghardware.5.6.Run-TimeReconÞgurationSummaryWehavediscussedthebeneÞtsofusingrun-timereconÞgurationtoincreasethebeneÞtsgainedthroughreconÞgurablecomputing.DifferentconÞgurationsmaybeusedatdifferentphasesofaprogramÕsexecution,customizingthehardwarenotonlyfortheapplication,butalsoforthedifferentstagesoftheapplication.Run-timereconÞgurationalsoallowsconÞgu-rationslargerthantheavailablerecon-Þgurablehardwaretobeimplemented,asthesecircuitscanbesplitintosev-eralsmalleronesthatareusedinsucces-sion.BecauseofthedelaysassociatedwithconÞguration,thisstyleofcomputingre-quiresthatreconÞgurationbeperformedinaveryefÞcientmanner.MulticontextandpartiallyreconÞgurableFPGAsarebothdesignedtoimprovethetimere-quiredforreconÞguration.Hardwareopti-mizations,suchaswildcarding,run-timerelocation,anddefragmentation,fur-therdecreaseconÞgurationoverheadinapartiallyreconÞgurabledesign.Soft-waretechniquestoenablefastconÞgura-tion,includingprefetchingandincremen-talconÞgurationcalculation,werealso6.CONCLUSIONReconÞgurablecomputingisbecominganimportantpartofresearchincomputerarchitecturesandsoftwaresystems.Byplacingthecomputationallyintensepor-tionsofanapplicationontothereconÞg-urablehardware,thatapplicationcanbegreatlyaccelerated.Thisisbecauserecon-ÞgurablecomputingcombinesmanyofthebeneÞtsofbothsoftwareandASICim-plementations.Likesoftware,themappedcircuitisßexible,andcanbechangedoverthelifetimeofthesystemoreventhelifetimeoftheapplication.SimilartoanASIC,reconÞgurablesystemsprovideamethodtomapcircuitsintohardware.Re-conÞgurablesystemsthereforehavethepotentialtoachievefargreaterperfor-mancethansoftwareasaresultofbypass-ingthefetch-decode-executecycleoftradi-tionalmicroprocessorsaswellaspossiblyexploitingagreaterdegreeofparallelism.ReconÞgurablehardwaresystemscomeinmanyforms,fromaconÞgurablefunc-tionalunitintegrateddirectlyintoaCPU,toareconÞgurablecoprocessorcoupledwithahostmicroprocessor,toamulti-FPGAstand-aloneunit.Thelevelofcou-pling,granularityofcomputationstruc-tures,andformofroutingresourcesareallkeypointsinthedesignofreconÞgurablesystems.Theuseofheterogeneousstruc-turescanalsogreatlyaddtotheoverallperformanceoftheÞnaldesign.CompilationtoolsforreconÞgurablesystemsrangefromsimpletoolsthataidinthemanualdesignandplacementofcircuits,tofullyautomaticdesignsuitesthatuseprogramcodewritteninahigh-levellanguagetogeneratecircuitsandthecontrollingsoftware.Thevarietyoftoolsavailableallowsdesignerstochoosebe-tweenmanualandautomaticcircuitcre-ationforanyorallofthedesignsteps.Althoughautomatictoolsgreatlysimplifythedesignprocess,manualcreationisstillimportantforperformance-drivenappli-cations.Circuitlibrariesandcircuitgen-eratorsareadditionalsoftwaretoolsthatenabledesignerstoquicklycreateefÞcientdesigns.ThesetoolsattempttoaidthedesigneringainingthebeneÞtsofman-ualdesignwithoutentirelysacriÞcingtheeaseofautomaticcircuitcreation.Finally,run-timereconÞgurationpro-videsamethodtoaccelerateagreaterpor-tionofagivenapplicationbyallowingtheconÞgurationofthehardwaretochangeovertime.ApartfromthebeneÞtsofaddedcapacitythroughtheuseofvirtualhard-ware,run-timereconÞgurationalsoallowsforcircuitstobeoptimizedbasedonrun-timeconditions.Inthismanner,perfor-manceofareconÞgurablesystemcanap-proachorevensurpassthatofanASIC.ReconÞgurablecomputingsystemshaveshowntheabilitytoaccelerateprogramACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauckexecutiongreatly,providingahigh-performancealternativetosoftware-onlyimplementations.However,noonehard-waredesignhasemergedastheclearpin-nacleofreconÞgurabledesign.Althoughgeneral-purposeFPGAstructureshavestandardizedintoLUT-basedarchitec-tures,groupsdesigninghardwareforre-conÞgurablecomputingarecurrentlyalsoexploringtheuseofheterogeneousstruc-turesandword-widthcomputationalele-ments.Thosedesigningcompilersystemsfacethetaskofimprovingautomaticde-signtoolstothepointwheretheymayachievemappingscomparabletomanualdesignforevenhigh-performanceapplica-tions.Withinbothoftheseresearchcat-egoriesliestheadditionaltopicofrun-timereconÞguration.WhilesomeworkhasbeendoneinthisÞeldaswell,re-searchmustcontinueinordertobeabletoperformfasterandmoreefÞcientre-conÞguration.Furtherstudyintoeachofthesetopicsisnecessaryinordertohar-nessthefullpotentialofreconÞgurablecomputing.,P.,BABBA,P.,DAULET,M.C.,AUCIERG.1993.Input-drivenpartitioningmethodsandapplicationtosynthesisontable-lookup-basedFPGAÕs.IEEETrans.Comput.Aid.Des.Integ.Circ.Syst.12,7,913Ð925.,S.J.B.,K.R.1997.AutomaticmappingofalgorithmsontomultipleFPGA-SRAMModules.Field-ProgrammableLogicand,W.Luk,P.Y.K.Cheung,andM.Glesner,Eds.LectureNotesinComputerScience,vol.1304,Springer-Verlag,Berlin,Germany,255Ð264..2001.MSA2500Pro-grammableLogicCores.AdaptiveSilicon,Inc.,LosGatos,CA.GARWAL,A.1995.VirtualWires:ATechnologyforMassiveMulti-FPGASystems.Availableonlineathttp://www.ikos.com/products/virtual-wires.ps.GGARWAL,A.,D.1994.Routingarchi-tecturesforhierarchicalÞeldprogrammablegatearrays.InProceedingsoftheIEEEInterna-tionalConferenceonComputerDesign,475Ð478.,M.J.,G.1996.Newperformance-drivenFPGAroutingalgorithms.IEEETrans.CADInteg.Circ.Syst.15,12,1505ÐLTERAORPORATION.1998.DataBook.AlteraCorporation,SanJose,CA.LTERAORPORATION.1999.AlteraMegaCore.Availableonlineathttp://www.altera.com/html/tools/megacore.html.AlteraCorpora-tion,SanJose,CA.LTERAORPORATION.2001.PressRelease:Al-teraUnveilsFirstCompleteSystem-on-a-Programmable-ChipSolutionatEmbeddedSystemsConference.AlteraCorporation,SanJose,CA..1998.WildÞreRef-erenceManual.AnnapolisMicrosystems,Inc,Annapolis,MD.,J.M.,B,D.A.,AVIS,E.G.1992.Splash2.InProceedingsoftheACMSymposiumonParallelAlgorithmsandArchitectures,316Ð,J.,R,M.,M,C.A.,L,W.,FM.,BARUA,R.,,S.1999.Par-allelizingapplicationsintosilicon.IEEESympo-siumonField-ProgrammableCustomComput-ingMachines,70Ð80.,J.,T,R.,GARWAL,A.1993.Vir-tualwires:OvercomingpinlimitationsinFPGA-basedlogicemulators.InIEEEWorkshoponFPGAsforCustomComputingMachines,P.,B.1998.JHDLÑAnHDLforreconÞgurablesystems.IEEESympo-siumonField-ProgrammableCustomComput-ingMachines,175Ð184.,V.,J.1997.VPR:Anewpacking,placementandroutingtoolforFPGAresearch.LectureNotesinComputerScience1304ÑField-ProgrammableLogicandApplications.W.Luk,P.Y.K.Cheung,andM.Glesner,Eds.Springer-Verlag,Berlin,Germany,213Ð222.,V.,J.1999.FPGAroutingarchi-tecture:Segmentationandbufferingtooptimizespeedanddensity.ACM/SIGDAInternationalSymposiumonFPGAs,59Ð68.,D.R.,AUCIER,G.1998.UsingconestructuresforcircuitpartitioningintoFPGApackages.IEEETrans.CADInteg.Circ.Syst.177,592Ð600.,S.D.,F,R.J.,R,J.,Z.G.1992a.Field-ProgrammableGateAr-,KluwerAcademicPublishers,Boston,MA.,S.,R,J.,,Z.G.1992b.AdetailedrouterforÞeld-programmablegatear-rays.IEEETrans.Comput.Aid.Desi.11,5,620Ð,M.,S.C.1999.Fastcom-pilationforpipelinedreconÞgurablefabrics.ACM/SIGDAInternationalSymposiumon,195Ð205.,D.,A,S.M.,,W.J.SPLASH2FPGAsinaCustomComput-ingMachine,IEEEComputerSocietyPress,LosAlamitos,CA.ACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing,J.,D,A.,H,J.,S,S.,,M.1997.AdynamicreconÞgu-rationrun-timesystem.IEEESymposiumonField-ProgrammableCustomComputingMachines,66Ð75.,M.ATCHELLER,J.1991.Methodofus-ingelectronicallyreconÞgurablelogiccircuits.USPatent5,S.,S.C.1999.CPR:AconÞgurationproÞlingtool.IEEESymposiumonField-ProgrammableCustomComputingMachines,104Ð113.,S.,W,J.,G,S.C.,S,H.,,D.E.1998.Managingpipeline-reconÞgurableFPGAs.ACM/SIGDAInterna-tionalSymposiumonFPGAs,55Ð64.,T.J.,C,P.,D,A.,AWRZYNEKJ.1998.FastModuleMappingandPlacementforDatapathsinFPGAs.ACM/SIGDAInterna-tionalSymposiumonFPGAs,123Ð132.,T.J.,HAUSER,J.R.,AWRZYNEK,J.2000.TheGarparchitectureandCcompiler.IEEEComput.3,4,62Ð69.,J.M.P.ETO,H.C.1999.Macro-basedhardwarecompilationofJavacodesintoadynamicreconÞgurablecomputingIEEESymposiumonField-Programm-ableCustomComputingMachines,2Ð11..2000.CS2000AdvanceProductSpeciÞcation.ChameleonSystems,Inc.,SanJose,CA.,P.K.,M.D.F.1997.Accel-erationofanFPGArouter.IEEESymposiumonField-ProgrammableCustomComputingMachines,175Ð181.,D.,M.1998.Parti-tioningsequentialcircuitsondynamicallyrecon-ÞgurableFPGAs.ACM/SIGDAInternationalSymposiumonFPGAs,161Ð167.,S.C.,M,M.,WANG,T.T.1996.TechnologymappingforTLUFPGAÕsbasedondecompositionofbinarydecisiondiagrams.IEEETrans.CADInteg.Circ.Syst.1510,1226Ð1248.HICHKOV,A.V.,C.B.1997.Anhard-ware/softwarepartitioningalgorithmforcus-tomcomputingmachines.LectureNotesinCom-puterScience1304ÑField-ProgrammableLogicandApplications.W.Luk,P.Y.K.Cheung,andM.Glesner,Eds.Springer-Verlag,Berlin,Germany,274Ð283.,A.A.,J.H.1999.Safeandpro-tectedexecutionforthemorph/AMRMrecon-Þgurableprocessor.IEEESymposiumonField-ProgrammableCustomComputingMachines,P.,S,S.O.,R,J.,C,K.,PG.,AHARDJA,I.1999a.ThedesignofanSRAM-basedÞeld-programmableGateArrayÑPartI:Architecture.IEEETrans.VLSISyst.72,191Ð197.,P.,S,S.O.,R,J.,C,K.,PG.,AHARDJA,I.1999b.ThedesignofanSRAM-basedÞeld-programmableGateArrayÑPartII:CircuitDesignandLayout.IEEETrans.VLSISyst.7,3,321Ð330.,A.AYES,J.P.1997.Generalmodelingandtechnology-mappingtechniqueforLUT-basedFPGAs.ACM/SIGDAInternationalSymposiumonFPGAs,43Ð49.,M.,WEAVER,N.,S,K.,D,A.,AWRZYNEK,J.1998.Objectorientedcircuit-generatorsinJava.IEEESymposiumonField-ProgrammableCustomComputingMachinesOMPTON,K.,C,J.,K,S.,AUCKS.2000.ConÞgurationrelocationanddefrag-mentationforFPGAs,NorthwesternUniver-sityTechnicalReport,Availableonlineathttp://www.ece.nwu.edu/kati/publications.html.OMPTON,K.,L,Z.,C,J.,K,S.,AUCKS.2002.ConÞgurationrelocationanddefrag-mentationforrun-timereconÞgurablecomput-ing.IEEETrans.VLSISyst.,toappear.,J.WANG,Y.Y.1998.Booleanmatch-ingforcomplexPLBsinLUT-basedFPGAswithapplicationtoarchitectureevaluation.ACM/SIGDAInternationalSymposiumonFPGAs,J.,C.1998.AnefÞcientalgorithmforperformance-optimalFPGAtechnologymap-pingwithretiming.IEEETrans.CADIntegr.Circ.Syst.17,9,738Ð748.,J.,W,C.,,Y.1999.CutrankingandpruningenablingageneralandefÞcientFPGAmappingsolution.ACM/SIGDAInterna-tionalSymposiumonFPGAs,29Ð35.,J.,S.1998.TechnologymappingforFPGAswithembeddedmemoryblocks.ACM/SIGDAInternationalSymposiumon,179Ð188.,D.C.,F,P.,B,S.G.,,C.1998.Specifyingandcom-pilingapplicationsforRaPiD.IEEESympo-siumonField-ProgrammableCustomComput-ingMachines,116Ð125.,A.,V.K.2001.ConÞgura-tioncompressionforFPGA-basedembeddedsys-tems.ACM/SIGDAInternationalSymposiumonField-ProgrammableGateArrays,173Ð182.,A.1996.DPGAUtilizationandApplica-ACM/SIGDAInternationalSymposiumon,115Ð121.,A.1999.Balancinginterconnectandcom-putationinareconÞgurablecomputingarray(or,whyyoudonÕtreallywant100%LUTutiliza-ACM/SIGDAInternationalSymposiumonFPGAs,69Ð78.ESHPANDE,D.,S,A.K.,YAGI,A.1999.ConÞgurationcachingvsdatacachingforstripedFPGAs.ACM/SIGDAInternationalSymposiumonFPGAs,206Ð214.ACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauck,O.INDY,H.1997.Run-timecom-pactionofFPGAdesigns.LectureNotesinComputerScience1304ÑField-ProgrammableLogicandApplications.W.Luk,P.Y.K.Cheung,M.Glesner,Eds.Springer-Verlag,Berlin,Germany,131Ð140.,A.,SOTIRIADES,E.,,A.1998.ArchitectureanddesignofGE1,AFCCMforgolombrulerderivation.IEEESympo-siumonField-ProgrammableCustomComput-ingMachines,48Ð56.,C.,C,D.C.,,P.1996.RaPiDÑReconÞgurablepipelineddat-apath.LectureNotesinComputerScienceField-ProgrammableLogic:SmartAppli-cations,NewParadigmsandCompilers.R.W.Hartenstein,M.Glesner,Eds.Springer-Verlag,Berlin,Germany,126Ð135.,A.ANGANATHAN,N.1999.Multi-terminalnetroutingforpartialcrossbar-basedmulti-FPGAsystems.ACM/SIGDAInterna-tionalSymposiumonFPGAs,176Ð184.,A.J.,C.2000.AnFPGAim-plementationandperformanceevaluationoftheserpentblockcipher.ACM/SIGDAInterna-tionalSymposiumonFPGAs,33Ð40.,J.M.HATIA,D.1999.AmethodologyforfastFPGAßoorplanning.ACM/SIGDAIn-ternationalSymposiumonFPGAs,47Ð56.,G.,B,B.,T,R.,,J.1963.Parallelprocessinginarestructurablecom-putersystem.IEEETrans.Elect.Comput.ALLOWAY,D.1995.ThetransmogriÞerChard-waredescriptionlanguageandcompilerforIEEESymposiumonFPGAsforCustomComputingMachines,136Ð144.,S.,S.1996.Thetrianussys-temanditsapplicationtocustomcomputing.LectureNotesinComputerScience1142ÑField-ProgrammableLogic:SmartApplications,NewParadigmsandCompilers.R.W.HartensteinandM.Glesner,Eds.Springer-Verlag,Berlin,Germany,176Ð184.,S.W.,S.H.M.1998.FastintegratedtoolsforcircuitdesignwithFPGAs.ACM/SIGDAInternationalSymposiumon,133Ð139.,M.B.TONE,J.M.1998.NAPAC:CompilingforahybridRISC/FPGAarchitec-ture.IEEESymposiumonField-ProgrammableCustomComputingMachines,126Ð135.,M.B.TONE,J.M.1999.AutomaticallocationofarraystomemoriesinFPGAproces-sorswithmultiplememorybanks.IEEESympo-siumonField-ProgrammableCustomComput-ingMachines,63Ð69.,S.C.,S,H.,B,M.,CS.,M,M.,AYLOR,R.2000.PipeRench:AReconÞgurableArchitectureandCompiler,IEEEComputer,vol.33,No.4.,P.,B.1996.Geneticalgo-rithmsinsoftwareandinhardwareÑAper-formanceanalysisofworkstationsandcustomcomputingmachineimplementations.SymposiumonFPGAsforCustomComputingMachines,216Ð225.AUCK,S.1995.Multi-FPGAsystems.Ph.D.dis-sertation,Univ.Washington,Dept.ofC.S.&E.AUCK,S.1998a.ConÞgurationprefetchforsin-glecontextreconÞgurablecoprocessors.ACM/SIGDAInternationalSymposiumonFPGAsAUCK,S.1998b.TherolesofFPGAsinrepro-grammablesystems.Proc.IEEE86,4,615Ð638.AUCK,S.GARWALA.1996.Softwaretech-nologiesforreconÞgurablesystems.Dept.ofECETechnicalReport,NorthwesternUniv.Availableonlineathttp://www.ee.washington.edu/faculty/hauck/publications.html.AUCK,S.,G.1997.Pinassignmentformulti-FPGAsystems.IEEETrans.Comput.Aid.Desi.Integ.Circ.Syst.16,9,956Ð964.AUCK,S.,B,G.,,C.1998a.Meshroutingtopologiesformulti-FPGAsys-tems.IEEETrans.VLSISyst.6,3,400Ð408.AUCK,S.,F,T.W.,H,M.M.,,J.P.1997.TheChimaerareconÞgurablefunctionalIEEESymposiumonField-ProgrammableCustomComputingMachines,87Ð96.AUCK,S.,L,Z.,CHWABE,E.1998b.ConÞgu-rationcompressionfortheXilinxXC6200FPGA.IEEESymposiumonField-ProgrammableCus-tomComputingMachines,138Ð146.AUCK,S.,W.D.1999.RunlengthcompressiontechniquesforFPGAconÞgura-tions.Dept.ofECETechnicalReport,North-westernUniv.Availableonlineathttp://www.ee.washington.edu/faculty/hauck/publications.AUSER,J.R.AWRZYNEK,J.1997.Garp:AMIPSprocessorwithareconÞgurablecoproces-sor.IEEESymposiumonField-ProgrammableCustomComputingMachines,12Ð21.AYNES,S.D.,P.Y.K.1998.Are-conÞgurablemultiplierarrayforvideoimageprocessingtasks,suitableforembeddinginanFPGAstructure.IEEESymposiumonField-ProgrammableCustomComputingMachines,F.EAVER,A.1999.HybridproducttermandLUTbasedarchitecturesusingembed-dedmemoryblocks.ACM/SIGDAInternationalSymposiumonFPGAs,13Ð16.UANG,W.J.,S,N.,,E.J.2000.AreliableLZdatacompressoronreconÞgurablecoprocessors.IEEESymposiumonField-ProgrammableCustomComputingMachines,249Ð258.,L.2000.Arepresentationfordy-namicgraphsinreconÞgurablehardwareanditsapplicationtofundamentalgraphACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputingalgorithms.ACM/SIGDAInternationalSympo-siumonFPGAs,105Ð115.,B.L.1997.ExploitingreconÞg-urabilitythroughdomain-speciÞcsystems.LectureNotesinComputerScience1304ÑField-ProgrammableLogicandApplicationsW.Luk,P.Y.K.Cheung,andM.Glesner,Eds.Springer-Verlag,Berlin,Germany,193Ð,B.,B,P.,HAWKINS,J.,HS.,N,B.,,M.1999.ACADsuiteforhigh-performanceFPGAdesign.SymposiumonField-ProgrammableCustomComputingMachines,12Ð24.WANG,T.T.,O,R.M.,I,M.J.,,K.H.1994.LogicsynthesisforÞeld-programmablegatearrays.IEEETrans.Com-put.Aid.Des.Integ.Circ.Syst.13,10,1280ÐNUANI,M.K.AUL,J.1997.Technologymap-pingofheterogeneousLUT-basedFPGAs.Lec-tureNotesinComputerScience1304ÑField-ProgrammableLogicandApplications.W.Luk,P.Y.K.Cheung,andM.Glesner,Eds.Springer-Verlag,Berlin,Germany,223Ð234.,J.A.,P.1999.MemoryinterfacingandinstructionspeciÞcationforreconÞgurableprocessors.ACM/SIGDAInternationalSympo-siumonField-ProgrammableGateArrays,145Ð,J.S.N.,T,K.,YAVAGAL,V.,S,J.,R.1999.DynamicreconÞgurationtosupportconcurrentapplications.IEEETrans.Comput.48,6,591Ð602.,B.,B,A.,,J.1999.ConCISe:Acompiler-drivenCPLD-basedin-structionsetaccelerator.IEEESymposiumonField-ProgrammableCustomComputingMachines,92Ð101.,M.A.S.1999.Routingarchitectureandlayoutsynthesisformulti-FPGAsystems.Ph.D.dissertation,Dept.ofECE,Univ.Toronto.,M.A.S.,J.1998.Ahybridcomplete-graphpartial-crossbarroutingarchi-tectureformulti-FPGAsystems.ACM/SIGDAInternationalSymposiumonFPGAs,45Ð54.,H.J.,W.H.2000.Fac-toringlargenumberswithprogrammablehard-ware.ACM/SIGDAInternationalSymposiumonFPGAs,41Ð48.,H.S.,S,A.K.,YAGI,A.2000.AreconÞgurablemulti-functioncomputingcachearchitecture.ACM/SIGDAInternationalSym-posiumonFPGAs,85Ð94.,R.,H,R.W.,,U.1997.Anoperatingsystemforcustomcomput-ingmachinesbasedontheXputerparadigm.LectureNotesinComputerScience1304ÑField-ProgrammableLogicandApplications.W.Luk,P.Y.K.Cheung,andM.Glesner,Eds.Springer-Verlag,Berlin,Germany,304Ð313.RUPNOVA,H.,R,C.,AUCIER,G.1997.Synthesisandßoorplanningforlargehierarchi-calFPGAs.ACM/SIGDAInternationalSympo-siumonFPGAs,105Ð111.,Y.T.,P.T.1997.Hierarchicalin-terconnectionstructuresforÞeldprogrammablegatearrays.IEEETrans.VLSISyst.5,2,186ÐAUFER,R.,TAYLOR,R.R.,,H.1999.PCI-PipeRenchandtheSwordAPI:Asystemforstream-basedreconÞgurablecomputing.SymposiumonField-ProgrammableCustomComputingMachines,200Ð208.,Y.S.,A.C.H.1997.Aperformanceandroutability-drivenrouterforFPGAÕsconsid-eringpathdelays.IEEETrans.CADInteg.Circ.Syst.16,2,179Ð185.,J.,W.H.1997.Acasestudyofpartiallyevaluatedhardwarecir-cuits:Key-speciÞcDES.LectureNotesinCom-puterScience1304ÑField-ProgrammableLogicandApplications.W.Luk,P.Y.K.Cheung,andM.Glesner,Eds.Springer-Verlag,Berlin,Germany,151Ð160.,K.H.,M,K.W.,W,W.K.,P.H.W.2000.FPGAImplementationofami-crocodedellipticcurvecryptographicprocessor.IEEESymposiumonField-ProgrammableCus-tomComputingMachines,68Ð76.,D.M.,GALLOWAY,D.R.,VAN,M.,RJ.,,P.1997.TheTransmogriÞer-2:A1milliongaterapidprototypingsystem.ACM/SIGDAInternationalSymposiumon,53Ð61.,Y.,C,T.,D,E.,H,R.,KU.,TOCKWOOD,J.2000a.Hardware-softwareco-designofembeddedreconÞgurablearchitectures.DesignAutomationConference,Z.,COMPTON,K.,AUCK,S.2000b.ConÞg-urationcachingforFPGAs.IEEESymposiumonField-ProgrammableCustomComputingMachines,22Ð36.,Z.AUCK,S.1999.DonÕtcarediscoveryforFPGAconÞgurationcompression.ACM/SIGDAInternationalSymposiumonFPGAs,91Ð98.,X.,D,E.,,A.1997.Technol-ogymappingofLUTbasedFPGAsfordelayoptimisation.LectureNotesinComputerSci-ence1304ÑField-ProgrammableLogicandAp-plications.W.Luk,P.Y.K.Cheung,andM.Glesner,Eds.Springer-Verlag,Berlin,Germany,,H.,D.F.1999.CircuitpartitioningfordynamicallyreconÞgurableFPGAs.ACM/SIGDAInternationalSymposiumonFPGAs.1998.FPGAData.LucentTechnologies,Inc.,Allentown,PA.,W.,S,N.,,P.Y.K.1997a.Compilationtoolsforrun-timereconÞgurableACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauckdesigns.IEEESymposiumonField-Programm-ableCustomComputingMachines,56Ð65.,W.,S,N.,G,S.R.,,P.Y.K.1997b.Pipelinemorphingandvirtualpipelines.LectureNotesinComputerScienceField-ProgrammableLogicandApplica-.W.Luk,P.Y.K.Cheung,andM.Glesner,Eds.Springer-Verlag,Berlin,Germany,111Ð,P.TOCKWOOD,J.1996.AsimulationtoolfordynamicallyreconÞgurableÞeldpro-grammablegatearrays.IEEETrans.VLSISyst.,3,381Ð390.,W.K.,D.F.1997.Board-levelmultinetroutingforFPGA-basedlogicACMTrans.Des.Automat.Elect.Syst.2,2,151Ð167.,W.H.1999.ATRfromUCLA.PersonalCommun.,W.H.,H,B.,A,D.,,A.,E,C.,H,R.,MO.,M,J.,P,K.,P,V.K.,PAANENBURG,H.A.E.1997.Seekingsolu-tionsinconÞgurablecomputing.IEEEComput.,12,38Ð43.,A.,STANSFIELD,T.,KOSTARNOV,I.,VJ.,,B.1999.AreconÞgurablearithmeticarrayformultimediaapplications.ACM/SIGDAInternationalSymposiumon,135Ð143.CKAY,N.,S.1999.Debuggingtech-niquesfordynamicallyreconÞgurablehard-ware.IEEESymposiumonField-ProgrammableCustomComputingMachines,114Ð122.,L.,C.1995.PathÞnder:Anegotiation-basedperformance-drivenrouterforFPGAs.ACM/SIGDAInternationalSympo-siumonFPGAs,111Ð117.,O.,M,M.,LYNN,M.J.1998.PAM-blox:HighperformanceFPGAdesignforadap-tivecomputing.IEEESymposiumonField-ProgrammableCustomComputingMachinesIYAMORI,T.LUKOTUN,K.1998.Aquanti-tativeanalysisofreconÞgurablecoprocessorsformultimediaapplications.IEEESymposiumonField-ProgrammableCustomComputingMachines,2Ð11.,C.A.,Y,D.,GARWAL,A.1998.ExploringoptimalcostperformancedesignsforRawmicroprocessors.IEEESymposiumonField-ProgrammableCustomComputingMachines,12Ð27.,G.J.,S,K.A.,UTENBARR.A.1999.SatisÞability-basedlayoutre-visited:detailedroutingofcomplexFPGAsviasearch-basedbooleanSAT.ACM/SIDGAInternationalSymposiumonFPGAs,167Ð,P.,C.C.1998.Anewretiming-basedtechnologymappingalgorithmforLUT-basedACM/SIGDAInternationalSymposiumonFPGAs,35Ð42.AYNE,R.1997.Run-timeparameterisedcircuitsfortheXilinxXC6200.LectureNotesinCom-puterScience1304ÑField-ProgrammableLogicandApplications.W.Luk,P.Y.K.Cheung,andM.Glesner,Eds.Springer-Verlag,Berlin,Germany,161Ð172.,K.M.G.HATIA,D.1999.TemporalpartitioningandschedulingdataßowgraphsforreconÞgurablecomputers.IEEETrans.Comput.,6,579Ð590.UICKTURNOMPANY.1999a..Availableonlineathttp://www.quickturn.com/products/systemrealizer.htm.Quickturn,ACadenceCompany,SanJose,CA.UICKTURNOMPANY.1999b.DesignVeriÞcationSystemTechnol-ogyBackgrounder.Availableonlineathttp://www.quickturn.com/products/mercury backgro-under.htm.Quickturn,ACadenceCompany,SanJose,CA,1999.,R.,M.D.1994.Ahigh-performancemicroarchitecturewithhardware-programmablefunctionalunits.SymposiumonMicroarchitecture,172Ð180.,M.,B.L.1997.Auto-matedtargetrecognitiononSPLASH2.SymposiumonField-ProgrammableCustomComputingMachines,192Ð200.,J.,E,A.,ANGIOVANNIA.1993.ArchitectureofÞeld-programmablegatearrays.Proc.IEEE81,7,1013Ð1029.,C.R.,L,M.,G,T.,GE.,HOLT,H.,A,J.M.,,M.1998.TheNAPAadaptiveprocessingarchitec-ture.IEEESymposiumonField-ProgrammableCustomComputingMachines,28Ð37.ANGIOVANNI,A.,E,A.,J.1993.SynthesismethodsforÞeldpro-grammablegatearrays.Proc.IEEE81,7,1057Ð,Y.,J.1999.Tradingqualityforcompiletime:Ultra-fastplacementforACM/SIGDAInternationalSymposiumonFPGAs,157Ð166.CALERA,S.M.,J.R.1998.ThedesignandimplementationofacontextswitchingFPGA.IEEESymposiumonField-ProgrammableCustomComputingMachinesELVIDGE,C.,AGARWAL,A.,D,M.,1995.TIERS:TopologyIndependEntPipelinedRoutingandSchedulingforVirtualWireACM/SIGDAInternationalSym-posiumonField-ProgrammableGateArrays,S.A.,A,A.,KRUPNOVA,H.,AUCIERG.1998.Timingdrivenßoorplanningonpro-grammablehierarchicaltargets.ACM/SIGDAInternationalSymposiumonFPGAs,85Ð92.ACMComputingSurveys,Vol.34,No.2,June2002. ReconÞgurableComputing,K.,P.1991.VLSIcellplacementtechniques.ACMComput.Surv.232,145Ð220.,J.HATIA,D.1997.PerformancedrivenßoorplanningforFPGAbaseddesigns.ACM/SIGDAInternationalSymposiumon,112Ð118.,N.,L,W.,,P.Y.K.1998.Automatingproductionofrun-timereconÞg-urabledesigns.IEEESymposiumonField-ProgrammableCustomComputingMachines,M.,B,D.,AUCIER,G.1994.Afast-FPGAprototypingsystemthatusesinexpensivehigh-performanceFPIC.ACM/SIGDAWorkshoponField-ProgrammableGateOTIRIADES,E.,D,A.,,P.2000.Hardware-softwarecodesignandparallelimple-mentationofaGolombrulerderivationengine.IEEESymposiumonField-ProgrammableCus-tomComputingMachines,227Ð235.TOHMANN,J.,E.1996.AnuniversalCLAaddergeneratorforSRAM-basedFPGAs.LectureNotesinComputerScience1142ÑField-ProgrammableLogic:SmartApplications,NewParadigmsandCompilers.R.W.HartensteinandM.Glesner,Eds.Springer-Verlag,Berlin,Germany,44Ð54.WARTZ,J.S.,B,V.,,J.1998.Afastroutability-drivenrouterforFPGAs.ACM/SIGDAInternationalSymposiumonFPGAs.2000.CoCentricSystemCCom-.Synopsys,Inc.,MountainView,CA..1999.SynplifyUserGuideRelease.Synplicity,Inc.,Sunnyvale,CA.,A.,MIYAZAKI,T.,M,T.,KATAYAMA,M.,AYASHI,K.,T,A.,I,T.,K.1998.MorewiresandfewerLUTs:AdesignmethodologyforFPGAs.ACM/SIGDAInternationalSymposiumonFPGAs,12Ð19.,S.,C,Y.W.,W,D.F.,,S.1997.AlgorithmsforanFPGAswitchmoduleroutingproblemwithap-plicationtoglobalrouting.IEEETrans.CADInteg.Circ.Syst.16,1,32Ð46.OGAWA,N.,YANAGISAWA,M.,,T.1998.Maple-OPT:Aperformance-orientedsimultane-oustechnologymapping,placement,andglobalgoutingalgorithmforFPGAÕs.IEEETrans.CADInteg.Circ.Syst.17,9,803Ð818.,S.1998.Schedulingdesignsintoatime-multiplexedFPGA.ACM/SIGDAInterna-tionalSymposiumonFPGAs,153Ð160.,S.,C,D.,J,A.,,J.1997a.Atime-multiplexedFPGA.IEEESymposiumonField-ProgrammableCus-tomComputingMachines,22Ð28.,S.,D,K.,,B.1997b.Architectureissuesandsolutionsforahigh-capacityFPGA.ACM/SIGDAInternationalSymposiumonFPGAs,3Ð9.,W.,M,K.,J,A.,HUANG,R.,W,N.,,T.,R,O.,G,V.,WAWRZYNEKJ.,,A.1999.HSRA:High-speed,hierarchicalsynchronousreconÞgurablear-ray.ACM/SIGDAInternationalSymposiumon,125Ð134.,F.1997.I/OandperformancetradeoffswiththeFunctionBusduringmulti-FPGAparti-tioning.ACM/SIGDAInternationalSymposiumonFPGAs,27Ð34.,J.,B,M.,ATCHELLER,J.1993.AnefÞcientlogicemulationsystem.IEEETrans.VLSISyst.1,2,171Ð174.,M.ABANIS,D.1999.Improvingsim-ulationaccuracyindesignmethodologiesfordy-namicallyreconÞgurablelogicsystems.Sympos.Field-Prog.Cust.Comput.Mach.,J.,B,P.,R,D.,S,M.,OUATI,H.,OUCARD,P.1996.Pro-grammableactivememories:ReconÞgurablesystemscomeofage.IEEETrans.VLSISyst.41,56Ð69.,Q.,D.M.1997.AutomatedÞeld-programmablecomputeacceleratordesignusingpartialevaluation.IEEESymposiumonField-ProgrammableCustomComputingMachines,M.,W.1999.Pipelinevector-izationforreconÞgurablesystems.IEEESympo-siumonField-ProgrammableCustomComput-ingMachines,52Ð62.ILTON,S.J.E.1998.SMAP:Heterogeneoustech-nologymappingforareareductioninFPGAswithembeddedmemoryarrays.ACM/SIGDAInternationalSymposiumonFPGAs,171Ð178.,M.J.,B.L.1995.Ady-namicinstructionsetcomputer.IEEESym-posiumonFPGAsforCustomComputingMachines,99Ð107.,M.J.,B.L.1996.Se-quencingrun-timereconÞguredhardwarewithsoftware.ACM/SIGDAInternationalSympo-siumonFPGAs,122Ð128.,M.J.,B.L.1997.Improv-ingfunctionaldensitythroughrun-timecon-stantpropagation.ACM/SIGDAInternationalSymposiumonFPGAs,86Ð92.,R.D.,P.1996.OneChip:AnFPGAprocessorwithreconÞgurablelogic.SymposiumonFPGAsforCustomComputingMachines,126Ð135.,R.G.UTENBAR,R.A.1997.FPGAroutingandroutabilityestimationviaBooleansatisÞability.ACM/SIGDAInternationalSym-posiumonFPGAs,119Ð125.,Y.L.,M.1997.Routingforarray-typeFPGAÕs.IEEETrans.CADInteg.Circ.Syst.16,5,506Ð518.ACMComputingSurveys,Vol.34,No.2,June2002. K.ComptonandS.Hauck.1994.TheProgrammableLogicData.Xilinx,Inc.,SanJose,CA..1996.XC6200:AdvanceProductSpec-.Xilinx,Inc.,SanJose,CA..1997.LogiBLOX:ProductSpeciÞca-.Xilinx,Inc.,SanJose,CA..1999.Virtex2.5VFieldPro-grammableGateArrays:AdvanceProductSpec-.Xilinx,Inc.,SanJose,CA..2000.PressRelease:IBMandXilinxTeamtoCreateNewGenerationofIntegrated.Xilinx,Inc.,SanJose,CA..2001.Virtex-II1.5VFieldPro-grammableGateArrays:AdvanceProduct.Xilinx,Inc.,SanJose,CA.,G.,D,J.,T,Y.,STADTLANDERG.,,E.1996.GrowableFPGAmacrogenerator.LectureNotesinComputerSci-ence1142ÑField-ProgrammableLogic:SmartApplications,NewParadigmsandCompil-ers.R.W.HartensteinandM.Glesner,Eds.Springer-Verlag,Berlin,Germany,307Ð,K.,C.S.1996.AnewFPGAtech-nologymappingapproachbyclustermerging.LectureNotesinComputerScience1142ÑField-ProgrammableLogic:SmartApplications,NewParadigmsandCompilers.R.W.HartensteinandM.Glesner,Eds.Springer-Verlag,Berlin,Germany,366-370.,P.,M,M.,A,P.,,S.1998.AcceleratingBooleansatisÞabilitywithconÞgurablehardware.IEEESymposiumonField-ProgrammableCustomComputingMachines,186Ð195.ReceivedMay2000;revisedOctober2001andJanuary2002;acceptedFebruary2002ACMComputingSurveys,Vol.34,No.2,June2002.