/
EfcientMulti-PortedMemoriesforFPGAsCharlesEricLaForestandJ.GregorySte EfcientMulti-PortedMemoriesforFPGAsCharlesEricLaForestandJ.GregorySte

EfcientMulti-PortedMemoriesforFPGAsCharlesEricLaForestandJ.GregorySte - PDF document

jane-oiler
jane-oiler . @jane-oiler
Follow
354 views
Uploaded On 2015-09-28

EfcientMulti-PortedMemoriesforFPGAsCharlesEricLaForestandJ.GregorySte - PPT Presentation

S0S1S2Wm1W0SDR0Rn1mWnRr Figure1AmultiportedmemoryimplementedwithFPGAlogicblockshavingDsinglewordstoragelocationsSmwriteWportsandnreadRportsencodedasmWnRandntemporaryregist ID: 142906

....S0S1S2..Wm-1W0..SDR0Rn-1..mW/nRr Figure1:Amulti-portedmemoryimplementedwithFPGAlogicblocks havingDsingle-wordstoragelocations(S) mwrite(W)ports andnread(R)ports(encodedasmW=nR) andntemporaryregist

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "EfcientMulti-PortedMemoriesforFPGAsChar..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

EfcientMulti-PortedMemoriesforFPGAsCharlesEricLaForestandJ.GregorySteffanDepartmentofElectricalandComputerEngineeringUniversityofToronto{laforest,steffan}@eecg.toronto.eduABSTRACTMulti-portedmemoriesarechallengingtoimplementwithFPGAssincetheprovidedblockRAMstypicallyhaveonlytwoports.WepresentathoroughexplorationofthedesignspaceofFPGA-basedsoftmulti-portedmemoriesbyevaluatingconventionalsolutionstothisproblem,andintroduceanewdesignthatefcientlycombinesblockRAMsintomulti-portedmemorieswitharbitrarynumbersofreadandwriteportsandtruerandomaccesstoanymemorylocation,whileachievingsignicantlyhigheroperatingfrequen-ciesthanconventionalapproaches.Forexamplewebuilda256-location,32-bit,12-ported(4-write,8-read)memorythatoperatesat281MHzonAlteraStratixIIIFPGAswhileconsuminganareaequivalentto3679ALMs:a43%speedimprovementand84%areareductionoverapureALMimplementation,anda61%speedim-provementoverapure“multipumped”implementation,althoughthepuremultipumpedimplementationis7.2xsmaller.CategoriesandSubjectDescriptorsB.3.2[MemoryStructures]:DesignStyle—SharedMemoryGeneralTermsDesignPerformanceKeywordsFPGA,memory,multi-port,parallel1.INTRODUCTIONAsFPGAscontinuetoincreaseintransistordensity,designersareusingthemtobuildlargerandmorecomplexsystems-on-chipthatrequirefrequentsharing,communication,queueing,andsyn-chronizationamongdistributedfunctionalunitsandcomputenodes.ForASICimplementationsthesemechanismswouldoftenbeim-plementedwithmulti-portedmemories—memoriesthatallowmul-tiplereadsandwritestooccursimultaneously—sincetheycanavoidserializationandcontention.Forexample,processorsnormallyre-quireamulti-portedregisterle:moreregisterleportsallowstheprocessortoexploitagreateramountofinstruction-levelparal-lelism(ILP)wheremultipleinstructionsarebeingexecutedatthePermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.FPGA'10,February21–23,2010,Monterey,California,USA.Copyright2010ACM978-1-60558-911-4/10/02...$10.00. ....S0S1S2..Wm-1W0..SDR0Rn-1..mW/nRr Figure1:Amulti-portedmemoryimplementedwithFPGAlogicblocks,havingDsingle-wordstoragelocations(S),mwrite(W)ports,andnread(R)ports(encodedasmW=nR),andntemporaryregistersr.Onlyreadandwritedatalinesareshown(i.e.,notaddresslines).sametime.However,FPGA-basedsoftprocessorshavesofarex-ploitedlittleILP,limitedmainlytosimpleinstructionpipelines.Thisispartlyduetothefactthatmulti-portedmemoriesarepar-ticularlyinefcienttoimplementusingtheresourcestypicallypro-videdbyFPGAs.1.1ConventionalApproachesItispossibletoimplementamulti-portedmemoryusingonlythebasiclogicelementsofanFPGA,asillustratedinFigure1,whichshowsaD-locationmemorywithmwriteportsandnreadports.Asshown,werequireDm-to-onedecoderstosteerwritestotheappropriatememorylocations,andnD-to-onemultiplex-erstoalloweachreadtoaccessanymemorylocation.Notealsothatthereadoutputsareregistered(r)toimplementasynchronousmemorywheretheoutputisheldstablebetweenclockedges.Theproblemisthatthiscircuitscalesverypoorly,withareaincreasingrapidlywithmemorydepthandthedecoding/multiplexingseverelylimitingthemaximumoperatingfrequency.ItisnormallymoreefcienttoimplementmemoriesonFPGAsusingtheprovidedblockRAMs,eachofwhichcanbequitelarge(e.g.,9Kbits)whilesupportinghighoperatingfrequencies(e.g.,580MHz).However,FPGAblockRAMscurrentlyprovideonlytwoportsforreadingand/orwriting.NotethatAltera'sMercurylineofProgrammableLogicDevices(PLDs)[2]previouslypro-videdquad-portRAMstosupportgigabittelecomapplications—however,thisfeaturehasnotbeensupportedinanyotherAlteradevice,likelyduetothehighhardwarecost.SystemdesignershavehenceusedoneoracombinationofthreeconventionaltechniquesforincreasingtheeffectivenumberofportsofFPGAblockRAMs,asshowninFigure2.Therstisreplica-tion,whichcanincreasethenumberofreadportsbymaintainingareplicaofthememoryforeachadditionalreadport.However, MR (a)Replication ..R0W0M0Mm-1Rn-1Wm-11W/1RmW/nR1W/1R (b)Banking M0r.r.Wm-1W1W0r.rRn-1R0.Rn-21W/1RmW/nR (c)Multipumping Figure2:Threeconventionaltechniquesforprovidingmoreportsgivena1W/1Rmemory(readandwriteaddressvaluesarenotdepicted,onlydatavalues):(a)Replicationmaintainsanextracopyofthememorytosupporteachadditionalreadport,butislimitedtosupportingonlyonewriteport;(b)Bank-ingdividesdataacrossmultiplememories,buteachreadorwriteportcanonlyaccessonespecicmemory;(c)Multipump-ingmultipliesthenumberofread/writeportsofamemorybyaddinginternaldataandaddressmultiplexersandtemporaryregisters(r),andinternallyclockingthememoryatamultipleoftheexternalclock(whichquicklydegradesthemaximumex-ternaloperatingfrequency).thistechniquealonecannotsupportmorethanonewriteport,sincetheoneexternalwriteportmustberoutedtoeachblockRAMtokeepitup-to-date.Thesecondisbanking,whichdividesmemorylocationsamongmultipleblockRAMs(banks),allowingeachad-ditionalbanktosupportanadditionalreadandwriteport.However,withthisapproacheachreadorwriteportcanonlyaccessitscorre-spondingmemorydivision—henceapurebankeddesigndoesnottrulysupportsharingacrossports.Thethirdwecall“multipump-ing”,whereanymemorydesignisclockedatamultipleoftheex-ternalclock,providingtheillusionofamultipleofthenumberofports.Forexample,a1W/1Rmemorycanbeinternallyclockedat2Xtheexternalfrequencytogivetheillusionofbeinga2W/2Rmemory.Amultipumpeddesignmustalsoincludemultiplexersandregisterstotemporarilyholdtheaddressesanddataofpendingreadsandwrites,andmustcarefullydenethesemanticsoftheor-deringofreadsandwrites.Whilereasonablystraight-forward,thedrawbackofamultipumpeddesignisthateachincreaseinthenum-berofportsdramaticallyreducesthemaximumexternaloperatingfrequencyofthememory.1.2AMoreEfcientApproachInthispaperweproposeanewdesignfortruemulti-portedmem-oriesthatcapitalizesonFPGAblockRAMswhileproviding(i)substantiallybetterareascalingthanapurelogic-basedapproach,and(ii)higherfrequenciesthanthemultipumpingapproach.ThekeytoourapproachisaformofindirectionthroughastructurecalledtheLiveValueTable(LVT),whichisitselfasmallmulti-portedmemoryimplementedinrecongurablelogicsimilartoFig-ure1.Essentially,theLVTallowsabankeddesigntobehavelikeatruemulti-porteddesignbydirectingreadstoappropri-atebanksbasedonwhichbankholdsthemostrecentor“live”writevalue.TheintuitionforwhyanLVT-baseddesignismoreefcienteventhoughtheLVTispurelyimplementedinlogicele-mentsisbecausetheLVTismuchnarrowerthantheactualmem-orybankssinceitonlyholdsbanknumbersratherthanfulldatavalues—thusthelinesthataredecoded/multiplexedarealsomuchnarrowerandhencemoreefcientlyplacedandrouted.AnLVT-baseddesignalsoleveragesblockRAMS,whichimplementmem-orymoreefciently,andhasanoperatingfrequencyclosertothatoftheblockRAMsthemselves.Additionally,LVT-baseddesignandmultipumpingarecomplementary,andwewillshowthatwithmultipumpingwecanreducetheareaofanLVT-baseddesignbyhalvingitsmaximumoperatingfrequency.Withthesetechniqueswecansupportsoftsolutionsformulti-portedmemorieswithoutexpensivehardwareblockRAMswithmorethantwoports.1.3RelatedWorkThereareseveralpriorattemptstoimplementmulti-portedmem-oriesinthecontextofFPGAs,mainlyforthepurposeofsoftpro-cessorregisterles.Mostsoftuniprocessorsexploitreplicationtoprovidethe1W/2Rregisterlerequiredtosupportathree-operandISA[6–8,13,17].Jonesetal.[9]implementaVLIWsoftproces-sorwhereadditionalregisterleportssupportazero-overheadin-terfacetocustomhardwarefunctions.However,theirmulti-portedregisterleisimplementedentirelyintheFPGA'srecongurablelogicandlimitstheoperatingfrequencyoftheirsoftprocessor.Saghiretal.[14,15]implementamulti-portedregisterleforaVLIWsoftprocessorbyexploitingbothreplicationandbanking;however,thisrequiresthatthecompilerscheduleregisteraccessessuchthattherearenottwosimultaneousreadsorwritestothesamebank.Nonetheless,thisapproachissufcienttosupportmulti-threading[10,12,13]sinceeachthreadneedonlyread/writeitsowndivisionoftheregisterle.Manjikianexploitsanaggressiveformofmultipumpingbyperformingreadsandwritesonconsecutiverisingandfallingclockedgeswithinaprocessorcycle[11].HisapproachavoidsWrite-After-Read(WAR)violationsbyperform-ingallwritesbeforereads.Unfortunatelythisdesignrequiresthattheentiresystemusemultiple-phaseclocking.1.4ContributionsThispapermakesthefollowingcontributions:(i)wepresenttherstthoroughexplorationofthedesignspaceofFPGA-basedsoftmulti-portedmemories;(ii)weevaluateconventionalmeth-odsofbuildingsuchmemoriesandconrmthattheydonotscalewell;(iii)weintroducetheLiveValueTable(LVT),anefcientmechanismforimplementingmulti-portedmemorieswithanar-bitrarynumberofreadandwriteports;(iv)wedemonstratethatLVT-baseddesignsaresmallerandfasterthanpurerecongurablelogicimplementations,aswellasfasterandmorescalablethanpuremultipumpingimplementations;(v)weevaluatetheimpactofmul-tipumpingonLVT-baseddesigns,anddemonstratethattheyarecomplementary.2.EXPERIMENTALFRAMEWORKMemoryDesignsWeconsideronlymemoriesof32-bitelementwidthasthisisthecommoncaseinmanycomputingsystems.Weconsiderarangeofmulti-portedmemorydesignsthathaveatleastonewriteportandtworeadports(1W/2R)suchthatallportsareusablesimultaneouslywithinasingleexternalcycle.Wedonotconsiderone-write-one-read(1W/1R)memoriesastheyaretrivialtoimplementwithasingleFPGAblockRAM.Wealsodonotcon-sidermemoriesthatmaystall(eg.,takemultiplecyclestoreturn readvaluesshouldtheyconictwithconcurrentwrites),althoughsuchdesignswouldbecompellingfuturework.Additionally,weassumethatmultiplewritestothesameaddressarepreventedbythesystemusingthemulti-portedmemory,andthattheresultofdoingsoisundened.Eachdesigniswrappedinatestharnesssuchthatallpathsbeginandendatregisters,allowingustoensurepropertiminganalysisandtotesteachdesignforcorrectness.TheVerilogsourcesaregenericanddonotcontainanyAltera-specicmodulesorannotations.CADFLowWeuseAltera'sQuartus9.0totargettheAlteraStratixIIIEP3SL340F1760C2,alargeandfastdevicethatallowsustocomparewithpublishedresultsfortheNiosIIsoftprocessor[5].Wedonotbiasthesynthesisprocesstofavourareaorspeed,norperformanycircuittransformationssuchasretiming.Wecong-uredtheplaceandrouteprocesstomakeastandardeffortatttingwithonlytwoconstraints:(i)toavoidI/Opinregisterstopreventarticiallylongpathsthatwouldaffecttheclockfrequency,and(ii)tosetthetargetclockfrequencyto1Ghztooptimizecircuitlayoutforspeed1.Wereportmaximumoperatingfrequencybyaveragingtheresultofplace/routesacrosstendifferentrandomseeds.MeasuringAreaWereportareaasthetotalequivalentarea,whichestimatestheactualsiliconareaofadesignpoint:wecalculatethesumofalltheAdaptiveLogicModules(ALMs)plustheareaoftheBlockRAMscountedastheirequivalentareainALMs2.EachALMcancontainunrelatedlogicandregisters,avoidinganinatedlogicutilizationmeasureduetounderusedALMs.3.STRATIXIIIARCHITECTUREThefollowingdescribesthebasiccomponentsprovidedbytheAlteraStratixIIIarchitecture.AlthoughourworktargetsAltera'sStratixIIIFPGAs,thefollowingconceptsgenerallytranslatetothedevicesofotherFPGAvendors.Forexample,otherthanthediffer-entcapacities,theblockRAMsinXilinx'sVirtex-6FPGAswouldfunctionidentically.AdaptiveLogicModule(ALM)MemoryFPGAscanimplementmemoryusingtheirgenericrecongurablelogiccomposedofAdap-tiveLogicModules(ALMs).TheStratixIIIALMseachcontaintworegisters,someadderlogic,andLook-UpTables(LUTs).ALMmemoryhasvirtuallynoconstraintsoncapacity,conguration,andnumberofports,butpaysalargeareaandspeedpenalty(Figure1).TheCADtoolsmayalsorequireaprohibitiveamountoftime(overanhour)toplaceandroutesuchamemory.BlockRAM(BRAM)MemoryFPGAsimplementblockRAMsdirectlyontheirsiliconsubstrate.BlockRAMshavetwoportsthatcaneachfunctioneitherasareadorawriteport.ThesememoriesuselessareaandrunatahigherfrequencythanonescreatedfromtheFPGA'srecongurablelogic,butdosoattheexpenseofhavingaxedstoragecapacityandnumberofports.TheStratixIIIFPGAdevicesmostlycontainM9KblockRAMs3,whichholdninekilo-bitsofinformationinvariouswidthsanddepths.Atawidthof32bits,anM9Kholds256elements. 1ThisapproachwasrecommendedbyanexperienceduserofQuar-tusasmorepracticalthaniteratedguessing.2AlteragraciouslyprovidedthecondentialareaequivalenceofBRAMsforStratixII.WeextrapolatedtheunavailableStratixIIIareanumbersfromtheStratixIIdataandothercondentialdata.3TheyalsocontainlargerM144KblockRAMswhicheachhold144kilobits(as4k32forexample),butexistinmuchfewernum-bersthanM9KsandtargetbulkRAMinstead. 0 1000 2000 3000 4000 5000 6000 7000 150 200 250 300 350 400 450 500 550 600 32 to 2563264326412825632 to 2563264 Repl-M9K Figure3:ComparisonofthespeedandareaofvariousALM,M9K,andMLABimplementationsof32-bit1W/2Rmemo-riesofvaryingdepth(asindicatedbythenumberateachdatapoint).Theprexdenotestheimplementationtechnique:“Pure”forpurelogic,“Repl”forreplication,and“MP”forpuremultipumping.ThesmallestpossibleM9Kdesignshaveacapacityof256elements,hencethetwoM9Kdesignsarealloverlapping.Placement/routingfailforMLABdesignsofdepthgreaterthan64.MemoryLogicArrayBlock(MLAB)MemoryTheStratixIIIFPGAarchitectureclustersitsALMsintoLogicArrayBlocks(LABs),eachcontainingtenALMs.SomeoftheLABscanfunctionei-therasagroupofALMsorasasinglesmallblockofmemory,orMemoryLAB(MLAB).MLABsprovideahalfwaypointbetweenALMandBRAMimplementations:theyaresmall,numerous,andwidelydistributedlikeALMs,butimplementmemoryinadenser,constrainedmannerlikeBRAMs.AsingleMLABholdsupto20wordsof16bits.Unlikeothermemories,whichperformalloper-ationsontherisingedgeoftheclock,MLABsreadontherisingedgeandwriteonthefallingedge.MLABsbestimplementsmallshiftregistersandFIFObuffersandnotarbitrarilydeepmemories.4.CONVENTIONALMULTI-PORTINGAsimpletwo-portedmemory,withonereadandonewriteport(1W/1R)denesthebasicconceptualandphysicalunitofstoragefromwhichwebuildmulti-portedmemories.Weassumethateachportmayaccessanyonelocationpercycle,andifareadandwritetothesamelocationoccurinthesamecycle,thereadportobtainsthecurrentcontentsofthelocationandthewriteportoverwritesthecontentsattheendofthecycle(“Write-After-Read”(WAR)operation).Thesimplestmulti-portedmemorythatweconsiderisa1W/2Rmemory.Thismemoryisinterestingbecauseitisnotnaturallysup-portedbyFPGAstructuresbutiscommonlyused,forexampleforsoftprocessorregisterles.Figure3plotstheareaandoperatingfrequencyof1W/2Rmemoriesofvaryingdepth(wherethedepthisindicatedbythenumbernexttoeachpoint),andofvaryingim-plementation.Weusetheseresultstodiscussthefollowingconven-tionaltechniquesforbuildingmulti-portedmemoriesonFPGAs:PureALMsAstraightforwardmethodforconstructingamulti-portedmemoryonanFPGAistodosodirectlyinALMs—i.e.,adesignlikethatshowninFigure1.WeevaluatesuchdesignsinFigure3,shownasthePure-ALMseriesofpoints.Fromthegureweseethatevena32-entry1W/2Rmemoryrequires864ALMs forthisdesign.Asweincreasedepth,areaincreasesrapidlyandoperatingfrequencydropssignicantly.ThistrendmotivatestheneedtouseblockRAMsformoreefcientmulti-portedmemories.ReplicationReplication(Figure2(a))isaneasywaytoincreasethenumberofreadportsofasimplememory(i.e.,to1W/nR):simplyprovideasmanycopiesofthememoryasyourequirereadports,androutethewriteporttoallcopiestokeepthemup-to-date.WeevaluatereplicationinFigure3forbothM9Ks(Repl-M9K)andMLABs(Repl-MLAB).AlloftheRepl-M9KdesignstintotwoM9KBRAMs,suchthatthosepointsareallco-locatedinthegure.Replicationrequiresnoadditionalcontrollogic,hencethesede-signsareveryefcient.For1W/2Rmemorieswithadepthgreaterthan256elements,anotherpairofM9Kswouldbeaddedateverydepthincrementof256elements—resultinginarelativelyslowin-creaseinareaasmemorydepthincreases.Wealsoconsiderrepli-cateddesignscomposedofMLABs(Repl-MLAB).Unfortunately,QuartuscouldnotplaceandrouteanyMLAB-basedmemorywithmorethan64elements.SinceeachMLABstorestheequivalentof160ALMs,theRepl-MLABimplementationrequiresmuchlessinterconnectthanthePure-ALMimplementationbutconsiderablymorethantheRepl-M9Kimplementation.Forexample,the32-entryRepl-MLAB1W/2Rmemoryrequiresonly198equivalentALMs,butstillsuffersaloweroperatingspeedof376MHz.ThereplicatedM9Kdesigns(Repl-M9K)areevidentlyfarsuperiortothealternatives,withanareaof90equivalentALMsandmaximumoperatingfrequencyof564MHz.However,thedrawbacktothisapproachisthatthereisnowaytoprovideadditionalwriteportswithreplicationalone—wemustpursueothertechniquestopro-videmorewriteports.BankingBanking(Figure2(b))issimilartoreplication,exceptthatthememorycopiesarenotkeptcoherent;eachadditionalmem-orynowsupportsanadditionalreadandwriteport,providinganeasywaytoincreaseportsarbitrarily(mW/mR).Theconventionalwaytousebankingistodividememorylocationsevenlyamongthebanks,suchthateachreadandwriteportaretiedtoacertainmem-orydivision.However,amemorywithonlybankingisnottrulymulti-ported,sinceonlyonereadfromacertaindivisionispossi-bleinagivencycle.Forthisreasonwedonotevaluatebanked-onlymemories,althoughacloseestimateoftheFmax/areaofamW/mRbankedmemoryisthecorresponding1W/mRreplicateddesign.MultipumpingMultipumping(Figure2(c))internallyusesanin-tegermultipleoftheexternalsystemclocktomultiplexamulti-portedmemorywithfewerports,givingtheexternalappearanceofalargernumberofports(mW/nR).Thisrequirestheadditionofmultiplexersandregisterstoholdtemporarystates,aswellasthegenerationofaninternalclock,andcarefulmanagementofthetimingofread-writeoperations.Wefurtherdescribethedetailsofimplementingamultipumpeddesigninthenextsection.4.1MultipumpingImplementationsSincemultipumpedmemoriesmultiplexportsovertime,theor-derofread/writeoperationsmustbecarefullymanaged:violatingtheprecedenceofreadsandwriteswouldbreaktheexternalap-pearanceofthemoccurringatthesametime.Inparticular,writesmustbeperformedattheendtoavoidWrite-After-Read(WAR)violationswhereanearlierinternalwriteupdatesavaluebeforeithasbeenreadbyasubsequentinternalread.Fornon-multipumpeddesigns,eachblockRAMportsupportseitherareadorawrite,henceweusetheblockRAMsin“simpledual-port”modewhereaportisstaticallydenedtobeforreadingorwriting.Sincemultipumpeddesignstime-multiplextheblockRAMportswecanpotentiallyexploit“truedual-port”mode,whereablockRAMportcanbedynamicallyconguredforreadingorwriting.ForthesimplestmultipumpeddesignconsistingofasingleblockRAM,truedual-portmodecanallowustocongurebothportsforreadsandperformpairsofreadsuntilallaredone,thencongurebothportsaswritesandperformpairsofwritesuntilallaredone.Alargerbutmoreaggressivemultipumpeddesigncanalsoex-ploitbankingtoreducethenumberofcyclesrequiredtoperformreads:eachbankcanperformtwouniquereads,andallbankscanoperateinparallel;whenreadsarecompleted,onepairofwritescanbeperformedacrossallbankseachcycleuntilallwritesareperformed.Inotherwords,theblockRAMsarereadlikeabankedmemoryandarewrittenlikeareplicatedmemory.Similartech-niqueshavebeenpublishedbyXilinx[16]andActel[1]butonlyforcertainformsofquad-portmemories,whereasourimplementa-tionsupportsarbitrarynumbersofreadandwriteports.Truedual-portmodeisnotfree:forStratixIIIFPGAs[3]anM9KblockRAMinsimpledual-portmodehas256locationsof32bits,whileintruedual-portmodeithas512locationsof16bitssincetheRAMoutputdriversaresplittosupporttworeads.Thereforetruedual-portmoderequirestwoM9KblockRAMstocreatea32-bit-widememory.Despitethisdoubling,thenumberofblockRAMsrequiredremainspractical:evenan8W/16RpurelymultipumpedmemorywouldneedonlyoneblockRAMpairtosupporteachreadport,foratotalof32.Thefollowingsummarizesthedesignofapuremulti-portedmem-oryusingtruedual-portmodefortheblockRAMs.Givenanarbi-trarymW/nRmemory,thenumberofcyclesrequiredtoperformallthemwritesandnreadsfollowsdm=2+n=2xe,wherexcountsthenumberofblockRAMs.Them=2termstemsfromeachwritebeingreplicatedtoalltheblockRAMstoavoiddatafragmentation,makingthewholememoryappeartohaveonlytwowriteports.Then=2xtermcomesfromeachblockRAMbeingabletoserviceanytworeadsatoncesincethewritesreplicatetheirdatatoallblockRAMs.Theceilingfunctionhandlescaseswherethereareeithermoreinternalportsthanthereareexternalreadorwriteports,orthenumberofinternalportsdoesnotevenlydividethenumberofexter-nalports.Afractionalnumberofcyclesinatermimpliesthat,foroneofthecycles,someportsremainfreeandsomewritesmightbedonesimultaneouslywiththelastreads.ThetypicalcaseiswhenthenumberofblockRAMsequalsthenumberofreadports,allow-ingallreadstobeperformedinonecyclewhileleavinghalftheportsavailableforoneofthewrites,whichmaysaveonecycleincertainportcongurations.LargernumbersofblockRAMswillnotfurtherreducethenumberofcycles.Asasimpleexample,inFigure3weimplement1W/2Rmemo-riesbydouble-pumpingM9Ks(MP-M9K2X)andMLABs(MP-MLAB2X)4.While2XmultipumpingdoeshalvethenumberofM9KsorMLABsused,theoverheadoftherequiredcontrolcir-cuitrynegatesanyareasavingsformemorieswithsofewports.Themaximumexternaloperatingfrequenciesofthedouble-pumpedde-signsarealsoalittleunderhalfthoseofthereplicateddesigns(186MHzforMP-MLAB2X,and279MHzforMP-M9K2X).Aswewilldemonstratelater,multipumpingcanbeanimportanttechniquetoreduceareawhenbuildingmemorieswithlargernum-bersofports. 4Again,duetoQuartus'difcultywithMLABs,themultipumpingimplementationusessimpledual-portMLABsonly.For1W/2Ronly,thisdoesnotaffecttheareaorexternaloperation. ..M0M1Mm-1......Rn-1R1R0..Wm-1W1W0.....WriteRead1W/nRmW/nRLVTmW/nR Figure4:AgeneralizedmW/nRmemoryimplementedusingaLiveValueTable(LVT).Eachwriteupdatesitsownreplicatedmemorybank(M)andupdatesitsentryatthesameaddressintheLVT.Foreachread,theLVTselectsthememorybankthatholdsthemostrecentlywrittenvaluefortherequestedmemoryaddress.4.2SummaryA1W/2Rmemorycaneasilybeextendedtohavemorereadportsbyincreasingtheamountofreplication,butthistechniquecannotbeusedtoaddmorewriteports.Whilebankingeasilyal-lowsmultiplewriteports,suchdesignsmustmapreadsandwritestodivisionsofthememory,anddonotallowtruesharing.Amulti-portedmemoryimplementedpurelyinALMsscalespoorly.Mul-tipumpingbyitselfcausesalargedropinoperatingfrequency.Inthenextsection,weintroduceamethodfortransparentlymanagingandkeepingcoherentbankedmemoriestoeffectivelyallowmulti-plereadandwriteports.5.LVT-BASEDMULTIPORTEDMEMORIESWeproposeanewapproachtoimplementingmulti-portedmem-oriesonFPGAsthatcanexploitthestrengthsofallthreecon-ventionaltechniquesforaddingports.OurapproachcomprisesbanksofreplicatedblockRAMswhereamechanismofindi-rectionsteerseachreadtothebankholdingthemost-recentwritevalue.Multipumpingisorthogonaltoourapproach,andcanbeappliedtoreducetheareaofamemoryincaseswhereasloweroperatingfrequencycanbetolerated,aswedemonstratelaterinSection7.WenameourindirectionmechanismtheLiveValueTa-ble(LVT),sinceittrackswhichbankcontainsthe“live”ormost-recentlyupdatedvalueforeachmemorylocation.AbriefoutlineofthisapproachisdescribedbyAltera[4],butprovidesnodetailsofoperation,nocomparisons,andlimitsitselftoonlyfourports.5.1TheBasicIdeaFigure4illustratesanLVT-basedmulti-portedmemory.Thememoryiscomposedofmbanks(M0toMm1),eachofwhichcontainsa1W=nRmemory(constructedviareplicationofblockRAMs)suchthatnisequaltothedesirednumberofreadports(R0toRn1).Eachwriteportwritestoitsownbank,andeachreadportcanreadfromanyofallthebanksviaitsmultiplexer.Thebankedmemoryallowsforarbitraryconcurrentwrites,whilethereplica-tionwithineachbanksupportsarbitraryconcurrentreads.TheLVTisamW/nRmulti-portedmemoryimplementedusingALMs.Atahighlevel,thedesignoperatesasfollows.Duringawritetoagivenaddress,thewriteportupdatesthatlocationinitsblock AddressesWm-1W10W..bDdmW/nRBank # (a)WriteOperation .AddressesR1R0R1R0Rn-1Rn-1..bdDbbmW/nRBank #Bank # (b)ReadOperation Figure5:ALiveValueTable(LVT)foramulti-portedmemoryofdepthDwithmwriteports(W)andnreadports(R).EachLVTlocationcorrespondstoamemorylocation,andtracksthebanknumberofthememorybankthatholdsthemostrecentwritevalue.Everywriteupdatesthecorrespondinglocationwiththedestinationbanknumber,andeveryreadisdirectedtotheappropriatebankbythebanknumberstoredinthecor-respondingLVTlocation.Thewidth(b)ofthebanknumbersislog2(m).Thewidth(d)oftheaddressesislog2(D).RAMbankwiththenewvalue,andtheLVTsimultaneouslyup-datesitscorrespondinglocationwiththebanknumber(0tom1).Duringaread,thereadportsendstheaddresstoeverybankandtotheLVT.AllthebanksreturntheirvalueforthatlocationandtheLVTreturnsthenumberofthewriteportwhichlastupdatedthatlocation,drivingthemultiplexerofthereadporttoselecttheoutputoftheproperblockRAMbank.5.2ImplementingtheLVTFigure5illustratestheoverallstructureandoperationofaLVTforamulti-portedmemoryofdepthDwithmwriteports(W)andnreadports(R).EachLVTlocationcorrespondstoamemorylocation,andtracksthebanknumberofthememorybankthatholdsthemostrecentwritevalueforthatmemorylocation.DespitebeingimplementedentirelyinALMs,theareaofaLVTremainstractableduetoitsnarrowwidthb=log2(m).Forexample,comparedtothe864ALMsofthe32-element1W/2RPure-ALMmemoryinFigure3,aLVTofthesamedepthwith2R/2Wportsusesonly75ALMs5.Evenwith8W/16Rports,thecorrespondingLVTconsumesonly649ALMs.Duringwrites,theLVTusesthememorywriteaddressestoup-datethecorrespondinglocationswiththenumbersoftheportsperformingthewrites.ThesenumbersidentifytheblockRAMbanksthatholdthewrittenvalues.Duringreads,theLVTusesthereadaddressestofetchthebanknumbersthatinturnsteertheoutputsofthosebankstothereadports.Alladdressesareofwidthd=log2(D).5.3LVTOperationAsanexampleoftheoperationofaLiveValueTable,Figure6depictstwowritesandtworeadstoamulti-portedmemorysimilartotheonedepictedinFigure4.Thememorycontainsonememorybankforeachwriteport(W0andW1).EachmemorybankisareplicatedblockRAMmemorywithenoughportsforeachreadport(R0andR1).TheLVTatthetopisimplementedusingALMsonly,hasthesamedepthaseachmemorybank,butstoresthemuchnarrowerbanknumbers.ThewriteportsplacetheirbanknumberintheLVTatthesameaddressatwhichtheywritetheirdatatothebanks.TheLVTcontrolstheoutputmultiplexerofeachreadport.Thememorybeginsemptyorotherwiseuninitialized. 5A2W/2RLVTisthesmallestmeaningfulcasehere,asamemorywithasinglewriteportdoesnotneedanLVT. W0W1R0R12W/2R420@3421W/2R@2@3123Read2W/2R LVT (a)WriteOperation R0R1W0W11W/2R2W/2R420@3@2@3@2422311W/2R123WriteRead2W/2R LVT (b)ReadOperation Figure6:Exampleoperationofa2W/2RLVT-basedmulti-portedmemory:duringwriteoperation,W0writes42toad-dress3andW1writes23toaddress2,andtheLVTrecordsforeachaddressthebankthatwaslastwritten;duringreadoper-ation,R0readsaddress2andR1readsaddress3,andtheLVTselectstheappropriatebankforeachreadaddress.Figure6(a)showsthestateofthememorybanksandtheLVTafterportW0writesthevalue42toaddress3andportW1writes23toaddress2.ThevaluesarestoredintotheseparatememorybanksofportsW0andW1,whiletheLVTstorestheirbanknumbersatthesameaddresses.Anaccessfromanyreadportwillsimultaneouslysendthead-dresstotheLVTandtoeachmemorybank.Thebanknumberre-turnedbytheLVTdirectstheoutputmultiplexertoselecttheoutputoftheblockRAMmemorybankcontainingthemostcurrentvalueforthesecondmemoryelement.InFigure6(b),portR1readsfromaddress3andthusgets42frombank0,whileportR0readsfromaddress2andgets23frombank1.5.4BlockRAMRequirementsHavingmemorybankswhichcanholdtheentirememorycon-tentsforeachwriteportandhavingeachofthesebanksinternallyreplicatedonceforeachreadportmeansthatthetotalnumberofblockRAMswithinallthebanksequalstheproductofthenumberofwriteportsandreadports,timesthenumberofblockRAMsnec-essarytoholdtheentirememorycontentsinasinglebank.Forex-ample,theratherlargecaseofa32-bit8W/16Rmulti-portedmem-oryrequires128blockRAMsfordepthsofupto256elements.EventhesmallestStratixIIIFPGA(EP3SL50)contains108M9KblockRAMs,whilemid-rangedevicescontain275to355.Also,therelativelylargedepthoftheM9KblockRAMsallowscorre-spondinglylargemulti-portedmemoriestobeimplemented.LargermemorieswouldlikelyrequiretheuseofdeeperblockRAMssuchastheStratixM144K.InSection7,wewilldemonstratehowmul-tipumpingcanreducethenumberofrequiredblockRAMs.5.5RecursiveLVTImplementationAnLVTimplementsamulti-portedmemoryusingALMsandthusgrowsproportionatelywithdepth—however,sinceeachloca-tionstoresonlythefewbitsrequiredtoencodeamemorybanknumber,thememorysizeremainspractical.Itwouldseemdesir-abletorepeatthisarea-savingandimplementtheLVTitselfusingblockRAMs,managedbyastillsmaller,innerLVT.However,wecannotavoidimplementingaLVTusingALMssinceFPGAsdonotprovideanysuitablemulti-portedblockRAMswithenoughwriteportsandthenarrowwidthofanLVT.Ideally,anumberofmW/1rblockRAMscouldbeusedasareplicatedmemorytocreateamW/nRLVTwithouttheuseofALM-basedstorage,butnosuchblockRAMsexistonFPGAs.Additionally,anyinnerLVTusedtocoordinateblockRAMsimplementingalargerouterLVTwouldnecessarilybeimplementedusingALMsandwouldhavethesamedepthandcontrolthesamenumberofbanksandportsastheouterLVTitsoughttoreplace.ThisinnerLVTwouldthushavethesameareaastheouterLVT,andhenceisnotworthit.6.LVTPERFORMANCEWhileanLVTdoessolvetheproblemofaddingwriteportstoamemory,italsointroducesadditionaldelayduetothebanknumberlook-upandthereadportmultiplexers,andincreasestheareaduetointernalreplicationofeachmemorybank.Inthissectionandthenext,wedemonstratethattheLVT-basedapproachprovides(i)substantiallybetterareascalingthanapurelogic-basedapproach,and(ii)higherfrequenciesthanmultipumpingapproaches.6.1Speedvs.AreaFigure7(a)andFigure8(a)plottheaveragemaximumoperatingfrequency(Fmax)versusareafor2W/4Rand4W/8Rmemoriesofincreasingdepth(denotedbythenumbernexttothedatapoint).ItisapparentthatthepureALMimplementation(Pure-ALM)isinef-cient:forthe4W/8Rmemory,32elementsrequires3213ALMsand256elementsrequires23767ALMs.ThelargerofthesepureALMdesignsarelikelyimpractiallylargeformostapplications.LookingattheMLAB-basedLVTimplementations(LVT-MLAB)for2W/4R,thedesignsaresmallerbutachieveaslowerFmaxthanthecorrespondingpureALMdesigns.Forthe4W/8Rdesigns,theMLAB-basedLVTimplementationsarebothlargerandslowerthanthecorrespondingpureALMdesigns.Furthermore,theMLAB-baseddesignscannotsupportmemoriesdeeperthan64elementssinceQuartuscannotplaceandroutethem.OveralltheMLAB-baseddesignsareuncompelling,exceptforprovidinganarea-Fmaxtrade-offrelativetothepureALMdesignsfor2W/4Rmemories.FromtheguresitisevidentthattheM9K-basedimplementa-tionsaresuperior.Theareaofthe2W/4Rand4W/8RLVT-M9KimplementationsincreasesmuchmoreslowlywithdepththanthepureALMimplementation.Furthermore,asanindicationoftheirusability,thesedesignsachieveaclockfrequencyclose-toorbetterthanthe290MHzclockfrequencyofaNiosII/fsoftprocessoronthesameStratixIIIdevice[5].Forexample,the4W/8Rversionhasanoperatingfrequencyrangingfrom361MHzat32elements,downto281MHzfor256elements,withenoughportstosupportfoursuchsoftprocessors.6.2AreaBreakdownFigure7(b)andFigure8(b)displaythetotalequivalentareaofvariousimplementationsofthesame2W/4Rand4W/8Rmemories,brokendownintotheircomponents.ThePure-ALMimplementa-tionisasinglemulti-portedmemorywithoutanyspeciedsubcom-ponents:thesynthesisprocessimplementsallofthemultiplexers,decoders,andstorageimplicitly.Theseincreaseinproportionwiththedepthofthememoryandrapidlybecomeimpracticallylarge.TheLVT-MLABimplementation,despiteusingdensermemory,suffersfromhigherinterconnectareaoverhead.TheareaoftheLVT-MLABmemorybanksincreasesquicklywiththememorydepthsinceeachMLABcanonlystore20wordsof16bits.Also,QuartuscouldnotplaceandrouteMLAB-basedmemoriesdeeperthan64elements.Theabsenceofoutputmultiplexersforthe64-element2W/4Rmemoryisduetoafortuitoussynthesisoptimiza-tionbyQuartus:eachregisterinanALMhastwoloadlines,whichmayeliminatethemultiplexerwhenthereareonlytwosources.TheLVT-M9KblockRAMMemoryBankshavethelowestarea 0 2000 4000 6000 8000 10000 12000 14000 200 250 300 350 400 450 326412825632643264128256 LVT-M9K (a)FmaxvsArea PureALM LVTMLAB32 elements LVTM9K PureALM LVTMLAB64 elements LVTM9K PureALM 128 elements LVTM9K PureALM 256 elements LVTM9K 0 500 1000 1500 2000 1309264723501 Memory Banks (b)AreaBreakdown Figure7:SpeedandareaforPure-ALM,LVT-MLAB,andLVT-M9Kimplementationsofa2W/4Rmemorywithanincreasingnumberofmemoryelements. 0 5000 10000 15000 20000 25000 150 200 250 300 350 400 326412825632643264128256 LVT-M9K (a)FmaxvsArea PureALM LVTMLAB32 elements LVTM9K PureALM LVTMLAB64 elements LVTM9K PureALM 128 elements LVTM9K PureALM 256 elements LVTM9K 0 1000 2000 3000 4000 5000 6000 7000 2376711456 Memory Banks (b)AreaBreakdown Figure8:SpeedandareaforPure-ALM,LVT-MLAB,andLVT-M9Kimplementationsofa4W/8Rmemorywithanincreasingnumberofmemoryelements.duetotheirhigherdensityandlowerinterconnectrequirements.MostofthemultiplexinganddecodingoverheadinthePure-ALMandLVT-MLABimplementationsbecomesimplicitinthecircuitryoftheM9KblockRAMs.TheareaoftheLVT-M9K4W/8RMem-oryBanksremainsconstantat1446equivalentALMssinceallofthememorydepthstintothesamenumberofblockRAMs.Evenwiththenon-trivialoverheadoftheLVT,theLVT-M9Kimplemen-tationsconsumemuchlesstotalareathanthealternatives.TheLVTsoftheLVT-MLABandLVT-M9KimplementationshavetheexactsameinternalstructureandthesamedepthasthecorrespondingPure-ALMmemoryimplementationandthusalsoscaleproportionatelywiththedepthofthememory.However,theLVTsonlystoretheoneortwobitsrequiredtoidentifyamemorybank,reducingtheirgrowthtotractablelevels.Asanexample,theareaoftheLVToftheLVT-M9K4W/8Rmemoryrangesfrom280ALMsupto1977ALMs:approximatelyone-tenththeareaofthecorrespondingPure-ALMmemory.Theareaofthe4W/8Routputmultiplexers,whenpresent,remainsconstantat256ALMssincethenumberofbanksintheLVT-MLABandLVT-M9Kmemoriesalsoremainsconstant.Forthe2W/4Rmemory,themultiplexerareauctuatesbetween77and93ALMs,likelyduetooptimizationsmadepossiblewhenanALMhasinputsfromonlytwobanks.7.MULTIPUMPINGPERFORMANCEIntheprevioussectionweobservedthatM9KimplementationsofLVT-basedmulti-portedmemoriesarefasterandsmallerthanthealternatives—forsomeapplicationstheachievableFmaxispo-tentiallyoverkill.Insuchcaseswecouldapplymultipumping(in-troducedearlierinSection4)totradeFmaxforreducedareaastheapplicationallows.Inthissectionwedescribeandmeasuremul-tipumpingappliedtoLVT-baseddesigns,andalsocomparewithpuremultipumping-basedmulti-portedmemorydesigns.7.1Speedvs.AreaMultipumpingcanbringaboutausefulreductioninareaifthespeedoftheoriginalmemoryissignicantlyhigherthanrequiredbythesurroundingsystem.Figure9(a)andFigure10(a)comparethemaximumexternaloperatingfrequency(Fmax)andthetotalareaofM9K-basedLVT2W/4Rand4W/8Rmemorieswith2Xand4Xmultipumping,alongwiththeequivalentpuremultipump- 200 300 400 500 600 700 800 900 1000 50 100 150 200 250 300 350 400 450 32641282563264128256326412825632 to 256 LVT 1X (a)FmaxvsArea LVT1X LVT2X 32 elements LVT4X MP2X LVT1X LVT2X 64 elements LVT4X MP2X LVT1X LVT2X 128 elements LVT4X MP2X LVT1X LVT2X 256 elements LVT4X MP2X 0 200 400 600 800 1000 Memory Banks (b)AreaBreakdown Figure9:SpeedandareaforM9K-based2W/4Rmultipumpedmemories:anLVTmemorywithmultipumpingfactorsof1X(a2W/4Rmemorywithnomultipumping),2X(a2W/2Rmemorywithtwointernalcycles),and4X(a2W/1Rmemorywithfourinternalcycles),andapuremultipumpingmemory(MP2X). 0 500 1000 1500 2000 2500 3000 3500 4000 50 100 150 200 250 300 350 400 32641282563264128256326412825632 to 256 LVT 1X (a)FmaxvsArea LVT1X LVT2X 32 elements LVT4X MP3X LVT1X LVT2X 64 elements LVT4X MP3X LVT1X LVT2X 128 elements LVT4X MP3X LVT1X LVT2X 256 elements LVT4X MP3X 0 500 1000 1500 2000 2500 3000 3500 4000 Memory Banks (b)AreaBreakdown Figure10:SpeedandareaforM9K-based4W/8Rmultipumpedmemories:anLVTmemorywithmultipumpingfactorsof1X(a4W/8Rmemorywithnomultipumping),2X(a4W/4Rmemorywithtwointernalcycles),and4X(a4W/2Rmemorywithfourinternalcycles),andapuremultipumpingmemory(MP3X).ing(MP)implementations.Forallcases,theinternaloperatingfrequencyremainsapproximatelyequaltotheFmaxoftheoriginalbaselinememorypriortomultipumping,whichrangesfortheLVT4W/8Rmemoryfrom361MHzto281MHzasthedepthincreases,and523MHzforalldepthsoftheMP3X4W/8Rmemory.Despitethehighinternaloperatingfrequencies,dividingthembyamultipumpingfactordoesbringaboutaharshexternalspeedpenalty.Forexample,the4W/8RLVT2Xmultipumpedimplemen-tationsinFigure10(a)operateexternallyatfrequenciesrangingfrom176MHzto149MHz,whichmaystillbepracticalspeeds.TheMP3Ximplementationsalsoholdat174MHz.Foreitherim-plementation,itisevidentthatonlysmallmultipumpingfactorscanbeusedbeforethedropinFmaxbecomestoogreattobepractical.Althoughwehavetestedmultipumpingfactorsofuptoeight,weexpectthatmostdesignswilluseafactoroftwoorthree.Furthermore,althoughthepuremultipumping(MP)implemen-tationsseemtohavebetterperformanceandagreatlyreducedarea,amultipumpingfactoroftwoisonlypossiblefor1W/2R(Figure3)and2W/4Rmemories(Figure9(a)).Puremultipumpingmemorieswithmoreportswillalwaysrequireamultipumpingfactorofatleastthreeorfour,whichquicklydropstheFmax.Bycomparison,amultipumpingfactoroftwoisalwaysfeasibleforanyLVTmem-orywithanevennumberofreadports.TheslowerdropinspeedofanLVTmemoryasthenumberofportsincreases(Figure9(a)vs.Figure10(a))isaconsequenceofitsinternalparallelism,insteadofthemostlyserialoperationofapuremultipumpingmemory.7.2AreaBreakdownTheprimarybenetofmultipumpingisreducingtheareaofthememorybanksattheexpenseofclockfrequency.Althoughtheareaofthememorybanksreducesproportionallytotheamountofmultipumping,theLVTdoesnotscaledownasmuchandlimitstheoverallareareduction.AsdiscussedinSection5.4,thenumberofblockRAMsinamulti-portedLVTmemoryisequaltotheproductofthenumberofreadandwriteports.Sincemultipumpingdividesthenumberof internalreadports,thenumberofblockRAMsperbankisreducedbythesamefactor6.ThenumberofreadportsontheLiveValueTablereducestomatch,asdoesthenumberofoutputmultiplexers.Figure9(b)andFigure10(b)showhowmultipumpingaffectstheareaofeachofthesecomponentsforthesameLVT2W/4Rand4W/8Rmemorieswhenusingfactorsoftwo(2X)andfour(4X),comparedtoafactorofone(1X)asthebaselinenon-multipumpedcase,whichisidenticaltotheLVT-M9KbarsofFigure7(b)andFigure8(b).Theguresalsoshowtheareabreakdownoftheequiv-alentpuremultipumping(MP)memories.ForLVTmemories,themultipumpingfactorexactlydividestheareaofthememorybanksbyitselfsincenowonlyone-halforone-quarterthenumberofinternalreadportsexists,whichalsoreducestheareaoftheoutputmultiplexersbythesameratio.Forthe4W/8Rmemory,theareaoftheLiveValueTableshrinksbyonly24%for2Xand36%for4Xonaveragesinceitsnumberofwriteportsremainsunchanged7.The“MultipumpingOverhead”fractioncontainstheadditionaloverheadofmultipumpingsuchastheMultipumpingController,internalmultiplexers,andtemporaryregisters.Regardlessofthedepthofthememory,multipumpingin-troducesasmall,nearlyconstantoverhead:145ALMsfor4W/8RLVT2Xmultipumping,and219ALMsfor4W/8RLVT4Xonav-erage.Summedtogether,theseindividualchangestothe4W/8RLVTmemoriesreducethetotalareabyanaverageof36%for2Xmultipumping,and54%for4X.TheunchangednumberofwriteportsintheLVTprimarilylimitshowmuchwecanreducethearea.Thepuremultipumpingmemories(MP)usemuchlessareasincetheydonotrequireaLiveValueTableorOutputMultiplexers,noruseasmanyblockRAMssincetheirbanksarenotreplicated.Forexample,the4W/8RMP3XmemoryinFigure10(b)usesonlyeightM9KblockRAMsinsideatotalequivalentareaof511ALMs,ofwhich105aremultipumpingoverhead.Unfortunately,puremultipumpingmemoriestendtohavehigherminimummulti-pumpingfactorsandthusslowerFmaxthanLVTmemoriesasthenumberofportsincreases.InSection8.1,wewillexploretheideaofusingpuremultipumpingmemorieswithasmallnumberofportstopotentiallyimprovetheefciencyofLVT-basedmemories.8.MOREAGGRESSIVEDESIGNSInthissectionwedescribepotentialdesignavenuesthataremoreaggressivethanthosewehavepresented:awaytobuildanevenmoreefcientLVT-baseddesign,andrelaxingread/writeorderingtoeaseconstraintsonthedesignofthemulti-portedmemory.8.1LVT-BasedMemoryBasedonPureMul-tipumpedBanksIfevenmoderatelymulti-portedblockRAMsbecameavailableonFPGAs,someverysignicantareaimprovementstoLVT-basedmulti-portedmemorieswouldfollow.Forexample,doublingthenumberofreadandwriteportsonablockRAMwouldmeanneed-ingonlyhalfasmanymemorybankstosupportthewriteportsofanLVT-basedmemory,witheachbankcontainingonlyhalfasmanyreplicatedmemoriestoservicethereadports,resultinginneedingonlyaquarterofthenumberofblockRAMstoconstructagivenLVT-basedmulti-portedmemory.Furthermore,halvingthenumberofbanksreducesthewidthoftheLVTbyonebit,whichissignicantsinceatypicalLVTisonlythreebitswideorless. 6Thisassumesthemultipumpingfactorcanevenlydividethenum-berofreadports.Forexample,a4W/8RLVTmemorysupportsfactorsoftwo,four,oreightonly.7ThisfactsuggeststhatthenarrowerbutmorenumerouswriteportmultiplexersanddecodershavethelargestimpactontheareaofpureALMmemories.AlthoughmostFPGAsdonotprovideblockRAMswithmorethantwoports,someofthesmallerpuremultipumpingmemoriesmightprovideusablesubstitutes.Thisspeculationissupportedbytheinterestingperformanceofthe`MP2X'2W/4Rpuremulti-pumpingdesignfromFigure9:255equivalentALMsat279MHz,usingfourM9KblockRAMs.Ifweusedthismemorytocon-structthebanksofthe`LVT1X'4W/8RLVT-basedmemoryinFigure10,twobankswouldberequiredinsteadoffour,witheachbankinternallyreplicatedoncetosupportthereadportsforatotaloffour2W/4Rmemories.Thissumstoonly16M9KblockRAMsinsteadof32,andevenwiththeadditionalareaoverheadofmulti-pumping8theareaofthememorybankswoulddecreaseby29%,whiletheareaoftheLVTwouldbehalved.ItiseasytoseefromFigure10(b)thatthesechangeswouldsignicantlyreducetheareaofthe256-element4W/8R`LVT1X'implementation.Theimpactonspeedishardertopredictduetothelargechangesinthestruc-tureofthememorybanks,butitisconceivablethattheoperatingfrequencywouldremainnearthatoftheunderlying2W/4Rpuremultipumpingmemory.8.2RelaxedRead/WriteOrderingTheprimaryobstacletogettingthemostareabenetfrommulti-pumpingistherelativelysmallareareductionoftheLVTsincethenumberofwriteportscannotbedivided.ThewritesmustalloccurtogetherafterthereadstopreventWARviolations.Ifwerelaxtheread/writeorderingandallowwritestooccurbeforeallofthereadshavecompleted,thentime-multiplexingtheinternalwriteportsbe-comespossible.ThemultipumpingfactorcannowdivideboththenumberofinternalmemorybanksandthenumberofwriteportsontheLiveValueTable,furtherimprovingtheareareduction.Forexample,withamultipumpingfactoroftwoandtheread/writeorderingpreserved,our4W/8Rmulti-portedmemoryexamplein-ternallybecomesa4W/4Rmemory.HalvingthenumberofreadportsonlyhalvesthesizeofthememorybanksandreducesthesizeoftheLVTtoalesserdegree.Bycomparison,ifweallowrelaxedread/writeordering,thenthemultipumpingfactorcanalsodividethenumberofwriteports9,whichwillinturndividethenumberofmemorybanksinadditiontotheirsizeandfurtherre-ducetheareaoftheLVT.Ineffect,exceptforthesmalloverheadofthemultipumpingcontrolcircuitry,theentire4W/8Rmemorywouldinternallyreducetoa2W/4Rinstancewhichusesabout75%lesshardware.ThisquadraticareareductionisimmediatelyvisiblewhencomparingtheLVTentriesinFigure9(b)andFigure10(b),aswellastheLVT-M9KentriesinFigure7(b)andFigure8(b).Relaxingtheread/writeorderingrequiresthedesignertosched-ulethereadsandwritestothemulti-portedmemorytoavoidWARviolationswhichwouldcorruptdata.Forexample,givenour4W/8Rexamplemulti-portedmemorywithamultipumpingfactoroftwoandrelaxedread/writeordering,thereadsandwriteswillinternallyexecuteastwoconsecutive2W/4Rsets,eachusingonehalfoftheexternalports.Ifthedesignerwantstosimultaneouslyreadandwritetothesamelocationwithinasystemcycle,bothoperationsmustbegroupedinthesameread/writesetbyperformingthemontheappropriateexternalports.Ifthedesignercannotrearrangethem,thenthewriteoperationsmustexplicitlyoccurafterthecon-ictingreads,eitherbyplacingtheminthefollowingread/writeset,orinthenextsystemcycle.Fortunately,thisproblemisidenticaltodependenceanalysisforoptimizingsoftwareloops. 8Thisispessimistic.Forexample,allofthemultipumpedmemo-riescouldshareasinglemultipumpingcontroller.9Thisassumesthatthemultipumpingfactorcanevenlydividethenumberofwriteports. 9.CONCLUSIONSFPGAsystemsprovideefcientblockRAMs,butwithonlytwoports.ConventionalapproachestobuildingmemoriesonFPGAswithalargernumberofportsareeitherveryareainefcient,slow,orboth.Weintroducedasmallerandfasterimplementationformulti-portedmemoriesbasedontheLiveValueTable(LVT)—asmall,narrow,multi-portedmemoryimplementedinlogicelementsthatcoordinatesreadandwriteaccessessuchthatabankedmem-orydesigntobehavelikeatruemulti-porteddesign.Theresultingmulti-portedmemoriesprovidetrueWrite-After-Read(WAR)ran-domaccesstoanyvalue,fromanarbitrarynumberofports,withouttheneedtoschedulereadsandwrites.Forexample,usingaLVTcontrolling32M9KblockRAMs,wewereabletoimplementa256-element12-ported(4W/8R)multi-portedmemorywhichoperatesat281MHzonAlteraStratixIIIFPGAswhileconsuminganareaequivalentto3679ALMs:a43%speedimprovementand84%areareductionovertheequivalentpureALMimplementation,anda61%speedimprovementoverapuremultipumpingimplementation,despitebeing7.2xlarger.ThehigherspeedsofourLVT-baseddesignspresentedthepossibilityofexchangingspeedforareabyapplyingmultipumping.Onaver-age,2Xmultipumpingreducedthetotalareaby36%,while4Xdidsoby54%.Ourdesignsalsoallowedforlowerandmorepracticalmultipumpingfactorsthanpuremultipumpingimplementationsasthenumberofportsincreased.WealsoproposedtwopotentialavenuesforfurtherincreasingtheefciencyofLVT-baseddesigns:(i)relaxingtheorderingofreadsandwriteswhichavoidedWARviolationswouldincreasetheareareductionfrommultipumpingtoabout75%at2X,minustheoverheadofmultipumping,atnoadditionalcostinspeed;(ii)implementingthememorybanksofa4W/8RLVT-basedmemoryusing2W/4Rpuremultipumpingmemoriescouldreducetheareaofthememorybanksby29%andhalvetheareaoftheLVTwhileconceivablykeepingtheoperatingfrequencyinausefulrange.Insummary,ourexplorationofthedesignspaceledustothreemainconclusions:(i)LVT-basedmulti-portedmemoriesaresu-periortologic-element-baseddesignsinbothareaandspeed;(ii)LVT-basedimplementationsarefasterthanpuremultipumpingim-plementationsalthoughwithanareacost;(iii)puremultipumpingimplementationscanbesufcientifthenumberofrequiredportsorexternaloperatingfrequencyaremodest.10.REFERENCES[1]ImplementingMulti-PortMemoriesinProASICPLUSDevices.http://www.actel.com/documents/APA_MultiPort_AN.pdf,July2003.ApplicationNoteAC176,AccessedSept.2009.[2]MercuryProgrammableLogicDeviceFamilyDataSheet.http://www.altera.com/literature/ds/dsmercury.pdf,Jan2003.Version2.2,AccessedSept.2009.[3]StratixIIIDeviceHandbookVolume1,Chapter4:TriMatrixEmbeddedMemoryBlocksinStratixIIIDevices.http://www.altera.com/literature/hb/stx3/stx3_siii51004.pdf,May2008.Version1.8,AccessedSept.2009.[4]AdvancedSynthesisCookbook:ADesignGuideforStratixII,StratixIII,andStratixIVDevices.http://www.altera.com/literature/manual/stx_cookbook.pdf,July2009.Version5.0,AccessedNov.2009.[5]NiosIIPerformanceBenchmarks.http://www.altera.com/literature/ds/ds_nios2_perf.pdf,June2009.Version4.0,AccessedSept.2009.[6]NiosIIProcessorReferenceHandbook.http://www.altera.com/literature/hb/nios2/n2cpu_nii5v1.pdf,March2009.Version9.0,AccessedSept.2009.[7]CARLI,R.FlexibleMIPSSoftProcessorArchitecture.Tech.rep.,MassachusettsInstituteofTechnology,ComputerScienceandArticialIntelligenceLaboratory,June2008.[8]FORT,B.,CAPALIJA,D.,VRANESIC,Z.,ANDBROWN,S.AMultithreadedSoftProcessorforSoPCAreaReduction.InIEEESymposiumonField-ProgrammableCustomComputingMachines(April2006),pp.131–142.[9]JONES,A.K.,HOARE,R.,KUSIC,D.,FAZEKAS,J.,ANDFOSTER,J.AnFPGA-basedVLIWprocessorwithcustomhardwareexecution.InInternationalSymposiumonField-ProgrammableGateArrays(2005).[10]LABRECQUE,M.,ANDSTEFFAN,J.ImprovingPipelinedSoftProcessorswithMultithreading.InInternationalConferenceonFieldProgrammableLogicandApplications(Aug.2007),pp.210–215.[11]MANJIKIAN,N.DesignIssuesforPrototypeImplementationofaPipelinedSuperscalarProcessorinProgrammableLogic.InPACRIM2003:IEEEPacicRimConferenceonCommunications,ComputersandSignalProcessing(Aug.2003),vol.1,pp.155–158vol.1.[12]MOUSSALI,R.,GHANEM,N.,ANDSAGHIR,M.MicroarchitecturalEnhancementsforCongurableMulti-ThreadedSoftProcessors.InInternationalConferenceonFieldProgrammableLogicandApplications(Aug.2007),pp.782–785.[13]MOUSSALI,R.,GHANEM,N.,ANDSAGHIR,M.A.R.Supportingmultithreadingincongurablesoftprocessorcores.InCASES'07:Proceedingsofthe2007internationalconferenceonCompilers,Architecture,andSynthesisforEmbeddedSystems(NewYork,NY,USA,2007),ACM,pp.155–159.[14]SAGHIR,M.,ANDNAOUS,R.ACongurableMulti-portedRegisterFileArchitectureforSoftProcessorCores.InARC2007:Proceedingsofthe2007InternationalWorkshoponAppliedRecongurableComputing(March2007),Springer-Verlag,pp.14–25.[15]SAGHIR,M.A.R.,EL-MAJZOUB,M.,ANDAKL,P.DatapathandISACustomizationforSoftVLIWProcessors.InReConFig2006:IEEEInternationalConferenceonRecongurableComputingandFPGAs(Sept.2006),pp.1–10.[16]SAWYER,N.,ANDDEFOSSEZ,M.Quad-PortMemoriesinVirtexDevices.http://www.xilinx.com/support/documentation/application_notes/xapp228.pdf,September2002.XAPP228(v1.0),AccessedSept.2009.[17]YIANNACOURAS,P.,STEFFAN,J.G.,ANDROSE,J.Application-speciccustomizationofsoftprocessormicroarchitecture.InFPGA'06:Proceedingsofthe2006ACM/SIGDA14thinternationalsymposiumonFieldProgrammableGateArrays(NewYork,NY,USA,2006),ACM,pp.201–210.

Related Contents


Next Show more