vdveldtuvanl Rob van Nieuwpoort Vrije Universiteit Amsterdam The Netherlands rvvannieuwpoortvunl Ana Lucia Varbanescu Technische Universiteit Delft The Netherlands alvarbanescutudelftnl Chris Jesshope Universiteit van Amsterdam The Netherlands crjess ID: 33026
Download Pdf The PPT/PDF document "A Polyphase Filter for GPUs and MultiCor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
MicroGridarchitecture,exposingtheprogrammabilityandperformanceabilitiesofthisresearcharchitecture.Finally,boththeoptimizationsandtheresultspresentedcanbeusedtoimplementtheentirepipeline(orothersignalprocessingkernels)onmany-coreplatforms.2.RELATEDWORKInthissectionwediscussotherworkrelatedtoFIRlterandpolyphaselterimplementations.Intheirpaper[15],RobV.vanNieuwpoortandJohnW.RomeindescribetheiroptimizedimplementationoftheLO-FARcorrelatoronvariousmulticoreplatforms.ThebestperformanceisachievedontheIBMCell/B.E.(fullblade),reaching91%peakperformance,comparedto96%ontheBlueGene/P.TheCell/B.E.isalso3.9xmoreenergye-cientthantheBG/P.In2005,SmirnovandChiuehdescribeaGPGPUimple-mentationofaFIRlterusingOpenGL[14].Atthetime,CUDAandOpenCLdidnotexistyet.AnimplementationofapolyphaselterontheCellBroad-bandEnginethatissimilartoourswaspresentedbyHamil-toninhismaster'sthesis[6].Hisresultsshowthattheimplementationisover6xmoreecientthanonanormalprocessor,dependingontheamountofinput.Themaster'sthesisbyPetterssonandWainwright[11]dis-cussestheimplementationandperformanceofFIRltersonCUDAandOpenCL.TheyachievegoodperformanceonCUDA,buttheydonotprovidemuchdetailontheactualimplementation.TheirFIRlterparametersalsodierfromours.TheSPIRALProject[13]researchesautomaticcodegen-erationforthedevelopmentandoptimizationofDSPalgo-rithmsandothernumericalkernels,includingFIRltersandFFTs.Thegeneratedcodeoutperformsexisting,handwrit-tenlibraries,butisnotvery exibleandthereisnoGPUcodegeneration.Overall,webelievethatalthoughsignalprocessingingeneralandFIRltersinparticularareofinteresttothemany-corecommunity,thisistherstthoroughstudyofFIRltersusingsomanyplatforms,programmingmodels,andperfor-mancemetrics.3.SIGNALPROCESSINGBACKGROUNDInthissectionwegiveashortdescriptionofthesignalpro-cessingconceptsrequiredtounderstandpolyphaselters.3.1SignalsAsignalisdenedasanyphysicalquantitythatvarieswithtime,space,orotherindependentvariable(s)[12].Asignalcanbemathematicallydescribedasafunctionofoneormoreindependentvariables.Inthiswork,weareonlyinterestedindiscretesignals.Discretesignalscanbeobtainedbysamplingat(usually)equallyspacedintervalsfromananalogsignalsource.Inourcase,LOFARantennassamplediscrete,complex-valuedsamples,usingsamplingfrequenciesof160or200MHz.3.2FIRlterAFiniteImpulseResponse(FIR)ltermultipliesanitenumberofrecentinputsignals(impulses)relativetoagivendiscretetimebycoecients(impulseresponses)andaccu-mulatestheresults.Itcanbedescribedmathematicallyasy(n)=NPi=0cix(ni),where:y(n)istheoutputsignalatdiscretetimen.x(n)istheinputsignalatdiscretetimen.ciarethecoecients,alsocalledweights.Nisthenumberofrecentsignalstoconsider,calledthelterorder.Thetermsontheright-handsideoftheequationarecalledtaps.AnNthorderFIRlterhasN+1taps.AFIRltermustrememberitslastNinputsamples,whicharestoredinwhatiscalledthedelayline.OnecandesignaFIRlterbycarefullychoosingthelterorderandcoe-cientssuchthatthesystemhasspeciccharacteristics.Forthepurposeofourwork,thevaluesofthecoecientsareirrelevantastheydonotaecttheimplementation.WhilegenerallyitispossibletoreducethecomplexityofFIRl-tersbystrengthreduction[10],thisisnotfeasibleforusasitinvolvesdesigningaspecicFIRlterforaspecicsetofcoecients.InLOFARtherearehundredsofdierentFIRcongurations,allofwhichcanbechangedatanytime.3.3DiscreteFourierTransformAFouriertransformsplitsasequenceofinputsignalsintoasequenceoffrequencies.Indoingsoittransformstheinputfromthetimedomaintothefrequencydomain.Itcanbecomparedtohowaprismsplitswhitelightintoseparatelightbeamsofasinglefrequency.ADFToperatesondiscretesignalsandcanbedescribedmathematicallyasfk=N1Pn=0x(n)ei2 Nnk,where:x(n)isaninputsignal;thereareNinputsignals.fkisthekthfrequencyandisacomplexnumber,k=0;1;2;:::;N1.ThecomplexityofthisalgorithmisO(N2),sincecomputinganyoftheNfrequenciesrequiresiteratingoverNinputs.DFTsarenotuseddirectlyinpractice,becausetherearebetteralgorithmsknownasFastFourierTransforms(FFT)whichhaveacomplexityofonlyO(Nlog2(N))[4].3.4PolyphaselterPolyphaseltersareusedbyLOFARtochannelizeinputstreamsandreduceinterference.Theysplitaninputse-quenceintoNsubsequencesofMsamples,whereeachsub-sequentinputsignalistheinputtooneofMFIRlters(orchannels).Thiscanbedescribedmathematicallyasym(n)=NPi=0cix((ni)M+m),where:Nisthenumberofrecentsamplestoconsider(thelterorder).MisthenumberofFIRlters(channels).ym(n)isthenthoutputsignalofthemthFIRlter,m=0;1;2;:::;M1.TheMoutputsym(n)areusedasinputstoaDFTasde-scribedintheprevioussubsection.TheoutputoftheDFTistheoutputofthepolyphaselter. 4.THELOFARPOLYPHASEFILTERInthissectionwepresenttheimplementationdetailscom-montoallarchitecturesweimplementedthepolyphaselteron,andhowwemeasureperformance.Wefocusontheim-plementationoftheFIRlter,asweusethird-partyFFTlibrarieswhenpossible.4.1PolyphaselterIntheLOFARsystem,receiversaregroupedintostations.Asallstationsarecompletelyindependent,weexplainhowthepolyphaselterworksforasinglestation.AstationhasNchannelschannels,whicheachhavetwopo-larizations(XandY).PolarizationsareseparateinterleaveddatastreamsthatsharethesameFIRcoecients.Thereareatotalof2NstationsNchannelspolyphaselters.EachstationcombinesthesamplesofitsreceiversandstreamsittotheLOFARpipeline.Samplesfromthestreamare4,8,or16-bitinterleavedcom-plexintegers,whichthepolyphaselterrstconvertsto32-bit oatingpoint.TheFIRcoecientsare32-bit oatingpointrealnumbers.Thereisacoecientforeverychan-nelandtapcombination,butallstationsandpolarizationssharethesamecoecients.TheFIRdelaylinecanbeseenasaboundedFIFObuer.Whenanewsampleisprocesseditisstoredinthefrontofthebuer,allothersamplesshifttothenexttap,andthelastsampleisdiscarded.AfterallFIRsofagivenpolarizationhaveprocessedasample,theFFTiscomputed.Thereare2NstationsFFTsofNchannelslength.Inourimplementationtheinputsamplesarereadfromaninputarray,andtheresultisstoredinanoutputarray,whicharelargeenoughtostoreanumberofsamplesdescribedabovefortheNstationswewanttoprocess.Wealsouseadelaylinearrayandacoecientsarray.4.2MeasuringperformanceInthissectionweexplainhowwemeasuretheperformanceofourkernels.4.2.1FloatingpointoperationsComputingtheoutputofaFIRlterrequiresanumberofmultiply-addoperations.ThereareNtapscomplexsamplesinthedelayline.Eachsampleismultipliedbyarealcoef-cientandtheseresultsaresummed.Thisrequires2Ntaps oatingpointmultiplicationsand2(Ntaps1) oatingpointadditions.ThetotalamountofFLOPsperFIRlteristhus2+4(Ntaps1).Sinceweusethird-partyFFTlibrarieswedonotknowtheexactnumberofFLOPsfortheFFT,butitcanbeapprox-imatedas5Nchannelslog2(Nchannels)[9].LOFARonlyusespoweroftwoFFTs,becausethosecanbecomputedmosteciently.4.2.2MemorytrafcComputingtheoutputofaFIRlterrequiresthefollowingmemoryloadsandstores:Readone(2x4bit),(2x8bit)or(2x16bit)inputsample,whichisconvertedtoa(2x32bit) oatingpointsample.Notethatforsimplicityofthecalcula-tionsweneedtomakeweassume(2x16bit)samples.Read(Ntaps1)(2x32bit)samplesfromthedelayline.ReadNtaps32bitcoecients.Writeone(2x32bit)output.Writeone(2x32bit)sampletothedelayline.So,thetotalamountofmemorytracforoneFIRlteris4+8(Ntaps1)+4Ntaps+8+8=(12Ntaps4)+16=12Ntaps+12bytes.OneFFThasintotal4Nchannels[9]complex oatingpointinputsandoutputs,sotheamountofmemorytracis84Nchannels=32Nchannelsbytes.4.2.3PeakperformanceWeusetheRoo inemodel[16]todeterminethemaximumattainableperformanceofourimplementationonagivenarchitecture:perfmax=min(perfpeak;MemoryBandwidthAI),where:perfmaxisthemaximumattainable oatingpointper-formanceofourimplementationonthegivenarchitec-ture(GFLOP/s).perfpeakisthetheoreticalpeak oatingpointperfor-manceofthearchitecture(GFLOP/s).MemoryBandwidthisthepeakmemorybandwidthofthearchitecture(GB/s).AIisthearithmeticintensityoftheimplementation,whichisdenedasthenumberofFLOPsperbyteofmemorytrac.TheAIofthepolyphaselterisgiveninthefollowingsubsection.UsingtheRoo inemodelwecandeterminewhetherourkernelsareboundedbycomputationalpowerofthepro-cessororbythememorybandwidth.Ifthemeasuredper-formanceofakernelislowerthanperfmax,itismemorybound.Otherwise,itiscomputebound.NotethatbecausetheRoo inemodeldoesnottakeallpossibleoptimizations(suchascaching)intoaccount,therearecaseswhenthemeasuredperformanceishigherthanperfmax.4.2.4ArithmeticintensityTousetheRoo inemodel,wemustdeterminethearithmeticintensityofourkernel.ArithmeticintensityisdenedasthenumberofFLOPsperbyteofmemorytrac,soweneedtocalculateboth.WecalculatetheAIoftheFIRlterandFFTseparately.FLOPfir=2+4(Ntaps1)BytesAccessedfir=12Ntaps+12AIfir=FLOPfir=BytesAccessedfirFLOPfft=5Nchannelslog2(Nchannels)BytesAccessedfft=32NchannelsAIfft=FLOPfft=BytesAccessedfft(1)NotethatforsomeofourimplementationstherearecertainoptimizationswhichimprovetheAI,asexplainedinSec.5.4.3ParametersandmetricsWemadetestprogramstomeasuretheperformanceofourkernelsbasedongeneralandimplementation-specicparam-eters.Thegeneralparametersare:samplesize,Nstations,Nchannels,Ntaps,andthenumberofinputsamplesperchan-nelNruns(inotherwordsthenumberoftimestorunthepolyphaselter).Wecalltheactofstartingthekerneltoprocessasamplearun,andeveryrunisperformedinlock-stepbyallpolyphaselters.Implementation-specicpa-rametersincludeenabledoptimizations(determinedatcom-pilationtime)andadditionalcommandlineparameters,for Figure1:PerformancegraphshowingtheimpactofthenumberoftapsandbatchesoftheoptimizedFIRlterwithoutI/OontheGTX580usingCUDA.quiredforthesameamountofcomputation,thearithmeticintensityincreasesasNbatchesincreases.WemeasuredwithNbatches=1,2,4,8,16,and32,thelattergivingthebestperformance.FromtheequationabovewealsoknowthatalargerNbatchesdoesnotgivefurtherperformanceincrease.Table2showsthebestcasearithmeticintensitywhenNbatches=32,andthemaximumperformanceasdeter-minedbyRoo ine.Theactualperformanceismuchhigher,becauseofcaching[15]andouruseoftheconstantmemorywhichhasahigherbandwidththantheglobalmemory.Ntapsx32batches 4 8 16 32 64 BytesAccessedfir;ref 60 108 204 396 780BytesAccessedfir;opt 28 44 76 140 268AIfir 0.49 0.67 0.81 0.90 0.95perfmax;fir;ref 44.9 53.4 58.5 61.2 62.7perfmax;fir;opt 96.2 131.2 157.0 173.2 182.3 Nchannels 64 128 256 512 1024 AIfft 0.94 1.1 1.25 1.4 1.6perfmax;fft 180.9 211.6 240.5 269.36 307.8 Table2:ThemaximumperformanceofthepolyphaselterontheNVIDIAGTX580,excludinghost-to-devicememorytransfers.perfpeak=1581:1GFLOP/sandMemoryBandwidth=192:4GB/s.5.2.2OccupancyOccupancyisameasureofhowwellthemultiprocessorisutilizedbyakernelwhichisbasedonthenumberofregistersperthread,amountofsharedmemoryperthread(althoughwedonotusesharedmemory),andthenumberofthreadsperblock.Bestpracticeguidelinesstatethatitshouldbeascloseto100%aspossible.Table3showstheoccupancyforFIRltersofdierentlengths,whichwecomputedusingtheCUDAOccupancyCalculator.OntheGTX580,threadscanuseamaximumof63registerswithoutspillingregisterstodevicememory,andeachmulti-Ntaps Registers Max.threads Totalnr. Occupancy perthread perblock ofregisters 4 18 512 27684 100% 8 26 512 26624 67% 16 42 256 32256 50% 32 74 128 28416 25% 64 138 32 30912 15% Table3:CUDAoccupancyoncomputeability2.0.Registersperthread=2Ntaps+10.processorhas32678registerstoallocatebetweenthreadsinawarp[1].Keepingthatinmind,thetableshowsthatthe16tapsFIRltermakesnearoptimaluseoftheavailableregisters(32256outof32768registersareused)withoutex-ceedingthemax.registers/thread.Thisisre ectedintheperformancemeasurementsshowninFigure1,asthisFIRlterisbyfarthebestperformingone.FIRlterswithmoretapsexceedthemax.registers/threadandthereforemustspillregisters,impactingtheirperformance.Moreover,smallerFIRltershavehigheroccupancybutlessperfor-mancethanthe16tapsFIRlter,becausethehardwareissub-optimallyutilized.Thisshowsthathigheroccupancydoesnotimplybetterper-formance,andtogetthebestperformanceoneshoulduseasmanyregistersaspossiblewithoutexceedingthemax.regis-terperthread.ItalsomeansourFIRlterimplementationscaleswiththemax.registers/thread,whichisunfortunateasitisahardwarelimitwecannotdoanythingabout.Asalsoimpliedbythetable,weneedaseparatekernelforeachNtaps,becausethenumberofregistersmustbehardcoded.AsshowninTable3,themaximumsizeofathreadblockdependsonthenumberoftaps.Thereisonethreadforeachchannelandpolarizationinastation,soif2NchannelsMaxThreadsPerBlock,wemustusemultiplethreadblocksperstation.MaxThreadsPerBlockisgiveninTable3.However,allthreadblocksmusthavethesamesize,sowechooseThreadsPerBlockandBlocksPerStationsuchthat:2Nchannels=ThreadsPerBlockBlocksPerStationwhereThreadsPerBlockMaxThreadsPerBlockOurimplementationcomputesThreadsPerBlockandBlocksPerStationautomatically,basedonthenumberofchannelsandtaps.Theconsequenceofthisdynamicsizingisthatdependingonthenumberofchannels,threadblocksmaybesmallerthanoptimal,aectingperformance(sincetheoccupancywillbelowerthanshowninTable3).Westronglyrec-ommendchoosingNchannelssuchthatThreadsPerBlock=MaxThreadsPerBlock.5.2.3I/OtransfersTheinputarrayispagelocked(orpinned),write-combined,andmappedintodevicememory.ThisminimizestransferoverheadandtheGPUcanautomaticallyoverlapI/Otrans-ferswithcomputations.Wedidnotapplythistotheoutputarrayasitissupposedtobereusedasinputforthefollow-ingpipelinestagekernel,whilethementionedoptimizationsonlyapplytodeviceread-onlyorwrite-onlydata.TheseoptimizationsgiveasubstantialI/Operformanceboost. (a) (b) (c)Figure2:PerformanceofLOFARscenarios:(a)GPUsexcl.I/O,(b)FIRincl.I/O,(c)PPFincl.I/O. Loop Vector- I/Opage-Platform unrolling ization Batching locking Corei7 ++ +++ n.a. n.a.GTX580 +++ n.a. +++ +++HD5870 +++ + ++ +++MicroGrid ++ n.a. n.a. n.a.Table6:Summaryofimpactofoptimizations.thecongurationwehavechosen(64GFLOP/s).Thefullpolyphaselterachieves39%ofthepeakperformance.Botharesignicantlyhigherthantheotherplatformswehavein-vestigated.6.EXPERIMENTSANDRESULTSInthissectionwecomparetheoptimizedimplementationsofFIRlterandthepolyphaselteronthedierentplatforms,usingtwocriteria:performanceofLOFARscenariosandenergyconsumption.LOFARscenariosarethecongurationofchannelandtapsusedinpracticebyLOFAR.Inthesescenarios,whenthenumberofchannelsdoubles,thenumberoftapshalves,andviceversa.ThiskeepsthetotalFLOPsconstant.Theper-formanceresultsareshowninFigure2.Table6summarizestheimpactoftheoptimizationswehaveapplied.Toevaluatetheenergyconsumption,wemeasuredtheen-ergyconsumptionofthewhole(desktop)computerusingaVoltcraftEnergyCheck3000.TheresultsarepresentedinTable5.WemeasuredtheminimumandmaximumenergyconsumptionofallLOFARscenarios,butforreadabilityweonlyshowtheaverageenergyconsumptionofthe256x16scenario.Allmeasurementsweretakenwith16-bitsam-ples.Finally,weshowtheamountofGFLOPsperWatt(GFLOPs/W)togaininsightintotheactualenergye-ciency.WehavenomeasurementsoftheMicrogridarchi-tecture,asthereisnohardwareforityet.WeobservethattheCUDAimplementationontheGTX580givesthebestperformanceinalmostallcases.NotethattheLOFARscenariosdonotachievethehighestpossibleperfor-mance.Thehighestperformancewemeasuredis619(FIR)or576(PPF)GFLOP/swith64stationsx1024channelsx16tapsx16-bitsamples,excludingI/Otransfers.Over-allI/Ohasahugeimpactonperformance,reducingitbyasmuchas90%.TheenergymeasurementsshowthattheGTX580isboththemostenergyecientandpowerhun-grydevice.ComparedtotheGTX480itisnotasenergyecient,butdoesachieveapproximately20%higherperfor-mance.Interestingly,inLOFARscenarioswheretheoccu-pancyislow(seeTable3),thepowerconsumptionisalsolow,becausethedeviceisunderutilized.TheHD5870doesnotachievetheperformanceexpectedfromitshardwarespecications.Weexpectedthevector-izedimplementationtoperformbetter,becauseitmakesbetteruseofthevectorregisters,butthereislittledier-ence.WebelievethisisbecausetheATIOpenCLcompilerdoesnotyetgenerategoodenoughcode.Anotherreasonmightbethatregisterspillingismorecostlyastheregistersare128bitswide,comparedto32bitsontheGTX480/580.ItconsumeslesspowerthantheGTX480,butisonlyonethirdasenergyecient.TheIntelCorei7isinalowerperformanceclassthantheGPUs,butcanbeusedmore exiblybecause,unliketheGPUimplementations,performancescaleslinearlywiththenumberoftaps,andtherearefewerhardwarelimitationsingeneral.Itisthesecondmostenergyecientplatform.TheMicroGridimplementationexcelsinthespeciccaseof64channelsx64taps,whichispreciselyascenariowhereGPUsarenotecient.Inothercasesitisnotsoecient,butoneshouldkeepinmindthattheMicroGridarchitectureisstillinresearchsotheperformanceisexpectedtoimproveinlaterversionsofthesimulator,andeventuallyhardware.Concluding,theCUDAplatformforNVIDIAGPUsisatthemomentthemostpromisingmany-coreplatformfortheLOFARpolyphaselter.However,wehaveobservedthattheimplementationishighlyI/Obound.Thisisduetothelowbandwidth(8GB/s)ofthePCIExpress2.0bus.TomakeGPUsworthwhiletouse,theI/Otransferslatenciesmustbehiddenbyperformingmanyoperationsperbyteofinput/output.ThiscanbeachievedbycomputingtheentireLOFARpipelineontheGPU,keepingthedatainsidetheGPUinbetweenpipelinestages.7.CONCLUSIONSWehavediscussedandcomparedtheimplementationofanecientpolyphaselterontheCorei7,GTX480/580,HD5870,andMicroGridarchitectures.Wehaveshownthat