/
A Polyphase Filter for GPUs and MultiCore Processors Karel van der Veldt Universiteit A Polyphase Filter for GPUs and MultiCore Processors Karel van der Veldt Universiteit

A Polyphase Filter for GPUs and MultiCore Processors Karel van der Veldt Universiteit - PDF document

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
622 views
Uploaded On 2015-01-18

A Polyphase Filter for GPUs and MultiCore Processors Karel van der Veldt Universiteit - PPT Presentation

vdveldtuvanl Rob van Nieuwpoort Vrije Universiteit Amsterdam The Netherlands rvvannieuwpoortvunl Ana Lucia Varbanescu Technische Universiteit Delft The Netherlands alvarbanescutudelftnl Chris Jesshope Universiteit van Amsterdam The Netherlands crjess ID: 33026

vdveldtuvanl Rob van Nieuwpoort Vrije

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "A Polyphase Filter for GPUs and MultiCor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

MicroGridarchitecture,exposingtheprogrammabilityandperformanceabilitiesofthisresearcharchitecture.Finally,boththeoptimizationsandtheresultspresentedcanbeusedtoimplementtheentirepipeline(orothersignalprocessingkernels)onmany-coreplatforms.2.RELATEDWORKInthissectionwediscussotherworkrelatedtoFIR lterandpolyphase lterimplementations.Intheirpaper[15],RobV.vanNieuwpoortandJohnW.RomeindescribetheiroptimizedimplementationoftheLO-FARcorrelatoronvariousmulticoreplatforms.ThebestperformanceisachievedontheIBMCell/B.E.(fullblade),reaching91%peakperformance,comparedto96%ontheBlueGene/P.TheCell/B.E.isalso3.9xmoreenergye-cientthantheBG/P.In2005,SmirnovandChiuehdescribeaGPGPUimple-mentationofaFIR lterusingOpenGL[14].Atthetime,CUDAandOpenCLdidnotexistyet.Animplementationofapolyphase lterontheCellBroad-bandEnginethatissimilartoourswaspresentedbyHamil-toninhismaster'sthesis[6].Hisresultsshowthattheimplementationisover6xmoreecientthanonanormalprocessor,dependingontheamountofinput.Themaster'sthesisbyPetterssonandWainwright[11]dis-cussestheimplementationandperformanceofFIR ltersonCUDAandOpenCL.TheyachievegoodperformanceonCUDA,buttheydonotprovidemuchdetailontheactualimplementation.TheirFIR lterparametersalsodi erfromours.TheSPIRALProject[13]researchesautomaticcodegen-erationforthedevelopmentandoptimizationofDSPalgo-rithmsandothernumericalkernels,includingFIR ltersandFFTs.Thegeneratedcodeoutperformsexisting,handwrit-tenlibraries,butisnotvery exibleandthereisnoGPUcodegeneration.Overall,webelievethatalthoughsignalprocessingingeneralandFIR ltersinparticularareofinteresttothemany-corecommunity,thisisthe rstthoroughstudyofFIR ltersusingsomanyplatforms,programmingmodels,andperfor-mancemetrics.3.SIGNALPROCESSINGBACKGROUNDInthissectionwegiveashortdescriptionofthesignalpro-cessingconceptsrequiredtounderstandpolyphase lters.3.1SignalsAsignalisde nedasanyphysicalquantitythatvarieswithtime,space,orotherindependentvariable(s)[12].Asignalcanbemathematicallydescribedasafunctionofoneormoreindependentvariables.Inthiswork,weareonlyinterestedindiscretesignals.Discretesignalscanbeobtainedbysamplingat(usually)equallyspacedintervalsfromananalogsignalsource.Inourcase,LOFARantennassamplediscrete,complex-valuedsamples,usingsamplingfrequenciesof160or200MHz.3.2FIRlterAFiniteImpulseResponse(FIR) ltermultipliesa nitenumberofrecentinputsignals(impulses)relativetoagivendiscretetimebycoecients(impulseresponses)andaccu-mulatestheresults.Itcanbedescribedmathematicallyasy(n)=NPi=0cix(n�i),where:y(n)istheoutputsignalatdiscretetimen.x(n)istheinputsignalatdiscretetimen.ciarethecoecients,alsocalledweights.Nisthenumberofrecentsignalstoconsider,calledthe lterorder.Thetermsontheright-handsideoftheequationarecalledtaps.AnNthorderFIR lterhasN+1taps.AFIR ltermustrememberitslastNinputsamples,whicharestoredinwhatiscalledthedelayline.OnecandesignaFIR lterbycarefullychoosingthe lterorderandcoe-cientssuchthatthesystemhasspeci ccharacteristics.Forthepurposeofourwork,thevaluesofthecoecientsareirrelevantastheydonota ecttheimplementation.WhilegenerallyitispossibletoreducethecomplexityofFIR l-tersbystrengthreduction[10],thisisnotfeasibleforusasitinvolvesdesigningaspeci cFIR lterforaspeci csetofcoecients.InLOFARtherearehundredsofdi erentFIRcon gurations,allofwhichcanbechangedatanytime.3.3DiscreteFourierTransformAFouriertransformsplitsasequenceofinputsignalsintoasequenceoffrequencies.Indoingsoittransformstheinputfromthetimedomaintothefrequencydomain.Itcanbecomparedtohowaprismsplitswhitelightintoseparatelightbeamsofasinglefrequency.ADFToperatesondiscretesignalsandcanbedescribedmathematicallyasfk=N�1Pn=0x(n)e�i2 Nnk,where:x(n)isaninputsignal;thereareNinputsignals.fkisthekthfrequencyandisacomplexnumber,k=0;1;2;:::;N�1.ThecomplexityofthisalgorithmisO(N2),sincecomputinganyoftheNfrequenciesrequiresiteratingoverNinputs.DFTsarenotuseddirectlyinpractice,becausetherearebetteralgorithmsknownasFastFourierTransforms(FFT)whichhaveacomplexityofonlyO(Nlog2(N))[4].3.4PolyphaselterPolyphase ltersareusedbyLOFARtochannelizeinputstreamsandreduceinterference.Theysplitaninputse-quenceintoNsubsequencesofMsamples,whereeachsub-sequentinputsignalistheinputtooneofMFIR lters(orchannels).Thiscanbedescribedmathematicallyasym(n)=NPi=0cix((n�i)M+m),where:Nisthenumberofrecentsamplestoconsider(the lterorder).MisthenumberofFIR lters(channels).ym(n)isthenthoutputsignalofthemthFIR lter,m=0;1;2;:::;M�1.TheMoutputsym(n)areusedasinputstoaDFTasde-scribedintheprevioussubsection.TheoutputoftheDFTistheoutputofthepolyphase lter. 4.THELOFARPOLYPHASEFILTERInthissectionwepresenttheimplementationdetailscom-montoallarchitecturesweimplementedthepolyphase lteron,andhowwemeasureperformance.Wefocusontheim-plementationoftheFIR lter,asweusethird-partyFFTlibrarieswhenpossible.4.1PolyphaselterIntheLOFARsystem,receiversaregroupedintostations.Asallstationsarecompletelyindependent,weexplainhowthepolyphase lterworksforasinglestation.AstationhasNchannelschannels,whicheachhavetwopo-larizations(XandY).PolarizationsareseparateinterleaveddatastreamsthatsharethesameFIRcoecients.Thereareatotalof2NstationsNchannelspolyphase lters.EachstationcombinesthesamplesofitsreceiversandstreamsittotheLOFARpipeline.Samplesfromthestreamare4,8,or16-bitinterleavedcom-plexintegers,whichthepolyphase lter rstconvertsto32-bit oatingpoint.TheFIRcoecientsare32-bit oatingpointrealnumbers.Thereisacoecientforeverychan-nelandtapcombination,butallstationsandpolarizationssharethesamecoecients.TheFIRdelaylinecanbeseenasaboundedFIFObu er.Whenanewsampleisprocesseditisstoredinthefrontofthebu er,allothersamplesshifttothenexttap,andthelastsampleisdiscarded.AfterallFIRsofagivenpolarizationhaveprocessedasample,theFFTiscomputed.Thereare2NstationsFFTsofNchannelslength.Inourimplementationtheinputsamplesarereadfromaninputarray,andtheresultisstoredinanoutputarray,whicharelargeenoughtostoreanumberofsamplesdescribedabovefortheNstationswewanttoprocess.Wealsouseadelaylinearrayandacoecientsarray.4.2MeasuringperformanceInthissectionweexplainhowwemeasuretheperformanceofourkernels.4.2.1FloatingpointoperationsComputingtheoutputofaFIR lterrequiresanumberofmultiply-addoperations.ThereareNtapscomplexsamplesinthedelayline.Eachsampleismultipliedbyarealcoef- cientandtheseresultsaresummed.Thisrequires2Ntaps oatingpointmultiplicationsand2(Ntaps�1) oatingpointadditions.ThetotalamountofFLOPsperFIR lteristhus2+4(Ntaps�1).Sinceweusethird-partyFFTlibrarieswedonotknowtheexactnumberofFLOPsfortheFFT,butitcanbeapprox-imatedas5Nchannelslog2(Nchannels)[9].LOFARonlyusespoweroftwoFFTs,becausethosecanbecomputedmosteciently.4.2.2MemorytrafcComputingtheoutputofaFIR lterrequiresthefollowingmemoryloadsandstores:Readone(2x4bit),(2x8bit)or(2x16bit)inputsample,whichisconvertedtoa(2x32bit) oatingpointsample.Notethatforsimplicityofthecalcula-tionsweneedtomakeweassume(2x16bit)samples.Read(Ntaps�1)(2x32bit)samplesfromthedelayline.ReadNtaps32bitcoecients.Writeone(2x32bit)output.Writeone(2x32bit)sampletothedelayline.So,thetotalamountofmemorytracforoneFIR lteris4+8(Ntaps�1)+4Ntaps+8+8=(12Ntaps�4)+16=12Ntaps+12bytes.OneFFThasintotal4Nchannels[9]complex oatingpointinputsandoutputs,sotheamountofmemorytracis84Nchannels=32Nchannelsbytes.4.2.3PeakperformanceWeusetheRoo inemodel[16]todeterminethemaximumattainableperformanceofourimplementationonagivenarchitecture:perfmax=min(perfpeak;MemoryBandwidthAI),where:perfmaxisthemaximumattainable oatingpointper-formanceofourimplementationonthegivenarchitec-ture(GFLOP/s).perfpeakisthetheoreticalpeak oatingpointperfor-manceofthearchitecture(GFLOP/s).MemoryBandwidthisthepeakmemorybandwidthofthearchitecture(GB/s).AIisthearithmeticintensityoftheimplementation,whichisde nedasthenumberofFLOPsperbyteofmemorytrac.TheAIofthepolyphase lterisgiveninthefollowingsubsection.UsingtheRoo inemodelwecandeterminewhetherourkernelsareboundedbycomputationalpowerofthepro-cessororbythememorybandwidth.Ifthemeasuredper-formanceofakernelislowerthanperfmax,itismemorybound.Otherwise,itiscomputebound.NotethatbecausetheRoo inemodeldoesnottakeallpossibleoptimizations(suchascaching)intoaccount,therearecaseswhenthemeasuredperformanceishigherthanperfmax.4.2.4ArithmeticintensityTousetheRoo inemodel,wemustdeterminethearithmeticintensityofourkernel.Arithmeticintensityisde nedasthenumberofFLOPsperbyteofmemorytrac,soweneedtocalculateboth.WecalculatetheAIoftheFIR lterandFFTseparately.FLOPfir=2+4(Ntaps�1)BytesAccessedfir=12Ntaps+12AIfir=FLOPfir=BytesAccessedfirFLOPfft=5Nchannelslog2(Nchannels)BytesAccessedfft=32NchannelsAIfft=FLOPfft=BytesAccessedfft(1)NotethatforsomeofourimplementationstherearecertainoptimizationswhichimprovetheAI,asexplainedinSec.5.4.3ParametersandmetricsWemadetestprogramstomeasuretheperformanceofourkernelsbasedongeneralandimplementation-speci cparam-eters.Thegeneralparametersare:samplesize,Nstations,Nchannels,Ntaps,andthenumberofinputsamplesperchan-nelNruns(inotherwordsthenumberoftimestorunthepolyphase lter).Wecalltheactofstartingthekerneltoprocessasamplearun,andeveryrunisperformedinlock-stepbyallpolyphase lters.Implementation-speci cpa-rametersincludeenabledoptimizations(determinedatcom-pilationtime)andadditionalcommandlineparameters,for Figure1:PerformancegraphshowingtheimpactofthenumberoftapsandbatchesoftheoptimizedFIR lterwithoutI/OontheGTX580usingCUDA.quiredforthesameamountofcomputation,thearithmeticintensityincreasesasNbatchesincreases.WemeasuredwithNbatches=1,2,4,8,16,and32,thelattergivingthebestperformance.FromtheequationabovewealsoknowthatalargerNbatchesdoesnotgivefurtherperformanceincrease.Table2showsthebestcasearithmeticintensitywhenNbatches=32,andthemaximumperformanceasdeter-minedbyRoo ine.Theactualperformanceismuchhigher,becauseofcaching[15]andouruseoftheconstantmemorywhichhasahigherbandwidththantheglobalmemory.Ntapsx32batches 4 8 16 32 64 BytesAccessedfir;ref 60 108 204 396 780BytesAccessedfir;opt 28 44 76 140 268AIfir 0.49 0.67 0.81 0.90 0.95perfmax;fir;ref 44.9 53.4 58.5 61.2 62.7perfmax;fir;opt 96.2 131.2 157.0 173.2 182.3 Nchannels 64 128 256 512 1024 AIfft 0.94 1.1 1.25 1.4 1.6perfmax;fft 180.9 211.6 240.5 269.36 307.8 Table2:Themaximumperformanceofthepolyphase lterontheNVIDIAGTX580,excludinghost-to-devicememorytransfers.perfpeak=1581:1GFLOP/sandMemoryBandwidth=192:4GB/s.5.2.2OccupancyOccupancyisameasureofhowwellthemultiprocessorisutilizedbyakernelwhichisbasedonthenumberofregistersperthread,amountofsharedmemoryperthread(althoughwedonotusesharedmemory),andthenumberofthreadsperblock.Bestpracticeguidelinesstatethatitshouldbeascloseto100%aspossible.Table3showstheoccupancyforFIR ltersofdi erentlengths,whichwecomputedusingtheCUDAOccupancyCalculator.OntheGTX580,threadscanuseamaximumof63registerswithoutspillingregisterstodevicememory,andeachmulti-Ntaps Registers Max.threads Totalnr. Occupancy perthread perblock ofregisters 4 18 512 27684 100% 8 26 512 26624 67% 16 42 256 32256 50% 32 74 128 28416 25% 64 138 32 30912 15% Table3:CUDAoccupancyoncomputeability2.0.Registersperthread=2Ntaps+10.processorhas32678registerstoallocatebetweenthreadsinawarp[1].Keepingthatinmind,thetableshowsthatthe16tapsFIR ltermakesnearoptimaluseoftheavailableregisters(32256outof32768registersareused)withoutex-ceedingthemax.registers/thread.Thisisre ectedintheperformancemeasurementsshowninFigure1,asthisFIR lterisbyfarthebestperformingone.FIR lterswithmoretapsexceedthemax.registers/threadandthereforemustspillregisters,impactingtheirperformance.Moreover,smallerFIR ltershavehigheroccupancybutlessperfor-mancethanthe16tapsFIR lter,becausethehardwareissub-optimallyutilized.Thisshowsthathigheroccupancydoesnotimplybetterper-formance,andtogetthebestperformanceoneshoulduseasmanyregistersaspossiblewithoutexceedingthemax.regis-terperthread.ItalsomeansourFIR lterimplementationscaleswiththemax.registers/thread,whichisunfortunateasitisahardwarelimitwecannotdoanythingabout.Asalsoimpliedbythetable,weneedaseparatekernelforeachNtaps,becausethenumberofregistersmustbehardcoded.AsshowninTable3,themaximumsizeofathreadblockdependsonthenumberoftaps.Thereisonethreadforeachchannelandpolarizationinastation,soif2Nchannels�MaxThreadsPerBlock,wemustusemultiplethreadblocksperstation.MaxThreadsPerBlockisgiveninTable3.However,allthreadblocksmusthavethesamesize,sowechooseThreadsPerBlockandBlocksPerStationsuchthat:2Nchannels=ThreadsPerBlockBlocksPerStationwhereThreadsPerBlockMaxThreadsPerBlockOurimplementationcomputesThreadsPerBlockandBlocksPerStationautomatically,basedonthenumberofchannelsandtaps.Theconsequenceofthisdynamicsizingisthatdependingonthenumberofchannels,threadblocksmaybesmallerthanoptimal,a ectingperformance(sincetheoccupancywillbelowerthanshowninTable3).Westronglyrec-ommendchoosingNchannelssuchthatThreadsPerBlock=MaxThreadsPerBlock.5.2.3I/OtransfersTheinputarrayispagelocked(orpinned),write-combined,andmappedintodevicememory.ThisminimizestransferoverheadandtheGPUcanautomaticallyoverlapI/Otrans-ferswithcomputations.Wedidnotapplythistotheoutputarrayasitissupposedtobereusedasinputforthefollow-ingpipelinestagekernel,whilethementionedoptimizationsonlyapplytodeviceread-onlyorwrite-onlydata.TheseoptimizationsgiveasubstantialI/Operformanceboost. (a) (b) (c)Figure2:PerformanceofLOFARscenarios:(a)GPUsexcl.I/O,(b)FIRincl.I/O,(c)PPFincl.I/O. Loop Vector- I/Opage-Platform unrolling ization Batching locking Corei7 ++ +++ n.a. n.a.GTX580 +++ n.a. +++ +++HD5870 +++ + ++ +++MicroGrid ++ n.a. n.a. n.a.Table6:Summaryofimpactofoptimizations.thecon gurationwehavechosen(64GFLOP/s).Thefullpolyphase lterachieves39%ofthepeakperformance.Botharesigni cantlyhigherthantheotherplatformswehavein-vestigated.6.EXPERIMENTSANDRESULTSInthissectionwecomparetheoptimizedimplementationsofFIR lterandthepolyphase lteronthedi erentplatforms,usingtwocriteria:performanceofLOFARscenariosandenergyconsumption.LOFARscenariosarethecon gurationofchannelandtapsusedinpracticebyLOFAR.Inthesescenarios,whenthenumberofchannelsdoubles,thenumberoftapshalves,andviceversa.ThiskeepsthetotalFLOPsconstant.Theper-formanceresultsareshowninFigure2.Table6summarizestheimpactoftheoptimizationswehaveapplied.Toevaluatetheenergyconsumption,wemeasuredtheen-ergyconsumptionofthewhole(desktop)computerusingaVoltcraftEnergyCheck3000.TheresultsarepresentedinTable5.WemeasuredtheminimumandmaximumenergyconsumptionofallLOFARscenarios,butforreadabilityweonlyshowtheaverageenergyconsumptionofthe256x16scenario.Allmeasurementsweretakenwith16-bitsam-ples.Finally,weshowtheamountofGFLOPsperWatt(GFLOPs/W)togaininsightintotheactualenergye-ciency.WehavenomeasurementsoftheMicrogridarchi-tecture,asthereisnohardwareforityet.WeobservethattheCUDAimplementationontheGTX580givesthebestperformanceinalmostallcases.NotethattheLOFARscenariosdonotachievethehighestpossibleperfor-mance.Thehighestperformancewemeasuredis619(FIR)or576(PPF)GFLOP/swith64stationsx1024channelsx16tapsx16-bitsamples,excludingI/Otransfers.Over-allI/Ohasahugeimpactonperformance,reducingitbyasmuchas90%.TheenergymeasurementsshowthattheGTX580isboththemostenergyecientandpowerhun-grydevice.ComparedtotheGTX480itisnotasenergyecient,butdoesachieveapproximately20%higherperfor-mance.Interestingly,inLOFARscenarioswheretheoccu-pancyislow(seeTable3),thepowerconsumptionisalsolow,becausethedeviceisunderutilized.TheHD5870doesnotachievetheperformanceexpectedfromitshardwarespeci cations.Weexpectedthevector-izedimplementationtoperformbetter,becauseitmakesbetteruseofthevectorregisters,butthereislittledi er-ence.WebelievethisisbecausetheATIOpenCLcompilerdoesnotyetgenerategoodenoughcode.Anotherreasonmightbethatregisterspillingismorecostlyastheregistersare128bitswide,comparedto32bitsontheGTX480/580.ItconsumeslesspowerthantheGTX480,butisonlyonethirdasenergyecient.TheIntelCorei7isinalowerperformanceclassthantheGPUs,butcanbeusedmore exiblybecause,unliketheGPUimplementations,performancescaleslinearlywiththenumberoftaps,andtherearefewerhardwarelimitationsingeneral.Itisthesecondmostenergyecientplatform.TheMicroGridimplementationexcelsinthespeci ccaseof64channelsx64taps,whichispreciselyascenariowhereGPUsarenotecient.Inothercasesitisnotsoecient,butoneshouldkeepinmindthattheMicroGridarchitectureisstillinresearchsotheperformanceisexpectedtoimproveinlaterversionsofthesimulator,andeventuallyhardware.Concluding,theCUDAplatformforNVIDIAGPUsisatthemomentthemostpromisingmany-coreplatformfortheLOFARpolyphase lter.However,wehaveobservedthattheimplementationishighlyI/Obound.Thisisduetothelowbandwidth(8GB/s)ofthePCIExpress2.0bus.TomakeGPUsworthwhiletouse,theI/Otransferslatenciesmustbehiddenbyperformingmanyoperationsperbyteofinput/output.ThiscanbeachievedbycomputingtheentireLOFARpipelineontheGPU,keepingthedatainsidetheGPUinbetweenpipelinestages.7.CONCLUSIONSWehavediscussedandcomparedtheimplementationofanecientpolyphase lterontheCorei7,GTX480/580,HD5870,andMicroGridarchitectures.Wehaveshownthat