/
FPGAbased Monte Carlo Computation of Light Absorption for Photodynamic Cancer Therapy FPGAbased Monte Carlo Computation of Light Absorption for Photodynamic Cancer Therapy

FPGAbased Monte Carlo Computation of Light Absorption for Photodynamic Cancer Therapy - PDF document

mitsue-stanley
mitsue-stanley . @mitsue-stanley
Follow
493 views
Uploaded On 2014-12-24

FPGAbased Monte Carlo Computation of Light Absorption for Photodynamic Cancer Therapy - PPT Presentation

Rogers Sr Department of Electrical and Computer Engineering Department of Medical Biophysics Ontario Cancer Institute University of Toronto Toronto Ontario Canada Abstract Photodynamic therapy PDT is a method of treating cancer that combines light a ID: 28946

Rogers Department

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "FPGAbased Monte Carlo Computation of Lig..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

FPGA-basedMonteCarloComputationofLightAbsorptionforPhotodynamicCancerTherapyJasonLuu,KeithRedmond,WilliamChunYipLo,PaulChow,LotharLilge,JonathanRoseTheEdwardS.RogersSr.DepartmentofElectricalandComputerEngineering Fig.1.Photonpropagationinarealistic,Þve-layerskinmodelusinganinÞnitelynarrowbeamat633nmphotonpacketswithinthetissuelayersfromaninÞnitelynarrowbeamirradiatingthetissuevolumeattheskinsurface.ThesecondgoalofthisworkistoassistotherFPGAdevelopersinacceleratingsimilarcomputationallyintensiveapplications.Therefore,thedevelopmentmethodology,timerequiredforeachstage,techniquestoleveragethefeaturesofFPGAs,andpitfallsencounteredwillbepresented.TheremainderofthispaperdiscussestheFPGA-basedhardwaredesign,namedhereFPGA-basedMCML(FBM). [8].Aconvolution-basedalgorithmusedinradiationdosecalculationsonanFPGAachieveda20.7-foldspeedup[9].However,theyadoptedaverydifferentdesignßowwiththeuseofHandel-C[10].Also,theirspeedupvalueswereprojectedusingresultsfromModelSimbecausetheirdesignwastoolargetoÞtontheirFPGA.Similarly,[11]presentedonlyapartialimplementationofanMC-basedcomputationforradiotherapywithoutanyspeedupÞgures.AworkingFPGAimplementationofMC-basedelectrontransportwasshownin[12],reportingspeedupbetween300and500-foldcomparedtotheircustomsoftwarerunningona64-bitAMDOpteron2.4GHzmachine.Theirworkfocusedonradiationtransportcomputations,whichhavesomesimilaritiestoourwork,butinvolvefundamentallydifferentphysicalinterac-tions,suchaselectronimpactionizationeventsduetohigh-energybeams.ThisworkpresentsaworkingimplementationofthegoldstandardMCphotonmigrationsimulationcalledMCML(describedinthenextsection)onFPGAhardware.OurworkfocusesonphotontransportforPDT.Anearlierpaperdescribingpartofthisworkwaspublishedinabiomedicaljournal,whichfocusesmoreonthebiomedicalaspectsoftheprojectandalsoreportsresultsbasedonolderFPGAsandprocessorsforperformancecomparison[13].B.MCML1)Overview:TheMCMLcode[4]providesanMCmodelofsteady-statelighttransportinmulti-layeredturbidmedia.ItassumesinÞnitelywide,planarlayersandmodelsanincidentpencilbeamperpendiculartothesurfaceofthemedia.Theprogramtakesthemediapropertiesasinputsandgeneratesaphotondistributioninthemediaasoutput.ThreephysicalquantitiesarescoredspatiallyinMCMLÐabsorption,reßectance,andtransmittance.ForthepurposesofPDTtreatmentplanning,onlyabsorptionisconsidered.MCMLexploitssymmetryofthelightsourceandhomo-geneityoftheindividuallayerstoreducecomputationtime.InMCML,thismeansscoringtheabsorptionin2D(insteadof3D().Absorptioninthetissueisstoredinthether][z]array,whichrepresentsthephotonabsorptionprobabilitydensity[]normalizedtothetotalnumberofphotonpacketslaunched.IftheabsorptionprobabilitydensityisdividedbythelocalabsorptioncoefÞcient,photonphotoncmŠ2]canbecalculatedtogeneratetheisoßuencelines,whichrepresentregionswithequallightdose.Thesecontourlinescanbesuperimposedontopofthecorrespond-ingtissuestructurestoshowhowclosethesimulatedlightdoseistothedesireddoseintreatmentplanning.DespitetheuseofphotonpacketsinMCML,millionsofphotonpacketsarerequiredtogeneratelow-noiseisoßuencemaps.Eachphotonpacketundergoesthreekeystepsthatarerepeatedcontinuouslyuntilitisterminatedbyasurvivalrouletteorbyexitingthetissue:positionupdate,directionupdate,andßuenceupdate.2)PositionUpdate:ThepositionupdatestepmovesthephotonpackettoitsnextinteractionsitebyastepsizeobtainedfromsamplingaprobabilitydistributionbasedonthephotonpacketÕsmeanfreepathbetweeninteractionevents.ThestepsizeiscalculatedusingEq.(1) isauniformrandomvariable,scatteringandabsorptioncoefÞcients.Thisstepsizemayresultinthephotonpackettraversingaboundary.Thisconditioniscalledaandisdeterminedifs-( ifs-( b/µz)�0(2)wheredl isthedistancetotheclosestboundaryinthedirectionofphotonpropagationandisthedirectioncosine(zdirection).Ifthephotonpacketcrossesaboundary,thestepsizeisreducedsothatthephotonpacketarrivesattheboundary.ThedifferencebetweentheoriginalstepsizeandthereducedstepsizeiscalledandiscalculatedusingEq.(3).isusedasthestepsizeforthephotonpacketÕsnextiteration. ThenewpositionforthephotonisdeterminedbyÞrstmultiplyingthestepsizebythedirectioncosinesinthex,y,andzdirections(respectively)thenaddingthesevaluestotheoldposition.3)DirectionUpdate:Thedirectionupdatestepperformstwomutuallyexclusiveoperationsdependingonwhetherornotthephotonencountersaboundaryduringthepositionupdatestep.Ifthephotondoesnotencounteraboundary,thescatteringangleiscomputedasfollows: 1Šµ2z+µx)(4)µy=yµzx 1Šµ2z+µy)(5)µz=Š)) where()representsthenewdirection.Ifthephotonencountersaboundary,thedirectionupdatestepdetermineswhetheritreßectsortraversesthrough,andupdatesthedirectionaccordingly.WhetherornotaphotonreßectsisdeterminedbyaseriesofcomplexFresnelcalculationsexplainedindetailin[4].4)FluenceUpdate:TheßuenceupdatestepadjuststhephotonpacketÕsweighttosimulateabsorptionatthesiteofinteraction.Thedifferentialweighttobeabsorbediscomputedbasedonthecurrentweightasfollowsandisaccumulatedintheabsorptionarrayarrayr][z]atthelocationofabsorption: µa+µs(7) aretheopticalcoefÞcientsofthecurrentlayer.5)PhotonTermination:MCMLterminatesaphotonwhenitexitsthetissueorthroughasurvivalroulettewhenthephotonweighthasreachedapredeÞnedthresholdvalue.Whentheweightreachesthethreshold,thesurvivalroulettegeneratesarandomnumberbetween0and1.Iftherandomnumberisabove1/10,thephotonpacketisterminated;otherwise,theweightofthephotonpacketisincreasedbyafactorof10tomaintaintheconservationofenergyintheIII.DEVELOPMENTOFBASEDThedevelopmentprocessofthisworkispresentedheretoassistotherdevelopersinterestedinacceleratingsimilarcomputationallyintensiveapplicationsonanFPGA.ThedevelopmentofFBMstartedwiththeMCMLsoft-wareandendedwithaworkingdesignontheDE3FPGAdevelopmentboard[1].ThismappingfromsoftwaretohardwarerequiredtwoadditionaltypesofspeciÞcations:structuralandcycle-accuratespeciÞcations.TheÞrststageofFBMdevelopmentwasanearlyarchi-tectureexplorationofvariousdatapathstoselectonethatmatchedwellwiththeresourcesavailableontheFPGAchip.Thesecondstagewasthemodellingofacycle-accuratepipelineofthearchitectureselectedfromtheÞrststageusingSystemC[2].ThethirdstageconsistedofaVerilogconversionwithprecisestructuralspeciÞcationtoproduceaworkinghardwaredesign.TheÞnalstageincludedoptimizationstoproduceafasterandmoreaccuratedesign.A.EarlyArchitectureModeling(2person-months)Thisstagefocusedonunderstandingthearchitecturaltrade-offsavailableandselectinganappropriatearchitecturealongwiththenumberandtypeofeachcorecomputationalunitforthatarchitecture.Tofacilitatethistask,thenumberofLUTs,registers,multipliers,andmemoriesrequiredforvariousimplementationsofeachtypeofcomputationweremeasuredusingempiricalexperiments.Itbecameclearinthisstagethataconversionfromßoating-pointtoÞxed-pointdatarepresentationwasnecessaryinordertoobtainbothgoodspeedupandtoaccommodatethedesignonthetargetFPGAplatform.Furthermore,trigonometryandlogarithmoperationswereconvertedintolookuptablesduringthisstage.ThemajorityoftimeinearlyarchitecturemodelingwasspentmodelingandtuningthisÞxed-pointconversionandlookuptablesusinganempiricalapproachtoensureacceptableaccuracywasmaintained.B.SystemC(5person-months)SystemCisadesignandveriÞcationlanguagethatcon-sistsofasetofClibraryextensionstomodelhardwaresystemsatahighlevel[2].SystemCwasusedtoaddcycle-accuratetimingtotheuntimedearlyarchitecturemodelofthepreviousstage.Thisstageconsistedoftwosteps.TheÞrststepwastodeÞnetheinputsandoutputsofeachcore,usingalatencyofonecycleforsimplicity.ThesecondstepinvolvedpreciselymodelingthelatencybydeÞningthepipelinestages.AsimplecontrollerwasimplementedinSystemCtocontrolandtestthesimulation.C.Hardware(1person-month)AfterthecompletionoftheSystemCmodelofthedesign,convertingthemodelsintoVerilog,completingthesystemcontroller,andincludingvendor-speciÞcinformationwerestraightforward.TheshorttimerequiredtoproceedfromthesoftwaremodelsinSystemCtoaworkinghardwareimplementationwassurprising.Onefactormayhavebeenthattheprecisecomputationalongwithvaluesfortheinputs,outputs,andkeyinternalsignalsofeachcorewerealreadyavailablefromtheSystemCstagethusspeedingupthedevelopment,veriÞcation,andtestingofeachcore.TheendofthisdevelopmentstageresultedinaworkingimplementationofMCMLontheTransmogriÞer-4(TM-4)(TM-4)D.OptimizationandPorting(4person-months)Theinitial,unoptimizedhardwareshowedanincreaseinerrorof6%inthesimulationoutputwhenthenumberofphotonpacketsexceededthevalidationtestrangeoftheearlierstagesofdevelopment.Todeterminethesourceoferror,empiricaltestsonaccuracyweredoneonthevariousÞxed-pointconversionsandtherandomnumbergenerator.ThemostsigniÞcantsourceofinaccuracywasthecorrelationofvaluesproducedbytherandomnumbergeneratorfrom[15].Amorerigorousimplementation[16]increasedtheaccuracyofthesystem(1person-month).Variousoptimiza-tiontechniques,suchasregister-retiming,wereappliedtoimprovetheclockspeedofthedesign(2person-months).Oncetheoptimizedhardwaresystemwasfullyvalidated,thedesignwasmigratedfromtheTM-4[14]totheDE3boardtoprovideamoremoderncomparisonthantheolderStratix[17]chipsontheTM-4(1person-month).IV.HARDWAREA.ModiÞcationsfromSoftwareSimpliÞcationsweremadetoMCMLtomeethardwaredesignrequirementsandtotailorthesolutionforPDTtreatmentplanning.First,sinceßuenceisthequantityofinterestinPDTtreatmentplanning,onlyabsorptionwasrecorded;thereßectanceandtransmittancewereignoredtoreducethememoryresourcerequirementsinhardware.ThesizeoftheabsorptionarraywasÞxedat256by256.Second,thenumberofallowablelayersinthemediumwasrestrictedtoÞvetoreducememoryrequirements.Toreducetheareaofthedesign,64-bitßoating-pointoperationswereconvertedinto32-bitÞxed-pointoperationsandlookuptablesfortrigonometricandlogarithmicfunc-tionswereused.ThisconversiontoÞxed-pointrequiredthecreationofagridofvalidphotonpositions,asa32-bitÞxed-pointnumbermayonlytakeonvalues.Thisgridmayhaveadifferentscaleinther-andz-directions,whichcanbeconceptualizedashavingdifferentunitsofmeasure.Therefore,itwas Stratix III EP3SL150 Master Controller Photon Simulator Memories JTAG I/O (lookup tables, absorption array) Fig.2.Top-levelsystemdesignnecessarytogeneratethestepsizeinbothther-andz-directions(respectively),andconversionbetweenthesetwoisperformedusingascalingfactor(slefthandledinasimilarfashion).Thisimprovestheprecisionofthesimulationspacewithvaryinginputgeometries.B.Implementation1)Architecture:Thehardwaredesignconsistsoftwomaincomponents,namelyaMasterControllerandaPhotonSimulator.TheMasterControllercommunicateswiththehostcomputer,coordinatesaccesstothememories,andstartsandterminatesthesimulations.ThePhotonSimulatorcomputesthepositionanddirectionofthephotonpacket,theweightabsorbedwiththecorrespondinglocationofabsorption,anddetermineswhentoterminateeachphotonpacket.Fig.2showsthetop-levelsystemdesignofFBM.Theperformance-criticalportionofthedesignisthePho-tonSimulator,duetothelowoverhead(parsingofinputÞle,systeminitialization,anddatatransfer)inthissimulation.Twomethodsforacceleratingtheproblemwereconsidered:maximizingthethroughputofasinglePhotonSimulatorthroughtheuseofadeeppipeline,orminimizingthesizeofaPhotonSimulatorandreplicatingittoachieveparallelism.Itwasdeterminedthatduetothecomputationallycomplexnatureoftheproblem,replicatingcoreswouldnotefÞcientlyproducespeedup.Thus,onehigh-performance,pipelinedPhotonSimulatorwascreated.AblockdiagramofthepipelinedPhotonSimulatorispresentedinFig.3.ThePhotonSimulatorisaggressivelypipelined,leadingtoapipelinethatis100stagesdeepintotal.Thepipelineisfullatalltimesexceptatthebeginningandtheendofasimulation;however,thetimeduringwhichthepipelineisnotfullislessthan0.1%foranysimulationwithatleast100,000photonpackets.Torepresentaphotonpacket,theparametersshowninTableIneedtobemaintainedineverystageofthepipeline.Inordertokeeptrackofeachphotonpacketwithinthepipeline,357registersarerequired,yieldingatotalof35,700registerstomaintainphotonpacketinformationforallphotonpacketsinthepipeline.ThiseasilyÞtsinthetargetEP3SL150FPGAdevice[18],whichcontains113,600registers,leaving77,900fortheremainderofthedesign.2)PositionUpdateEngine:ThePositionUpdateEngineupdatesthex,y,andzcoordinatesofaphotonpacket,andconsistsofthreecores:theStepSizeCore,Boundary 1. Step Size 2.Boundary Checker Transmit 5. RouletteCore New photon Position Update Engine 4d.Fluence Previous photon 4c. Shared Arithmetic Fig.3.PipelinedarchitectureofPhotonSimulatorTABLEIEGISTEREDDATADESCRIBINGAPHOTONPACKETINAPIPELINESTAGE Symbol BitWidth xcoordinate x-pos 32 ycoordinate y-pos 32 zcoordinate z-pos 32 directioncosine(x) µx 32 directioncosine(y) µy 32 directioncosine(z) µz 32 layer layer 3 hitboundary? hit 1 dead? dead 1 zstepsize sz 32 rstepsize sr 32 zstepsizeremainder sleftz 32 rstepsizeremainder sleftr 32 weight W 32 CheckerCore,andtheMovementCore.Fig.4showsthedataßowdiagramforthePositionUpdateEngine.TheStepSizeCorecomputesthemeanfreepath(stepsize)ofthephotonpacketasshowninEq.(1).Thevaluesareassignedthroughappropriatescalingofthisstepsize.Logarithmtendstobeverysensitivetosmallnumbersandveryinsensitivetolargenumbers.ThefollowingcommonidentitywasusedtoefÞcientlyapproximatethelogarithmusedinEq.(1)ontheFPGAwhileretainingitssensitivity)=log=log+logisthebase-2exponentofisthebase-2signiÞcandofisauniformlydistributedrandomvariablewitharangefrom0to1andisimplementedusing[16].isdeterminedusingapriorityencoderonbecauseitsvalueisequivalenttothepositionofthemostsigniÞcantbitinisimplementedusinga10-bitlookuptableonthe10trailingbitsfromthemostsigniÞcantbitin.Divisionbytheconstantisimplementedbyamultiplicationofitsreciprocal.Oncethestepsizeisknown,theBoundaryCheckerCoredetermineswhetherornotthephotonpacketwillhit Step Size Core (1 stage) Logarithm Random Number Generator sr, sz sleftr, sleftz sleftzR, Z-direction constants 5:1 5:1 X 5:1 2:1 z Š Š z-posz-posNext layer z-posPrevious layer z-pos 2:1 z sz sr sz + + X Boundary Checker Core (60 stages) / sz� 5:1 2:1 Movement Core (1 stage) X X x,yzsr, sz + x-pos, y-pos, z-pos x-pos, y-pos, z-pos X constant Fig.4.Detaileddata-ßowdiagramforthePositionUpdateEngine,where representsthedistancetotheclosestboundaryinthedirectionofphotonpropagation.Overlappedshapesrepresentidenticaldatapathsforagroupofsignals.aboundaryusingEq.(2)anddeterminethestepsizeremainderusingEq.(3).Similartothestepsize,sleftzsleftrareassignedanappropriatelyscaledvalueofsleftTheBoundaryCheckerCorecontainsatotalof60pipelinestages.Thirtytwoofthesestagescomefromthepipelinedcomputationofadivision(30stages),followedbythecomputationoftwomultiplications(1stageeach).AsthecriticalpathwasfoundtobewithintheBoundaryCheckerCore,anadditional28pipelinestageswereaddedtotheendoftheBoundaryCheckerCoretoallowQuartusIItoautomaticallyperformregisterretiming.Finally,theMovementCorecomputesthenewvaluesofthex,y,andzpositionsbasedontheresultsoftheprevioustwocores.Thisphaseinvolvesamultiplicationoftheappropriatestepsizebythedirectioncosinesinthex,y,andzdirectionsrespectively().Thesevaluesarethenaddedtothepreviousvaluesforthepositionvector.3)DirectionUpdateEngine:TheDirectionUpdateEn-ginechangesthedirectioncosines(,and),andconsistsofthreecores:theReßect-TransmitCore,theRo-tationCore,andtheSharedArithmeticCore.Eachtimeaphotonpacketproceedsthroughitsmainloop,itwilleitherencounteraboundary,oritwillinteractwiththemedium,butitwillneverdobothinasingleiteration.Thisbehaviourallowsthesharingofresources.AnyaccesstoasharedarithmeticunitispassedthroughaMUX,whichsigniÞcantlyreducesthenumberofarithmeticunitsrequiredbythedesign.TheRotationCoreusesequations(4),(5),and(6)tocalculatethenewdirectionofaphotonpacketwhenitdoesnothitaboundary.ThehardwaredesignusedtocalculatethesevaluesisshowninFig.5a.Toavoidtrigonometryinhardware,lookuptablesareusedtogenerate,andThepipelinefortheRotationCorewasoptimizedtoreducethenumberofregistersandarithmeticunits(multipli-ers,dividers,andsquarerootunits)used.Oneoptimizationtechniqueusedwascommonsubexpressionelimination:cal-culatingaresultonceandreusingitinmultiplecalculations.AsecondtechniqueemployedwastosharearithmeticunitswiththeTransmit-ReßectCore.AsshownbytheshadedblocksinFig.5,eachofthesharedarithmeticunitsintheReßect-TransmitCore(Fig.5b)ismatchedinthesamestagebythesameunitintheRotationCore(Fig.5a).Finally,bycalculatingresultsaslateaspossible(withoutincreasingtotallatency)thenumberofregistersrequiredtopropagateintermediateresultswasminimized.TheReßect-TransmitCoredeterminesthedirectionaphotonpacketwilltraveluponreachingaboundary.Itfurtherdetermineswhetherornotthephotonpacketwilltransmitintothenextlayer(Fig.6).ThecalculationoftheFresnelformulasareresourceintensive,butnotcriticalforprecision,andthusisimplementedaslookuptables.Ifthephotonpacketreßects,andhenceremainsinthecurrentlayer,isnegatedandallotherphotonpacketparametersremainthesame.Shouldthephotonpackettraversethrough,thecalculationshowninFig.5b)isusedtodeterminethenewdirectionvector.Inthiscase,thelayervalueisalsoupdated,andthedeadßagissetifthephotonpacketexitsthetissueintoair.4)FluenceUpdateCore:TheFluenceUpdateCoreup-datestheweightofeachphotonpacketandadjuststhevaluesstoredintheabsorptionarray.ThemaximumsizeofthesimulationregionwasÞxedtobeelementsintherandzdimension(radialsymmetryisassumedinthismodel)duetotheuseofon-chipmemoryforstoringtheabsorptionarray.Theresolution()canbeadjustedinthesimulationinputÞle.However,byÞxingthegriddimensionsatelements,twodivisionswereturnedintobit-shiftoperations,decreasingthedepthofthepipeline,andsavingarea.Thewidthofthedatastoredintheabsorptionarraywassetto62bitstobothavoidoverßowaswellasÞtontheStratixIIIchip.NotethatthisissufÞcienttoacceptthetotalweightof100millionphotonpacketsdepositedonasinglearrayelement.OnepointofcomplexityintheFluenceUpdateCoreinvolvedapotentialReadAfterWrite(RAW)hazard.Thereisatwoclockcycledelaybetweenwritingavalueintotheon-chipmemoryandthatvaluebeingavailableforreading.Thus,toensurethatthelatestvalueisalwaysread,eachphotonpacketcheckstoseeifthenexttwophotonpacketsmaptothesamelocationintheabsorptionarray.Ifamatchisfound,thephotonpacketforwardsthevaluethatwillbe  Stage 1 Stage 2Stage 3Stage 5Stage 15Stage 30Stage 31Stage 32Stage33Stage 34Stage 35 X zz  2 xz  / 2 sinpLUT cospLUT sintLUT cost LUT RNG RNG X X yz y x   X x + Stage 36 +  X X X X X X Xzy X zz 2 layer constants 5:1 Stage 4 2  X layer constants 5:1 new new new new new new X X a) Rotation Coreb) Reflect-Transmit Core layer constants 5:1 y Fig.5.Dataßowdiagramfora)theRotationCore,andb)theReßect-TransmitCore.ResultscomputedfromtheReßect-Transmitcoreareusedwhenaphotonpackethitsaboundary.Otherwise,theresultsfromtheRotationCoreareused.Shadedblocksindicatesharedarithmeticunits. Up FresnelLUT Down FresnelLUT zz 2:1 z� 0 rnd rfres 2:1 rndrfres reflectedtransmitted Fig.6.Reßect-TransmitCorelogictodeterminewhetheraphotonpackettransmitsorreßectswrittenbacktothememory,asopposedtousingthevaluereadfrommemory.5)RouletteCore:TheRouletteCoredetermineswhentodiscontinuesimulatingaphotonpacketwhenthephotonpackethasnotyetexitedthetissue,buthasdroppedbelowthethresholdweight.WhenaphotonpacketÕsweightdropsbelowthethresholdweight,arandomnumber(between0and1)isgeneratedintheRouletteCore.Iftherandomnumberisabove1/8,thephotonpacketisterminated.Ifitisbelow1/8,theweightofthephotonpacketisincreasedbyafactorof8tomaintainconservationofenergyinthesystem.Notethatapowerof2wasusedinsteadofthenumber10usedinthesoftwareMCMLforhardwareefÞciency,whilemaintainingsimulationaccuracy.V.EXPERIMENTSANDESULTSThedesignpresentedinthispaperwasÞrstimplementedonamulti-FPGAplatformcalledtheTransmogriÞer-4(TM-4)[14]andthenmigratedtotheDE3board,whichhasoneStratixIII(EP3SL150F1152C3)FPGA[18].ThissectiononlypresentsresultsfromtheDE3platform.Forthepurposeofvalidationandperformancecompari-son,arealisticskinmodel[5]wasselectedasthesimulationinputtotheMCMLprogram.A.ValidationSystemvalidationconsistedoftwophases.TheÞrstphaseinvolvedverifyingtheFBMsimulationoutputsagainstthe 1.0E+051.0E+061.0E+071.0E+08 Relative Error Number of Photons MCML run 1 vs MCML run 2 FBM vs. MCML Fig.7.Relativeerrorasafunctionofthenumberofphotonpacketssimulated.Thehorizontalaxisisinlogarithmicscale.goldstandardMCMLexecutedonanIntelXeonprocessor.Duetothepseudo-randomnatureofMCalgorithms,itisimportanttoseparatetheerrorintroducedbythehardwareimplementation(includinglookuptablesandÞxed-pointconversion)fromthestatisticaluncertaintyinherentinanMCsimulation.Inotherwords,afaircomparisonbetweenMCMLandFBMcanonlybeobtainedbyconsideringthevarianceintheoutputoftheMCMLsimulation,whichisa2DabsorptionarrayscoringthephotonprobabilitydensitydensitycmŠ3]asafunctionofradiusanddepth.Toquantifydifferencesbetweenthesearrays,therelativeerrorerrorr][z]betweencorrespondingelementswascomputedusingEq. r][z](9)whereAsisthegoldstandardabsorptionarrayproducedbyMCMLafterlaunching100millionphotonpacketsandcontainsthecorrespondingelementsintheabsorptionarrayproducedbyFBM.Tosummarizetheeffectofvaryingthenumberofphotonpackets,themeanrelativeerrorisshowninFig.7,averagingtherelativeerroroverallabsorptionarrayelements()withvaluesaboveathresholdof.ThethresholdisnecessarysincetherelativeerrorisundeÞnedundeÞnedr][z]iszero.Photonpacketnumbersrangingfromweresimulated.Thesecondphaseforsystem-levelvalidationoftheFPGA-basedhardwaredesigninvolvedanalyzingtheerrorinthecontextofPDTtreatmentplanning.IsoßuencemapsweregeneratedfromtheFBMoutputbasedonpackets,asillustratedinFig.8.Therelativeshiftinthepositionoftheisoßuencelineswasinthehardware-generatedisoßuencelinescomparedtothoseproducedbyMCMLforßuencelevelsaslowas.ThisshiftisnegligiblewithinthecontextofPDTtreatmentplanningconsideringthetypicallymuchlargermarginofsafetyforsurgicalresectionortreatmentplanninginradiationtherapy.TABLEIIOWERELAYRODUCTOFTRATIX3GHEONFORSIMULATIONWITHMILLIONPHOTONPACKETS MachinePowerPDPNormalizedPDP Single-coreXeon40W244kJ716StratixIII1.55W0.341kJ1 B.AreaandSpeedupFBMontheStratixIIIused31,185outof56,800ALMs(StratixIIIsoftlogicblocks),92outof384DSPblocks,and4,751,360outof5,630,976blockmemorybits.FBMontheStratixIIIclockedat80MHzachieveda28-foldspeedupcomparedtoasingle-core3GHzXeon5160processorwith8GBofRAMand4MBofcache.Theon-chipsimulationtimewas6102secondsfortheprocessorand220secondsfortheFPGA.Athird-partyhost-to-FPGAcommunicationcorethatisstillunderdevelopmentwasusedtoporttheTM-4designovertotheDE3.Thiscorehasaperformanceproblemandrequiresoverthreehourstotransferonemegabyteofdata.Thetotalamountofdatathatneedstobetransferredtoandfromthedeviceislessthanamegabyteandshouldtakelessthanonesecondgivenawell-designedhostinterface.Forthisreason,thedata-transferoverheadwasnotincludedinthereportedsimulationtime.C.PowerComparisonPowerconsumptionisanothermetrictocomparecom-putationalefÞciencybetweentheFPGAandtheprocessor.OnlythepowerconsumedbytheprocessororFPGAwasconsidered;off-chipmemory,network,anddiskpowerwereignored.ThethermaldesignpowerofthedualcoreXeon5160processor[19]isspeciÞedbyInteltobe80W,andweusehalfofthisvalueasthepowerconsumptionforasinglecore.Although40Wmaybepessimisticforthesinglecore,thisvaluewaschosenbecausetheprocessorwillbeheavilyutilizedduringthecomputationinMCML.TheFPGApowerconsumptionwasdeterminedusingthePowerPlayPowerAnalyzertoolinQuartusIIversion8.1[20].ThedefaultsettingsforPowerPlaywereused.Inputpintoggleratewasleftatthedefault12.5%toapproximatethepowerconsumptionoftheFPGA.Notethatalthoughmostoftheinputsremainalmostconstantduringthesimulationofanylargenumberofphotonpackets,thedefaultsettingswerechosentogiveamoreconservativeestimate.Also,sincethehostprocessorcanperformothertaskswhilewaitingforthesimulationontheFPGAtocomplete,itspowerwasnotincludedinthepowercalculationfortheFPGA.TableIIshowsthatFBMontheStratixIIIhasa716-foldbetterpower-delayproductthanasingle-core3GHzXeon5160processor.Also,undertheseassumptions,ahypotheticalclusterofprocessorsthatmatchestheFPGAspeedupwillstillhavethesame716-foldworsepower-delayproductcomparedtotheFPGA.VI.CONCLUSIONSANDUsingtheMCMLprogramasthegoldstandard,custompipelinedhardwarewasdesignedontheDE3boardto 0.000.020.040.060.080.100.120.140.160.180.200.000.200.400.600.80 Depth (cm) Radius (cm) (b) 0.000.020.040.060.080.100.120.140.160.000.020.040.060.080.100.12 Depth (cm) Radius (cm) (a)1 0.01 1e-5 Fig.8.ComparisonoftheisoßuencelinesgeneratedbyFBMandMCMLusing100millionphotonpackets.Fluencelevelsrangefromasindicatedinparts(a)and(b)oftheÞgure.,and:MCML;,and:FBMachievea28-foldspeedupanda716-foldlowerpower-delayproductcomparedtoa3GHzIntelXeonprocessor.Thedevelopmenttimewasapproximately1person-year.IsoßuencedistributionmapsgeneratedbyFBMandMCMLwerecomparedat100millionphotonpackets,showingonlyshiftinthehardware-generatedisoßuencelinesfromthoseproducedbyMCMLforßuencelevelsaslowas.ThisshiftisnegligiblewithinthecontextofPDTtreatmentplanning.Theimplicationsofthecurrentstudyaretwofold.First,thepipelineddesigncouldformthebasisonwhichmorecomplexMCsimulationscanbebuilt.Secondly,thedra-maticreductionintreatmentplanningtimeenablesreal-timetreatmentplanningbasedonthemostrecentimagesoftheclinicaltargetvolume,takingintoaccountthechang-ingtissueopticalpropertiesasthetreatmentprogresses.Currently,pre-treatmentmodelsassumeconstantvaluesfortissueopticalpropertiesandignorethedynamicnatureoftissues,whichcoulddirectlyaffecttreatmentoutcomesininterstitialPDT[21].ThesigniÞcantperformancegainprovidedbythehardwareapproachallowsPDTtreatmentplanninginheterogeneous,spatiallycomplextissuesusingmoresophisticatedMC-basedmodels.CKNOWLEDGEMENTSTheauthorsacknowledgeDavidGallowayÕssupportwithJTAGportspackagefortheDE3[1],theÞnancialsupportfromCIHR(GrantNo.68951),NSERC(DiscoveryGrantNo.171074)andtheNSERCandOGSscholarships.[1]TerasicTechnologies,ÒAlteraDE3DevelopmentandEducationBoarduserhandbook,ÓNov2008,dE3http://www.terasic.com.tw/attachment/archive/260/DE3 User manual [2]T.Grtker,S.Liao,G.Martin,andS.Swan,SystemDesignwith.Springer,2002.[3]T.Dougherty,ÒPhotodynamicTherapy,ÓPhotochemistryandPhoto-biology,vol.58,no.6,pp.895Ð900,1993.[4]L.Wang,S.Jacques,andL.Zheng,ÒMCML-MonteCarlomodelingoflighttransportinmulti-layeredtissues,ÓComputerMethodsandProgramsinBiomedicine,vol.47,no.2,pp.131Ð146,1995.[5]V.Tuchin,ÒLightscatteringstudyoftissues,Óvol.40,no.5,pp.495Ð515,1997.[6]A.Colasanti,G.Guida,A.Kisslinger,R.Liuzzi,M.Quarto,P.Riccio,R.G.,andF.Villani,ÒMultipleProcessorVersionofaMonteCarloCodeforPhotonTransportinTurbidMedia,ÓComputerPhysics,vol.132,pp.84Ð93,2000.[7]S.Coyle,K.Thomas,T.Naughton,M.Charles,andW.Toms,ÒDis-tributedMonteCarloSimulationofLightTransportationinTissue,ÓinParallelandDistributedProcessingSymposium,2006.Proceedings.20thIEEEInternationalSymposiumon,2006,p.4.[8]M.Gokhale,J.Frigo,C.Ahrens,J.Tripp,andR.Minnich,ÒMonteCarloRadiativeHeatTransferSimulationonaReconÞgurableCom-puter,ÓLecturenotesincomputerscience,pp.95Ð104,2004.[9]K.Whitton,X.Hu,Y.Cedric,andD.Chen,ÒAnFPGASolutionforRadiationDoseCalculation,ÓinField-ProgrammableCustomCom-putingMachines,2006.Proceedings.14thAnnualIEEESymposium,2006,pp.227Ð236.[10]Agility,ÒHandel-clanguagereferencemanual,Ó2007,http://www. Language Reference [11]V.Fanti,R.Marzeddu,C.Pili,P.Randaccio,andJ.Spiga,ÒMonteCarloComputationsforRadiotherapywiththeuseofDedicatedProcessors,ÓinNuclearScienceSymposiumConferenceRecord,2006.,vol.4,2006.[12]A.S.PasciakandJ.R.Ford,ÒHigh-speedEvaluationofTrack-structureMonteCarloElectronTransportSimulations,ÓPhysicsinMedicineandBiology,vol.19,no.53,pp.5539Ð5553,2008.[13]W.Lo,K.Redmond,J.Luu,P.Chow,J.Rose,andL.Lilge,ÒHard-wareAccelerationofaMonteCarloSimulationforPhotodynamicTherapyTreatmentPlanning,ÓJournalofBiomedicalOptics,vol.14,p.014019,2009.[14]J.Fender,J.Rose,andD.Galloway,ÒThetransmogriÞer-4:anFPGA-basedhardwaredevelopmentsystemwithmulti-gigabytememoryca-pacityandhighhostandmemorybandwidth,ÓinField-ProgrammableTechnology,2005.Proceedings.2005IEEEInternationalConference,2005,pp.301Ð302.[15]ThomasTkacik,ÒAhardwarerandomnumbergenerator,ÓAug2002,openCoreDesignfromMotorolaIncorporatedhttp://www.opencores.org/projects.cgi/web/systemc rng/overview.[16]P.LÕEcuyer,ÒMaximallyEquidistributedCombinedTauswortheGen-erators,ÓMathematicsOfComputation,vol.65,no.213,pp.203Ð213,[17]AlteraCorporation,ÒStratixdevicehandbook,ÓJan2006,stratixhttp://www.altera.com/literature/hb/stx/stratix [18]AlteraCorporation,ÒStratixIIIdevicehandbook,ÓNov2007,stratix3http://www.altera.com/literature/hb/stx3/stratix3 [19]IntelCorporation,ÒDual-coreintelxeonprocessor5160(4mcache,3.00ghz,1333mhzfsb),ÓDec2008,xeon5160DataSheethttp:[20]AlteraCorporation,ÒQuartusii8.1:Powerplaypoweranalysis,ÓNov2008,quartusIIIntroductionhttp://www.altera.com/literature/hb/qts/ [21]A.Johansson,N.Bendsoe,K.Svanberg,S.Svanberg,andS.Andersson-Engels,ÒInßuenceoftreatment-inducedchangesintissueabsorptionontreatmentvolumeduringinterstitialphotodynamictherapy,ÓMedicalLaserApplication,vol.21,no.4,pp.261Ð270,