SA email qasemwhalley xyuanengelen csfsuedu phone 850 6443506 Abstract swap instruction whic exc hanges alue in memory with alue of register is ailable on man arc hitectures The primary ap plication of sw ap instruction has een for pro cess sync hron ID: 42206
Download Pdf The PPT/PDF document "Using Sw ap Instruction to Coalesce Load..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
UsingaSwapInstructiontoCoalesceLoadsandStoresApanQasem,DavidWhalley,XinYuan,andRobertvanEngelenDepartmentofComputerScience,FloridaStateUniversityTallahassee,FL32306-4530,U.S.A.e-mail:fqasem,whalley,xyuan,engeleng@cs.fsu.edu,phone:(850)644-3506Abstract.Aswapinstruction,whichexchangesavalueinmemorywithavalueofaregister,isavailableonmanyarchitectures.Theprimaryap-plicationofaswapinstructionhasbeenforprocesssynchronization.Inthispaperweshowthataswapinstructioncanoftenbeusedtocoalesceloadsandstoresinavarietyofapplications.Wedescribetheanalysisnecessarytodetectopportunitiestoexploitaswapandthetransforma-tionrequiredtocoalescealoadandastoreintoaswapinstruction.Theresultsshowthatboththenumberofaccessestothememorysystem(datacache)andthenumberofexecutedinstructionsarereduced.Inaddition,thetransformationreducestheregisterpressurebyoneregis-teratthepointtheswapinstructionisused,whichsometimesenablesothercode-improvingtransformationstobeperformed.1INTRODUCTIONAninstructionthatexchangesavalueinmemorywithavalueinaregisterhasbeenusedonavarietyofmachines.Theprimarypurposefortheseswapinstructionsistoprovideanatomicoperationforreadingfromandwritingtomemory,whichhasbeenusedtoconstructmutual-exclusionmechanismsinsoftwareforprocesssynchronization.Infact,thereareotherformsofhardwareinstructionsthathavebeenusedtosupportmutualexclusion,whichincludetheclassictest-and-setinstruction.Inthispaperweshowthataswapinstructioncanalsobeusedbyalow-levelcode-improvingtransformationtocoalesceloadsandstoresintoasingleinstruction.Aswapinstructionexchangesavalueinmemorywithavalueinaregister.ThisisillustratedinFigure1,whichdepictsaloadinstruction,astoreinstruc-tion,andaswapinstructionusinganRTL(registertransferlist)notation.EachassignmentinanRTLrepresentsaneectonthemachine.ThelistofeectswithinasingleRTLareaccomplishedinparallel.Thus,theswapinstructionisessentiallyaloadandstoreaccomplishedinparallel.AswapinstructioncanbeecientlyintegratedintoaconventionalRISCarchitecture.First,itcanbeencodedusingthesameformatasaloadorastoresinceallthreeinstructionsreferencearegisterandamemoryaddress.Onlyadditionalopcodesarerequiredtosupporttheencodingofaswapinstruction. (b) Store Instruction(c) Swap Instruction(a) Load Instructionr[2] = M[x];M[x] = r[2];r[2] = M[x]; M[x] = r[2];Fig.1.ConstrastingtheEectsofLoad,Store,andSwapInstructionsSecond,accesstoadatacachecanbeperformedecientlyforaswapinstruc-tiononmostRISCmachines.Adirect-mappeddatacachecansendthevaluetobeloadedfrommemorytotheprocessorforaloadoraswapinstructioninparallelwiththetagcheck.Thisvaluewillnotbeusedbytheprocessorifatagmismatchislaterdiscovered[1].Adatacacheisnotupdatedwiththevaluetobestoredbyastoreoraswapinstructionuntilafterthetagcheck[1].Thus,aswapinstructioncouldbeperformedasecientlyasastoreinstructiononamachinewithadirect-mappeddatacache.Infact,aswapinstructionrequiresthesamenumberofcyclesinthepipelineasastoreinstructionontheMicroSPARCI[2].Oneshouldnotethataswapinstructionwilllikelyperformlessecientlywhenitusedforprocesssynchronizationonamultiprocessormachinesinceitrequiresasignaloverthebustopreventotheraccessestomemory.Finally,itispossiblethatamainmemoryaccesscouldalsobeperformedecientlyforaswapinstruction.ReadstoDRAMaredestructive,meaningthatthevaluereadmustbewrittenbackafterwards.ADRAMorganizationcouldbeconstructedwherethevaluethatiswrittenbackcoulddierfromthevaluethatwasreadandsenttotheprocessor.Thus,aloadandastoretoasinglewordofmainmemorycouldoccurinonemainmemoryaccesscycle.Theremainderofthispaperhasthefollowingorganization.First,weintro-ducerelatedworkthatallowsmultipleaccessestomemorytooccursimultane-ously.Second,wedescribeavarietyofdierentopportunitiesforexploitingaswapinstructionthatcommonlyappearinapplications.Third,wepresentanalgorithmtodetectandcoalescealoadandstorepairintoaswapinstructionanddiscussissuesrelatedtoimplementingthiscode-improvingtransformation.Fourth,wepresenttheresultsofapplyingthecode-improvingtransformationonavarietyofapplications.Finally,wepresenttheconclusionsofthepaper.2RELATEDWORKTherehasbeensomerelatedworkthatallowsmultipleaccessestothememorysystemtooccurinasinglecycle.SuperscalarandVLIWmachineshavebeendevelopedwhereawiderdatapathbetweenthedatacacheandtheprocessorhasbeenusedtoallowmultiplesimultaneousaccessestothedatacache.Likewise,awiderdatapathhasbeenimplementedbetweenthedatacacheandmainmemorytoallowmultiplesimultaneousaccessestomainmemorythroughtheuseofmemorybanks.Asignicantamountofcompilerresearchhasbeenspentontryingtoscheduleinstructionssothatmultipleindependentmemoryaccessestodierentbankscanbeperformedsimultaneously[3].2 Memoryaccesscoalescingisacode-improvingtransformationthatgroupsmultiplememoryreferencestoconsecutivememorylocationsintoasinglelargermemoryreference.Thistransformationwasaccomplishedbyrecognizingacon-tiguousaccesspatternforamemoryreferenceacrossiterationsofaloop,un-rollingtheloop,andreschedulinginstructionssothatmultipleloadsorstorescouldbecoalesced[4].DirecthardwaresupportofmultiplesimultaneousmemoryaccessesintheformofsuperscalarorVLIWarchitecturesrequiresthatthesesimultaneousmemoryaccessesbeindependentinthattheyaccessdierentmemoryloca-tions.Likewise,memoryaccesscoalescingrequiresthatthecoalescedloadsorstoresaccesscontiguous(anddierent)memorylocations.Incontrast,theuseofaswapinstructionallowsastoreandaloadtothesamememorylocationtobecoalescedtogetherandperformedsimultaneously.Inamannersimilartothememoryaccesscoalescingtransformation,aloadandastorearecoalescedtogetherandexplicitlyrepresentedinasingleinstruction.3OPPORTUNITIESFOREXPLOITINGTHESWAPINSTRUCTIONAswapinstructioncanpotentiallybeexploitedwhenaloadisfollowedbyastoretothesamememoryaddressandthevaluestoredisnotcomputedusingthevaluethatwasloaded.Weinvestigatedhowoftenthissituationoccursandwehavefoundmanydirectopportunitiesinanumberofapplications.ConsiderthefollowingcodesegmentinFigure2fromanapplicationthatusespolynomialapproximationfromChebyshevcoecients[5].Thereisrstaloadofthed[k]arrayelementfollowedbyastoretothesameelement,wherethestoredoesnotusethevaluethatwasloaded.Wehavefoundcomparablecodesegmentscontainingsuchaloadfollowedbyastoreinotherdiverseapplications,suchasGauss-Jordanelimination[5]andtreetraversals[6]....=d[k];d[k]=2.0*d[k-1]-dd[k];...Fig.2.CodeSegmentinPolynomialApproximationfromChebyshevCoefficientsAmorecommonoperationwhereaswapinstructioncanbeexploitediswhenthevaluesoftwovariablesareexchanged.ConsiderFigure3(a),whichdepictstheexchangeofthevaluesinxandyatthesourcecodelevel.Figure3(b)indicatesthattheloadandstoreofxcanbecoalescedtogether.Likewise,Figure3(c)indicatesthattheloadandstoreofycanalsobecoalescedtogether.3 However,wewilldiscoverinthenextsectionthatonlyasinglepairofloadandstoreinstructionsinanexchangeofvaluesbetweenvariablescanbecoalescedtogether.(a) Exchange of Values Source Code Levelin x and y at thet = x;(b) The Load and Store Coalesced Togetherof x Can Be t = ;(c) The Load and Store Coalesced Togetherof y Can Be t = x;x = ;yyxxFig.3.ExampleofExchangingtheValuesofTwoVariablesTherearenumerousapplicationswherethevaluesoftwovariablesareex-changed.Varioussortsofanarrayorlistofvaluesareobviousapplicationsinwhichaswapinstructioncouldbeexploited.Someotherapplicationsrequiringanexplicitexchangeofvaluesbetweentwovariablesincludetransposingama-trix,thetravelingsalespersonproblem,solvinglinearalgebraicequations,fastfouriertransforms,andtheintegrationofdierentialequations.Theabovelistisonlyasmallsubsetoftheapplicationsthatrequirethisbasicoperation.Therearealsoopportunitiesforexploitingaswapinstructionafterothercode-improvingtransformationshavebeenperformed.Considerthecodeseg-mentinFigure4(a)fromanapplicationthatusespolynomialapproximationfromChebyshevcoecients[5].Itwouldappearinthiscodesegmentthatthereisnoopportunityforexploitingaswapinstruction.However,considerthebodyoftheloopexecutedacrosstwoiterations,whichisshowninFigure4(b)afterunrollingtheloopbyafactoroftwo.Forsimplicity,weareassuminginthisexamplethattheoriginalloopiteratedanevennumberoftimes.Nowthevalueloadedfromd[j-1]intherstassignmentstatementintheloopisupdatedinthesecondassignmentstatementandthevaluecomputedintherstassign-mentisnotusedtocomputethevaluestoredinthesecondassignment.Wehavefoundopportunitiesforexploitingaswapinstructionacrossloopiterationsbyloopunrollinginanumberofapplications,whichincludeslinearprediction,interpolationandextrapolation,andsolutionoflinearalgebraicequations. d[j] = d[j-1]-dd[j];(a) Original Loopfor (j = n-1; j 1; j--)} = d[j-2]-dd[j];d[j-1] d[j] = -dd[j];d[j-1]for (j = n-1; j 1; j -= 2) {(b) Loop after UnrollingFig.4.ExampleofUnrollingaLooptoProvideanOpportunitytoExploitaSwapInstruction4 Equation1indicatesthenumberofmemoryreferencessavedforeachloadandstorepairthatcanbecoalescedacrossloopiterations.Inotherwords,onememoryreferenceissavedeachtimetheloopisunrolledtwice.Ofcourse,loopunrollinghasadditionalbenets,suchasreducedloopoverheadandbetteropportunitiesforschedulinginstructions.MemoryReferencesSaved=bloopunrollfactor2c(1)Finally,wehavealsodiscoveredopportunitiesforspeculativelyexploitingaswapinstructionacrossbasicblocks.ConsiderthecodesegmentinFigure5,whichassignsvaluestoanimageaccordingtoaspeciedthreshold[7].p[i][j]isloadedinoneblockandavalueisassignedtop[i][j]inbothofitssuccessorblocks.Theloadofp[i][j]andthestorefromtheassignmenttop[i][j]inthethenorelseportionsoftheifstatementcanbecoalescedintoaswapinstructionsincethevalueloadedisnotusedtocomputethevaluestored.Thestoreoperationcanbespeculativelyperformedaspartofaswapinstructionintheblockcontainingtheload.Wehavefoundthatstorescanbeperformedspeculativelyinanumberofotherimageprocessingapplications,whichincludeclippingandarithmeticoperations. for (i = 0; i )for (i = 0; i ) if )p[i][j]Fig.5.SpeculativeUseofaSwapInstruction4ACODE-IMPROVINGTRANSFORMATIONTOEXPLOITTHESWAPINSTRUCTIONFigure6(a)illustratesthegeneralformofaloadfollowedbyastorethatcanbecoalesced.Thememoryreferenceistothesamevariableorlocationandtheregisterloaded(r[a])andregisterstored(r[b])dier.Figure6(b)depictstheswapinstructionthatrepresentsthecoalescedloadandstore.Notethattheregisterloadedhasbeenrenamedfromr[a]tor[b].Thisrenamingisrequiredsincetheswapinstructionhastostorefromandloadintothesameregister.Figure7(a),likeFigure3(a),showsanexchangeofthevaluesoftwovari-ables,xandy,atthesourcecodelevel.Figure7(b)showssimilarcodeattheSPARCmachinecodelevel,whichisrepresentedinRTLs.Thevariablethasbeenallocatedtoregisterr[1].Registerr[2]isusedtoholdthetemporaryvalueloadedfromyandstoredinx.Atthispointaswapcouldbeusedto5 (a) Load Followed by a Store(b) Coalesced Load and Store...r[a] = M[v];r[b] = M[v]; M[v] = r[b];Fig.6.SimpleExampleofCoalescingaLoadandStoreintoaSwapInstructioncoalescetheloadandstoreofxortheloadandstoreofy.Figure7(c)showstheRTLsaftercoalescingtheloadandstoreofx.Oneshouldnotethatr[1]isnolongerusedsinceitsliverangehasbeenrenamedtor[2].Duetotherenamingoftheregister,theregisterpressureatthispointintheprogram\rowgraphhasbeenreducedbyone.Reducingtheregisterpressurecansometimesenableothercode-improvingtransformationsthatrequireanavailableregistertobeapplied.Notethatthedecisiontocoalescetheloadandstoreofxpreventsthecoalescingoftheloadandstoreofy.M[x] = r[2]; r[2] = M[x];(c) After Coalescing theLoad and Store of xr[2] = M[y];(a) Exchange of Values (b) Exchange of Values Machine Code LevelSource Code Levelin x and y at thein x and y at thet = x;r[1] = M[x];Fig.7.ExampleofExchangingtheValuesofTwoVariablesThecode-improvingtransformationtocoalescealoadandastoreintoaswapinstructionwasaccomplishedusingthealgorithminFigure8.Thealgo-rithmndsaloadfollowedbyastoretothesameaddressandcoalescesthetwomemoryreferencestogetherintoasingleswapinstructionifavarietyofconditionsaremet.Theseconditionsinclude:(1)theloadandstoremustbewithinthesameblockorconsecutivelyexecutedblocks,(2)theaddressesofthememoryreferencesintheloadandstoreinstructionshavetobethesame,(3)thevalueinr[b]thatwillbestoredcannotdependonthevalueloadedintor[a],(4)thevalueinr[b]cannotbeusedafterthestoreinstruction,and(5)r[a]hastobeabletoberenamedtor[b].Thefollowingsubsectionsdescribeissuesrelatingtothiscode-improvingtransformation.6 FORB=eachblockinfunctionDOFORLD=eachinstructioninBDOIFLDisaloadANDFindMatchingStore(LD,B,LD next,ST,P)ANDMeetSwapConds(LD,ST)THENSW=Create("%s=M[%s];M[%s]=%s;",ST r[b],LD loadaddr,LD loadaddr,ST r[b]);InsertSWbeforeP;ReplaceusesofL r[a]withS r[b]untilL r[a]dies;DeleteLDandST;BOOLFindMatchingStore(LD,B,FIRST,ST,P)fFORST=FIRSTtoB lastDOIFSTisastoreTHENIFST storeaddr==LD loadaddrTHENIfFIRST==B rstTHENRETURNTRUE;ELSERETURNFindPlaceToInsertSwap(LD,ST,P);IFST storeaddr!=LD loadaddrTHENCONTINUE;IFcannotdetermineifthetwoaddressesaresameordierentTHENRETURNFALSE;IFFIRST==B rstTHENRETURNFALSE;FORS=eachsuccessorofBDOIF!FindMatchingStore(LD,S,S rst,ST,P)THENRETURNFALSE;FORS=eachsuccessorofBDOIFFindPlaceToInsertSwap(LD,ST,P)THENRETURNTRUE;RETURNFALSE;gMeetSwapConds(LD,ST)fRETURN(valueinST r[b]isguaranteedtonotdependonthevalueinLD r[a])AND(ST r[b]diesatthestore)AND((ST r[b]isnotresetbeforeLD r[a]dies)OR(otherliverangeofS r[b]canberenamedtouseanotherregister));gFindPlaceToInsertSwap(LD,ST,P)fIFLD r[a]isnotusedbetweenLDandSTTHENP=ST;RETURNTRUE;IFST r[b]isnotreferencedbetweenLDandSTTHENP=LD next;RETURNTRUE;IFrstuseofLD r[a]afterLDcomesafterthelastreferencetoST r[b]beforethestoreTHENP=instructioncontainingrstuseofLD r[a]afterLD;RETURNTRUE;IFrstuseofLD r[a]afterLDcanbemovedafterthelastreferencetoST r[b]beforethestoreTHENMoveinstructionsasneeded;P=instructioncontainingrstuseofLD r[a]afterLD;RETURNTRUE;ELSERETURNFALSE;gFig.8.AlgorithmforCoalescingaLoadandaStoreintoaSwapInstruction7 4.1PerformingtheCode-ImprovingTransformationLateintheCompilationProcessSometimesapparentopportunitiesatthesourcecodelevelforexploitingaswapinstructionarenotavailableafterothercode-improvingtransformationshavebeenapplied.Manycode-improvingtransformationseithereliminatemem-oryreferences(e.g.registerallocation)ormovememoryreferences(e.g.loop-invariantcodemotion).Coalescingloadsandstoresintoswapinstructionsshouldonlybeperformedafterallothercode-improvingtransformationsthatcanaf-fectthememoryreferenceshavebeenapplied.Figure9(a)showsanexchangeofvaluesafterthetwovaluesarecomparedinanifstatement.Figure9(b)showsapossibletranslationofthiscodesegmenttomachineinstructions.Duetocom-monsubexpressionelimination,theloadsofxandyintheblockfollowingthebranchhavebeendeletedinFigure9(c).Thus,theswapinstructioncannotbeexploitedwithinthatblock.Thisexampleillustrateswhytheswapinstructionshouldbeperformedlateinthecompilationprocesswhentheactualloadsandstoresthatwillremaininthegeneratedcodeareknown.(c) Loads Are Deleted inthe Exchange of Values Due toCommon Subexpression Eliminationr[1] = M[x];(a) Exchange of Values (b) Loads are InitiallyM[y] = r[1];M[x] = r[2];PC = IC IC = r[1] ? r[2];r[2] = M[y];r[1] = M[x];Source Code LevelPerformed in the Exchange y = t;in x and y at theof Values of x and y= 0,; L5;;if (x y) {Br[1] = M[x];Fig.9.ExampleDepictingWhytheSwapInstructionShouldBeExploitedasaLow-LevelCode-ImprovingTransformation4.2EnsuringMemoryAddressesAreEquivalentOrAreDierentOneoftherequirementsforaloadandstoretobecoalescedisthattheloadandstoremustrefertothesameaddress.Figure10(a)showsaloadusingthead-8 dressinregisterr[2]andastoreusingtheaddressinr[4].Thecompilermustensurethatthevalueinr[2]isthesameasthatinr[4].Thisprocessofcheck-ingthattwoaddressesareequivalentiscomplicatedduetothecode-improvingtransformationbeingperformedlateinthecompilationprocess.Commonsubex-pressioneliminationandloop-invariantcodemotionmaymovetheassignmentsofaddressestoregistersfarfromwheretheyareactuallydereferenced.(b) Load and Store to the Same(a) Same Addresses afterExpanding the ExpressionsVariable with an Intervening Storer[2] = r[3] M[v] = r[b];...M[r[c]] = r[d];...r[a] = M[v];Fig.10.ExamplesofDetectingIfMemoryAddressesAretheSameorDierWeimplementedsometechniquestodetermineiftheaddressesoftwomem-oryreferenceswerethesameoriftheydier.Addressestomemorywereex-pandedbysearchingbackwardsforassignmentstoregistersintheaddressuntilallregistersarereplacedorthebeginningofablockwithmultiplepredecessorsisencountered.Forinstance,theaddressinthememoryreferencebeingloadedinFigure10(a)isexpandedasfollows:r[2]= 2;0;r[2]+a= 2;0;(r[3]2)+aTheaddressinthememoryreferencebeingstoredwouldbeexpandedinasimilarmanner.Oncetheaddressesoftwomemoryreferenceshavebeenexpanded,thentheyarecomparedtodetermineiftheydier.Iftheexpandedaddressesaresyntaticallyequalivalent,thenthecompilerhasensuredthattheyrefertothesameaddressinmemory.Wealsoassociatedtheexpandedaddresseswithmemoryreferencesbeforecode-improvingtransformationsinvolvingcodemotionwereapplied.Thecom-pilertrackedtheseexpandedaddresseswiththememoryreferencesthroughavarietyofcode-improvingtransformationsthatwouldmovethelocationofthememoryreferences.Determiningtheexpandedaddressesearlysimpliedtheprocessofcalculatingaddressesassociatedwithmemoryreferences.Anotherrequirementforaloadandastoretobecoalescedisthattherearenootherpossibleinterveningstorestothesameaddress.Figure10(b)showsaloadofavariablevfollowedbyastoretothesamevariablewithaninterveningstore.Thecompilermustensurethatthevalueinr[c]isnottheaddressofthevariablev.However,simplycheckingthattwoexpandedaddressesarenotidenticaldoesnotsucetodetermineiftheyrefertodierlocationsinmemory.9 Variousruleswereusedtodetermineiftwoaddressesdiered.Table1depictssomeoftheserulesthatwereused.NumRuleExampleFirstAddressSecondAddressTheaddressesaretodier-1entclasses(localvariables,arguments,staticvariables,andglobalvariables.M[a]M[r[30]+x]Bothaddressesaretothe2sameclassandtheirnamediers.M[a]M[b]Oneaddressistoavariable3thathasneverhaditsad-dresstakenandthesecondaddressisnottothesamevariable.M[r[14]+v]M[r[7]]Theaddressesarethesame,4exceptfordierentconstantosets.M[(r[3]2)+a]M[(r[3]2)+a+4]Table1.ASubsetoftheRulesUsedforMemoryDisambiguation4.3FindingaLocationtoPlacetheSwapInstructionAnotherconditionthathastobemetforaloadandastoretobecoalescedintoaswapinstructionisthattheinstructioncontainingtherstuseofregisterassignedbytheloadhastooccurafterthelastreferencetotheregistertobestored.Forexample,considertheexampleinFigure11(a).Auseofr[a]appearsafterthelastreferencetor[b]beforethestoreinstruction,whichpreventstheloadandstorefrombeingcoalesced.Figure11(b)showsthatthecompilerissometimesabletorescheduletheinstructionsbetweentheloadandthestoretomeetthiscondition.Nowtheloadandthestorecanbemovedwheretheloadappearsimmediatelybeforethestore,asshowninFigure11(c).Oncetheloadandstorearecontiguous,thetwoinstructionscanbecoalesced.Figure11(d)showsthecodesequenceaftertheloadandstorehasbeendeleted,theswapinstructioninserted,andr[a]hasbeenrenamedtor[b].4.4RenamingRegisterstoAllowtheSwapInstructiontoBeExploitedWeencounteredanothercomplicationduetocoalescingloadsandstoresintoswapinstructionslateinthecompilationprocess.Pseudoregisters,whichcon-taintemporaryvalues,havealreadybeenassignedtohardwareregisterswhenthecoalescingtransformationisattempted.Thecompilerreuseshardwarereg-isterswhenassigningpseudoregisterstohardwareregistersinanattempttominimizethenumberofhardwareregistersused.Ourimplementationofthecode-improvingtransformationsometimesrenamedliverangesofregistersto10 (d) After Coalescing the Loadand Store and Renamingr[a] to Be r[b](c) Load and StoreCan Now BeMade Contiguous......... = ... r[b] ...;... = ... r[b] ...;(a) Use of r[a]Appears before aReference to r[b](b) First Use of r[a]Reference to r[b]Appears after the Lastr[a] = M[v];r[a] = M[v];Fig.11.ExamplesofFindingaLocationtoPlacetheSwapInstructionpermittheuseofaswapinstruction.ConsidertheexampleinFigure12(a),whichcontainsasetofr[b]afterthestoreandbeforethelastuseofthevalueassignedtor[a].Inthissituation,weattempttorenamethesecondliverangeofr[b]toadierentavailableregister.Figure12(b)showsthisliverangebe-ingrenamedtor[c].Figure12(c)depictsthattheloadandstorecannowbecoalescedsincer[a]canberenamedtor[b].Sometimeswehadtomovesequencesofinstructionspastotherinstructionsinorderfortheloadandstoretobecoalesced.ConsidertheunrolledloopinFigure4.Figure13(a)showsthesameloop,butinaload/storefashion,wherethetemporariesareregisters.Theloadandstorecannotbemadecontiguousduetoreuseofthesameregisters.Figure13(b)showsthesamecodeafterrenamingtheregistersonwhichthevaluetobestoreddepends.NowtheinstructionscanbescheduledsothattheloadandstorecanbemadecontiguousasshowninFigure13(c).Figure13(d)showstheloadandstorecoalescedandtheloadedregisterrenamed.5RESULTSTable2describesthenumerousbenchmarksandapplicationsthatweusedtoevaluatetheimpactofapplyingthecode-improvingtransformationtocoalesceloadsandstoresintoaswapinstruction.TheprogramsdepictedinboldfaceweredirectlyobtainedfromtheNumericalRecipesinCtext[5].Thecodeinmanyofthesebenchmarksareusedasutilitiesinavarietyofprograms.Thus,coalescingloadsandstoresintoswapscanbeperformedonadiversesetofapplications.Measurementswerecollectedusingtheeasesystemthatisavailablewiththevpocompiler.Insomecases,weemulatedaswapinstructionwhenitdidnotexist.Forinstance,theSPARCdoesnothaveswapinstructionsthatswaps11 and Renaming the Live Range of r[a] to r[b](c) After Coalescing the Load and Store...(a) r[b] Is Set in theLive Range of r[a](b) Live Range of r[b] after StoreHas Been Renamed to r[c]r[a] = M[v];r[a] = M[v];Fig.12.ExampleofApplyingRegisterRenamingtoPermittheUseofaSwapIn-structionfor (j = n-1; j 1; j -= 2) { r[4] = dd[j-1]; r[3] = d[j-2];for (j = n-1; j 1; j -= 2) { r[1] = r[1]-r[2]; r[2] = dd[j]; r[3] = r[3]-r[4]; r[4] = dd[j-1]; r[3] = d[j-2];(a) After Loop Unrollingfor (j = n-1; j 1; j -= 2) {(b) After Register Renaming(c) After Scheduling the Instructionsfor (j = n-1; j 1; j -= 2) {(d) After Coalescing the Load and Store r[1] = ; r[3] = ; = r[3];d[j-1] r[2] = dd[j];d[j-1] = r[1]; r[1] = d[j-1]; = dd[j-1];r[3]d[j-1]d[j-1]d[j-1] r[1] = ;d[j-1]Fig.13.AnotherExampleofApplyingRegisterRenamingtoPermittheUseofaSwapInstruction12 ProgramDescriptionbandecconstructsanLUdecompositionofasparserepresentationofabanddiagonalmatrixbubblesortsortsanintegerarrayinascendingorderusingabubblesortchebpcpolynomialapproximationfromChebyshevcoecientselmhesreducesanNNmatrixtoHessenbergformtfastfouriertransformgaussjsolveslinearequationsusingGauss-Jordaneliminationindexxcal.indicesforthearraysuchthattheindicesareinascendingorderludcmpperformsLUdecompositionofanNNmatrixmmidmodiedmidpointmethodpredicperformslinearpredictionofasetofdatapointsrt\rspndstherootofafunctionusingthefalsepositionmethodselectreturnstheksmallestvalueinanarraythreshadjustsanimageaccordingtoathresholdvaluetransposetransposesamatrixtraversebinarytreetraversalwithoutastacktsptravelingsalesmanproblemTable2.TestProgramsbytes,halfwords,\roats,ordoublewords.Theeasesystemprovidestheabilitytogathermeasurementsonproposedarchitecturalfeaturesthatdonotexistonahostmachine[8,9].NotethatitissometimespossibletousetheSPARCswapinstruction,whichexchangesawordinanintegerregisterwithawordinmemory,forexchanginga\roating-pointvaluewithavalueinmemory.Whenthe\roating-pointvaluesthatareloadedandstoredarenotusedinanyoperations,thenthesevaluescouldbeloadedandstoredusingintegerregistersinsteadof\roating-pointregistersandtheswapinstructioncouldbeexploited.Table3depictstheresultsthatwereobtainedonthetestprogramsforco-alescingloadsandstoresintoswapinstructions.Weunrolledseveralloopsintheseprogramsbyanunrollfactoroftwotoprovideopportunitiesforcoalescingaloadandastoreacrosstheoriginaliterationsoftheloop.Inthesecases,theNotCoalescedcolumnincludestheunrollingoftheseloopstoprovideafaircom-parison.Theresultsshowdecreasesinthenumberofinstructionsexecutedandmemoryreferencesperformedforawidevarietyofapplications.Theamountofthedecreasevarieddependingontheexecutionfrequencyoftheloadandstoreinstructionsthatwerecoalesced.6CONCLUSIONSInthispaperwehaveshownhowtoexploitaswapinstruction,whichexchangesthevaluesbetweenaregisterandalocationinmemory.Wehavediscussedhowaswapinstructioncouldbeecientlyintegratedintoaconventionalload/storearchitecture.Anumberofdierenttypesofopportunitiesforexploitingtheswapinstructionwereshowntobeavailable.Analgorithmforcoalescingaloadandastoreintoaswapinstructionwasgivenandanumberofissuesrelatedtoimplementingthecoalescingtransformationsweredescribed.Theresultsshowthatthiscode-improvingtransformationcouldbeappliedonavarietyofappli-cationsandbenchmarksandreductionsinthenumberofinstructionsexecutedandmemoryreferencesperformedwereobserved.13 ProgramInstructionsExecutedMemoryReferencesPerformedNotCoalescedCoalescedDecreaseNotCoalescedCoalescedDecreasebandec69,18968,4591.06%18,05417,3244.04%bubblesort2,439,0052,376,7052.55%498,734436,43412.49%chebpc7,531,9847,029,9906.66%3,008,0522,507,05616.66%elmhes18,52718,0442.61%3,0102,8913.95%t4,176,1124,148,1120.67%672,132660,9321.67%gaussj27,14326,7561.43%7,8847,5873.77%indexx70,32268,6762.34%17,13215,9816.72%ludcmp10,521,95210,439,1520.79%854,915845,7151.08%mmid267,563258,5543.37%88,62279,61310.17%predic40,82738,9274.65%13,89411,99413.67%rt\rsp81,11780,1161.23%66,18465,1831.51%select19,93919,4342.53%3,6183,12113.74%thresh7,958,9097,661,7963.73%1,523,5541,226,59419.49%transpose42,88337,93311.54%19,83214,88224.96%traverse94,15991,0903.26%98,31196,2652.08%tsp64,294,81463,950,1220.54%52,144,37551,969,5290.34%average6,103,4026,019,6163.06%3,689,8933,622,5688.52%Table3.ResultsReferences1.M.D.Hill,\ACaseforDirect{MappedCaches,"IEEEComputer,21(11),pages25{40,December1988.2.TexasInstruments,Inc.,ProductPreviewoftheTMS390S10IntegratedSPARCProcessor,1993.3.J.HennessyandD.Patterson,ComputerArchitecture:AQuantitativeApproach,SecondEdition,MorganKaufmann,SanFrancisco,CA,1996.4.J.W.DavidsonandS.Jinturkar,\MemoryAccessCoalescing:ATechniqueforEliminatingRedundantMemoryAccesses,"ProceedingsoftheSIGPLAN'94Sym-posiumonProgrammingLanguageDesignandImplementation,pages186{195,June1994.5.W.H.Press,S.A.Teukolsky,W.T.Vetterling,andB.P.Flannery,NumericalRecipesinC:TheArtofScienticComputing,SecondEdition,CambridgeUni-versityPress,NewYork,NY,1996.6.B.Dwyer,\SimpleAlgorithmsforTraversingaTreewithoutaStack,"InformationProcessingLetters,2(5),pages143{145,1973.7.I.Pitas,DigitalImageProcessingAlgorithmsandApplications,JohnWiley&Sons,Inc.,NewYork,NY,2000.8.J.W.DavidsonandD.B.Whalley,\Ease:AnEnvironmentforArchitectureStudyandExperimentation,"ProceedingsSIGMETRICS'90ConferenceonMeasurementandModelingofComputerSystems,Pages259{260,May1990.9.J.W.DavidsonandD.B.Whalley,\ADesignEnvironmentforAddressingAr-chitectureandCompilerInteractions,"MicroprocessorsandMicrosystems,15(9),pages459{472,November1991.14