/
Using Sw ap Instruction to Coalesce Loads and Stores Apan Qasem Da vid Whalley Xin uan Using Sw ap Instruction to Coalesce Loads and Stores Apan Qasem Da vid Whalley Xin uan

Using Sw ap Instruction to Coalesce Loads and Stores Apan Qasem Da vid Whalley Xin uan - PDF document

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
541 views
Uploaded On 2015-03-07

Using Sw ap Instruction to Coalesce Loads and Stores Apan Qasem Da vid Whalley Xin uan - PPT Presentation

SA email qasemwhalley xyuanengelen csfsuedu phone 850 6443506 Abstract swap instruction whic exc hanges alue in memory with alue of register is ailable on man arc hitectures The primary ap plication of sw ap instruction has een for pro cess sync hron ID: 42206

email qasemwhalley xyuanengelen csfsuedu

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Using Sw ap Instruction to Coalesce Load..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

UsingaSwapInstructiontoCoalesceLoadsandStoresApanQasem,DavidWhalley,XinYuan,andRobertvanEngelenDepartmentofComputerScience,FloridaStateUniversityTallahassee,FL32306-4530,U.S.A.e-mail:fqasem,whalley,xyuan,engeleng@cs.fsu.edu,phone:(850)644-3506Abstract.Aswapinstruction,whichexchangesavalueinmemorywithavalueofaregister,isavailableonmanyarchitectures.Theprimaryap-plicationofaswapinstructionhasbeenforprocesssynchronization.Inthispaperweshowthataswapinstructioncanoftenbeusedtocoalesceloadsandstoresinavarietyofapplications.Wedescribetheanalysisnecessarytodetectopportunitiestoexploitaswapandthetransforma-tionrequiredtocoalescealoadandastoreintoaswapinstruction.Theresultsshowthatboththenumberofaccessestothememorysystem(datacache)andthenumberofexecutedinstructionsarereduced.Inaddition,thetransformationreducestheregisterpressurebyoneregis-teratthepointtheswapinstructionisused,whichsometimesenablesothercode-improvingtransformationstobeperformed.1INTRODUCTIONAninstructionthatexchangesavalueinmemorywithavalueinaregisterhasbeenusedonavarietyofmachines.Theprimarypurposefortheseswapinstructionsistoprovideanatomicoperationforreadingfromandwritingtomemory,whichhasbeenusedtoconstructmutual-exclusionmechanismsinsoftwareforprocesssynchronization.Infact,thereareotherformsofhardwareinstructionsthathavebeenusedtosupportmutualexclusion,whichincludetheclassictest-and-setinstruction.Inthispaperweshowthataswapinstructioncanalsobeusedbyalow-levelcode-improvingtransformationtocoalesceloadsandstoresintoasingleinstruction.Aswapinstructionexchangesavalueinmemorywithavalueinaregister.ThisisillustratedinFigure1,whichdepictsaloadinstruction,astoreinstruc-tion,andaswapinstructionusinganRTL(registertransferlist)notation.EachassignmentinanRTLrepresentsane ectonthemachine.Thelistofe ectswithinasingleRTLareaccomplishedinparallel.Thus,theswapinstructionisessentiallyaloadandstoreaccomplishedinparallel.AswapinstructioncanbeecientlyintegratedintoaconventionalRISCarchitecture.First,itcanbeencodedusingthesameformatasaloadorastoresinceallthreeinstructionsreferencearegisterandamemoryaddress.Onlyadditionalopcodesarerequiredtosupporttheencodingofaswapinstruction. (b) Store Instruction(c) Swap Instruction(a) Load Instructionr[2] = M[x];M[x] = r[2];r[2] = M[x]; M[x] = r[2];Fig.1.ConstrastingtheE ectsofLoad,Store,andSwapInstructionsSecond,accesstoadatacachecanbeperformedecientlyforaswapinstruc-tiononmostRISCmachines.Adirect-mappeddatacachecansendthevaluetobeloadedfrommemorytotheprocessorforaloadoraswapinstructioninparallelwiththetagcheck.Thisvaluewillnotbeusedbytheprocessorifatagmismatchislaterdiscovered[1].Adatacacheisnotupdatedwiththevaluetobestoredbyastoreoraswapinstructionuntilafterthetagcheck[1].Thus,aswapinstructioncouldbeperformedasecientlyasastoreinstructiononamachinewithadirect-mappeddatacache.Infact,aswapinstructionrequiresthesamenumberofcyclesinthepipelineasastoreinstructionontheMicroSPARCI[2].Oneshouldnotethataswapinstructionwilllikelyperformlessecientlywhenitusedforprocesssynchronizationonamultiprocessormachinesinceitrequiresasignaloverthebustopreventotheraccessestomemory.Finally,itispossiblethatamainmemoryaccesscouldalsobeperformedecientlyforaswapinstruction.ReadstoDRAMaredestructive,meaningthatthevaluereadmustbewrittenbackafterwards.ADRAMorganizationcouldbeconstructedwherethevaluethatiswrittenbackcoulddi erfromthevaluethatwasreadandsenttotheprocessor.Thus,aloadandastoretoasinglewordofmainmemorycouldoccurinonemainmemoryaccesscycle.Theremainderofthispaperhasthefollowingorganization.First,weintro-ducerelatedworkthatallowsmultipleaccessestomemorytooccursimultane-ously.Second,wedescribeavarietyofdi erentopportunitiesforexploitingaswapinstructionthatcommonlyappearinapplications.Third,wepresentanalgorithmtodetectandcoalescealoadandstorepairintoaswapinstructionanddiscussissuesrelatedtoimplementingthiscode-improvingtransformation.Fourth,wepresenttheresultsofapplyingthecode-improvingtransformationonavarietyofapplications.Finally,wepresenttheconclusionsofthepaper.2RELATEDWORKTherehasbeensomerelatedworkthatallowsmultipleaccessestothememorysystemtooccurinasinglecycle.SuperscalarandVLIWmachineshavebeendevelopedwhereawiderdatapathbetweenthedatacacheandtheprocessorhasbeenusedtoallowmultiplesimultaneousaccessestothedatacache.Likewise,awiderdatapathhasbeenimplementedbetweenthedatacacheandmainmemorytoallowmultiplesimultaneousaccessestomainmemorythroughtheuseofmemorybanks.Asigni cantamountofcompilerresearchhasbeenspentontryingtoscheduleinstructionssothatmultipleindependentmemoryaccessestodi erentbankscanbeperformedsimultaneously[3].2 Memoryaccesscoalescingisacode-improvingtransformationthatgroupsmultiplememoryreferencestoconsecutivememorylocationsintoasinglelargermemoryreference.Thistransformationwasaccomplishedbyrecognizingacon-tiguousaccesspatternforamemoryreferenceacrossiterationsofaloop,un-rollingtheloop,andreschedulinginstructionssothatmultipleloadsorstorescouldbecoalesced[4].DirecthardwaresupportofmultiplesimultaneousmemoryaccessesintheformofsuperscalarorVLIWarchitecturesrequiresthatthesesimultaneousmemoryaccessesbeindependentinthattheyaccessdi erentmemoryloca-tions.Likewise,memoryaccesscoalescingrequiresthatthecoalescedloadsorstoresaccesscontiguous(anddi erent)memorylocations.Incontrast,theuseofaswapinstructionallowsastoreandaloadtothesamememorylocationtobecoalescedtogetherandperformedsimultaneously.Inamannersimilartothememoryaccesscoalescingtransformation,aloadandastorearecoalescedtogetherandexplicitlyrepresentedinasingleinstruction.3OPPORTUNITIESFOREXPLOITINGTHESWAPINSTRUCTIONAswapinstructioncanpotentiallybeexploitedwhenaloadisfollowedbyastoretothesamememoryaddressandthevaluestoredisnotcomputedusingthevaluethatwasloaded.Weinvestigatedhowoftenthissituationoccursandwehavefoundmanydirectopportunitiesinanumberofapplications.ConsiderthefollowingcodesegmentinFigure2fromanapplicationthatusespolynomialapproximationfromChebyshevcoecients[5].Thereis rstaloadofthed[k]arrayelementfollowedbyastoretothesameelement,wherethestoredoesnotusethevaluethatwasloaded.Wehavefoundcomparablecodesegmentscontainingsuchaloadfollowedbyastoreinotherdiverseapplications,suchasGauss-Jordanelimination[5]andtreetraversals[6]....=d[k];d[k]=2.0*d[k-1]-dd[k];...Fig.2.CodeSegmentinPolynomialApproximationfromChebyshevCoefficientsAmorecommonoperationwhereaswapinstructioncanbeexploitediswhenthevaluesoftwovariablesareexchanged.ConsiderFigure3(a),whichdepictstheexchangeofthevaluesinxandyatthesourcecodelevel.Figure3(b)indicatesthattheloadandstoreofxcanbecoalescedtogether.Likewise,Figure3(c)indicatesthattheloadandstoreofycanalsobecoalescedtogether.3 However,wewilldiscoverinthenextsectionthatonlyasinglepairofloadandstoreinstructionsinanexchangeofvaluesbetweenvariablescanbecoalescedtogether.(a) Exchange of Values Source Code Levelin x and y at thet = x;(b) The Load and Store Coalesced Togetherof x Can Be t = ;(c) The Load and Store Coalesced Togetherof y Can Be t = x;x = ;yyxxFig.3.ExampleofExchangingtheValuesofTwoVariablesTherearenumerousapplicationswherethevaluesoftwovariablesareex-changed.Varioussortsofanarrayorlistofvaluesareobviousapplicationsinwhichaswapinstructioncouldbeexploited.Someotherapplicationsrequiringanexplicitexchangeofvaluesbetweentwovariablesincludetransposingama-trix,thetravelingsalespersonproblem,solvinglinearalgebraicequations,fastfouriertransforms,andtheintegrationofdi erentialequations.Theabovelistisonlyasmallsubsetoftheapplicationsthatrequirethisbasicoperation.Therearealsoopportunitiesforexploitingaswapinstructionafterothercode-improvingtransformationshavebeenperformed.Considerthecodeseg-mentinFigure4(a)fromanapplicationthatusespolynomialapproximationfromChebyshevcoecients[5].Itwouldappearinthiscodesegmentthatthereisnoopportunityforexploitingaswapinstruction.However,considerthebodyoftheloopexecutedacrosstwoiterations,whichisshowninFigure4(b)afterunrollingtheloopbyafactoroftwo.Forsimplicity,weareassuminginthisexamplethattheoriginalloopiteratedanevennumberoftimes.Nowthevalueloadedfromd[j-1]inthe rstassignmentstatementintheloopisupdatedinthesecondassignmentstatementandthevaluecomputedinthe rstassign-mentisnotusedtocomputethevaluestoredinthesecondassignment.Wehavefoundopportunitiesforexploitingaswapinstructionacrossloopiterationsbyloopunrollinginanumberofapplications,whichincludeslinearprediction,interpolationandextrapolation,andsolutionoflinearalgebraicequations. d[j] = d[j-1]-dd[j];(a) Original Loop�for (j = n-1; j 1; j--)} = d[j-2]-dd[j];d[j-1] d[j] = -dd[j];d[j-1]�for (j = n-1; j 1; j -= 2) {(b) Loop after UnrollingFig.4.ExampleofUnrollingaLooptoProvideanOpportunitytoExploitaSwapInstruction4 Equation1indicatesthenumberofmemoryreferencessavedforeachloadandstorepairthatcanbecoalescedacrossloopiterations.Inotherwords,onememoryreferenceissavedeachtimetheloopisunrolledtwice.Ofcourse,loopunrollinghasadditionalbene ts,suchasreducedloopoverheadandbetteropportunitiesforschedulinginstructions.MemoryReferencesSaved=bloopunrollfactor2c(1)Finally,wehavealsodiscoveredopportunitiesforspeculativelyexploitingaswapinstructionacrossbasicblocks.ConsiderthecodesegmentinFigure5,whichassignsvaluestoanimageaccordingtoaspeci edthreshold[7].p[i][j]isloadedinoneblockandavalueisassignedtop[i][j]inbothofitssuccessorblocks.Theloadofp[i][j]andthestorefromtheassignmenttop[i][j]inthethenorelseportionsoftheifstatementcanbecoalescedintoaswapinstructionsincethevalueloadedisnotusedtocomputethevaluestored.Thestoreoperationcanbespeculativelyperformedaspartofaswapinstructionintheblockcontainingtheload.Wehavefoundthatstorescanbeperformedspeculativelyinanumberofotherimageprocessingapplications,whichincludeclippingandarithmeticoperations. for (i = 0; i )for (i = 0; i ) if )p[i][j]Fig.5.SpeculativeUseofaSwapInstruction4ACODE-IMPROVINGTRANSFORMATIONTOEXPLOITTHESWAPINSTRUCTIONFigure6(a)illustratesthegeneralformofaloadfollowedbyastorethatcanbecoalesced.Thememoryreferenceistothesamevariableorlocationandtheregisterloaded(r[a])andregisterstored(r[b])di er.Figure6(b)depictstheswapinstructionthatrepresentsthecoalescedloadandstore.Notethattheregisterloadedhasbeenrenamedfromr[a]tor[b].Thisrenamingisrequiredsincetheswapinstructionhastostorefromandloadintothesameregister.Figure7(a),likeFigure3(a),showsanexchangeofthevaluesoftwovari-ables,xandy,atthesourcecodelevel.Figure7(b)showssimilarcodeattheSPARCmachinecodelevel,whichisrepresentedinRTLs.Thevariablethasbeenallocatedtoregisterr[1].Registerr[2]isusedtoholdthetemporaryvalueloadedfromyandstoredinx.Atthispointaswapcouldbeusedto5 (a) Load Followed by a Store(b) Coalesced Load and Store...r[a] = M[v];r[b] = M[v]; M[v] = r[b];Fig.6.SimpleExampleofCoalescingaLoadandStoreintoaSwapInstructioncoalescetheloadandstoreofxortheloadandstoreofy.Figure7(c)showstheRTLsaftercoalescingtheloadandstoreofx.Oneshouldnotethatr[1]isnolongerusedsinceitsliverangehasbeenrenamedtor[2].Duetotherenamingoftheregister,theregisterpressureatthispointintheprogram\rowgraphhasbeenreducedbyone.Reducingtheregisterpressurecansometimesenableothercode-improvingtransformationsthatrequireanavailableregistertobeapplied.Notethatthedecisiontocoalescetheloadandstoreofxpreventsthecoalescingoftheloadandstoreofy.M[x] = r[2]; r[2] = M[x];(c) After Coalescing theLoad and Store of xr[2] = M[y];(a) Exchange of Values (b) Exchange of Values Machine Code LevelSource Code Levelin x and y at thein x and y at thet = x;r[1] = M[x];Fig.7.ExampleofExchangingtheValuesofTwoVariablesThecode-improvingtransformationtocoalescealoadandastoreintoaswapinstructionwasaccomplishedusingthealgorithminFigure8.Thealgo-rithm ndsaloadfollowedbyastoretothesameaddressandcoalescesthetwomemoryreferencestogetherintoasingleswapinstructionifavarietyofconditionsaremet.Theseconditionsinclude:(1)theloadandstoremustbewithinthesameblockorconsecutivelyexecutedblocks,(2)theaddressesofthememoryreferencesintheloadandstoreinstructionshavetobethesame,(3)thevalueinr[b]thatwillbestoredcannotdependonthevalueloadedintor[a],(4)thevalueinr[b]cannotbeusedafterthestoreinstruction,and(5)r[a]hastobeabletoberenamedtor[b].Thefollowingsubsectionsdescribeissuesrelatingtothiscode-improvingtransformation.6 FORB=eachblockinfunctionDOFORLD=eachinstructioninBDOIFLDisaloadANDFindMatchingStore(LD,B,LD�next,ST,P)ANDMeetSwapConds(LD,ST)THENSW=Create("%s=M[%s];M[%s]=%s;",ST�r[b],LD�loadaddr,LD�loadaddr,ST�r[b]);InsertSWbeforeP;ReplaceusesofL�r[a]withS�r[b]untilL�r[a]dies;DeleteLDandST;BOOLFindMatchingStore(LD,B,FIRST,ST,P)fFORST=FIRSTtoB�lastDOIFSTisastoreTHENIFST�storeaddr==LD�loadaddrTHENIfFIRST==B� rstTHENRETURNTRUE;ELSERETURNFindPlaceToInsertSwap(LD,ST,P);IFST�storeaddr!=LD�loadaddrTHENCONTINUE;IFcannotdetermineifthetwoaddressesaresameordi erentTHENRETURNFALSE;IFFIRST==B� rstTHENRETURNFALSE;FORS=eachsuccessorofBDOIF!FindMatchingStore(LD,S,S� rst,ST,P)THENRETURNFALSE;FORS=eachsuccessorofBDOIFFindPlaceToInsertSwap(LD,ST,P)THENRETURNTRUE;RETURNFALSE;gMeetSwapConds(LD,ST)fRETURN(valueinST�r[b]isguaranteedtonotdependonthevalueinLD�r[a])AND(ST�r[b]diesatthestore)AND((ST�r[b]isnotresetbeforeLD�r[a]dies)OR(otherliverangeofS�r[b]canberenamedtouseanotherregister));gFindPlaceToInsertSwap(LD,ST,P)fIFLD�r[a]isnotusedbetweenLDandSTTHENP=ST;RETURNTRUE;IFST�r[b]isnotreferencedbetweenLDandSTTHENP=LD�next;RETURNTRUE;IF rstuseofLD�r[a]afterLDcomesafterthelastreferencetoST�r[b]beforethestoreTHENP=instructioncontaining rstuseofLD�r[a]afterLD;RETURNTRUE;IF rstuseofLD�r[a]afterLDcanbemovedafterthelastreferencetoST�r[b]beforethestoreTHENMoveinstructionsasneeded;P=instructioncontaining rstuseofLD�r[a]afterLD;RETURNTRUE;ELSERETURNFALSE;gFig.8.AlgorithmforCoalescingaLoadandaStoreintoaSwapInstruction7 4.1PerformingtheCode-ImprovingTransformationLateintheCompilationProcessSometimesapparentopportunitiesatthesourcecodelevelforexploitingaswapinstructionarenotavailableafterothercode-improvingtransformationshavebeenapplied.Manycode-improvingtransformationseithereliminatemem-oryreferences(e.g.registerallocation)ormovememoryreferences(e.g.loop-invariantcodemotion).Coalescingloadsandstoresintoswapinstructionsshouldonlybeperformedafterallothercode-improvingtransformationsthatcanaf-fectthememoryreferenceshavebeenapplied.Figure9(a)showsanexchangeofvaluesafterthetwovaluesarecomparedinanifstatement.Figure9(b)showsapossibletranslationofthiscodesegmenttomachineinstructions.Duetocom-monsubexpressionelimination,theloadsofxandyintheblockfollowingthebranchhavebeendeletedinFigure9(c).Thus,theswapinstructioncannotbeexploitedwithinthatblock.Thisexampleillustrateswhytheswapinstructionshouldbeperformedlateinthecompilationprocesswhentheactualloadsandstoresthatwillremaininthegeneratedcodeareknown.(c) Loads Are Deleted inthe Exchange of Values Due toCommon Subexpression Eliminationr[1] = M[x];(a) Exchange of Values (b) Loads are InitiallyM[y] = r[1];M[x] = r[2];PC = IC IC = r[1] ? r[2];r[2] = M[y];r[1] = M[x];Source Code LevelPerformed in the Exchange y = t;in x and y at theof Values of x and y&#x= 0,;&#x L5;;if (x y) {Br[1] = M[x];Fig.9.ExampleDepictingWhytheSwapInstructionShouldBeExploitedasaLow-LevelCode-ImprovingTransformation4.2EnsuringMemoryAddressesAreEquivalentOrAreDi erentOneoftherequirementsforaloadandstoretobecoalescedisthattheloadandstoremustrefertothesameaddress.Figure10(a)showsaloadusingthead-8 dressinregisterr[2]andastoreusingtheaddressinr[4].Thecompilermustensurethatthevalueinr[2]isthesameasthatinr[4].Thisprocessofcheck-ingthattwoaddressesareequivalentiscomplicatedduetothecode-improvingtransformationbeingperformedlateinthecompilationprocess.Commonsubex-pressioneliminationandloop-invariantcodemotionmaymovetheassignmentsofaddressestoregistersfarfromwheretheyareactuallydereferenced.(b) Load and Store to the Same(a) Same Addresses afterExpanding the ExpressionsVariable with an Intervening Storer[2] = r[3] M[v] = r[b];...M[r[c]] = r[d];...r[a] = M[v];Fig.10.ExamplesofDetectingIfMemoryAddressesAretheSameorDi erWeimplementedsometechniquestodetermineiftheaddressesoftwomem-oryreferenceswerethesameoriftheydi er.Addressestomemorywereex-pandedbysearchingbackwardsforassignmentstoregistersintheaddressuntilallregistersarereplacedorthebeginningofablockwithmultiplepredecessorsisencountered.Forinstance,theaddressinthememoryreferencebeingloadedinFigure10(a)isexpandedasfollows:r[2]=&#x 2;0;r[2]+a=&#x 2;0;(r[3]2)+aTheaddressinthememoryreferencebeingstoredwouldbeexpandedinasimilarmanner.Oncetheaddressesoftwomemoryreferenceshavebeenexpanded,thentheyarecomparedtodetermineiftheydi er.Iftheexpandedaddressesaresyntaticallyequalivalent,thenthecompilerhasensuredthattheyrefertothesameaddressinmemory.Wealsoassociatedtheexpandedaddresseswithmemoryreferencesbeforecode-improvingtransformationsinvolvingcodemotionwereapplied.Thecom-pilertrackedtheseexpandedaddresseswiththememoryreferencesthroughavarietyofcode-improvingtransformationsthatwouldmovethelocationofthememoryreferences.Determiningtheexpandedaddressesearlysimpli edtheprocessofcalculatingaddressesassociatedwithmemoryreferences.Anotherrequirementforaloadandastoretobecoalescedisthattherearenootherpossibleinterveningstorestothesameaddress.Figure10(b)showsaloadofavariablevfollowedbyastoretothesamevariablewithaninterveningstore.Thecompilermustensurethatthevalueinr[c]isnottheaddressofthevariablev.However,simplycheckingthattwoexpandedaddressesarenotidenticaldoesnotsucetodetermineiftheyrefertodi erlocationsinmemory.9 Variousruleswereusedtodetermineiftwoaddressesdi ered.Table1depictssomeoftheserulesthatwereused.NumRuleExampleFirstAddressSecondAddressTheaddressesaretodi er-1entclasses(localvariables,arguments,staticvariables,andglobalvariables.M[a]M[r[30]+x]Bothaddressesaretothe2sameclassandtheirnamedi ers.M[a]M[b]Oneaddressistoavariable3thathasneverhaditsad-dresstakenandthesecondaddressisnottothesamevariable.M[r[14]+v]M[r[7]]Theaddressesarethesame,4exceptfordi erentconstanto sets.M[(r[3]2)+a]M[(r[3]2)+a+4]Table1.ASubsetoftheRulesUsedforMemoryDisambiguation4.3FindingaLocationtoPlacetheSwapInstructionAnotherconditionthathastobemetforaloadandastoretobecoalescedintoaswapinstructionisthattheinstructioncontainingthe rstuseofregisterassignedbytheloadhastooccurafterthelastreferencetotheregistertobestored.Forexample,considertheexampleinFigure11(a).Auseofr[a]appearsafterthelastreferencetor[b]beforethestoreinstruction,whichpreventstheloadandstorefrombeingcoalesced.Figure11(b)showsthatthecompilerissometimesabletorescheduletheinstructionsbetweentheloadandthestoretomeetthiscondition.Nowtheloadandthestorecanbemovedwheretheloadappearsimmediatelybeforethestore,asshowninFigure11(c).Oncetheloadandstorearecontiguous,thetwoinstructionscanbecoalesced.Figure11(d)showsthecodesequenceaftertheloadandstorehasbeendeleted,theswapinstructioninserted,andr[a]hasbeenrenamedtor[b].4.4RenamingRegisterstoAllowtheSwapInstructiontoBeExploitedWeencounteredanothercomplicationduetocoalescingloadsandstoresintoswapinstructionslateinthecompilationprocess.Pseudoregisters,whichcon-taintemporaryvalues,havealreadybeenassignedtohardwareregisterswhenthecoalescingtransformationisattempted.Thecompilerreuseshardwarereg-isterswhenassigningpseudoregisterstohardwareregistersinanattempttominimizethenumberofhardwareregistersused.Ourimplementationofthecode-improvingtransformationsometimesrenamedliverangesofregistersto10 (d) After Coalescing the Loadand Store and Renamingr[a] to Be r[b](c) Load and StoreCan Now BeMade Contiguous......... = ... r[b] ...;... = ... r[b] ...;(a) Use of r[a]Appears before aReference to r[b](b) First Use of r[a]Reference to r[b]Appears after the Lastr[a] = M[v];r[a] = M[v];Fig.11.ExamplesofFindingaLocationtoPlacetheSwapInstructionpermittheuseofaswapinstruction.ConsidertheexampleinFigure12(a),whichcontainsasetofr[b]afterthestoreandbeforethelastuseofthevalueassignedtor[a].Inthissituation,weattempttorenamethesecondliverangeofr[b]toadi erentavailableregister.Figure12(b)showsthisliverangebe-ingrenamedtor[c].Figure12(c)depictsthattheloadandstorecannowbecoalescedsincer[a]canberenamedtor[b].Sometimeswehadtomovesequencesofinstructionspastotherinstructionsinorderfortheloadandstoretobecoalesced.ConsidertheunrolledloopinFigure4.Figure13(a)showsthesameloop,butinaload/storefashion,wherethetemporariesareregisters.Theloadandstorecannotbemadecontiguousduetoreuseofthesameregisters.Figure13(b)showsthesamecodeafterrenamingtheregistersonwhichthevaluetobestoreddepends.NowtheinstructionscanbescheduledsothattheloadandstorecanbemadecontiguousasshowninFigure13(c).Figure13(d)showstheloadandstorecoalescedandtheloadedregisterrenamed.5RESULTSTable2describesthenumerousbenchmarksandapplicationsthatweusedtoevaluatetheimpactofapplyingthecode-improvingtransformationtocoalesceloadsandstoresintoaswapinstruction.TheprogramsdepictedinboldfaceweredirectlyobtainedfromtheNumericalRecipesinCtext[5].Thecodeinmanyofthesebenchmarksareusedasutilitiesinavarietyofprograms.Thus,coalescingloadsandstoresintoswapscanbeperformedonadiversesetofapplications.Measurementswerecollectedusingtheeasesystemthatisavailablewiththevpocompiler.Insomecases,weemulatedaswapinstructionwhenitdidnotexist.Forinstance,theSPARCdoesnothaveswapinstructionsthatswaps11 and Renaming the Live Range of r[a] to r[b](c) After Coalescing the Load and Store...(a) r[b] Is Set in theLive Range of r[a](b) Live Range of r[b] after StoreHas Been Renamed to r[c]r[a] = M[v];r[a] = M[v];Fig.12.ExampleofApplyingRegisterRenamingtoPermittheUseofaSwapIn-struction�for (j = n-1; j 1; j -= 2) { r[4] = dd[j-1]; r[3] = d[j-2];�for (j = n-1; j 1; j -= 2) { r[1] = r[1]-r[2]; r[2] = dd[j]; r[3] = r[3]-r[4]; r[4] = dd[j-1]; r[3] = d[j-2];(a) After Loop Unrolling�for (j = n-1; j 1; j -= 2) {(b) After Register Renaming(c) After Scheduling the Instructions�for (j = n-1; j 1; j -= 2) {(d) After Coalescing the Load and Store r[1] = ; r[3] = ; = r[3];d[j-1] r[2] = dd[j];d[j-1] = r[1]; r[1] = d[j-1]; = dd[j-1];r[3]d[j-1]d[j-1]d[j-1] r[1] = ;d[j-1]Fig.13.AnotherExampleofApplyingRegisterRenamingtoPermittheUseofaSwapInstruction12 ProgramDescriptionbandecconstructsanLUdecompositionofasparserepresentationofabanddiagonalmatrixbubblesortsortsanintegerarrayinascendingorderusingabubblesortchebpcpolynomialapproximationfromChebyshevcoecientselmhesreducesanNNmatrixtoHessenbergform tfastfouriertransformgaussjsolveslinearequationsusingGauss-Jordaneliminationindexxcal.indicesforthearraysuchthattheindicesareinascendingorderludcmpperformsLUdecompositionofanNNmatrixmmidmodi edmidpointmethodpredicperformslinearpredictionofasetofdatapointsrt\rsp ndstherootofafunctionusingthefalsepositionmethodselectreturnstheksmallestvalueinanarraythreshadjustsanimageaccordingtoathresholdvaluetransposetransposesamatrixtraversebinarytreetraversalwithoutastacktsptravelingsalesmanproblemTable2.TestProgramsbytes,halfwords,\roats,ordoublewords.Theeasesystemprovidestheabilitytogathermeasurementsonproposedarchitecturalfeaturesthatdonotexistonahostmachine[8,9].NotethatitissometimespossibletousetheSPARCswapinstruction,whichexchangesawordinanintegerregisterwithawordinmemory,forexchanginga\roating-pointvaluewithavalueinmemory.Whenthe\roating-pointvaluesthatareloadedandstoredarenotusedinanyoperations,thenthesevaluescouldbeloadedandstoredusingintegerregistersinsteadof\roating-pointregistersandtheswapinstructioncouldbeexploited.Table3depictstheresultsthatwereobtainedonthetestprogramsforco-alescingloadsandstoresintoswapinstructions.Weunrolledseveralloopsintheseprogramsbyanunrollfactoroftwotoprovideopportunitiesforcoalescingaloadandastoreacrosstheoriginaliterationsoftheloop.Inthesecases,theNotCoalescedcolumnincludestheunrollingoftheseloopstoprovideafaircom-parison.Theresultsshowdecreasesinthenumberofinstructionsexecutedandmemoryreferencesperformedforawidevarietyofapplications.Theamountofthedecreasevarieddependingontheexecutionfrequencyoftheloadandstoreinstructionsthatwerecoalesced.6CONCLUSIONSInthispaperwehaveshownhowtoexploitaswapinstruction,whichexchangesthevaluesbetweenaregisterandalocationinmemory.Wehavediscussedhowaswapinstructioncouldbeecientlyintegratedintoaconventionalload/storearchitecture.Anumberofdi erenttypesofopportunitiesforexploitingtheswapinstructionwereshowntobeavailable.Analgorithmforcoalescingaloadandastoreintoaswapinstructionwasgivenandanumberofissuesrelatedtoimplementingthecoalescingtransformationsweredescribed.Theresultsshowthatthiscode-improvingtransformationcouldbeappliedonavarietyofappli-cationsandbenchmarksandreductionsinthenumberofinstructionsexecutedandmemoryreferencesperformedwereobserved.13 ProgramInstructionsExecutedMemoryReferencesPerformedNotCoalescedCoalescedDecreaseNotCoalescedCoalescedDecreasebandec69,18968,4591.06%18,05417,3244.04%bubblesort2,439,0052,376,7052.55%498,734436,43412.49%chebpc7,531,9847,029,9906.66%3,008,0522,507,05616.66%elmhes18,52718,0442.61%3,0102,8913.95% t4,176,1124,148,1120.67%672,132660,9321.67%gaussj27,14326,7561.43%7,8847,5873.77%indexx70,32268,6762.34%17,13215,9816.72%ludcmp10,521,95210,439,1520.79%854,915845,7151.08%mmid267,563258,5543.37%88,62279,61310.17%predic40,82738,9274.65%13,89411,99413.67%rt\rsp81,11780,1161.23%66,18465,1831.51%select19,93919,4342.53%3,6183,12113.74%thresh7,958,9097,661,7963.73%1,523,5541,226,59419.49%transpose42,88337,93311.54%19,83214,88224.96%traverse94,15991,0903.26%98,31196,2652.08%tsp64,294,81463,950,1220.54%52,144,37551,969,5290.34%average6,103,4026,019,6163.06%3,689,8933,622,5688.52%Table3.ResultsReferences1.M.D.Hill,\ACaseforDirect{MappedCaches,"IEEEComputer,21(11),pages25{40,December1988.2.TexasInstruments,Inc.,ProductPreviewoftheTMS390S10IntegratedSPARCProcessor,1993.3.J.HennessyandD.Patterson,ComputerArchitecture:AQuantitativeApproach,SecondEdition,MorganKaufmann,SanFrancisco,CA,1996.4.J.W.DavidsonandS.Jinturkar,\MemoryAccessCoalescing:ATechniqueforEliminatingRedundantMemoryAccesses,"ProceedingsoftheSIGPLAN'94Sym-posiumonProgrammingLanguageDesignandImplementation,pages186{195,June1994.5.W.H.Press,S.A.Teukolsky,W.T.Vetterling,andB.P.Flannery,NumericalRecipesinC:TheArtofScienti cComputing,SecondEdition,CambridgeUni-versityPress,NewYork,NY,1996.6.B.Dwyer,\SimpleAlgorithmsforTraversingaTreewithoutaStack,"InformationProcessingLetters,2(5),pages143{145,1973.7.I.Pitas,DigitalImageProcessingAlgorithmsandApplications,JohnWiley&Sons,Inc.,NewYork,NY,2000.8.J.W.DavidsonandD.B.Whalley,\Ease:AnEnvironmentforArchitectureStudyandExperimentation,"ProceedingsSIGMETRICS'90ConferenceonMeasurementandModelingofComputerSystems,Pages259{260,May1990.9.J.W.DavidsonandD.B.Whalley,\ADesignEnvironmentforAddressingAr-chitectureandCompilerInteractions,"MicroprocessorsandMicrosystems,15(9),pages459{472,November1991.14