w ashingtonedu ABSTRA CT Multiv ersion program analyses require that elemen ts of one ersion of program mapp ed to the elemen ts of other er sions of that program Matc hing program elemen ts et een ersions of program is fundamen tal building blo for ID: 29054
Download Pdf The PPT/PDF document "Pr ogram Element Matc hing or MultiV er ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
ProgramElementMatchingforMultiVersionProgramAnalysesMiryungKim,DavidNotkinComputerScience&EngineeringUniversityofWashingtonSeattle,WAfmiryung,notking@cs.washington.eduABSTRACTMulti-versionprogramanalysesrequirethatelementsofoneversionofaprogrambemappedtotheelementsofotherver-sionsofthatprogram.Matchingprogramelementsbetweentwoversionsofaprogramisafundamentalbuildingblockformulti-versionprogramanalysesandothersoftwareevolu-tionresearchsuchasprolepropagation,regressiontesting,andsoftwareversionmerging.Inthispaper,wesurveymatchingtechniquesthatcanbeusedformulti-versionprogramanalysesandevaluatethembasedonhypotheticalchangescenarios.Thispaperalsolistschallengesofthematchingproblem,identiesopenprob-lems,andproposesfuturedirections.CategoriesandSubjectDescriptor:D.2.7[SoftwareEngineering]:Distribution,Maintenance,andEnhancement|restructuring,reverseengineering,andreengineeringGeneralTerms:Documentation,AlgorithmsKeywords:matching,softwareevolution,multi-versionanal-ysisINTRODUCTIONInthelastseveralyears,researchersinsoftwareengineer-inghavebeguntoanalyzeprogramstogetherwiththeirchangehistory.Incontrasttotraditionalprogramanalysesthatexamineasingleversion,multi-versionprogramanal-ysesusemultipleversionsofaprogramasinputandminechangepatterns.Thereareroughlytwodierenttypesofmulti-versionanalyses:(1)coarse-grainedanalysesand(2)ne-grainedanalyses.Coarse-grainedanalysescomputechangesbetweentwoconsecutiveversionsofaprogram,aggregatethechangeinformationacrossmultipleversionsoracrossmultipleles,andinfercoarse-grainedpatterns[37,15,20,17].Forex-ample,NagappanandBall'sanalysis[37]ndsline-levelchangesbetweentwoconsecutiveversions,countsthetotalnumberofchangesperbinarymodule,andinfersthecharac-teristicsoffrequentlychangedmodules.Ontheotherhand,Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.MSR'06,May2223,2006,Shanghai,China.Copyright2006ACM159593085X/06/0005...$5.00.ne-grainedanalysestrackhowindividualcodefragmentschangedduringprogramevolutiontoinferne-grainedchangepatterns[29,31,52,38,43,48,51,11].Forexample,aclonegenealogyextractortracksindividualcodesnippetsovermultipleversionstoinfercloneevolutionpatterns[29].Asanotherexample,asignaturechangepatternanalysis[31,30]traceshowthenameandthesignatureoffunctionschange.Matchingelementsbetweentwoversionsofapro-gramisafundamentalbuildingblockforne-grainedmulti-versionanalysesaswellasothersoftwareevolutionresearchsuchassoftwareversionmerging,regressiontesting,prolepropagation,etc[36,42,21,46].Werstdenetheproblemofmatchingcodeelementsbetweentwoversions:SupposethataprogramP0iscreatedbymodifyingP.De-terminethedierencebetweenPandP0.Foracodefragmentc02P0,determinewhetherc02.Ifnot,ndc0'scorrespondingorigincinP:Theproblemdenitionstatesthatwemustcomputethedierencebetweentwoprograms.Computingsemanticdif-ferencesrequiressolvingtheproblemofsemanticprogramequivalence,whichisanundecidableproblem.Thus,oncetheproblemisapproximatedbymatchingacodeelementbyitssyntacticandtextualsimilarity,solutionsdependonthechoicesof(1)anunderlyingprogramrepresentation,(2)matchinggranularity,(3)matchingmultiplicity,and(4)matchingheuristics.Inthispaper,weexplainhowthechoicesimpactapplicabilityofeachmatchingmethodandhowthechoicesaecteectivenessandaccuracyofmatchingbycreatinganevaluationframeworkforexistingmatchingtechniques.Therestofthispaperisorganizedasfollows.Thenextsectiondiscussesseveralmulti-versionanalyses,whichdemon-stratetheneedsofprogramelementmatching.Then,wediscusschallengesofprogramelementmatchinginSection3.Section4presentsasurveyofstate-of-the-artmatch-ingtechniquesfromvariousresearchareassuchasmulti-versionprogramanalyses,prolepropagation,softwarever-sionmerging,andregressiontesting.Section5comparesthesurveyedtechniquesandSection6evaluateseachtech-niqueusinghypotheticalprogramchangescenarios.Section7presentsopenproblemsandfuturedirections.2.MOTIVATINGPROBLEMSInthissection,wedescribeseveralmulti-versionanalysisproblemsthatdemonstrateimportanceofprogramelementmatching.Therstproblemismaintainingtwosimilarprograms thatoriginatedfromthesamesourcebutevolveddierentlyinparallel.1Inmanyorganizations,itisacommonpracticetocloneaproductoramoduleandmaintaintheclonesinparallel[13].Maintenancedicultiesarisewhenprogram-mersdiscoveracriticalbuginclonedparts.Ifprogram-mersdonotknowwhetherthediscoveredbugisrelevanttootherclonedcounterparts,theymustinspectsourcecodeofthecounterparts.Webelievethatprogrammerscanbetterlocaterelevantcounterpartsbyunderstandinghowcloneschangeovertime.Monitoringclonesrequirestrackingeachclonebyitsphysicallocationsuchasalename,aprocedurename,orcalibratedlinenumbers[29].Oursecondmotivatingproblemisunderstandingtheevo-lutionofinformationhidinginterfaces.Theinformationhid-ingprinciple[41]statesthatprogrammersshouldanticipatewhatkindsofdecisionsarelikelytochangeinthefutureandhidethemusinganinterface.Ingeneral,itisdicultforprogrammerstopredictwhichdesigndecisionsarelikelytochange;thus,unanticipateddesigndecisionsresultindegra-dationoforiginalsoftwaredesign.Webelievethatunder-standinginterfaceevolutioncanshedlighton(1)whichde-cisionswereoriginallyhiddenbutlaterexposedand(2)howunanticipateddecisionsimpactoriginalinterfacedesign.Inadditiontotheproblemsabove,typechangeanalysis[38],signaturechangeanalysis[31],andinstabilityanalysis[11]alsorequirematchingprogramelementsinordertotrackcodeelementsovertime.3.MATCHINGCHALLENGESThissectionlistschallengesofprogramelementmatching.3.1AbsenceofBenchmarksItisdiculttoevaluatematchingtechniquesbecausethereisnoreferencedatasetorarchiveofeditinglogs.Previousstudies[30]alsoindicatethatprogrammersoftendisagreeabouttheoriginofcodesnippets;lowinter-rateragreementsuggeststhattheremaybenogroundtruthinprogramelementmatching.3.2VariousGranularitySupportWithrespecttoourmotivatingproblemsinSection2,wecannotassumethatprogrammersintendtotrackpro-gramelementsataxedgranularity.Therearetworeasonswhymatchingtechniquesmustsupportvariousgranular-ity.First,aprogrammermaywanttotrackdesigndecisionsthatcannotbemappedtoprogramunitsprecisely.Second,aprogrammermaywanttrackprogramelementsatadif-ferentgranularitydependingonthenatureoftasks.Forexample,ifmatchingtechniquesaretobeusedforprolepropagationorpreciseregressiontestselection,mappingsshouldbefoundatanegranularitysuchasatthelevelofcontrol\rowgraphedges[42]oratthelevelofcodeblocks[46,44].Ontheotherhand,ifaprogrammerwantstousematchingresultsforprogramunderstandingtasks,itwouldbemoreappropriatetondassociationsatahigherlevelsuchasale.3.3TypesofCodeChangesCertaintypesofcodechangesmakethematchingprob-lemnon-trivial.Forexample,trackingcodebyitsenclosing1Inanopensourceprojectcommunity,thispracticeisoftencalledasforking.procedurenamewouldfailifprogrammersmerged,split,orrenamedprocedures.Whenaprogrammercopiescode,matchingtechniquescannotassumeone-to-onemappingsbetweenoldelementsandnewelementsbecauseanoldcodeelementcanhavemorethanonematchingdescendantsinanewversion.4.MATCHINGTECHNIQUESThissectiondescribesmatchingtechniquesusedforsoft-wareversionmerging,programdierencing,prolepropaga-tion,regressiontesting,andmulti-versionprogramanalyses.Foreasycomparison,wegrouptechniquesbyprogramrep-resentation.WedescribeclonedetectorsandtoolsthatinferrefactoringeventsinSection4.7and4.8becausethesetoolscanbeleveragedforndingcorrespondencesbetweentwoversions.4.1EntityNameMatchingThesimplestmatchingmethodtreatsprogramelementsasimmutableentitieswithaxednameandmatchestheelementsbyname.Forexample,Zimmermannetal.mod-eledafunctionasatuple,(lename,FUNCTION,functionname),andaeldasatuple,(functionname,FIELD,eldname)[51].Similarly,Yingetal.[48]modeledalewithitsfullpathname.Infact,matchingbynamewouldbesu-cientformanymulti-versionanalysesthatintendtoidentifycoarse-grainedpatternssuchasthecharacteristicsoffaultpronemodules[15,20,37].4.2StringMatchingWhenaprogramisrepresentedasastring,thebestmatchbetweentwostringsiscomputedbyndingthelongestcom-monsubsequence(LCS)[7].TheLCSproblemisbuiltontheassumptionthat(1)availableoperationsareadditionanddeletion,and(2)matchedpairscannotcrossonean-other.Thus,thelongestcommonsubsequencedoesnotnecessarilyincludeallpossiblematcheswhenavailableeditoperationsincludecopy,paste,andmove.Tichy[45](bdi)extendedtheLCSproblembyrelaxingthetwoassumptionsabove:permittingcrossingblockmovesandnotrequiringone-to-onecorrespondence.Theline-levelLCSimplementation,di[25],hasservedasbasisformanymulti-versionanalyses,because(1)diisfastandreliable,and(2)popularversioncontrolsystemssuchasCVS[2]orSubversion[1]alreadystorechangesasline-leveldierences.Forexample,aclonegenealogyextractortrackscodesnippetsbytheirlenameandlinenumber[29].Asanotherexample,x-inducingcodesnippets[43]areinferredbytrackingbackwardatupleof(lename::functionname::linenumber)fromthemomentthatabugisxed.4.3SyntaxTreeMatchingForsoftwareversionmerging,Yang[47]developedanASTdierencingalgorithm.Givenapairoftwofunctions(fT;fR),thealgorithmcreatestwoabstractsyntaxtreesTandRandattemptstomatchthetwotreeroots.Ifthetworootsmatch,thealgorithmalignsT'ssubtreest1;t2;:::;tiandR'ssubtreesr1;r2;:::rjusingtheLCSalgorithmandmatchessubtreesrecursively.Thistypeoftreematchingrespectstheparent-childrelationshipaswellastheorderbetweensiblingnodes,butisverysensitivetochangesinnestedblockandcontrolstructuresbecausetreerootsmustbematchedforeverylevel. HuntandTichydonotdirectlycompareASTsbutusesyntacticinformationtoguidestringleveldierencing.Their3-waymergingtool[24]parsesaprogramintoalanguageneutralform,comparestokenstringsusingtheLCSalgo-rithm,andndssyntacticchangesusingstructuralinforma-tionfromtheparse.Fordynamicsoftwareupdating,Neamtiuetal.[38]builtanalgorithmthattrackssimplechangestovariables,types,andfunctionsbasedonaASTrepresentation.Neamtiu'salgorithmassumesthatfunctionnamesarerelativelysta-bleovertimeandmatchestheASTsoffunctionswiththesamename;thealgorithmtraversestwoASTsinparallelandincrementallyaddsone-to-onemappingsaslongastheASTshavethesameshape.IncontrasttoYang'salgorithm,Neamtiu'salgorithmcannotcomparestructurallydierentASTs.ControlFlowGraphMatchingLaskiandSzermer[33]rstdevelopedanalgorithmthatcomputesone-to-onecorrespondencesbetweenCFGnodesintwoprogramsP1andP2.ThisalgorithmrstreducesaCFGtoaseriesofsingle-entry,single-exitsubgraphscalledhammocksandmatchesasequenceofhammocknodesus-ingadepthrstsearch(DFS).Onceapairofcorrespondinghammocknodesisfound,thehammocknodesarerecur-sivelyexpandedinordertondcorrespondenceswithinthematchedhammocks.Jdi[5]extendsLaskiandSzermer's(LS)algorithmtocompareJavaprogramsbasedonanenhancedcontrol\rowgraph(ECFG).JdiissimilartotheLSalgorithminthesensethathammocksarerecursivelyexpandedandcom-pared,butisdierentinthreeways:First,whiletheLSalgorithmcompareshammocknodesbythenameofastartnodeinthehammock,Jdicheckswhethertheratioofunchanged-matchedpairsinthehammockisgreaterthanachosenthresholdinordertoallowfor\rexiblematches.Second,whiletheLSalgorithmusesDFStomatchham-mocknodes,JdionlyusesDFSuptoacertainlook-aheaddepthtoimproveitsperformance.Third,whiletheLSalgo-rithmrequireshammocknodematchesatthesamenestedlevel,Jdicanmatchhammocknodesatadierentnestedlevel;thus,Jdiismorerobusttoadditionofwhileloopsorif-statementsatthebeginningofacodesegment.Jdihasbeenusedforregressiontestselection[40]anddynamicimpactanalysis[6].4.5ProgramDependenceGraphMatchingThereareseveralprogramdierencingalgorithmsbasedonaprogramdependencegraph[23,12,26].Thesealgo-rithmsarenotapplicabletopopularmodernprogramlan-guagesbecausetheycanrunonlyonalimitedsubsetofClanguageswithoutglobalvariables,pointers,arrays,orprocedures.4.6BinaryCodeMatchingBMAT[46]isafastandeectivetoolthatmatchestwoversionsofabinaryprogramwithoutknowledgeofsourcecodechanges.BMATwasusedforprolepropagationandregressiontestprioritization[44].BMAT'salgorithmmatchesblocksinthreesteps.TherststepofBMAT'smatchingalgorithmistondone-to-onemappingsbetweenthepro-ceduresintwoversionsbasedontheirnames,typeinforma-tion,andcodecontents.Tomatchprocedureswithdierentnames,blocktrialmatchingisdoneonprocedurepairswithasmallnumberofcharacterdierencesintheirhierarchicalnames.Inthisstep,thethresholdsforprocedurenamedif-ferenceandblockmatchingpercentagearebothsetheuris-tically.Inthesecondstep,BMATrstmatchesdatablockswithineachpairofmatchedproceduresusingahashfunc-tionandmatchesremainingunmatcheddatablocksiftheunmatchedblocksaresandwichedbyalreadymatchedpairs.Inthethirdstep,BMATmatchescodeblocksinmultiplehashingpasses.Duringhash-basedmatching,ifhashvaluescollide,twoheuristicsareusedtobreakties:(1)crossingmatchesareforbiddenatcertainhashingpasses,and(2)apairofblocksispreferredtoothertiedpairsifeitheritspredecessorsorsuccessorsarealsomatched.Forremainingunmatchedblocks,BMATmatchesblocksbasedoncontrol\rowequivalence,allowingone-to-manymappingsbetweenoldcodeblocksandnewcodeblocks.4.7CloneDetectionAclonedetectorissimplyanimplementationofanarbi-traryequivalencefunction.Theequivalencefunctiondenedbyeachclonedetectordependsaprogramrepresentationandacomparisonalgorithm.Mostclonedetectors[8,28,9,32,27]areheavilydependenton(1)hashfunctionstoimproveperformance,(2)parameterizationtoallow\rexiblematches,and(3)thresholdstoremovespuriousmatches.Aclonedetectorcanbeconsideredasamany-to-manymatcherbasedsolelyoncontentsimilarityheuristics.4.8OriginAnalysisToolsOriginanalysisinfersrefactoringeventssuchassplitting,merging,renamingandmovingbycomparingtwoversionsofaprogram[14,52,31,4,18,35,19].Originanalysistacklestheprogramelementmatchingproblemdirectlybutproducesmatchingresultsonlyatapredenedgranularitysuchasaprocedure,classorle.Demeyeretal.[14]rstproposedtheideaofinferringrefactoringeventsbycomparingthetwoprograms.De-meyeretal.usedasetoftencharacteristicsmetricsforeachclass,suchasLOCandthenumberofmethodcallswithinamethod(i.e.,fan-out)andinferredwhererefactor-ingeventsoccurredbasedonthemetricvaluesandaclassinheritancehierarchy.ZouandGodfrey'soriginanalysis[52]matchesproceduresusingmultiplecriteria(names,signatures,metricvalues,andasetofcallersandcallees)andinfersmerging,split-ting,andrenamingevents.BothDemeyeretal.andZouandGodfrey'sanalysesaresemi-automaticinthesensethataprogrammermustmanuallytunematchingcriteriaandselectamatchamongcandidatematches.Kimetal.[30]automatedZouandGodfrey'sprocedurerenaminganalysis.InadditiontomatchingcriteriausedbyZouandGodfrey,Kimetal.usedclonedetectorssuchasCCFinder[28]andMoss[3]tocalculatecontentsimi-laritybetweenprocedures.Anoverallsimilarityiscom-putedasaweightedsumofeachsimilaritymetric,andamatchisselectediftheoverallsimilarityisgreaterthanacertainthreshold.Tocreateanevaluationdataset,tenhu-manjudgesidentiedrenamingeventsintheSubversionandtheApacheprojects,andifsevenoutoftenjudgesagreedtheoriginofarenamedprocedure,amatchwasaddedtoareferencedataset.Usingthereferencedataset,Kimetal.optimizedeachfactor'sweightandtunedthethreshold Table1:ComparisonofMatchingTechniquesProgramCitationGranularityAssumedMultiplicityHeuristicsApplicationRepresentationCorrespondenceNPSAsetof[20,15,37]Module1:1pFaultpronenessentitiesBevanetal.[11]File1:1pInstabilityYingetal.[48]File1:1pCo-changeZimmermannetal.File1:1p[51]ProcedureFieldStringdi[25]LineFile1:1pMergingClonegenealogy[29]Fixinducingcode[43]bdi[45]LineFile1:npMergingASTcdi[47]ASTNodeProcedure1:1pNeamtiuetal.[38]Type,Var1:1ppTypechangeHunt,Tichy[24,35]TokenFile1:1ppMergingCFGJdi[5]CFGnode1:1ppRegressiontestingImpactanalysisBinaryBMAT[46]Codeblock1:1(procedure)pppProlepropagationn:1(block)RegressiontestingHybridZou,Godfrey[52]Procedure1:1or1:norn:1ppOriginanalysisKimetal.[30]Procedure1:1ppSignaturechange[31]RenaminganalysisN:Name-basedheuristics,P:Position-basedheuristics,S:Similarity-basedheuristicsvalue.TheaccuracyofKim'stoolwasbetterthantheaver-ageaccuracyofhumanjudges,indicatingthathumanjudgessignicantlydisagreedabouttheoriginofprocedures.5.COMPARISONTable1showscomparisonofthestate-of-the-artmatchingtechniquesinSection4.AsshowninthefourthcolumnofTable1,manymatchingtechniquesassumecorrespondenceatacertaingranularitynomatterwhetherthisassumptionisexplicitlystatedornot.Forexample,usingditomatchcodesnippetsassumesthatinputlesalreadyarematched.Asanotherexample,usingcditomatchASTnodesas-sumesthatenclosingfunctionsarematchedbythesamename.Allmatchingtechniquesheavilyrelyonheuristicstore-duceamatchingscopeandtoimproveprecisionandrecall.Theheuristicsarecategorizedintothreecategories:21.Name-basedheuristicsmatchentitieswithsimilarnames.Forexample,BMATandJdimatchproceduresinmultiplephasesbythesamegloballyqualiedname(e.g.,System.out.println),bythesamehierarchicalname,bythesamesignature,andbythesamename.2.Position-basedheuristicsmatchentitieswithsimilarpositions.Ifentitiesareplacedinthesamesyntac-ticpositionorsurroundedbyalreadymatchedpairs(i.e.,asandwichheuristic),theybecomeamatchedpair.Forexample,BMATusesasandwichheuristicaggressivelytoremoveunmatchedpairs.Asanotherexample,Neamtiu'salgorithmtraversestwoASTsinparallelandmatchesvariablesplacedinthesamesyn-tacticpositionregardlessoftheirlabels.3.Similarity-basedheuristicsmatchentitiesthatarenearlyidentical;theyoftenrelyonparameterizationandahashfunctiontondnearidenticalentities.Allclonedetectorscanbeviewedasasimilarity-basedmatcher.2Thethreecategoriesarenotcomprehensiveormutuallyexclusive.Thethreedierenttypesofheuristicscomplementonean-other.Forexample,whenhashvaluescollideorparameteri-zationresultsinspuriousmatches,position-basedheuristicswillselectamatchedpairthatpreserveslinearorderingorstructuralorderingbycheckingneighboringmatches.Table1(column6to8)summarizeswhichkindsofheuristicsthateachmatchingtechniqueuses.6.EVALUATIONMatchingtechniquesareofteninadequatelyevaluated|OnlyKimetal.conductedacomparativestudyusinghumansub-jects[30].Thislackofevaluationisexacerbatedbythefactthattherearenoagreedevaluationcriteriaorrepresenta-tivebenchmarks.Findingsuchuniversalcriteriawouldbedicultsinceeachtechniqueisbuiltforadierentgoal.Forexample,matchingtechniquesforregressiontestingorpro-lepropagation[5,46,49]canbeevaluatedbytheaccuracyofstaticbranchpredictionandcodecoverage;buteventhisevaluationmethodisnotapplicabletoprogramswithouttestsuites.Toevaluatematchingtechniquesuniformly,wetakeascenario-basedevaluationapproach;wedesignasmallsetofhypotheticalprogramchangescenarios,onwhichwedescribehowwellvariousmatchingtechniqueswillperform.3Scenario1:(1)aprogrammerinsertsif-elsestatementsinthebeginningofthemethodmA,and(2)theprogram-merreordersseveralstatementsinthemethodmBwithoutchangingsemanticsasshowninTable2.Theidealmatchingtechniqueshouldproduce(s1-s1'),(s2-s2'),(s3-s4'),(s4-s3'),and(s5-s5')andidentifythats0'isadded.ThethirdcolumnofTable3summarizeshowwelleachtechniquewillworkinthisscenario.DicanmatchlinesofmAbutcannotmatchreorderedlinesinmBbecausetheLCSalgorithmdoesnotallowcrossingblockmoves.Ontheotherhand,bdicanmatchreorderedlinesinmBbe-causecrossingblockmovesareallowed.Neamtiu'salgo-3PDG-basedmatchingtechniquesareexcludedduetolackofmodernprogramminglanguagesupport. Table2:Scenario1CodeChangebeforeaftermA(){if(pred_a){\\s1foo()\\s2}}(b){a:=1\\s3b:=b+1\\s4fun(a,b)\\s5}mA(){if(pred_a0){\\s0'if(pred_a){\\s1'foo()\\s2'}}}(b){b:=b+1\\s3'a:=1\\s4'fun(a,b)\\s5'}rithmwillperformpoorlyinbothmAandmBbecauseitdoesnotperformadeepstructuralmatch.CdicannotmatchunchangedpartsinmAcorrectlybecausecdistopsearlyifrootsdonotmatchforeachlevel.Jdiwillbeabletoskipthechangedcontrolstructure,mapunchangedpartsinmA,andmatchreorderedstatementsinmBifthelook-aheadthresholdisgreaterthanthedepthofnestedcontrols.BMATcannottrackcodeblocksinmBbecauseBMAT'shashingalgorithmsareinstructionordersensitive.Inconclusion,Jdiwillworkthebestforchangeswithinproceduresatastatementorpredicatelevel.Scenario2:AlePElmtMatchchangeditsnametoPMatch-ing.AprocedurematchBlckaresplitintotwoproceduresmatchDBlckandmatchCBlck.AprocedurematchASTchangeditsnametomatchAbstractSyntaxTree.Theidealmatchingtechniqueshouldproduce(PElmt-Match,PMatching),(matchBlck,matchDBlck),(matchBlck,matchCBlck),and(matchAST,matchAbstractSyntaxTree).ThefourthcolumnofTable3summarizeshoweachtech-niquewillworkinthisscenario.Mostname-basedmatch-ingtechniqueswilldopoorlyduetorenamingevents.Diandbdiwillbeabletotrackeachlineonlyiflenamesdidnotchange.BothcdiandNeamtiu'salgorithmwillperformpoorlyifprocedurenameschanged.BothBMATandoriginanalysistoolswillperformwellbecausetheyrelyonmultiplepassesofhashfunctionsandmultiplephasesofnamematching.TheremainingcolumnsofTable3describehowwelleachmatchingtechniquewillworkincaseofrestructuringtasksataprocedureleveloratalelevel.BasedonTable1and3,weconcludethefollowing:MatchingtechniquesbasedonASTorCFGproducematchesatne-grainedlevelsbutareonlyapplicabletoacompleteandparsableprogram.Researchersmustconsiderthetrade-obetweenmatchinggranularity,matchingrequirements,andmatchingcost.ManytechniquesemploytheLCSalgorithmevenwhenmatchingASTorCFG,thusinheritingtheassump-tionsofLCS:one-to-onecorrespondencesbetweenmatchedentitiesandlinearorderingamongmatchedpairs.Thissortofimplicitassumptionsmustbecare-fullyexaminedbeforeimplementingamatcher.Mosttechniquessupportonlyone-to-onemappingsataxedgranularity.Therefore,theywillperformpoorlywhenmergingorsplittingoccurs.Themoreheuristicsareused,themorematchescanbefoundbycomplementingoneanother.Forexam-ple,name-basedmatchingiseasytoimplementandcanreducematchingscopequickly,butitisnotro-busttorenamingevents.Inthiscase,similarity-basedmatchingcanproducematchesbetweenrenamedenti-tiesandposition-basedmatchingcanleveragealreadymatchedpairstoinfermorematches.7.FUTUREDIRECTIONSThissectionlistsremainingopenproblemsandfuturedi-rections.HybridMatcher.Althoughnosingletechniqueper-formsperfectlyinallchangescenariosbutthecombinationofalltechniquesdoes.Thuscombiningmultipletechniquesmayimprovetheaccuracyofmatchesbycomplementingoneanother.Thesimplestwayistorunallmatchingtechniquesseparatelyandndconsensusamongtheresults.Anotherwayofbuildingahybridmatcheristoleverageafeedbackloopbetweenmatchingtoolsandtoolsthatinferrefactoringevents.Determiningwhichrefactoringoccurredanddeter-miningcorrespondencesisachickenandeggproblem;in-ferringrefactoringeventsrequiresknowledgeofcorrespon-dences,andndinggoodcorrespondencesisachievedbyknowingwhichrefactoringoccurred.Thisfeedbackcycleprovidesanopportunitytondmorematches.Theresultsofinferredtransformationsarefedtomatchingtools,andthematchingresultsarefedbacktoarefactoringrecon-structiontooliterativelyuntiloptimalcorrespondencesarefound.Wemustnotethatcombiningresultsfrommultiplematch-erswillrequiretremendouseortsbecause(1)noteverymatchingtoolisavailableforpublicuseorapplicabletopopularprogramminglanguagesand(2)dierentmatchersusedierentprogramrepresentations.CapturingEditingOperations.Havingacompletehistoryoflogicaleditingoperationswouldnullifythematch-ingproblem.However,mostsoftwarerepositoriesemploystate-basedmergingnotoperation-basedmerging[34],thusmakingitimpossibletorestorelogicaleditingoperationscompletely.Evenwhenaneditlogisavailable,ifeditingoperationsarecapturedatakeystrokelevel,itisnottrivialtoreconstructlogicaleditingoperations(suchasprocedurerenaming,splitting,andmerging)andproducematchesbe-tweenprogramelements.Recently,capturingandreplayingrefactoringoperationsisshowntobepossibleinanEclipseIDE[22],sowecanleveragethistypeofrefactoringhistorytoinitiatethefeedbackloopdiscussedabove.IntervalManipulationvs.MatchingToolSelec-tion.Inthispaper,wesimpliedamulti-versionprogrammatchingproblemasatwoversionmatchingproblem.Touseamatchingtechniqueinthecontextofmulti-versionanalyses,theintervalbetweeneachpairofversionsmustbedetermined.Inthepast,thegranularityofavailablehistoricaldatalimitedasamplingintervalformulti-versionanalyses.Recently,severalinfrastructures[10,51,50]werebuilttofacilitatemulti-versionanalysesbyrestoringcommittransactionsfromasourcecoderepositoryandautomati-callyextractingmultipleversionsseparatedbyanarbitrarytimeinterval.Theseinfrastructuresenablemulti-versionanalysestomanipulateasamplinginterval.Therefore,theremainingproblemistodetermineanoptimalsamplingin-tervalforeachmatchingtechnique(orselectanappropriate Table3:EvaluationoftheSurveyedMatchingTechniquesProgramCitationScenarioTransformationsStrengthandWeaknessRepresentationSplit/MergeRename12ProcFileProcFileStringdi[25] sensitivetolenamechangesbdi[45]+cantracecopiedblocksASTcdi[47] sensitivetonestedlevelchange requireprocedurelevelmappingsNeamtiuetal.[38] partialASTmatchingHunt,Tichy[24,35] requirelelevelmappings+canidentifyprocedurerenamingCFGJDi[5]+robusttosignaturechanges sensitivetocontrolstructurechangesBinaryBMAT[46]+robusttoprocedurerenaming assume1:1procedurecorrespondence onlyapplicabletobinaryprogramsHybridZou,Godfrey[52] semi-automatic,manualanalysisKimetal.[30] assume1:1procedurecorrespondencegoodmediocrepoormatchingtooldependingonthelogicalgapbetweentwoversionsofaprogram).Anotherinterestingopenquestionis,"canwedesignamatchingtechniquethatworksaswellasaggregatingresultsfromasetofprogramsnapshotsthatseparatedbyonlysmallchanges?"MatchingResultAggregation.Asmatchingcom-plexityincreasesbysupportingmultiplegranularitiesandmany-to-manymappings,representingmatchresultsbecomesanon-trivialproblem.Inaddition,whenatwo-versionmatch-ingtoolisusedformulti-versionprogramanalyses,aggre-gatingindividualmatchingresultsandrepresentingthenalresultsinacompactformremainsasanopenproblem.LeveragingDynamicInformation.Mostmatchingtechniquesarebasedonsyntacticsimilaritiesatasourcecodelevel.Incomparisoncheckingresearch[49,39],dy-namicinformationhasbeenusedtomatchanoptimizedver-sionandanunoptimizedversionofthesameprogramwhenthetwoversionswereexecutedonthesameinput.Abstrac-tionofmultipleexecutiontracesmayguidematchingofastaticprogramrepresentation.Forexample,comparingdy-namicinvariants[16]maybeusefulforidentifyingvariablelevelmatchesattheentry(orexit)ofafunction.8.CONCLUSIONInthispaper,wedenedtheprogramelementmatchingproblemandargueditsimportanceforne-grainedmulti-versionanalyses.Wepresentedasurveyofmatchingtech-niquesfromvariousresearchareasandevaluatedthembasedonhypotheticalprogramchangescenariosbyhand.Webe-lievethatourassessmentofexistingtechniqueswillguideresearcherstochooseanappropriatematchingtechniquefortheiranalysis.Inconclusion,everymatchingtechniqueisanimplemen-tationofsomepseudoequivalencefunction,andthemoreheuristicsareused,thebetterthematchingtechniquewillwork.Onedirectionoffutureworkinvolvesbuildingahy-bridmatcherthatleveragesafeedbackloopbetweenmatch-ingtoolsandtoolsthatinferrefactoringevents.Anotherfuturedirectionistocustomizeexistingmatchersinthecon-textofaspecictypeofmulti-versionanalysisandbuildanevaluationdatasetforthatanalysis.Inaddition,de-termininganoptimalsamplingintervalforeachmatchingtechniqueremainsasanopenproblem.Ourlonger-termobjectivesareto(1)denetheproblemmoreprecisely,allowingforbetterassessmentandsharingoftheapproachesand(2)layafoundationformoreeec-tivesolutionsapplicabletospecickindsofmulti-versionanalyses.ACKNOWLEDGMENTSWethankDagstuhl05261seminarparticipantsforfruitfuldiscussions.WealsothankMichaelToomimforreadingourdraftandVibhaSazawal,DanGrossman,andRobDeLinefordiscussionsthathelpedusreneourideas.10.REFERENCES[1]subversion.tigris.org.[2]www.cvshome.org.[3]A.Aiken.Asystemfordetectingsoftwareplagiarism.[4]G.Antoniol,M.D.Penta,andE.Merlo.Anautomaticapproachtoidentifyclassevolutiondiscontinuities.InIWPSE,pages31{40,2004.[5]T.Apiwattanapong,A.Orso,andM.J.Harrold.Adierencingalgorithmforobject-orientedprograms.InASE,pages2{13.IEEEComputerSociety,2004.[6]T.Apiwattanapong,A.Orso,andM.J.Harrold.Ecientandprecisedynamicimpactanalysisusingexecute-aftersequences.InICSE,pages432{441,2005.[7]A.ApostolicoandZ.Galil,editors.Patternmatchingalgorithms.OxfordUniversityPress,UK,1997.[8]B.S.Baker.Aprogramforidentifyingduplicatedcode.ComputingScienceandStatistics,24:49{57,1992.[9]I.D.Baxter,A.Yahin,L.M.deMoura,M.Sant'Anna,andL.Bier.Clonedetectionusingabstractsyntaxtrees.InICSM,pages368{377,1998.[10]J.Bevan,J.E.JamesWhitehead,S.Kim,andM.Godfrey.FacilitatingsoftwareevolutionresearchwithKenyon.InESEC/FSE,pages177{186,2005.[11]J.BevanandE.J.W.Jr.Identicationofsoftwareinstabilities.InWCRE,pages134{145,2003. [12]D.Binkley,S.Horwitz,andT.Reps.Programintegrationforlanguageswithprocedurecalls.ACMTOSEM,4(1):3{35,1995.[13]J.R.Cordy.Comprehendingreality:Practicalbarrierstoindustrialadoptionofsoftwaremaintenanceautomation.InIWPC'03,page196,2003.[14]S.Demeyer,S.Ducasse,andO.Nierstrasz.Findingrefactoringsviachangemetrics.InOOPSLA'00,pages166{177,2000.[15]S.G.Eick,T.L.Graves,A.F.Karr,J.S.Marron,andA.Mockus.Doescodedecay?Assessingtheevidencefromchangemanagementdata.IEEETrans.Softw.Eng.,27(1):1{12,2001.[16]M.D.Ernst.DynamicallyDiscoveringLikelyProgramInvariants.Ph.D.Disseratation,UniversityofWashington,Seattle,Washington,Aug.2000.[17]H.Gall,K.Hajek,andM.Jazayeri.Detectionoflogicalcouplingbasedonproductreleasehistory.InICSM,pages190{197,1998.[18]C.GorgandP.Weigerber.Errordetectionbyrefactoringreconstruction.InMSR'05,pages29{35.[19]C.GorgandP.Weigerber.Detectingandvisualizingrefactoringsfromsoftwarearchives.InIWPC,pages205{214,2005.[20]T.L.Graves,A.F.Karr,J.S.Marron,andH.Siy.Predictingfaultincidenceusingsoftwarechangehistory.IEEETrans.Softw.Eng.,26(7):653{661,2000.[21]M.J.Harrold,J.A.Jones,T.Li,D.Liang,A.Orso,M.Pennings,S.Sinha,S.A.Spoon,andA.Gujarathi.RegressiontestselectionforJavasoftware.InOOPSLA'01,pages312{326,2001.[22]J.HenkelandA.Diwan.Catchup!:capturingandreplayingrefactoringstosupportAPIevolution.InICSE'05,pages274{283,2005.[23]S.Horwitz.Identifyingthesemanticandtextualdierencesbetweentwoversionsofaprogram.InPLDI'90,volume25,pages234{245,June1990.[24]J.J.HuntandW.F.Tichy.Extensiblelanguage-awaremerging.InICSM,pages511{520,2002.[25]J.W.HuntandT.G.Szymanski.Afastalgorithmforcomputinglongestcommonsubsequences.Commun.ACM,20(5):350{353,1977.[26]D.JacksonandD.A.Ladd.SemanticDi:Atoolforsummarizingtheeectsofmodications.InICSM'94,pages243{252,1994.[27]J.H.Johnson.Identifyingredundancyinsourcecodeusingngerprints.InCASCON,pages171{183.IBMPress,1993.[28]T.Kamiya,S.Kusumoto,andK.Inoue.CCFinder:Amultilinguistictoken-basedcodeclonedetectionsystemforlargescalesourcecode.IEEETrans.Softw.Eng.,28(7):654{670,2002.[29]M.Kim,V.Sazawal,D.Notkin,andG.C.Murphy.Anempiricalstudyofcodeclonegenealogies.InESEC/SIGSOFTFSE,pages187{196,2005.[30]S.Kim,K.Pan,andJ.E.JamesWhitehead.Whenfunctionschangetheirnames:Automaticdetectionoforiginrelationships.InWCRE,2005.[31]S.Kim,E.J.Whitehead,andJ.Bevan.Analysisofsignaturechangepatterns.InMSR'05,pages64{68.[32]R.KomondoorandS.Horwitz.Usingslicingtoidentifyduplicationinsourcecode.InSAS,pages40{56,2001.[33]J.LaskiandW.Szermer.Identicationofprogrammodicationsanditsapplicationsinsoftwaremaintenance.InICSM,1992.[34]E.LippeandN.vanOosterom.Operation-basedmerging.InSDE'92,pages78{87,1992.[35]G.Malpohl,J.J.Hunt,andW.F.Tichy.Renamingdetection.Autom.Softw.Eng.,10(2):183{202,2000.[36]T.Mens.Astate-of-the-artsurveyonsoftwaremerging.IEEETrans.Softw.Eng.,28(5):449{462,2002.[37]N.NagappanandT.Ball.Useofrelativecodechurnmeasurestopredictsystemdefectdensity.InICSE,pages284{292,2005.[38]I.Neamtiu,J.S.Foster,andM.Hicks.Understandingsourcecodeevolutionusingabstractsyntaxtreematching.InMSR'05,pages2{6.[39]G.C.Necula.Translationvalidationforanoptimizingcompiler.InPLDI'00,pages83{94,2000.[40]A.Orso,N.Shi,andM.J.Harrold.Scalingregressiontestingtolargesoftwaresystems.InSIGSOFT'04/FSE-12,pages241{251,2004.[41]D.L.Parnas.Onthecriteriatobeusedindecomposingsystemsintomodules.Commun.ACM,15(12):1053{1058,1972.[42]G.RothermelandM.J.Harrold.Asafe,ecientregressiontestselectiontechnique.ACMTOSEM,6(2):173{210,1997.[43]J.Sliwerski,T.Zimmermann,andA.Zeller.Whendochangesinducexes?InMSR'05,pages24{28,2005.[44]A.SrivastavaandJ.Thiagarajan.Eectivelyprioritizingtestsindevelopmentenvironment.InISSTA'02,pages97{106,2002.[45]W.F.Tichy.Thestring-to-stringcorrectionproblemwithblockmoves.ACMTrans.Comput.Syst.,2(4):309{321,1984.[46]Z.Wang,K.Pierce,andS.McFarling.BMAT-abinarymatchingtoolforstaleprolepropagation.J.Instruction-LevelParallelism,2,2000.[47]W.Yang.Identifyingsyntacticdierencesbetweentwoprograms.Software-PracticeandExperience,21(7):739{755,1991.[48]A.T.T.Ying,G.C.Murphy,R.Ng,andM.Chu-Carroll.Predictingsourcecodechangesbyminingchangehistory.IEEETrans.Softw.Eng.,30(9):574{586,2004.[49]X.ZhangandR.Gupta.Matchingexecutionhistoriesofprogramversions.InESEC/SIGSOFTFSE,pages197{206,2005.[50]T.ZimmermannandP.Weigerber.PreprocessingCVSdataforne-grainedanalysis.InMSR'04,pages2{6.[51]T.Zimmermann,P.Weigerber,S.Diehl,andA.Zeller.Miningversionhistoriestoguidesoftwarechanges.IEEETrans.Softw.Eng.,31(6):429{445,2005.[52]L.ZouandM.W.Godfrey.Usingoriginanalysistodetectmergingandsplittingofsourcecodeentities.IEEETrans.Softw.Eng.,31(2):166{181,2005.