/
Pr ogram Element Matc hing or MultiV er sion Pr ogram Anal yses Mir yung Kim Da vid Notkin Pr ogram Element Matc hing or MultiV er sion Pr ogram Anal yses Mir yung Kim Da vid Notkin

Pr ogram Element Matc hing or MultiV er sion Pr ogram Anal yses Mir yung Kim Da vid Notkin - PDF document

liane-varnes
liane-varnes . @liane-varnes
Follow
503 views
Uploaded On 2014-12-24

Pr ogram Element Matc hing or MultiV er sion Pr ogram Anal yses Mir yung Kim Da vid Notkin - PPT Presentation

w ashingtonedu ABSTRA CT Multiv ersion program analyses require that elemen ts of one ersion of program mapp ed to the elemen ts of other er sions of that program Matc hing program elemen ts et een ersions of program is fundamen tal building blo for ID: 29054

ashingtonedu ABSTRA Multiv

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Pr ogram Element Matc hing or MultiV er ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

ProgramElementMatchingforMulti­VersionProgramAnalysesMiryungKim,DavidNotkinComputerScience&EngineeringUniversityofWashingtonSeattle,WAfmiryung,notking@cs.washington.eduABSTRACTMulti-versionprogramanalysesrequirethatelementsofoneversionofaprogrambemappedtotheelementsofotherver-sionsofthatprogram.Matchingprogramelementsbetweentwoversionsofaprogramisafundamentalbuildingblockformulti-versionprogramanalysesandothersoftwareevolu-tionresearchsuchaspro lepropagation,regressiontesting,andsoftwareversionmerging.Inthispaper,wesurveymatchingtechniquesthatcanbeusedformulti-versionprogramanalysesandevaluatethembasedonhypotheticalchangescenarios.Thispaperalsolistschallengesofthematchingproblem,identi esopenprob-lems,andproposesfuturedirections.CategoriesandSubjectDescriptor:D.2.7[SoftwareEngineering]:Distribution,Maintenance,andEnhancement|restructuring,reverseengineering,andreengineeringGeneralTerms:Documentation,AlgorithmsKeywords:matching,softwareevolution,multi-versionanal-ysisINTRODUCTIONInthelastseveralyears,researchersinsoftwareengineer-inghavebeguntoanalyzeprogramstogetherwiththeirchangehistory.Incontrasttotraditionalprogramanalysesthatexamineasingleversion,multi-versionprogramanal-ysesusemultipleversionsofaprogramasinputandminechangepatterns.Thereareroughlytwodi erenttypesofmulti-versionanalyses:(1)coarse-grainedanalysesand(2) ne-grainedanalyses.Coarse-grainedanalysescomputechangesbetweentwoconsecutiveversionsofaprogram,aggregatethechangeinformationacrossmultipleversionsoracrossmultiple les,andinfercoarse-grainedpatterns[37,15,20,17].Forex-ample,NagappanandBall'sanalysis[37] ndsline-levelchangesbetweentwoconsecutiveversions,countsthetotalnumberofchangesperbinarymodule,andinfersthecharac-teristicsoffrequentlychangedmodules.Ontheotherhand,Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.MSR'06,May22–23,2006,Shanghai,China.Copyright2006ACM1­59593­085­X/06/0005...$5.00. ne-grainedanalysestrackhowindividualcodefragmentschangedduringprogramevolutiontoinfer ne-grainedchangepatterns[29,31,52,38,43,48,51,11].Forexample,aclonegenealogyextractortracksindividualcodesnippetsovermultipleversionstoinfercloneevolutionpatterns[29].Asanotherexample,asignaturechangepatternanalysis[31,30]traceshowthenameandthesignatureoffunctionschange.Matchingelementsbetweentwoversionsofapro-gramisafundamentalbuildingblockfor ne-grainedmulti-versionanalysesaswellasothersoftwareevolutionresearchsuchassoftwareversionmerging,regressiontesting,pro lepropagation,etc[36,42,21,46].We rstde netheproblemofmatchingcodeelementsbetweentwoversions:SupposethataprogramP0iscreatedbymodifyingP.De-terminethedi erencebetweenPandP0.Foracodefragmentc02P0,determinewhetherc02.Ifnot, ndc0'scorrespondingorigincinP:Theproblemde nitionstatesthatwemustcomputethedi erencebetweentwoprograms.Computingsemanticdif-ferencesrequiressolvingtheproblemofsemanticprogramequivalence,whichisanundecidableproblem.Thus,oncetheproblemisapproximatedbymatchingacodeelementbyitssyntacticandtextualsimilarity,solutionsdependonthechoicesof(1)anunderlyingprogramrepresentation,(2)matchinggranularity,(3)matchingmultiplicity,and(4)matchingheuristics.Inthispaper,weexplainhowthechoicesimpactapplicabilityofeachmatchingmethodandhowthechoicesa ecte ectivenessandaccuracyofmatchingbycreatinganevaluationframeworkforexistingmatchingtechniques.Therestofthispaperisorganizedasfollows.Thenextsectiondiscussesseveralmulti-versionanalyses,whichdemon-stratetheneedsofprogramelementmatching.Then,wediscusschallengesofprogramelementmatchinginSection3.Section4presentsasurveyofstate-of-the-artmatch-ingtechniquesfromvariousresearchareassuchasmulti-versionprogramanalyses,pro lepropagation,softwarever-sionmerging,andregressiontesting.Section5comparesthesurveyedtechniquesandSection6evaluateseachtech-niqueusinghypotheticalprogramchangescenarios.Section7presentsopenproblemsandfuturedirections.2.MOTIVATINGPROBLEMSInthissection,wedescribeseveralmulti-versionanalysisproblemsthatdemonstrateimportanceofprogramelementmatching.The rstproblemismaintainingtwosimilarprograms thatoriginatedfromthesamesourcebutevolveddi erentlyinparallel.1Inmanyorganizations,itisacommonpracticetocloneaproductoramoduleandmaintaintheclonesinparallel[13].Maintenancedicultiesarisewhenprogram-mersdiscoveracriticalbuginclonedparts.Ifprogram-mersdonotknowwhetherthediscoveredbugisrelevanttootherclonedcounterparts,theymustinspectsourcecodeofthecounterparts.Webelievethatprogrammerscanbetterlocaterelevantcounterpartsbyunderstandinghowcloneschangeovertime.Monitoringclonesrequirestrackingeachclonebyitsphysicallocationsuchasa lename,aprocedurename,orcalibratedlinenumbers[29].Oursecondmotivatingproblemisunderstandingtheevo-lutionofinformationhidinginterfaces.Theinformationhid-ingprinciple[41]statesthatprogrammersshouldanticipatewhatkindsofdecisionsarelikelytochangeinthefutureandhidethemusinganinterface.Ingeneral,itisdicultforprogrammerstopredictwhichdesigndecisionsarelikelytochange;thus,unanticipateddesigndecisionsresultindegra-dationoforiginalsoftwaredesign.Webelievethatunder-standinginterfaceevolutioncanshedlighton(1)whichde-cisionswereoriginallyhiddenbutlaterexposedand(2)howunanticipateddecisionsimpactoriginalinterfacedesign.Inadditiontotheproblemsabove,typechangeanalysis[38],signaturechangeanalysis[31],andinstabilityanalysis[11]alsorequirematchingprogramelementsinordertotrackcodeelementsovertime.3.MATCHINGCHALLENGESThissectionlistschallengesofprogramelementmatching.3.1AbsenceofBenchmarksItisdiculttoevaluatematchingtechniquesbecausethereisnoreferencedatasetorarchiveofeditinglogs.Previousstudies[30]alsoindicatethatprogrammersoftendisagreeabouttheoriginofcodesnippets;lowinter-rateragreementsuggeststhattheremaybenogroundtruthinprogramelementmatching.3.2VariousGranularitySupportWithrespecttoourmotivatingproblemsinSection2,wecannotassumethatprogrammersintendtotrackpro-gramelementsata xedgranularity.Therearetworeasonswhymatchingtechniquesmustsupportvariousgranular-ity.First,aprogrammermaywanttotrackdesigndecisionsthatcannotbemappedtoprogramunitsprecisely.Second,aprogrammermaywanttrackprogramelementsatadif-ferentgranularitydependingonthenatureoftasks.Forexample,ifmatchingtechniquesaretobeusedforpro lepropagationorpreciseregressiontestselection,mappingsshouldbefoundata negranularitysuchasatthelevelofcontrol\rowgraphedges[42]oratthelevelofcodeblocks[46,44].Ontheotherhand,ifaprogrammerwantstousematchingresultsforprogramunderstandingtasks,itwouldbemoreappropriateto ndassociationsatahigherlevelsuchasa le.3.3TypesofCodeChangesCertaintypesofcodechangesmakethematchingprob-lemnon-trivial.Forexample,trackingcodebyitsenclosing1Inanopensourceprojectcommunity,thispracticeisoftencalledasforking.procedurenamewouldfailifprogrammersmerged,split,orrenamedprocedures.Whenaprogrammercopiescode,matchingtechniquescannotassumeone-to-onemappingsbetweenoldelementsandnewelementsbecauseanoldcodeelementcanhavemorethanonematchingdescendantsinanewversion.4.MATCHINGTECHNIQUESThissectiondescribesmatchingtechniquesusedforsoft-wareversionmerging,programdi erencing,pro lepropaga-tion,regressiontesting,andmulti-versionprogramanalyses.Foreasycomparison,wegrouptechniquesbyprogramrep-resentation.WedescribeclonedetectorsandtoolsthatinferrefactoringeventsinSection4.7and4.8becausethesetoolscanbeleveragedfor ndingcorrespondencesbetweentwoversions.4.1EntityNameMatchingThesimplestmatchingmethodtreatsprogramelementsasimmutableentitieswitha xednameandmatchestheelementsbyname.Forexample,Zimmermannetal.mod-eledafunctionasatuple,( lename,FUNCTION,functionname),anda eldasatuple,(functionname,FIELD, eldname)[51].Similarly,Yingetal.[48]modeleda lewithitsfullpathname.Infact,matchingbynamewouldbesu-cientformanymulti-versionanalysesthatintendtoidentifycoarse-grainedpatternssuchasthecharacteristicsoffaultpronemodules[15,20,37].4.2StringMatchingWhenaprogramisrepresentedasastring,thebestmatchbetweentwostringsiscomputedby ndingthelongestcom-monsubsequence(LCS)[7].TheLCSproblemisbuiltontheassumptionthat(1)availableoperationsareadditionanddeletion,and(2)matchedpairscannotcrossonean-other.Thus,thelongestcommonsubsequencedoesnotnecessarilyincludeallpossiblematcheswhenavailableeditoperationsincludecopy,paste,andmove.Tichy[45](bdi )extendedtheLCSproblembyrelaxingthetwoassumptionsabove:permittingcrossingblockmovesandnotrequiringone-to-onecorrespondence.Theline-levelLCSimplementation,di [25],hasservedasbasisformanymulti-versionanalyses,because(1)di isfastandreliable,and(2)popularversioncontrolsystemssuchasCVS[2]orSubversion[1]alreadystorechangesasline-leveldi erences.Forexample,aclonegenealogyextractortrackscodesnippetsbytheir lenameandlinenumber[29].Asanotherexample, x-inducingcodesnippets[43]areinferredbytrackingbackwardatupleof( lename::functionname::linenumber)fromthemomentthatabugis xed.4.3SyntaxTreeMatchingForsoftwareversionmerging,Yang[47]developedanASTdi erencingalgorithm.Givenapairoftwofunctions(fT;fR),thealgorithmcreatestwoabstractsyntaxtreesTandRandattemptstomatchthetwotreeroots.Ifthetworootsmatch,thealgorithmalignsT'ssubtreest1;t2;:::;tiandR'ssubtreesr1;r2;:::rjusingtheLCSalgorithmandmatchessubtreesrecursively.Thistypeoftreematchingrespectstheparent-childrelationshipaswellastheorderbetweensiblingnodes,butisverysensitivetochangesinnestedblockandcontrolstructuresbecausetreerootsmustbematchedforeverylevel. HuntandTichydonotdirectlycompareASTsbutusesyntacticinformationtoguidestringleveldi erencing.Their3-waymergingtool[24]parsesaprogramintoalanguageneutralform,comparestokenstringsusingtheLCSalgo-rithm,and ndssyntacticchangesusingstructuralinforma-tionfromtheparse.Fordynamicsoftwareupdating,Neamtiuetal.[38]builtanalgorithmthattrackssimplechangestovariables,types,andfunctionsbasedonaASTrepresentation.Neamtiu'salgorithmassumesthatfunctionnamesarerelativelysta-bleovertimeandmatchestheASTsoffunctionswiththesamename;thealgorithmtraversestwoASTsinparallelandincrementallyaddsone-to-onemappingsaslongastheASTshavethesameshape.IncontrasttoYang'salgorithm,Neamtiu'salgorithmcannotcomparestructurallydi erentASTs.ControlFlowGraphMatchingLaskiandSzermer[33] rstdevelopedanalgorithmthatcomputesone-to-onecorrespondencesbetweenCFGnodesintwoprogramsP1andP2.Thisalgorithm rstreducesaCFGtoaseriesofsingle-entry,single-exitsubgraphscalledhammocksandmatchesasequenceofhammocknodesus-ingadepth rstsearch(DFS).Onceapairofcorrespondinghammocknodesisfound,thehammocknodesarerecur-sivelyexpandedinorderto ndcorrespondenceswithinthematchedhammocks.Jdi [5]extendsLaskiandSzermer's(LS)algorithmtocompareJavaprogramsbasedonanenhancedcontrol\rowgraph(ECFG).Jdi issimilartotheLSalgorithminthesensethathammocksarerecursivelyexpandedandcom-pared,butisdi erentinthreeways:First,whiletheLSalgorithmcompareshammocknodesbythenameofastartnodeinthehammock,Jdi checkswhethertheratioofunchanged-matchedpairsinthehammockisgreaterthanachosenthresholdinordertoallowfor\rexiblematches.Second,whiletheLSalgorithmusesDFStomatchham-mocknodes,Jdi onlyusesDFSuptoacertainlook-aheaddepthtoimproveitsperformance.Third,whiletheLSalgo-rithmrequireshammocknodematchesatthesamenestedlevel,Jdi canmatchhammocknodesatadi erentnestedlevel;thus,Jdi ismorerobusttoadditionofwhileloopsorif-statementsatthebeginningofacodesegment.Jdi hasbeenusedforregressiontestselection[40]anddynamicimpactanalysis[6].4.5ProgramDependenceGraphMatchingThereareseveralprogramdi erencingalgorithmsbasedonaprogramdependencegraph[23,12,26].Thesealgo-rithmsarenotapplicabletopopularmodernprogramlan-guagesbecausetheycanrunonlyonalimitedsubsetofClanguageswithoutglobalvariables,pointers,arrays,orprocedures.4.6BinaryCodeMatchingBMAT[46]isafastande ectivetoolthatmatchestwoversionsofabinaryprogramwithoutknowledgeofsourcecodechanges.BMATwasusedforpro lepropagationandregressiontestprioritization[44].BMAT'salgorithmmatchesblocksinthreesteps.The rststepofBMAT'smatchingalgorithmisto ndone-to-onemappingsbetweenthepro-ceduresintwoversionsbasedontheirnames,typeinforma-tion,andcodecontents.Tomatchprocedureswithdi erentnames,blocktrialmatchingisdoneonprocedurepairswithasmallnumberofcharacterdi erencesintheirhierarchicalnames.Inthisstep,thethresholdsforprocedurenamedif-ferenceandblockmatchingpercentagearebothsetheuris-tically.Inthesecondstep,BMAT rstmatchesdatablockswithineachpairofmatchedproceduresusingahashfunc-tionandmatchesremainingunmatcheddatablocksiftheunmatchedblocksaresandwichedbyalreadymatchedpairs.Inthethirdstep,BMATmatchescodeblocksinmultiplehashingpasses.Duringhash-basedmatching,ifhashvaluescollide,twoheuristicsareusedtobreakties:(1)crossingmatchesareforbiddenatcertainhashingpasses,and(2)apairofblocksispreferredtoothertiedpairsifeitheritspredecessorsorsuccessorsarealsomatched.Forremainingunmatchedblocks,BMATmatchesblocksbasedoncontrol\rowequivalence,allowingone-to-manymappingsbetweenoldcodeblocksandnewcodeblocks.4.7CloneDetectionAclonedetectorissimplyanimplementationofanarbi-traryequivalencefunction.Theequivalencefunctionde nedbyeachclonedetectordependsaprogramrepresentationandacomparisonalgorithm.Mostclonedetectors[8,28,9,32,27]areheavilydependenton(1)hashfunctionstoimproveperformance,(2)parameterizationtoallow\rexiblematches,and(3)thresholdstoremovespuriousmatches.Aclonedetectorcanbeconsideredasamany-to-manymatcherbasedsolelyoncontentsimilarityheuristics.4.8OriginAnalysisToolsOriginanalysisinfersrefactoringeventssuchassplitting,merging,renamingandmovingbycomparingtwoversionsofaprogram[14,52,31,4,18,35,19].Originanalysistacklestheprogramelementmatchingproblemdirectlybutproducesmatchingresultsonlyataprede nedgranularitysuchasaprocedure,classor le.Demeyeretal.[14] rstproposedtheideaofinferringrefactoringeventsbycomparingthetwoprograms.De-meyeretal.usedasetoftencharacteristicsmetricsforeachclass,suchasLOCandthenumberofmethodcallswithinamethod(i.e.,fan-out)andinferredwhererefactor-ingeventsoccurredbasedonthemetricvaluesandaclassinheritancehierarchy.ZouandGodfrey'soriginanalysis[52]matchesproceduresusingmultiplecriteria(names,signatures,metricvalues,andasetofcallersandcallees)andinfersmerging,split-ting,andrenamingevents.BothDemeyeretal.andZouandGodfrey'sanalysesaresemi-automaticinthesensethataprogrammermustmanuallytunematchingcriteriaandselectamatchamongcandidatematches.Kimetal.[30]automatedZouandGodfrey'sprocedurerenaminganalysis.InadditiontomatchingcriteriausedbyZouandGodfrey,Kimetal.usedclonedetectorssuchasCCFinder[28]andMoss[3]tocalculatecontentsimi-laritybetweenprocedures.Anoverallsimilarityiscom-putedasaweightedsumofeachsimilaritymetric,andamatchisselectediftheoverallsimilarityisgreaterthanacertainthreshold.Tocreateanevaluationdataset,tenhu-manjudgesidenti edrenamingeventsintheSubversionandtheApacheprojects,andifsevenoutoftenjudgesagreedtheoriginofarenamedprocedure,amatchwasaddedtoareferencedataset.Usingthereferencedataset,Kimetal.optimizedeachfactor'sweightandtunedthethreshold Table1:ComparisonofMatchingTechniquesProgramCitationGranularityAssumedMultiplicityHeuristicsApplicationRepresentationCorrespondenceNPSAsetof[20,15,37]Module1:1pFaultpronenessentitiesBevanetal.[11]File1:1pInstabilityYingetal.[48]File1:1pCo-changeZimmermannetal.File1:1p[51]ProcedureFieldStringdi [25]LineFile1:1pMergingClonegenealogy[29]Fixinducingcode[43]bdi [45]LineFile1:npMergingASTcdi [47]ASTNodeProcedure1:1pNeamtiuetal.[38]Type,Var1:1ppTypechangeHunt,Tichy[24,35]TokenFile1:1ppMergingCFGJdi [5]CFGnode1:1ppRegressiontestingImpactanalysisBinaryBMAT[46]Codeblock1:1(procedure)pppPro lepropagationn:1(block)RegressiontestingHybridZou,Godfrey[52]Procedure1:1or1:norn:1ppOriginanalysisKimetal.[30]Procedure1:1ppSignaturechange[31]RenaminganalysisN:Name-basedheuristics,P:Position-basedheuristics,S:Similarity-basedheuristicsvalue.TheaccuracyofKim'stoolwasbetterthantheaver-ageaccuracyofhumanjudges,indicatingthathumanjudgessigni cantlydisagreedabouttheoriginofprocedures.5.COMPARISONTable1showscomparisonofthestate-of-the-artmatchingtechniquesinSection4.AsshowninthefourthcolumnofTable1,manymatchingtechniquesassumecorrespondenceatacertaingranularitynomatterwhetherthisassumptionisexplicitlystatedornot.Forexample,usingdi tomatchcodesnippetsassumesthatinput lesalreadyarematched.Asanotherexample,usingcdi tomatchASTnodesas-sumesthatenclosingfunctionsarematchedbythesamename.Allmatchingtechniquesheavilyrelyonheuristicstore-duceamatchingscopeandtoimproveprecisionandrecall.Theheuristicsarecategorizedintothreecategories:21.Name-basedheuristicsmatchentitieswithsimilarnames.Forexample,BMATandJdi matchproceduresinmultiplephasesbythesamegloballyquali edname(e.g.,System.out.println),bythesamehierarchicalname,bythesamesignature,andbythesamename.2.Position-basedheuristicsmatchentitieswithsimilarpositions.Ifentitiesareplacedinthesamesyntac-ticpositionorsurroundedbyalreadymatchedpairs(i.e.,asandwichheuristic),theybecomeamatchedpair.Forexample,BMATusesasandwichheuristicaggressivelytoremoveunmatchedpairs.Asanotherexample,Neamtiu'salgorithmtraversestwoASTsinparallelandmatchesvariablesplacedinthesamesyn-tacticpositionregardlessoftheirlabels.3.Similarity-basedheuristicsmatchentitiesthatarenearlyidentical;theyoftenrelyonparameterizationandahashfunctionto ndnearidenticalentities.Allclonedetectorscanbeviewedasasimilarity-basedmatcher.2Thethreecategoriesarenotcomprehensiveormutuallyexclusive.Thethreedi erenttypesofheuristicscomplementonean-other.Forexample,whenhashvaluescollideorparameteri-zationresultsinspuriousmatches,position-basedheuristicswillselectamatchedpairthatpreserveslinearorderingorstructuralorderingbycheckingneighboringmatches.Table1(column6to8)summarizeswhichkindsofheuristicsthateachmatchingtechniqueuses.6.EVALUATIONMatchingtechniquesareofteninadequatelyevaluated|OnlyKimetal.conductedacomparativestudyusinghumansub-jects[30].Thislackofevaluationisexacerbatedbythefactthattherearenoagreedevaluationcriteriaorrepresenta-tivebenchmarks.Findingsuchuniversalcriteriawouldbedicultsinceeachtechniqueisbuiltforadi erentgoal.Forexample,matchingtechniquesforregressiontestingorpro- lepropagation[5,46,49]canbeevaluatedbytheaccuracyofstaticbranchpredictionandcodecoverage;buteventhisevaluationmethodisnotapplicabletoprogramswithouttestsuites.Toevaluatematchingtechniquesuniformly,wetakeascenario-basedevaluationapproach;wedesignasmallsetofhypotheticalprogramchangescenarios,onwhichwedescribehowwellvariousmatchingtechniqueswillperform.3Scenario1:(1)aprogrammerinsertsif-elsestatementsinthebeginningofthemethodmA,and(2)theprogram-merreordersseveralstatementsinthemethodmBwithoutchangingsemanticsasshowninTable2.Theidealmatchingtechniqueshouldproduce(s1-s1'),(s2-s2'),(s3-s4'),(s4-s3'),and(s5-s5')andidentifythats0'isadded.ThethirdcolumnofTable3summarizeshowwelleachtechniquewillworkinthisscenario.Di canmatchlinesofmAbutcannotmatchreorderedlinesinmBbecausetheLCSalgorithmdoesnotallowcrossingblockmoves.Ontheotherhand,bdi canmatchreorderedlinesinmBbe-causecrossingblockmovesareallowed.Neamtiu'salgo-3PDG-basedmatchingtechniquesareexcludedduetolackofmodernprogramminglanguagesupport. Table2:Scenario1CodeChangebeforeaftermA(){if(pred_a){\\s1foo()\\s2}}(b){a:=1\\s3b:=b+1\\s4fun(a,b)\\s5}mA(){if(pred_a0){\\s0'if(pred_a){\\s1'foo()\\s2'}}}(b){b:=b+1\\s3'a:=1\\s4'fun(a,b)\\s5'}rithmwillperformpoorlyinbothmAandmBbecauseitdoesnotperformadeepstructuralmatch.Cdi cannotmatchunchangedpartsinmAcorrectlybecausecdi stopsearlyifrootsdonotmatchforeachlevel.Jdi willbeabletoskipthechangedcontrolstructure,mapunchangedpartsinmA,andmatchreorderedstatementsinmBifthelook-aheadthresholdisgreaterthanthedepthofnestedcontrols.BMATcannottrackcodeblocksinmBbecauseBMAT'shashingalgorithmsareinstructionordersensitive.Inconclusion,Jdi willworkthebestforchangeswithinproceduresatastatementorpredicatelevel.Scenario2:A lePElmtMatchchangeditsnametoPMatch-ing.AprocedurematchBlckaresplitintotwoproceduresmatchDBlckandmatchCBlck.AprocedurematchASTchangeditsnametomatchAbstractSyntaxTree.Theidealmatchingtechniqueshouldproduce(PElmt-Match,PMatching),(matchBlck,matchDBlck),(matchBlck,matchCBlck),and(matchAST,matchAbstractSyntaxTree).ThefourthcolumnofTable3summarizeshoweachtech-niquewillworkinthisscenario.Mostname-basedmatch-ingtechniqueswilldopoorlyduetorenamingevents.Di andbdi willbeabletotrackeachlineonlyif lenamesdidnotchange.Bothcdi andNeamtiu'salgorithmwillperformpoorlyifprocedurenameschanged.BothBMATandoriginanalysistoolswillperformwellbecausetheyrelyonmultiplepassesofhashfunctionsandmultiplephasesofnamematching.TheremainingcolumnsofTable3describehowwelleachmatchingtechniquewillworkincaseofrestructuringtasksataprocedurelevelorata lelevel.BasedonTable1and3,weconcludethefollowing:MatchingtechniquesbasedonASTorCFGproducematchesat ne-grainedlevelsbutareonlyapplicabletoacompleteandparsableprogram.Researchersmustconsiderthetrade-o betweenmatchinggranularity,matchingrequirements,andmatchingcost.ManytechniquesemploytheLCSalgorithmevenwhenmatchingASTorCFG,thusinheritingtheassump-tionsofLCS:one-to-onecorrespondencesbetweenmatchedentitiesandlinearorderingamongmatchedpairs.Thissortofimplicitassumptionsmustbecare-fullyexaminedbeforeimplementingamatcher.Mosttechniquessupportonlyone-to-onemappingsata xedgranularity.Therefore,theywillperformpoorlywhenmergingorsplittingoccurs.Themoreheuristicsareused,themorematchescanbefoundbycomplementingoneanother.Forexam-ple,name-basedmatchingiseasytoimplementandcanreducematchingscopequickly,butitisnotro-busttorenamingevents.Inthiscase,similarity-basedmatchingcanproducematchesbetweenrenamedenti-tiesandposition-basedmatchingcanleveragealreadymatchedpairstoinfermorematches.7.FUTUREDIRECTIONSThissectionlistsremainingopenproblemsandfuturedi-rections.HybridMatcher.Althoughnosingletechniqueper-formsperfectlyinallchangescenariosbutthecombinationofalltechniquesdoes.Thuscombiningmultipletechniquesmayimprovetheaccuracyofmatchesbycomplementingoneanother.Thesimplestwayistorunallmatchingtechniquesseparatelyand ndconsensusamongtheresults.Anotherwayofbuildingahybridmatcheristoleverageafeedbackloopbetweenmatchingtoolsandtoolsthatinferrefactoringevents.Determiningwhichrefactoringoccurredanddeter-miningcorrespondencesisachickenandeggproblem;in-ferringrefactoringeventsrequiresknowledgeofcorrespon-dences,and ndinggoodcorrespondencesisachievedbyknowingwhichrefactoringoccurred.Thisfeedbackcycleprovidesanopportunityto ndmorematches.Theresultsofinferredtransformationsarefedtomatchingtools,andthematchingresultsarefedbacktoarefactoringrecon-structiontooliterativelyuntiloptimalcorrespondencesarefound.Wemustnotethatcombiningresultsfrommultiplematch-erswillrequiretremendouse ortsbecause(1)noteverymatchingtoolisavailableforpublicuseorapplicabletopopularprogramminglanguagesand(2)di erentmatchersusedi erentprogramrepresentations.CapturingEditingOperations.Havingacompletehistoryoflogicaleditingoperationswouldnullifythematch-ingproblem.However,mostsoftwarerepositoriesemploystate-basedmergingnotoperation-basedmerging[34],thusmakingitimpossibletorestorelogicaleditingoperationscompletely.Evenwhenaneditlogisavailable,ifeditingoperationsarecapturedatakeystrokelevel,itisnottrivialtoreconstructlogicaleditingoperations(suchasprocedurerenaming,splitting,andmerging)andproducematchesbe-tweenprogramelements.Recently,capturingandreplayingrefactoringoperationsisshowntobepossibleinanEclipseIDE[22],sowecanleveragethistypeofrefactoringhistorytoinitiatethefeedbackloopdiscussedabove.IntervalManipulationvs.MatchingToolSelec-tion.Inthispaper,wesimpli edamulti-versionprogrammatchingproblemasatwoversionmatchingproblem.Touseamatchingtechniqueinthecontextofmulti-versionanalyses,theintervalbetweeneachpairofversionsmustbedetermined.Inthepast,thegranularityofavailablehistoricaldatalimitedasamplingintervalformulti-versionanalyses.Recently,severalinfrastructures[10,51,50]werebuilttofacilitatemulti-versionanalysesbyrestoringcommittransactionsfromasourcecoderepositoryandautomati-callyextractingmultipleversionsseparatedbyanarbitrarytimeinterval.Theseinfrastructuresenablemulti-versionanalysestomanipulateasamplinginterval.Therefore,theremainingproblemistodetermineanoptimalsamplingin-tervalforeachmatchingtechnique(orselectanappropriate Table3:EvaluationoftheSurveyedMatchingTechniquesProgramCitationScenarioTransformationsStrengthandWeaknessRepresentationSplit/MergeRename12ProcFileProcFileStringdi [25]sensitiveto lenamechangesbdi [45]+cantracecopiedblocksASTcdi [47]sensitivetonestedlevelchangerequireprocedurelevelmappingsNeamtiuetal.[38]partialASTmatchingHunt,Tichy[24,35]require lelevelmappings+canidentifyprocedurerenamingCFGJDi [5]+robusttosignaturechangessensitivetocontrolstructurechangesBinaryBMAT[46]+robusttoprocedurerenamingassume1:1procedurecorrespondenceonlyapplicabletobinaryprogramsHybridZou,Godfrey[52]semi-automatic,manualanalysisKimetal.[30]assume1:1procedurecorrespondencegoodmediocrepoormatchingtooldependingonthelogicalgapbetweentwoversionsofaprogram).Anotherinterestingopenquestionis,"canwedesignamatchingtechniquethatworksaswellasaggregatingresultsfromasetofprogramsnapshotsthatseparatedbyonlysmallchanges?"MatchingResultAggregation.Asmatchingcom-plexityincreasesbysupportingmultiplegranularitiesandmany-to-manymappings,representingmatchresultsbecomesanon-trivialproblem.Inaddition,whenatwo-versionmatch-ingtoolisusedformulti-versionprogramanalyses,aggre-gatingindividualmatchingresultsandrepresentingthe nalresultsinacompactformremainsasanopenproblem.LeveragingDynamicInformation.Mostmatchingtechniquesarebasedonsyntacticsimilaritiesatasourcecodelevel.Incomparisoncheckingresearch[49,39],dy-namicinformationhasbeenusedtomatchanoptimizedver-sionandanunoptimizedversionofthesameprogramwhenthetwoversionswereexecutedonthesameinput.Abstrac-tionofmultipleexecutiontracesmayguidematchingofastaticprogramrepresentation.Forexample,comparingdy-namicinvariants[16]maybeusefulforidentifyingvariablelevelmatchesattheentry(orexit)ofafunction.8.CONCLUSIONInthispaper,wede nedtheprogramelementmatchingproblemandargueditsimportancefor ne-grainedmulti-versionanalyses.Wepresentedasurveyofmatchingtech-niquesfromvariousresearchareasandevaluatedthembasedonhypotheticalprogramchangescenariosbyhand.Webe-lievethatourassessmentofexistingtechniqueswillguideresearcherstochooseanappropriatematchingtechniquefortheiranalysis.Inconclusion,everymatchingtechniqueisanimplemen-tationofsomepseudoequivalencefunction,andthemoreheuristicsareused,thebetterthematchingtechniquewillwork.Onedirectionoffutureworkinvolvesbuildingahy-bridmatcherthatleveragesafeedbackloopbetweenmatch-ingtoolsandtoolsthatinferrefactoringevents.Anotherfuturedirectionistocustomizeexistingmatchersinthecon-textofaspeci ctypeofmulti-versionanalysisandbuildanevaluationdatasetforthatanalysis.Inaddition,de-termininganoptimalsamplingintervalforeachmatchingtechniqueremainsasanopenproblem.Ourlonger-termobjectivesareto(1)de netheproblemmoreprecisely,allowingforbetterassessmentandsharingoftheapproachesand(2)layafoundationformoree ec-tivesolutionsapplicabletospeci ckindsofmulti-versionanalyses.ACKNOWLEDGMENTSWethankDagstuhl05261seminarparticipantsforfruitfuldiscussions.WealsothankMichaelToomimforreadingourdraftandVibhaSazawal,DanGrossman,andRobDeLinefordiscussionsthathelpedusre neourideas.10.REFERENCES[1]subversion.tigris.org.[2]www.cvshome.org.[3]A.Aiken.Asystemfordetectingsoftwareplagiarism.[4]G.Antoniol,M.D.Penta,andE.Merlo.Anautomaticapproachtoidentifyclassevolutiondiscontinuities.InIWPSE,pages31{40,2004.[5]T.Apiwattanapong,A.Orso,andM.J.Harrold.Adi erencingalgorithmforobject-orientedprograms.InASE,pages2{13.IEEEComputerSociety,2004.[6]T.Apiwattanapong,A.Orso,andM.J.Harrold.Ecientandprecisedynamicimpactanalysisusingexecute-aftersequences.InICSE,pages432{441,2005.[7]A.ApostolicoandZ.Galil,editors.Patternmatchingalgorithms.OxfordUniversityPress,UK,1997.[8]B.S.Baker.Aprogramforidentifyingduplicatedcode.ComputingScienceandStatistics,24:49{57,1992.[9]I.D.Baxter,A.Yahin,L.M.deMoura,M.Sant'Anna,andL.Bier.Clonedetectionusingabstractsyntaxtrees.InICSM,pages368{377,1998.[10]J.Bevan,J.E.JamesWhitehead,S.Kim,andM.Godfrey.FacilitatingsoftwareevolutionresearchwithKenyon.InESEC/FSE,pages177{186,2005.[11]J.BevanandE.J.W.Jr.Identi cationofsoftwareinstabilities.InWCRE,pages134{145,2003. [12]D.Binkley,S.Horwitz,andT.Reps.Programintegrationforlanguageswithprocedurecalls.ACMTOSEM,4(1):3{35,1995.[13]J.R.Cordy.Comprehendingreality:Practicalbarrierstoindustrialadoptionofsoftwaremaintenanceautomation.InIWPC'03,page196,2003.[14]S.Demeyer,S.Ducasse,andO.Nierstrasz.Findingrefactoringsviachangemetrics.InOOPSLA'00,pages166{177,2000.[15]S.G.Eick,T.L.Graves,A.F.Karr,J.S.Marron,andA.Mockus.Doescodedecay?Assessingtheevidencefromchangemanagementdata.IEEETrans.Softw.Eng.,27(1):1{12,2001.[16]M.D.Ernst.DynamicallyDiscoveringLikelyProgramInvariants.Ph.D.Disseratation,UniversityofWashington,Seattle,Washington,Aug.2000.[17]H.Gall,K.Hajek,andM.Jazayeri.Detectionoflogicalcouplingbasedonproductreleasehistory.InICSM,pages190{197,1998.[18]C.GorgandP.Weigerber.Errordetectionbyrefactoringreconstruction.InMSR'05,pages29{35.[19]C.GorgandP.Weigerber.Detectingandvisualizingrefactoringsfromsoftwarearchives.InIWPC,pages205{214,2005.[20]T.L.Graves,A.F.Karr,J.S.Marron,andH.Siy.Predictingfaultincidenceusingsoftwarechangehistory.IEEETrans.Softw.Eng.,26(7):653{661,2000.[21]M.J.Harrold,J.A.Jones,T.Li,D.Liang,A.Orso,M.Pennings,S.Sinha,S.A.Spoon,andA.Gujarathi.RegressiontestselectionforJavasoftware.InOOPSLA'01,pages312{326,2001.[22]J.HenkelandA.Diwan.Catchup!:capturingandreplayingrefactoringstosupportAPIevolution.InICSE'05,pages274{283,2005.[23]S.Horwitz.Identifyingthesemanticandtextualdi erencesbetweentwoversionsofaprogram.InPLDI'90,volume25,pages234{245,June1990.[24]J.J.HuntandW.F.Tichy.Extensiblelanguage-awaremerging.InICSM,pages511{520,2002.[25]J.W.HuntandT.G.Szymanski.Afastalgorithmforcomputinglongestcommonsubsequences.Commun.ACM,20(5):350{353,1977.[26]D.JacksonandD.A.Ladd.SemanticDi :Atoolforsummarizingthee ectsofmodi cations.InICSM'94,pages243{252,1994.[27]J.H.Johnson.Identifyingredundancyinsourcecodeusing ngerprints.InCASCON,pages171{183.IBMPress,1993.[28]T.Kamiya,S.Kusumoto,andK.Inoue.CCFinder:Amultilinguistictoken-basedcodeclonedetectionsystemforlargescalesourcecode.IEEETrans.Softw.Eng.,28(7):654{670,2002.[29]M.Kim,V.Sazawal,D.Notkin,andG.C.Murphy.Anempiricalstudyofcodeclonegenealogies.InESEC/SIGSOFTFSE,pages187{196,2005.[30]S.Kim,K.Pan,andJ.E.JamesWhitehead.Whenfunctionschangetheirnames:Automaticdetectionoforiginrelationships.InWCRE,2005.[31]S.Kim,E.J.Whitehead,andJ.Bevan.Analysisofsignaturechangepatterns.InMSR'05,pages64{68.[32]R.KomondoorandS.Horwitz.Usingslicingtoidentifyduplicationinsourcecode.InSAS,pages40{56,2001.[33]J.LaskiandW.Szermer.Identi cationofprogrammodi cationsanditsapplicationsinsoftwaremaintenance.InICSM,1992.[34]E.LippeandN.vanOosterom.Operation-basedmerging.InSDE'92,pages78{87,1992.[35]G.Malpohl,J.J.Hunt,andW.F.Tichy.Renamingdetection.Autom.Softw.Eng.,10(2):183{202,2000.[36]T.Mens.Astate-of-the-artsurveyonsoftwaremerging.IEEETrans.Softw.Eng.,28(5):449{462,2002.[37]N.NagappanandT.Ball.Useofrelativecodechurnmeasurestopredictsystemdefectdensity.InICSE,pages284{292,2005.[38]I.Neamtiu,J.S.Foster,andM.Hicks.Understandingsourcecodeevolutionusingabstractsyntaxtreematching.InMSR'05,pages2{6.[39]G.C.Necula.Translationvalidationforanoptimizingcompiler.InPLDI'00,pages83{94,2000.[40]A.Orso,N.Shi,andM.J.Harrold.Scalingregressiontestingtolargesoftwaresystems.InSIGSOFT'04/FSE-12,pages241{251,2004.[41]D.L.Parnas.Onthecriteriatobeusedindecomposingsystemsintomodules.Commun.ACM,15(12):1053{1058,1972.[42]G.RothermelandM.J.Harrold.Asafe,ecientregressiontestselectiontechnique.ACMTOSEM,6(2):173{210,1997.[43]J.Sliwerski,T.Zimmermann,andA.Zeller.Whendochangesinduce xes?InMSR'05,pages24{28,2005.[44]A.SrivastavaandJ.Thiagarajan.E ectivelyprioritizingtestsindevelopmentenvironment.InISSTA'02,pages97{106,2002.[45]W.F.Tichy.Thestring-to-stringcorrectionproblemwithblockmoves.ACMTrans.Comput.Syst.,2(4):309{321,1984.[46]Z.Wang,K.Pierce,andS.McFarling.BMAT-abinarymatchingtoolforstalepro lepropagation.J.Instruction-LevelParallelism,2,2000.[47]W.Yang.Identifyingsyntacticdi erencesbetweentwoprograms.Software-PracticeandExperience,21(7):739{755,1991.[48]A.T.T.Ying,G.C.Murphy,R.Ng,andM.Chu-Carroll.Predictingsourcecodechangesbyminingchangehistory.IEEETrans.Softw.Eng.,30(9):574{586,2004.[49]X.ZhangandR.Gupta.Matchingexecutionhistoriesofprogramversions.InESEC/SIGSOFTFSE,pages197{206,2005.[50]T.ZimmermannandP.Weigerber.PreprocessingCVSdatafor ne-grainedanalysis.InMSR'04,pages2{6.[51]T.Zimmermann,P.Weigerber,S.Diehl,andA.Zeller.Miningversionhistoriestoguidesoftwarechanges.IEEETrans.Softw.Eng.,31(6):429{445,2005.[52]L.ZouandM.W.Godfrey.Usingoriginanalysistodetectmergingandsplittingofsourcecodeentities.IEEETrans.Softw.Eng.,31(2):166{181,2005.