wiscedu Abstract Most previous work on codeclone detection has fo cused on 64257nding identical clones or clones that are identical up to identi64257ers and literal values However it is often important to 64257nd similar clones too One challenge is t ID: 21253
Download Pdf The PPT/PDF document "Detecting and Measuring Similarity in Co..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
DetectingandMeasuringSimilarityinCodeClonesRandySmithandSusanHorwitzDepartmentofComputerSciences,UniversityofWisconsin{Madisonfsmithr,horwitzg@cs.wisc.eduAbstractMostpreviousworkoncode-clonedetectionhasfo-cusedonndingidenticalclones,orclonesthatareidenticaluptoidentiersandliteralvalues.However,itisoftenimportanttondsimilarclones,too.Onechallengeisthatthedenitionofsimilaritydependsonthecontextinwhichclonesarebeingfound.There-fore,weproposenewtechniquesforndingsimilarcodeblocksandforquantifyingtheirsimilarity.Ourtechniquescanbeusedtondcloneclusters,setsofcodeblocksallwithinauser-suppliedsimilaritythresh-oldofeachother.Also,givenonecodeblock,wecanndallsimilarblocksandpresentthemrank-orderedbysimilarity.Ourtechniqueshavebeenusedinaclone-detectiontoolforCprograms.Theideascouldalsobeincorporatedinmanyexistingclone-detectiontoolstoprovidemore\rexibilityintheirdenitionsofsimilarclones.1IntroductionIdentifyingcodeclonesservesmanypurposes,in-cludingstudyingcodeevolution,performingplagiarismdetection,enablingrefactoringsuchasprocedureex-traction,andperformingdefecttrackingandrepair.Mostpreviousworkoncode-clonedetectionhasfo-cusedonndingidenticalclones,orclonesthatcouldbemadeidenticalviaconsistenttransformationsofidentiersandliterals.However,codesegmentsthataresimilarbutnotidenticaloccurofteninpractice,andndingsuchnon-identicalclonescanbeasimpor-tantasndingidenticalcodesegments.Forexample,whileautomatedcodecompactionmayrequirendingidenticalclones,studiesoftheevolutionofacodebaseovertimerequirendingclonesthatvaryintheirsim-ilarity.Oneofthecentralissueswithndingnon-identicalclonesisassessingwhentwopiecesofcodearecloseenoughtobeconsidered\similar"[10,24].Becausethisislikelytodependonthecontextinwhichtheclone-detectiontoolisused,webelievethatsuchtoolsshouldprovideaquantitativemeasureofclonesimilar-ity,leavingtheultimatedecisionofclassicationtotheuserofthetool.Inthispaper,weadvocatetheuseofexplicitsimilarity-basedclonedetection,andwepresenttech-niquesthatcanbetunedtondclonesofvaryingde-greesofsimilarity.Ourtechniquesidentifyclonesattheblocklevel,basedonngerprintscomputedatthestatementlevel.Atthestatementlevel,dierentn-gerprintingalgorithmscanbeusedsothat,atoneex-treme,onlyidenticalstatementshavethesamenger-print,whileattheotherextreme,thesamengerprintmaybegiventomanyminimallysimilarstatements.Attheblocklevel,sequencesofstatementngerprintsaregroupedintosyntactically-validhierarchicalblocksthatre\rectthestructureofthesourcecode.Wecanthencomputethesimilarityscoreandsimilaritydis-tance(denedinSection3)forapairofblocksasafunctionofthenumberofstatementsinoneblockwhosengerprintsmatchthoseinanotherblock.Ourtechniquesarelargelylanguage-independent,requiringonlyalanguagelexer.Wehaveimplementedaclone-detectiontoolforCprogramsbasedontheseideasthatcanbetunedbytheusertondcloneswithvaryingdegreesofsimilarity.Thisexplicitnotionofsimilarityenablesinterestingqueriestobeperformed.First,ourtoolcanbeusedtondcloneclusters:setsofclonessuchthatthesimilaritydistancebetweenanypairofclonesinaclusterislessthanauser-speciedmaximum.Second,foracodesegmentS,thetoolcanbeusedtondallcloneswithinagivensimilaritydis-tanceofSandwillpresentthemrank-orderedbyde-creasingsimilarity.Toillustrate,Figure1providesanexampleoffoursimilarcodesegmentstakenfromtheGNUDAPpro-gram.Allfourcodefragmentsinvolvendingthenexttokenintheinputandincludecodetoprintaner-rormessagefortokensthataretoolong.However,thekindsoftokensfoundaredierent,therearetwodierenterrormessages,andjusttwoofthefourseg- for(;alphanum(c);c=dgetc(dotc,dapc,out))fif(tTOKLEN)token[t++]=c;elseftoken[t]='n0';fprintf(stderr,"dappp:%s:%d:tokentoolong:%snn",dotname,lineno,token);exit(1);ggunget1c(c,dotc,(out?dapc:NULL)); Segment1 for(;num(c);c=dgetc(dotc,dapc,out))fif(tTOKLEN)token[t++]=c;elseftoken[t]='n0';fprintf(stderr,"dappp:%s:%d:tokentoolong:%snn",dotname,lineno,token);exit(1);ggunget1c(c,dotc,(out?dapc:NULL)); Segment2 for(t=0;c=='.'jj('0'=c&&c='9');c=sbsgetc(sbsfile))fif(tTOKENLEN)token[t++]=c;elseftoken[t]='n0';fprintf(stderr,"sbstrans:before%d:tokentoolong:%snn",sbslineno,token);exit(1);gg Segment3 for(t=0;('a'=c&&c='z')jj('A'=c&&c='Z')jj('0'=c&&c='9')jjc==' 'jjc=='.';c=sbsgetc(sbsfile))fif(tTOKENLEN)token[t++]=c;elseftoken[t]='n0';fprintf(stderr,"sbstrans:before%d:tokentoolong:%snn",sbslineno,token);exit(1);gg Segment4Figure1.AcloneclusterfromtheGNUDAPproject. mentsendwithacalltoungetlc.Ourtoolcanndthesefourfragmentsasaclonecluster,or,givenonefragment,canidentifytheotherthreeasbeingsimilar.Furtherdetailsaboutngerprintingandhownger-printsareusedbyourclone-detectiontooltocomputesimilarityscoresandsimilaritydistancesaregiveninSections2and3.Section4describeshowtousethesesimilaritymeasurestondcloneclusters,andtorank-orderclonesaccordingtotheirsimilaritytoagivenblockofcode.Section5discussesrelatedwork,andSection6concludes.Insummary,thispapermakesthefollowingcontri-butions:1.weadvocatetheuseofexplicitsimilarity-basedmechanismsforclonedetection;2.wepresentaclone-detectiontechniquethatcanbetunedtondblocksofcodewithvaryingdegreesofsimilarity;3.weintroducethesimilarityscoreanddistanceasameansforgaugingsimilarity,andpresentclus-teringandrank-orderingapplicationsbuiltfromthem;4.weproposeangerprintingschemeforndingidenticalorsimilarstatements,providinganothersourceoftunabilityinourapproachtoclonede-tection.2Similarity-PreservingFingerprintsInthissectionwediscusshowtousengerprintingtoidentifystatementsthataresimilarbutnotnec-essarilyidentical.Oneapproachwouldbetouseameasurelikeeditdistance:twostatementswouldbeconsideredsimilariftheywerewithinsomethresholdeditdistanceofeachother.Thisapproachwouldhavetheadvantageofbeingtunable,butthestrongdisad-vantageofbeingprohibitivelyexpensivewhenalargenumberofstatementsmustbecomparedpairwise.Furthermore,editdistancealonemaynotbestcapturestatementsimilarity.Forexample,considerthetwopairsofif-conditionsshownbelow(assequencesoftokens):Pair1:A\function-call"conditionanda\less-than"conditionif(id(id))if(idid)Pair2:Two\bitwise-and"conditionsif((id+id)&(id*id))if(id&id)Therstpairhasaclosereditdistancethanthesecondpair:totransformthe\function-call"conditiontothe\less-than"conditionrequiresjusttwodeletions(ofthetwoparentheses)andoneinsertion(ofthe),whileforthesecondpairthetransformationrequireseightdele-tions.However,itmightmakemoresensetoconsiderthetwo\bitwise-and"conditionstobemoresimilarthantherstpair,basedonthefactthattheysharethesamerelativelyrareoperator.Thisobservationissupportedbyinformationtheory:foraprobabilitydis-tributionPr(T)overadomainT,theinformationcon-tentofanelementt2Tincreasesasitsprobabilityofoccurrencedecreases;i.e.,themoreinfrequentado-mainelementis,themoreimportantorcharacteristicitbecomes.Withtheseobservationsinmind,weuseanger-printingtechniquethatisbothmoreecientthaneditdistanceandalsorespectstheideathatasequenceoftokensmaybebestrepresentedbyitscharacteristiccomponents:thosethatcaptureitsessenceanddistin-guishitfromother,dissimilarsequences.Giventhisrepresentation,twosequencesaredeemedsimilariftheysharethesamecharacteristiccomponents.Thengerprintingalgorithmthatweuseisanadap-tationofonethathasbeenusedsuccessfullytoidentifysimilarEnglishsentencesintextdocuments[5,22].ForeachstatementS,thealgorithmcomputesalln-grams,thesequencesoftokensoflengthnthatoccurinS,thenproducesthengerprintforSbyconcatenatingthekleast-frequentn-grams.1Byusingtheleastfre-quentn-grams,wetakeintoaccounttheimportanceofthepresenceofararetoken;becauseeachn-gramrepresentsnconsecutivetokens,wetakeintoaccountthefactthatsimilartokenorderingshouldaecttheperceivedsimilarityoftwostatements;andnally,byselectingonlykn-gramstorepresentastatement,weallowsimilarbutnon-identicalstatementstohavethesamengerprint.Toillustrateourngerprintingalgorithm,Fig-ure2(a)showsthe4-gramsofthestatementif(signum0||sigstr(signum,signame)!=0)break;representedasthesequencesoftokensif(idintlit||id(id,id)!=intlit)break;withtheirfrequencyrankings.Therankingsaretakenfromthefrequencydistributionofall4-gramsintheGNUbashshellsourcecode.2Figure2(b)showsthethreeleast-frequent4-gramsfortheexamplestatement.Eachofthose4-gramsisshownwithitsfrequencyand 1Wemeantheelementsthathavethelowestprobabilityofoccurrenceoverallsequences,nottheelementsthatoccurleastfrequentlyinSitself.2Formoderatetolargecodebases,wendthatn-gramfre-quencydistributionsareconsistent:n-gramsthatareinfrequentinonecodebasearetypicallyinfrequentinothercodebases. itsrepresentationasasequenceoffourtokennumbers(e.g.,31isthetokennumberfortheidtoken).Fig-ure2(c)givesthestatement'sngerprint:theconcate-nationoftherepresentationsofthethree4-grams.Choicesforkandncanbeusedtotunethealgo-rithmaccordingtothedesiredlevelofsimilarity:thelargerthevalues,themorelikelythatonlyidenticalstatementswillhavethesamengerprint(becausethengerprintwillencodetheexactsequenceoftokens).However,largervaluesforkandnalsoincreasethecostofcomputingangerprint.Therefore,ifthegoalistohaveonlyidenticalstatementsmaptothesamenger-print,wesuggestmoretraditionaltechniquessuchasRabinngerprints[11]orMD5hashing[19].Toevaluatetheeectivenessofstatementnger-printing,wetokenizedandngerprintedtheentirebashshellcodebaseusingthefollowingthree-stepprocesstocomputeangerprintforeachstatementSinthecode-base:1.Foreachn-graminS,lookupfrequency(S)inapre-computeddatabaseofn-gramfrequencies.2.Selectthekleastfrequentlyoccurringn-gramsinS.3.Concatenatethekn-gramsinoccurrenceorderinS.Wefoundk=3andn=4tobeeectiveforndingsimi-larstatements:nottoomanystatementshavethesamengerprint,andthosethatdoarelikelytobeconsid-eredtobesimilarbyahumanprogrammer.Forexam-ple,weshowbelowtwosetsofnon-identicalstatements(expressedassequencesoftokens)whereforeachsetallstatementshavethesamengerprint.1.Twofor-loopsthatdieronlyintheinitialization.for(id=intlit;id[id]&&id(id[id]);id++);for(id=id;id[id]&&id(id[id]);id++);2.Threeassignmentstatementswithfunctioncallsthatdieronlyinthenumberofparameters.id=id(id-id-id,id);id=id(id-id-id,id,id);id=id(id-id-id,id,id,id); 4-gram Freq 4-gram Freq if(id 8614 (idintlit 6988 id0|| 984 intlit||id 1008 intlit||id( 420 ||id(id 3050 id(sid, 117967 (id,id 77927 id,id) 64951 ,id)!= 532 id)!=intlit 1111 )!=intlit) 1722 !=intlit)break 102 inlit)break; 734 (a)all4-gramsandtheirfrequencies 4-gram Freq SeqofTokenNumbers intlit||id( 420 45:10:31:08 ,id)!= 532 02:31:09:27 !=intlit)break 102 27:45:09:16 (b)the3leastfrequent4-gramsinorderofoccurrence45:10:31:08:02:31:09:27:27:45:09:16(c)thenalngerprintFigure2.Fingerprintconstruc-tionforthetokenizedstatementif(idintlit||id(id,id)!=intlit)break;3Block-levelSimilarityOurtoolusesthengerprintscomputedforeachstatementtocomputesimilarityscoresanddistancesforpairsofblocksofcode.Thismeansthatweidentifycodeclonesattheblocklevel.Oneissuewiththisap-proachisthatduplicatedcodethatconstitutesasmallpartofotherwisedisparateblocksmaynotberead-ilyidentied.Thereareseveraladvantages,though.First,blocksprovidenaturalandreasonablebound-ariesintrinsictothesourcelanguageforcomparingcode.Othersuchboundariestendtobeeithertoocoarse(e.g.,functionbodies)ortoone(e.g.,state-ments).Second,blocksre\rectcodestructurewithouttheneedtoinvokefullparsing.Finally,blockspro-videawell-denedstructureoverwhichwecanquan-tifysimilarityandprovidetunablemeasures.ForCprograms,ablockisasequenceofstatementsdelimitedbyamatchingpairofcurlybraces.Ifoneblockcontainsanother,weconsiderbothasdistinctobjects.Toavoidthefruitlesstaskofcomparingablockwithitsowncomponents,wekeeptrackofeachblock'sstartingandendingpositioninthesourcecode.Thesimilarityscoreforapairofblocksisanorderedpaircontainingthenumberofngerprintscommontobothblocksdividedbythesizeofeachblock,respec-tively;i.e.,fortwoblocksofcodeS1andS2itisdenedasfollows:sim(S1;S2)=hjS1\S2j jS1j;jS1\S2j jS2ji Thersttermintheorderedpairisthefractionofngerprintsintherstblockthatiscommontobothblocks,andthesecondtermisthefractionofn-gerprintsinthesecondblockcommontobothblocks.Wearguethatsimilarityasdenedhereisnaturallybinary-valuedandpresentsabetterpictureofthere-lationshipbetweencodeblocksthanasinglenumberdoes.Specically,thebinary-valuedstructureofthesimilarityscorecanbeinterpretedasassessingtherel-ativecontainmentofoneblockinsideanother.Forexample,ascoreofh1:0;:333iindicatesthattherstblockiswhollycontainedwithinthesecond,andthatroughlyathirdofthesecondiscontainedintherst.Nevertheless,binary-valuedsimilarityscoresarenotappropriatewhenatotalorderingofscoresisneeded.Thisisthecase,forexample,whenthegoalistondallblocksthataresimilartoagivenblock,andtopresentthemrank-orderedbysimilarity.Simplyre-tainingonepartofthesimilarityscoreisnotagoodsolutionwhenoneblockismuchlargerthantheother.Forexample,retainingonlytherstcomponentofthescoreh1:0;:333ifailstodistinguishitfromthescoreh1:0;1:0i.Therefore,wetransformasimilarityscoretoasingle-valuedsimilaritydistancewhennecessary.Thesimilaritydistanceforasimilarityscorehs1;s2iisdenedasfollows:sim dist(hs1;s2i)=1 s (s1)2+(s2)2 2Geometrically,s1ands2aretreatedasthetwosidesofarighttrianglewithsim distrelatedtothelengthofthehypotenuse,normalizedtoavaluebetween0and1.Othermeasuressuchasasimplearithmeticmeanmaybeused,butinpracticeweprefertheproposeddenition,sinceitresultsinalower(better)similaritydistanceforsimilarityscoreswithatleastonelargecomponentthanthearithmeticmeandoes.Oneoftheconsequencesofourdenitionsofsimi-larityscoreanddistanceisthatstatementorderisir-relevanttocloneidentication.Thishasbothpositiveandnegativeconsequences:cloneswhosestatementshavebeenreorderedarereadilyidentiedatthepos-siblecostofintroducingfalsematchesduetoblockswhosestatementsmatchbutlogicallyarenotclones.Onecoulduseorder-preservingtechniques[20],butinpractice,wehavefoundourlackoforderingtohavenegligibleeectincloneidentication.4CodeClusteringandRank-OrderingWebrie\rydescribehowourtoolusessimilarityscoresanddistancestoperformcodeclusteringandrankordering.4.0.1CloneClustersItisoftenusefultondclustersofclonesthataremutu-allysimilar.Formally,asetofblocksformsaclusterifandonlyifeverypairofblocksinthesetiswithinsomeuser-suppliedsimilaritythreshold.Thusclusteringre-quiresanO(n2)steptocomputethesimilarityscoreforeachpairofblocks.Fortunately,wecansignicantlyreducetheruntimeofthisstepifweavoidcomput-ingsimilarityscoresforblockswithnongerprintsincommon.Thisisdonebymaintaininganinvertedin-dexkeyedonngerprintvaluesthatmapsthemtotheblocksinwhichtheyarefound.ForablockBwithngerprintsfB,wecomputesimilarityscoresonlyforthoseblocksthatalsocontainsomengerprintinfB.Oncethesetofsimilarityscoresiscomputed,weconverttheresultstoagraphandndallmaximalcliques.Blocksaregraphnodes,andanedgeisplacedbetweentwonodesifthesimilaritydistanceforthecorrespondingtwoblocksislessthanthethreshold.Wethenndthemaximalcliquesusingthealgorithmin[23],whichproducessetsofblocksinwhicheachblockinasetiswithinthethresholdofeveryotherblockintheset.Figure1showsaclusteroffoursimilarclonesfoundwithathresholdof0.5.4.0.2RankOrderingOnecommonuseofclone-detectiontoolsistondallclonesofagivencodesegmentS.Withsimilar-ityscoreswecangeneralizethisoperationtoproducearank-orderedsequenceofsimilarclonesorderedbydecreasingsimilarity.Operationally,theprocedureissimilartothatforcloneclusters,exceptthatanall-pairscomparisonisreplacedwithanO(n)comparisonofStoallotherblocks.Asbefore,weuseaninvertedindextosignicantlyreducethenumberofcompar-isons.Afterthesimilarityscoresarecomputed,werank-ordertheblocksaccordingtodecreasingsimilar-ity(i.e.,increasingsimilaritydistance).Figure3givesapartialexample:Figure3(a)containsacodesegmentS,andFigures3(b),(c),and(d)showsuccessivelyless-similarcloneswiththeirsimilarityscoresandsimilaritydistancevaluescomputedwithregardtosegmentSinFigure3(a).5RelatedWorkManytechniqueshavebeenproposedforperform-ingcodeclonedetection,rangingfromlightweightline-andtoken-basedsyntactictechniques[2,7{9,17]toheavier-weightapproachesthatemphasizesemantics[3,12,16],tometric-basedtechniquesthatindirectlymeasuresimilarity[13,18].Duetolimitedspace,werefertoexistingsummaries[4,14]andfocusonrelatedworkasitpertainsexplicitlytosimilarity. if(!recursive( NL CURRENT(LC TIME,D FMT)))fif(decided==loc)returnNULL;elserp=rp backup;gelsefif(decided==not&&strcmp( NL CURRENT(LC TIME,D FMT),HERE D FMT))decided=loc;want xday=1;break;gdecided=raw; (a) if(!recursive( NL CURRENT(LC TIME,T FMT AMPM)))fif(decided==loc)returnNULL;elserp=rp backup;gelsefif(decided==not&&strcmp( NL CURRENT(LC TIME,T FMT AMPM),HERE T FMT AMPM))decided=loc;break;gdecided=raw; (b)sim((a),)=h0:83;0:71i,sim dist=0:22 if(!recursive( NL CURRENT(LC TIME,T FMT)))fif(decided==loc)returnNULL;elserp=rp backup;gelsefif(strcmp( NL CURRENT(LC TIME,T FMT),HERE T FMT))decided=loc;break;gdecided=raw; (c)sim((a),)=h0:67;0:57i,sim dist=0:38 constcharfmt= NL CURRENT(LC TIME,ERA T FMT);if(fmt=='n0')fmt= NL CURRENT(LC TIME,T FMT);if(!recursive(fmt))fif(decided==loc)returnNULL;elserp=rp backup;gelsefif(strcmp(fmt,HERE T FMT))decided=loc;break;gdecided=raw; (d)sim((a),)=h0:50;0:33i,sim dist=0:58Figure3.(a)showsauser-suppliedcodesegment.(b)-(d)showsuccessivelyless-similarclones.Tothebestofourknowledge,similarity-preservingngerprintsandsimilarityscoreswererstusedforde-tectingandquantifyingsimilarityintextdocuments[5,22].Thatworkappliedsimilarityngerprintstoin-dividualsentencesandcalculatedsimilarityscoresbe-tweendocumentsrepresentedassequencesofnger-prints.Atahighlevel,ourworkcanbethoughtofasanapplicationofthoseideastosourcecode.EarlysyntactictechniquessuchasDup[2]denedparameterizedclones,allowingforvariationbetweencodesequencesaslongasaconsistentsubstitutionofidentiersexisted.Token-basedtechniquesoperateonlexemesandarethusresilienttodierencesinwhitespace.However,thesetechniquesdonotquantifysim-ilarityaswedo.Semantics-basedtechniquesuserepresentationssuchasabstractsyntaxtrees[3]andprogramdepen-dencegraphs(PDGs)[12,16]tondsemanticallyiden-ticalcloneswhoseoriginalsourcecodemaycontainex-traneousorreorderedstatements.Thesetechniquescannotquantifysimilarityaswecan.Cordyetal.[6]andRoyandCordy[20]proposeatechniquefornding\near-miss"'clones,withsomeas-pectsthataresimilartoourown.Aswithus,theyem-ployatoken-basedsystemanduselightweightmecha-nismsforensuringsyntacticvalidityofpotentialclones.Inaddition,theirmethodsformeasuringsimilaritybe-tweenclonesresemblesourownsimilarityscore,al-thoughtheydonotseemtoutilizeittotheextentthatwedoortomakeitauser-controlledparameter.Unlikethem,wedonotenforcestatementorderwhenmeasuringsimilaritybetweenblocksofcode.Finally,ouruseofsimilarityngerprintsforstatementsisdis-tinct.Lietal.[17]haveproposedCP-Minerforidenti-fyingcopy-pastebugsinlargesystems.Intheirap-proach,codesequencesaretransformedsothatcom-monsubsequencescanbeidentiedusingdataminingtechniques.Likeustheyngerprintstatements,al-thoughtheirngerprintsdonotpreservesimilarity.Aswithothertechniques,theydonotquantifythedegreeofsimilaritythatexistsbetweenpotentialclones.Finally,techniquesfordetectingplagiarismdealwithcodesequencesthatarebytheirnaturesimilarbutnottypicallyidentical.TheMosssystem[1,21],forex-ample,usesawinnowingalgorithmtoselectfragmentsofsourcecodetobengerprintedandthencalculatesasimilaritypercentagebasedonthesetofcommonngerprints.Becauseitsgoalsaredierentthanours,Mossoperatesatacoarserlevelofgranularitythanwedoandcannotproducethesamelevelofdetailthatwecan.Ontheotherhand,theirstoragerequirementsaresmallerthanours.Insummary,evenforthosetechniquesthatcaniden-tifysimilarclones,thereistypicallynomechanismforquantifyingthedegreeofsimilarity:codesegmentsaresimilarortheyaren't.Incontrast,oneofourmaingoalsistoidentifynon-identicalclonesandquantifytheamountofsimilaritypresent. 6ConclusionWepositthatidentifyingandquantifyingsimilarityisimportantformanyapplicationsofclonedetectionandthatsimilarityshouldbeintrinsictothedenitionofclones.Wehavepresentedaframeworkandimple-mentationforsimilaritydetectionthatoperatesattwolevels.Atthelowerlevel,weproposeatunablestate-mentngerprintingschemethattendstopreservesimi-larityacrossthengerprintingfunctionsothatsimilarstatementshavethesamengerprint.Atthehigherlevel,wepresentalightweightmechanismoperatingonlanguageblocksoverwhichwecanquantifytheamountofsimilarity.Weillustratetheutilityofthisapproachbydescribingtwoapplications{codeclusteringandrank-ordering{thatourapproachenables.Thereisconsiderablefutureworktobedone,butweareop-timisticthatsimilarity-basedapproachessuchasourscanprovideadditionalpowertoclonedetection.AcknowledgementsThisworkwassupportedbyNSFgrantnumberCCF-0701957.References[1]A.Aiken.Moss:Asystemfordetectingsoftwarepla-giarism.http://www.cs.stanford.edu/aiken/moss/.[2]B.Baker.Onndingduplicationandnear-duplicationinlargesoftwaresystems.InWorkingConf.onRe-verseEngineering,1995.[3]I.Baxter,A.Yahin,L.Moura,M.Sant'Anna,andL.Bier.Clonedetectionusingabstractsyntaxtrees.InICSM,1998.[4]S.Bellon,R.Koschke,G.Antoniol,J.Krinke,andE.Merlo.Comparisonandevaluationofclonedetec-tiontools.IEEETrans.SoftwareEng.,33(9):577{591,2007.[5]D.M.Campbell,W.R.Chen,andR.D.Smith.Copydetectionsystemsfordigitaldocuments.InAdvancesinDigitalLib.,2000.[6]J.R.Cordy,T.R.Dean,andN.Synytskyy.Practicallanguage-independentdetectionofnear-missclones.InCASCON'04:Proceedingsofthe2004conferenceoftheCentreforAdvancedStudiesonCollaborativere-search,pages1{12.IBMPress,2004.[7]S.Ducasse,M.Rieger,andS.Demeyer.Alanguageindependentapproachfordetectingduplicatedcode.InICSM'99.[8]J.Johnson.Identifyingredundancyinsourcecodeusingngerprints.InProc.oftheIBMCentreforAdvancedStudiesConferences,1993.[9]T.Kamiya,S.Kusumoto,andK.Inoue.Ccnder:Amultilinguistictoken-basedcodeclonedetectionsys-temforlargescalesourcecode.IEEETrans.onSoft-wareEngineering,28(7):654{670,July2002.[10]C.Kapser,P.Anderson,M.W.Godfrey,R.Koschke,M.Rieger,F.V.Rysselberghe,andP.Weigerber.Subjectivityinclonejudgment:Canweeveragree?InKoschkeetal.[15].[11]R.KarpandM.Rabin.Ecientrandomizedpattern-matchingalgorithms.IBMJnlofResearchandDevel-opment,31(2):249{260,1987.[12]R.KomondoorandS.Horwitz.Usingslicingtoiden-tifyduplicationinsourcecode.InSymp.onStaticAnalysis,pages40{56,July2001.[13]K.Kontogiannis,R.Demori,E.Merlo,M.Galler,andM.Bernstein.Patternmatchingforcloneandcon-ceptdetection.AutomatedSoftwareEngineering,3(1{2):77{108,1996.[14]R.Koschke.Surveyofresearchonsoftwareclones.InKoschkeetal.[15].[15]R.Koschke,E.Merlo,andA.Walenstein,editors.Duplication,Redundancy,andSimilarityinSoftware,volume06301ofDagstuhlSeminarProc.Interna-tionalesBegegnungs-undForschungszentrumfuerIn-formatik(IBFI),SchlossDagstuhl,Germany,2007.[16]J.Krinke.Identifyingsimilarcodewithprogramde-pendencegraphs.InWorkingConf.onReverseEngi-neering,pages301{309,Oct2001.[17]Z.Li,S.Lu,S.Myagmar,andY.Zhou.Cp-miner:Findingcopy-pasteandrelatedbugsinlarge-scalesoftwarecode.IEEETrans.SoftwareEng.,32(3):176{192,2006.[18]J.Mayrand,C.Leblanc,andE.Merlo.Experimentontheautomaticdetectionoffunctionclonesinasoft-waresystemusingmetrics.InICSM,1996.[19]R.Rivest.Themd5messagedigestalgorithm.RFC1321,April1992.NetworkWorkingGroup.[20]C.K.RoyandJ.R.Cordy.Nicad:Accuratedetectionofnear-missintentionalclonesusing\rexiblepretty-printingandcodenormalization.InICPC,pages172{181,2008.[21]S.Schleimer,D.S.Wilkerson,andA.Aiken.Winnow-ing:localalgorithmsfordocumentngerprinting.InSIGMOD'03,2003.[22]R.Smith.Copydetectionsystemsfordigitaldocu-ments.Master'sthesis,DepartmentofComputerSci-ence,BrighamYoungUniversity,1999.[23]S.Tsukiyama,M.Ide,H.Ariyoshi,andI.Shirakawa.Anewalgorithmforgeneratingallthemaximalinde-pendentsets.SIAMJournalofComputing,6(3):505{517,September1977.[24]A.Walenstein,N.Jyoti,J.Li,Y.Yang,andA.Lakho-tia.Problemscreatingtask-relevantclonedetectionreferencedata.InWCRE,2003.