/
Detecting and Measuring Similarity in Code Clones Randy Smith and Susan Horwitz Department Detecting and Measuring Similarity in Code Clones Randy Smith and Susan Horwitz Department

Detecting and Measuring Similarity in Code Clones Randy Smith and Susan Horwitz Department - PDF document

trish-goza
trish-goza . @trish-goza
Follow
521 views
Uploaded On 2014-12-05

Detecting and Measuring Similarity in Code Clones Randy Smith and Susan Horwitz Department - PPT Presentation

wiscedu Abstract Most previous work on codeclone detection has fo cused on 64257nding identical clones or clones that are identical up to identi64257ers and literal values However it is often important to 64257nd similar clones too One challenge is t ID: 21253

wiscedu Abstract Most previous work

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Detecting and Measuring Similarity in Co..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DetectingandMeasuringSimilarityinCodeClonesRandySmithandSusanHorwitzDepartmentofComputerSciences,UniversityofWisconsin{Madisonfsmithr,horwitzg@cs.wisc.eduAbstractMostpreviousworkoncode-clonedetectionhasfo-cusedon ndingidenticalclones,orclonesthatareidenticaluptoidenti ersandliteralvalues.However,itisoftenimportantto ndsimilarclones,too.Onechallengeisthatthede nitionofsimilaritydependsonthecontextinwhichclonesarebeingfound.There-fore,weproposenewtechniquesfor ndingsimilarcodeblocksandforquantifyingtheirsimilarity.Ourtechniquescanbeusedto ndcloneclusters,setsofcodeblocksallwithinauser-suppliedsimilaritythresh-oldofeachother.Also,givenonecodeblock,wecan ndallsimilarblocksandpresentthemrank-orderedbysimilarity.Ourtechniqueshavebeenusedinaclone-detectiontoolforCprograms.Theideascouldalsobeincorporatedinmanyexistingclone-detectiontoolstoprovidemore\rexibilityintheirde nitionsofsimilarclones.1IntroductionIdentifyingcodeclonesservesmanypurposes,in-cludingstudyingcodeevolution,performingplagiarismdetection,enablingrefactoringsuchasprocedureex-traction,andperformingdefecttrackingandrepair.Mostpreviousworkoncode-clonedetectionhasfo-cusedon ndingidenticalclones,orclonesthatcouldbemadeidenticalviaconsistenttransformationsofidenti ersandliterals.However,codesegmentsthataresimilarbutnotidenticaloccurofteninpractice,and ndingsuchnon-identicalclonescanbeasimpor-tantas ndingidenticalcodesegments.Forexample,whileautomatedcodecompactionmayrequire ndingidenticalclones,studiesoftheevolutionofacodebaseovertimerequire ndingclonesthatvaryintheirsim-ilarity.Oneofthecentralissueswith ndingnon-identicalclonesisassessingwhentwopiecesofcodearecloseenoughtobeconsidered\similar"[10,24].Becausethisislikelytodependonthecontextinwhichtheclone-detectiontoolisused,webelievethatsuchtoolsshouldprovideaquantitativemeasureofclonesimilar-ity,leavingtheultimatedecisionofclassi cationtotheuserofthetool.Inthispaper,weadvocatetheuseofexplicitsimilarity-basedclonedetection,andwepresenttech-niquesthatcanbetunedto ndclonesofvaryingde-greesofsimilarity.Ourtechniquesidentifyclonesattheblocklevel,basedon ngerprintscomputedatthestatementlevel.Atthestatementlevel,di erent n-gerprintingalgorithmscanbeusedsothat,atoneex-treme,onlyidenticalstatementshavethesame nger-print,whileattheotherextreme,thesame ngerprintmaybegiventomanyminimallysimilarstatements.Attheblocklevel,sequencesofstatement ngerprintsaregroupedintosyntactically-validhierarchicalblocksthatre\rectthestructureofthesourcecode.Wecanthencomputethesimilarityscoreandsimilaritydis-tance(de nedinSection3)forapairofblocksasafunctionofthenumberofstatementsinoneblockwhose ngerprintsmatchthoseinanotherblock.Ourtechniquesarelargelylanguage-independent,requiringonlyalanguagelexer.Wehaveimplementedaclone-detectiontoolforCprogramsbasedontheseideasthatcanbetunedbytheuserto ndcloneswithvaryingdegreesofsimilarity.Thisexplicitnotionofsimilarityenablesinterestingqueriestobeperformed.First,ourtoolcanbeusedto ndcloneclusters:setsofclonessuchthatthesimilaritydistancebetweenanypairofclonesinaclusterislessthanauser-speci edmaximum.Second,foracodesegmentS,thetoolcanbeusedto ndallcloneswithinagivensimilaritydis-tanceofSandwillpresentthemrank-orderedbyde-creasingsimilarity.Toillustrate,Figure1providesanexampleoffoursimilarcodesegmentstakenfromtheGNUDAPpro-gram.Allfourcodefragmentsinvolve ndingthenexttokenintheinputandincludecodetoprintaner-rormessagefortokensthataretoolong.However,thekindsoftokensfoundaredi erent,therearetwodi erenterrormessages,andjusttwoofthefourseg- for(;alphanum(c);c=dgetc(dotc,dapc,out))fif(tTOKLEN)token[t++]=c;elseftoken[t]='n0';fprintf(stderr,"dappp:%s:%d:tokentoolong:%snn",dotname,lineno,token);exit(1);ggunget1c(c,dotc,(out?dapc:NULL)); Segment1 for(;num(c);c=dgetc(dotc,dapc,out))fif(tTOKLEN)token[t++]=c;elseftoken[t]='n0';fprintf(stderr,"dappp:%s:%d:tokentoolong:%snn",dotname,lineno,token);exit(1);ggunget1c(c,dotc,(out?dapc:NULL)); Segment2 for(t=0;c=='.'jj('0'=c&&c='9');c=sbsgetc(sbsfile))fif(tTOKENLEN)token[t++]=c;elseftoken[t]='n0';fprintf(stderr,"sbstrans:before%d:tokentoolong:%snn",sbslineno,token);exit(1);gg Segment3 for(t=0;('a'=c&&c='z')jj('A'=c&&c='Z')jj('0'=c&&c='9')jjc==' 'jjc=='.';c=sbsgetc(sbsfile))fif(tTOKENLEN)token[t++]=c;elseftoken[t]='n0';fprintf(stderr,"sbstrans:before%d:tokentoolong:%snn",sbslineno,token);exit(1);gg Segment4Figure1.AcloneclusterfromtheGNUDAPproject. mentsendwithacalltoungetlc.Ourtoolcan ndthesefourfragmentsasaclonecluster,or,givenonefragment,canidentifytheotherthreeasbeingsimilar.Furtherdetailsabout ngerprintingandhow nger-printsareusedbyourclone-detectiontooltocomputesimilarityscoresandsimilaritydistancesaregiveninSections2and3.Section4describeshowtousethesesimilaritymeasuresto ndcloneclusters,andtorank-orderclonesaccordingtotheirsimilaritytoagivenblockofcode.Section5discussesrelatedwork,andSection6concludes.Insummary,thispapermakesthefollowingcontri-butions:1.weadvocatetheuseofexplicitsimilarity-basedmechanismsforclonedetection;2.wepresentaclone-detectiontechniquethatcanbetunedto ndblocksofcodewithvaryingdegreesofsimilarity;3.weintroducethesimilarityscoreanddistanceasameansforgaugingsimilarity,andpresentclus-teringandrank-orderingapplicationsbuiltfromthem;4.weproposea ngerprintingschemefor ndingidenticalorsimilarstatements,providinganothersourceoftunabilityinourapproachtoclonede-tection.2Similarity-PreservingFingerprintsInthissectionwediscusshowtouse ngerprintingtoidentifystatementsthataresimilarbutnotnec-essarilyidentical.Oneapproachwouldbetouseameasurelikeeditdistance:twostatementswouldbeconsideredsimilariftheywerewithinsomethresholdeditdistanceofeachother.Thisapproachwouldhavetheadvantageofbeingtunable,butthestrongdisad-vantageofbeingprohibitivelyexpensivewhenalargenumberofstatementsmustbecomparedpairwise.Furthermore,editdistancealonemaynotbestcapturestatementsimilarity.Forexample,considerthetwopairsofif-conditionsshownbelow(assequencesoftokens):Pair1:A\function-call"conditionanda\less-than"conditionif(id(id))if(idid)Pair2:Two\bitwise-and"conditionsif((id+id)&(id*id))if(id&id)The rstpairhasaclosereditdistancethanthesecondpair:totransformthe\function-call"conditiontothe\less-than"conditionrequiresjusttwodeletions(ofthetwoparentheses)andoneinsertion(ofthe),whileforthesecondpairthetransformationrequireseightdele-tions.However,itmightmakemoresensetoconsiderthetwo\bitwise-and"conditionstobemoresimilarthanthe rstpair,basedonthefactthattheysharethesamerelativelyrareoperator.Thisobservationissupportedbyinformationtheory:foraprobabilitydis-tributionPr(T)overadomainT,theinformationcon-tentofanelementt2Tincreasesasitsprobabilityofoccurrencedecreases;i.e.,themoreinfrequentado-mainelementis,themoreimportantorcharacteristicitbecomes.Withtheseobservationsinmind,weusea nger-printingtechniquethatisbothmoreecientthaneditdistanceandalsorespectstheideathatasequenceoftokensmaybebestrepresentedbyitscharacteristiccomponents:thosethatcaptureitsessenceanddistin-guishitfromother,dissimilarsequences.Giventhisrepresentation,twosequencesaredeemedsimilariftheysharethesamecharacteristiccomponents.The ngerprintingalgorithmthatweuseisanadap-tationofonethathasbeenusedsuccessfullytoidentifysimilarEnglishsentencesintextdocuments[5,22].ForeachstatementS,thealgorithmcomputesalln-grams,thesequencesoftokensoflengthnthatoccurinS,thenproducesthe ngerprintforSbyconcatenatingthekleast-frequentn-grams.1Byusingtheleastfre-quentn-grams,wetakeintoaccounttheimportanceofthepresenceofararetoken;becauseeachn-gramrepresentsnconsecutivetokens,wetakeintoaccountthefactthatsimilartokenorderingshoulda ecttheperceivedsimilarityoftwostatements;and nally,byselectingonlykn-gramstorepresentastatement,weallowsimilarbutnon-identicalstatementstohavethesame ngerprint.Toillustrateour ngerprintingalgorithm,Fig-ure2(a)showsthe4-gramsofthestatementif(signum0||sigstr(signum,signame)!=0)break;representedasthesequencesoftokensif(idintlit||id(id,id)!=intlit)break;withtheirfrequencyrankings.Therankingsaretakenfromthefrequencydistributionofall4-gramsintheGNUbashshellsourcecode.2Figure2(b)showsthethreeleast-frequent4-gramsfortheexamplestatement.Eachofthose4-gramsisshownwithitsfrequencyand 1Wemeantheelementsthathavethelowestprobabilityofoccurrenceoverallsequences,nottheelementsthatoccurleastfrequentlyinSitself.2Formoderatetolargecodebases,we ndthatn-gramfre-quencydistributionsareconsistent:n-gramsthatareinfrequentinonecodebasearetypicallyinfrequentinothercodebases. itsrepresentationasasequenceoffourtokennumbers(e.g.,31isthetokennumberfortheidtoken).Fig-ure2(c)givesthestatement's ngerprint:theconcate-nationoftherepresentationsofthethree4-grams.Choicesforkandncanbeusedtotunethealgo-rithmaccordingtothedesiredlevelofsimilarity:thelargerthevalues,themorelikelythatonlyidenticalstatementswillhavethesame ngerprint(becausethe ngerprintwillencodetheexactsequenceoftokens).However,largervaluesforkandnalsoincreasethecostofcomputinga ngerprint.Therefore,ifthegoalistohaveonlyidenticalstatementsmaptothesame nger-print,wesuggestmoretraditionaltechniquessuchasRabin ngerprints[11]orMD5hashing[19].Toevaluatethee ectivenessofstatement nger-printing,wetokenizedand ngerprintedtheentirebashshellcodebaseusingthefollowingthree-stepprocesstocomputea ngerprintforeachstatementSinthecode-base:1.Foreachn-graminS,lookupfrequency(S)inapre-computeddatabaseofn-gramfrequencies.2.Selectthekleastfrequentlyoccurringn-gramsinS.3.Concatenatethekn-gramsinoccurrenceorderinS.Wefoundk=3andn=4tobee ectivefor ndingsimi-larstatements:nottoomanystatementshavethesame ngerprint,andthosethatdoarelikelytobeconsid-eredtobesimilarbyahumanprogrammer.Forexam-ple,weshowbelowtwosetsofnon-identicalstatements(expressedassequencesoftokens)whereforeachsetallstatementshavethesame ngerprint.1.Twofor-loopsthatdi eronlyintheinitialization.for(id=intlit;id[id]&&id(id[id]);id++);for(id=id;id[id]&&id(id[id]);id++);2.Threeassignmentstatementswithfunctioncallsthatdi eronlyinthenumberofparameters.id=id(id-�id-�id,id);id=id(id-�id-�id,id,id);id=id(id-�id-�id,id,id,id); 4-gram Freq 4-gram Freq if(id 8614 (idintlit 6988 id0|| 984 intlit||id 1008 intlit||id( 420 ||id(id 3050 id(sid, 117967 (id,id 77927 id,id) 64951 ,id)!= 532 id)!=intlit 1111 )!=intlit) 1722 !=intlit)break 102 inlit)break; 734 (a)all4-gramsandtheirfrequencies 4-gram Freq SeqofTokenNumbers intlit||id( 420 45:10:31:08 ,id)!= 532 02:31:09:27 !=intlit)break 102 27:45:09:16 (b)the3leastfrequent4-gramsinorderofoccurrence45:10:31:08:02:31:09:27:27:45:09:16(c)the nal ngerprintFigure2.Fingerprintconstruc-tionforthetokenizedstatementif(idintlit||id(id,id)!=intlit)break;3Block-levelSimilarityOurtoolusesthe ngerprintscomputedforeachstatementtocomputesimilarityscoresanddistancesforpairsofblocksofcode.Thismeansthatweidentifycodeclonesattheblocklevel.Oneissuewiththisap-proachisthatduplicatedcodethatconstitutesasmallpartofotherwisedisparateblocksmaynotberead-ilyidenti ed.Thereareseveraladvantages,though.First,blocksprovidenaturalandreasonablebound-ariesintrinsictothesourcelanguageforcomparingcode.Othersuchboundariestendtobeeithertoocoarse(e.g.,functionbodies)ortoo ne(e.g.,state-ments).Second,blocksre\rectcodestructurewithouttheneedtoinvokefullparsing.Finally,blockspro-videawell-de nedstructureoverwhichwecanquan-tifysimilarityandprovidetunablemeasures.ForCprograms,ablockisasequenceofstatementsdelimitedbyamatchingpairofcurlybraces.Ifoneblockcontainsanother,weconsiderbothasdistinctobjects.Toavoidthefruitlesstaskofcomparingablockwithitsowncomponents,wekeeptrackofeachblock'sstartingandendingpositioninthesourcecode.Thesimilarityscoreforapairofblocksisanorderedpaircontainingthenumberof ngerprintscommontobothblocksdividedbythesizeofeachblock,respec-tively;i.e.,fortwoblocksofcodeS1andS2itisde nedasfollows:sim(S1;S2)=hjS1\S2j jS1j;jS1\S2j jS2ji The rsttermintheorderedpairisthefractionof ngerprintsinthe rstblockthatiscommontobothblocks,andthesecondtermisthefractionof n-gerprintsinthesecondblockcommontobothblocks.Wearguethatsimilarityasde nedhereisnaturallybinary-valuedandpresentsabetterpictureofthere-lationshipbetweencodeblocksthanasinglenumberdoes.Speci cally,thebinary-valuedstructureofthesimilarityscorecanbeinterpretedasassessingtherel-ativecontainmentofoneblockinsideanother.Forexample,ascoreofh1:0;:333iindicatesthatthe rstblockiswhollycontainedwithinthesecond,andthatroughlyathirdofthesecondiscontainedinthe rst.Nevertheless,binary-valuedsimilarityscoresarenotappropriatewhenatotalorderingofscoresisneeded.Thisisthecase,forexample,whenthegoalisto ndallblocksthataresimilartoagivenblock,andtopresentthemrank-orderedbysimilarity.Simplyre-tainingonepartofthesimilarityscoreisnotagoodsolutionwhenoneblockismuchlargerthantheother.Forexample,retainingonlythe rstcomponentofthescoreh1:0;:333ifailstodistinguishitfromthescoreh1:0;1:0i.Therefore,wetransformasimilarityscoretoasingle-valuedsimilaritydistancewhennecessary.Thesimilaritydistanceforasimilarityscorehs1;s2iisde nedasfollows:sim dist(hs1;s2i)=1s (s1)2+(s2)2 2Geometrically,s1ands2aretreatedasthetwosidesofarighttrianglewithsim distrelatedtothelengthofthehypotenuse,normalizedtoavaluebetween0and1.Othermeasuressuchasasimplearithmeticmeanmaybeused,butinpracticeweprefertheproposedde nition,sinceitresultsinalower(better)similaritydistanceforsimilarityscoreswithatleastonelargecomponentthanthearithmeticmeandoes.Oneoftheconsequencesofourde nitionsofsimi-larityscoreanddistanceisthatstatementorderisir-relevanttocloneidenti cation.Thishasbothpositiveandnegativeconsequences:cloneswhosestatementshavebeenreorderedarereadilyidenti edatthepos-siblecostofintroducingfalsematchesduetoblockswhosestatementsmatchbutlogicallyarenotclones.Onecoulduseorder-preservingtechniques[20],butinpractice,wehavefoundourlackoforderingtohavenegligiblee ectincloneidenti cation.4CodeClusteringandRank-OrderingWebrie\rydescribehowourtoolusessimilarityscoresanddistancestoperformcodeclusteringandrankordering.4.0.1CloneClustersItisoftenusefulto ndclustersofclonesthataremutu-allysimilar.Formally,asetofblocksformsaclusterifandonlyifeverypairofblocksinthesetiswithinsomeuser-suppliedsimilaritythreshold.Thusclusteringre-quiresanO(n2)steptocomputethesimilarityscoreforeachpairofblocks.Fortunately,wecansigni cantlyreducetheruntimeofthisstepifweavoidcomput-ingsimilarityscoresforblockswithno ngerprintsincommon.Thisisdonebymaintaininganinvertedin-dexkeyedon ngerprintvaluesthatmapsthemtotheblocksinwhichtheyarefound.ForablockBwith ngerprintsfB,wecomputesimilarityscoresonlyforthoseblocksthatalsocontainsome ngerprintinfB.Oncethesetofsimilarityscoresiscomputed,weconverttheresultstoagraphand ndallmaximalcliques.Blocksaregraphnodes,andanedgeisplacedbetweentwonodesifthesimilaritydistanceforthecorrespondingtwoblocksislessthanthethreshold.Wethen ndthemaximalcliquesusingthealgorithmin[23],whichproducessetsofblocksinwhicheachblockinasetiswithinthethresholdofeveryotherblockintheset.Figure1showsaclusteroffoursimilarclonesfoundwithathresholdof0.5.4.0.2RankOrderingOnecommonuseofclone-detectiontoolsisto ndallclonesofagivencodesegmentS.Withsimilar-ityscoreswecangeneralizethisoperationtoproducearank-orderedsequenceofsimilarclonesorderedbydecreasingsimilarity.Operationally,theprocedureissimilartothatforcloneclusters,exceptthatanall-pairscomparisonisreplacedwithanO(n)comparisonofStoallotherblocks.Asbefore,weuseaninvertedindextosigni cantlyreducethenumberofcompar-isons.Afterthesimilarityscoresarecomputed,werank-ordertheblocksaccordingtodecreasingsimilar-ity(i.e.,increasingsimilaritydistance).Figure3givesapartialexample:Figure3(a)containsacodesegmentS,andFigures3(b),(c),and(d)showsuccessivelyless-similarcloneswiththeirsimilarityscoresandsimilaritydistancevaluescomputedwithregardtosegmentSinFigure3(a).5RelatedWorkManytechniqueshavebeenproposedforperform-ingcodeclonedetection,rangingfromlightweightline-andtoken-basedsyntactictechniques[2,7{9,17]toheavier-weightapproachesthatemphasizesemantics[3,12,16],tometric-basedtechniquesthatindirectlymeasuresimilarity[13,18].Duetolimitedspace,werefertoexistingsummaries[4,14]andfocusonrelatedworkasitpertainsexplicitlytosimilarity. if(!recursive( NL CURRENT(LC TIME,D FMT)))fif(decided==loc)returnNULL;elserp=rp backup;gelsefif(decided==not&&strcmp( NL CURRENT(LC TIME,D FMT),HERE D FMT))decided=loc;want xday=1;break;gdecided=raw; (a) if(!recursive( NL CURRENT(LC TIME,T FMT AMPM)))fif(decided==loc)returnNULL;elserp=rp backup;gelsefif(decided==not&&strcmp( NL CURRENT(LC TIME,T FMT AMPM),HERE T FMT AMPM))decided=loc;break;gdecided=raw; (b)sim((a),)=h0:83;0:71i,sim dist=0:22 if(!recursive( NL CURRENT(LC TIME,T FMT)))fif(decided==loc)returnNULL;elserp=rp backup;gelsefif(strcmp( NL CURRENT(LC TIME,T FMT),HERE T FMT))decided=loc;break;gdecided=raw; (c)sim((a),)=h0:67;0:57i,sim dist=0:38 constcharfmt= NL CURRENT(LC TIME,ERA T FMT);if(fmt=='n0')fmt= NL CURRENT(LC TIME,T FMT);if(!recursive(fmt))fif(decided==loc)returnNULL;elserp=rp backup;gelsefif(strcmp(fmt,HERE T FMT))decided=loc;break;gdecided=raw; (d)sim((a),)=h0:50;0:33i,sim dist=0:58Figure3.(a)showsauser-suppliedcodesegment.(b)-(d)showsuccessivelyless-similarclones.Tothebestofourknowledge,similarity-preserving ngerprintsandsimilarityscoreswere rstusedforde-tectingandquantifyingsimilarityintextdocuments[5,22].Thatworkappliedsimilarity ngerprintstoin-dividualsentencesandcalculatedsimilarityscoresbe-tweendocumentsrepresentedassequencesof nger-prints.Atahighlevel,ourworkcanbethoughtofasanapplicationofthoseideastosourcecode.EarlysyntactictechniquessuchasDup[2]de nedparameterizedclones,allowingforvariationbetweencodesequencesaslongasaconsistentsubstitutionofidenti ersexisted.Token-basedtechniquesoperateonlexemesandarethusresilienttodi erencesinwhitespace.However,thesetechniquesdonotquantifysim-ilarityaswedo.Semantics-basedtechniquesuserepresentationssuchasabstractsyntaxtrees[3]andprogramdepen-dencegraphs(PDGs)[12,16]to ndsemanticallyiden-ticalcloneswhoseoriginalsourcecodemaycontainex-traneousorreorderedstatements.Thesetechniquescannotquantifysimilarityaswecan.Cordyetal.[6]andRoyandCordy[20]proposeatechniquefor nding\near-miss"'clones,withsomeas-pectsthataresimilartoourown.Aswithus,theyem-ployatoken-basedsystemanduselightweightmecha-nismsforensuringsyntacticvalidityofpotentialclones.Inaddition,theirmethodsformeasuringsimilaritybe-tweenclonesresemblesourownsimilarityscore,al-thoughtheydonotseemtoutilizeittotheextentthatwedoortomakeitauser-controlledparameter.Unlikethem,wedonotenforcestatementorderwhenmeasuringsimilaritybetweenblocksofcode.Finally,ouruseofsimilarity ngerprintsforstatementsisdis-tinct.Lietal.[17]haveproposedCP-Minerforidenti-fyingcopy-pastebugsinlargesystems.Intheirap-proach,codesequencesaretransformedsothatcom-monsubsequencescanbeidenti edusingdataminingtechniques.Likeusthey ngerprintstatements,al-thoughtheir ngerprintsdonotpreservesimilarity.Aswithothertechniques,theydonotquantifythedegreeofsimilaritythatexistsbetweenpotentialclones.Finally,techniquesfordetectingplagiarismdealwithcodesequencesthatarebytheirnaturesimilarbutnottypicallyidentical.TheMosssystem[1,21],forex-ample,usesawinnowingalgorithmtoselectfragmentsofsourcecodetobe ngerprintedandthencalculatesasimilaritypercentagebasedonthesetofcommon ngerprints.Becauseitsgoalsaredi erentthanours,Mossoperatesatacoarserlevelofgranularitythanwedoandcannotproducethesamelevelofdetailthatwecan.Ontheotherhand,theirstoragerequirementsaresmallerthanours.Insummary,evenforthosetechniquesthatcaniden-tifysimilarclones,thereistypicallynomechanismforquantifyingthedegreeofsimilarity:codesegmentsaresimilarortheyaren't.Incontrast,oneofourmaingoalsistoidentifynon-identicalclonesandquantifytheamountofsimilaritypresent. 6ConclusionWepositthatidentifyingandquantifyingsimilarityisimportantformanyapplicationsofclonedetectionandthatsimilarityshouldbeintrinsictothede nitionofclones.Wehavepresentedaframeworkandimple-mentationforsimilaritydetectionthatoperatesattwolevels.Atthelowerlevel,weproposeatunablestate-ment ngerprintingschemethattendstopreservesimi-larityacrossthe ngerprintingfunctionsothatsimilarstatementshavethesame ngerprint.Atthehigherlevel,wepresentalightweightmechanismoperatingonlanguageblocksoverwhichwecanquantifytheamountofsimilarity.Weillustratetheutilityofthisapproachbydescribingtwoapplications{codeclusteringandrank-ordering{thatourapproachenables.Thereisconsiderablefutureworktobedone,butweareop-timisticthatsimilarity-basedapproachessuchasourscanprovideadditionalpowertoclonedetection.AcknowledgementsThisworkwassupportedbyNSFgrantnumberCCF-0701957.References[1]A.Aiken.Moss:Asystemfordetectingsoftwarepla-giarism.http://www.cs.stanford.edu/aiken/moss/.[2]B.Baker.On ndingduplicationandnear-duplicationinlargesoftwaresystems.InWorkingConf.onRe-verseEngineering,1995.[3]I.Baxter,A.Yahin,L.Moura,M.Sant'Anna,andL.Bier.Clonedetectionusingabstractsyntaxtrees.InICSM,1998.[4]S.Bellon,R.Koschke,G.Antoniol,J.Krinke,andE.Merlo.Comparisonandevaluationofclonedetec-tiontools.IEEETrans.SoftwareEng.,33(9):577{591,2007.[5]D.M.Campbell,W.R.Chen,andR.D.Smith.Copydetectionsystemsfordigitaldocuments.InAdvancesinDigitalLib.,2000.[6]J.R.Cordy,T.R.Dean,andN.Synytskyy.Practicallanguage-independentdetectionofnear-missclones.InCASCON'04:Proceedingsofthe2004conferenceoftheCentreforAdvancedStudiesonCollaborativere-search,pages1{12.IBMPress,2004.[7]S.Ducasse,M.Rieger,andS.Demeyer.Alanguageindependentapproachfordetectingduplicatedcode.InICSM'99.[8]J.Johnson.Identifyingredundancyinsourcecodeusing ngerprints.InProc.oftheIBMCentreforAdvancedStudiesConferences,1993.[9]T.Kamiya,S.Kusumoto,andK.Inoue.Cc nder:Amultilinguistictoken-basedcodeclonedetectionsys-temforlargescalesourcecode.IEEETrans.onSoft-wareEngineering,28(7):654{670,July2002.[10]C.Kapser,P.Anderson,M.W.Godfrey,R.Koschke,M.Rieger,F.V.Rysselberghe,andP.Weigerber.Subjectivityinclonejudgment:Canweeveragree?InKoschkeetal.[15].[11]R.KarpandM.Rabin.Ecientrandomizedpattern-matchingalgorithms.IBMJnlofResearchandDevel-opment,31(2):249{260,1987.[12]R.KomondoorandS.Horwitz.Usingslicingtoiden-tifyduplicationinsourcecode.InSymp.onStaticAnalysis,pages40{56,July2001.[13]K.Kontogiannis,R.Demori,E.Merlo,M.Galler,andM.Bernstein.Patternmatchingforcloneandcon-ceptdetection.AutomatedSoftwareEngineering,3(1{2):77{108,1996.[14]R.Koschke.Surveyofresearchonsoftwareclones.InKoschkeetal.[15].[15]R.Koschke,E.Merlo,andA.Walenstein,editors.Duplication,Redundancy,andSimilarityinSoftware,volume06301ofDagstuhlSeminarProc.Interna-tionalesBegegnungs-undForschungszentrumfuerIn-formatik(IBFI),SchlossDagstuhl,Germany,2007.[16]J.Krinke.Identifyingsimilarcodewithprogramde-pendencegraphs.InWorkingConf.onReverseEngi-neering,pages301{309,Oct2001.[17]Z.Li,S.Lu,S.Myagmar,andY.Zhou.Cp-miner:Findingcopy-pasteandrelatedbugsinlarge-scalesoftwarecode.IEEETrans.SoftwareEng.,32(3):176{192,2006.[18]J.Mayrand,C.Leblanc,andE.Merlo.Experimentontheautomaticdetectionoffunctionclonesinasoft-waresystemusingmetrics.InICSM,1996.[19]R.Rivest.Themd5messagedigestalgorithm.RFC1321,April1992.NetworkWorkingGroup.[20]C.K.RoyandJ.R.Cordy.Nicad:Accuratedetectionofnear-missintentionalclonesusing\rexiblepretty-printingandcodenormalization.InICPC,pages172{181,2008.[21]S.Schleimer,D.S.Wilkerson,andA.Aiken.Winnow-ing:localalgorithmsfordocument ngerprinting.InSIGMOD'03,2003.[22]R.Smith.Copydetectionsystemsfordigitaldocu-ments.Master'sthesis,DepartmentofComputerSci-ence,BrighamYoungUniversity,1999.[23]S.Tsukiyama,M.Ide,H.Ariyoshi,andI.Shirakawa.Anewalgorithmforgeneratingallthemaximalinde-pendentsets.SIAMJournalofComputing,6(3):505{517,September1977.[24]A.Walenstein,N.Jyoti,J.Li,Y.Yang,andA.Lakho-tia.Problemscreatingtask-relevantclonedetectionreferencedata.InWCRE,2003.