/
CP-Miner:AToolforFindingCopy-pasteandRelatedBugsinOperatingSystemCodeZ CP-Miner:AToolforFindingCopy-pasteandRelatedBugsinOperatingSystemCodeZ

CP-Miner:AToolforFindingCopy-pasteandRelatedBugsinOperatingSystemCodeZ - PDF document

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
427 views
Uploaded On 2015-09-20

CP-Miner:AToolforFindingCopy-pasteandRelatedBugsinOperatingSystemCodeZ - PPT Presentation

linux266archsparc64prommemoryc r 68 void init prommeminitvoidr 69 r r 92 foriter0 iternumregs iter r 93 promphystotaliterstartadr r 94 ID: 135144

linux-2.6.6/arch/sparc64/prom/memory.c )\r void

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "CP-Miner:AToolforFindingCopy-pasteandRel..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

CP-Miner:AToolforFindingCopy-pasteandRelatedBugsinOperatingSystemCodeZhenminLi,ShanLu,SuvdaMyagmarandYuanyuanZhouDepartmentofComputerScienceUniversityofIllinoisatUrbana-Champaign,Urbana,IL61801ABSTRACTCopy-pastedcodeisverycommoninlargesoftwarebe-causeprogrammerspreferreusingcodeviacopy-pasteinordertoreduceprogrammingeffort.Recentstudiesshowthatcopy-pasteispronetointroducingbugsandasig-nicantportionofoperatingsystembugsconcentrateincopy-pastedcode.Unfortunately,itischallengingtoef-cientlyidentifycopy-pastedcodeinlargesoftware.Ex-istingcopy-pastedetectiontoolsareeithernotscalabletolargesoftware,orcannothandlesmallmodicationsincopy-pastedcode.Furthermore,fewtoolsareavailabletodetectcopy-pasterelatedbugs.Inthispaperweproposeatool,CP-Miner,thatusesdataminingtechniquestoefcientlyidentifycopy-pastedcodeinlargesoftwareincludingoperatingsystems,anddetectscopy-pasterelatedbugs.Specically,ittakeslessthan20minutesforCP-Minertoidentify190,000copy-pastedsegmentsinLinuxand150,000inFreeBSD.More-over,CP-Minerhasdetected28copy-pasterelatedbugsinthelatestversionofLinuxand23inFreeBSD.Inaddition,weanalyzesomeinterestingcharacteristicsofcopy-pasteinLinuxandFreeBSD,includingthedistributionofcopy-pastedcodeacrossdifferentlength,granularity,modules,degreesofmodication,andvarioussoftwareversions.1Introduction1.1MotivationCopyingandpastingcodeisacommonpracticeinsoft-waredevelopment.Inordertoreduceprogrammingef-fortandshortenprogrammingtime,programmerspreferreusingapieceofcodeviacopy-pasteratherthanrewrit-ingsimilarcodefromscratch.Recentstudies[6,13,25]haveshownthatalargeportionofcodeisduplicatedinsoftware.Forexample,KapserandGodfrey[25],usingacopy-pastedetectiontoolcalledCCFinder[24],foundthat12%oftheLinuxlesystemcode(279Klines)wasin-volvedincodecloningactivity.Baker[6]foundthatinthecompletesourceoftheXWindowsystem(714Klines),19%ofthecodewasidentiedasduplicates.Usingabstractionssuchasfunctionsandmacrostore-movethisduplicationmightimprovesoftwaremainte-nance;however,muchduplicationwilllikelyremain,fortwopossiblereasons.First,somechangesareusuallynec-essary,andcopy-pasteismucheasierandfasterthanab-straction.Anotherreasonisthatfunctionsmayimposehigheroverhead.However,thepsychologicalreasonsforlargepercentageofexistingcopy-pastedcodearebeyondthescopeofthispaper.Copy-pastedcodeispronetointroducingerrors.Forexample,Chouetal.[10]foundthatinasinglesourceleundertheLinuxdrivers/i2odirectory,34outof35errorswerecausedbycopy-paste.Oneoftheerrorswascopiedin10placesandanotherin24.Theyalsoshowedthatmanyoperatingsystemerrorsarenotindependentbe-causeprogrammersareignorantofsystemrestrictionsincopy-pastedcode.Inourstudy,wehavedetected28copy-pasterelatedbugsinthelatestversionofLinuxand23inFreeBSD.Mostofthesebugswerepreviouslyunreported.Amajorreasonwhycopy-pasteintroducesbugsisthatprogrammersforgettomodifyidentiers(variables,func-tions,types,etc.)consistentlythroughoutthepastedcode.Thismistakewillbedetectedbyacompileriftheidenti-erisundenedorhasthewrongtype.However,theseerrorsoftenslipthroughcompile-timechecksandbecomehiddenbugsthatareveryhardtodetect.Figure1showsanexampleofabugdetectedbyCP-MinerinthelatestversionofLinux(2.6.6).WereportedthisbugtotheLinuxkernelcommunityandithasbeenconrmedbykerneldevelopers[1].Inthisexample,theloopinlines111–118wascopiedfromlines92–99.Inthenewcopy-pastedsegment(lines111–118),thevariablepromphystotalisreplacedwithprompromtakeninmostofthecasesexcepttheoneinline117(showninboldfont).Asare-sult,thepointerprompromtaken[iter].theresmoreincor-rectlypointstotheelementofpromphystotalinsteadofprompromtaken.Thisbugisasemanticerror,andthere-foreitcannotbeeasilydetectedbymemory-relatedbugdetectiontoolsincludingstaticcheckers[9,14,17,32]ordynamictoolssuchasPurify[19],Valgrind[36],andCCured[12].Besidesthisbug,CP-Minerhasalsode-tectedmanyothersimilarbugscausedbycopy-pasteinLinux,FreeBSD,PostgreSQLandWebApache.Whileonecanimagineaugmentingthesoftwaredevel-opmenttoolsandeditorswithcopy-pastetracking,thissupportdoesnotcurrentlyexist.Therefore,wearefo- ( linux-2.6.6/arch/sparc64/prom/memory.c )\r 68 void __init prom_meminit(void)\r 69 {\r ......\r 92 for(iter=0; iternum_regs; iter++) {\r 93 prom_phys_total[iter].start_adr =\r 94 prom_reg_memlist[iter].phys_addr;\r 95 prom_phys_total[iter].num_bytes =\r 96 prom_reg_memlist[iter].reg_size;\r 97 prom_phys_total[iter].theres_more =\r 98 &prom_phys_total[iter+1];\r 99 }\r ......\r111 for(iter=0; iternum_regs; iter++) {\r1\r12 prom_prom_taken[iter].start_adr =\r113 prom_reg_memlist[iter].phys_addr;\r114 prom_prom_taken[iter].num_bytes =\r115 prom_reg_memlist[iter].reg_size;\r116 prom_prom_taken[iter].theres_more =\r117 &\rprom_phys_total \r[iter+1]; \r// bug\r118 }\r ......\r143 }\rFigure1:Anexampleofacopy-pasterelatederrordetectedbyCP-Miner.Thisbugappearsinlinux-2.6.6/arch/sparc64/prom/memory.c.Asimilarbugisalsodetectedinle/arch/sparc/prom/memory.c.cusingondetectinglikelycopiedandpastedcodeinanexistingcodebase.Notallcodesegmentsidentiedbypreviousdetectiontoolsandourtoolarereallythere-sultsofcopy-paste(eventhoughweprunemanyofthefalsecopy-pastedsegmentsasdescribedinSection3.1.4),butforsimplicitywereferlikely-copy-pastedsegmentsascopy-pastedsegments.Itisachallengingtasktoefcientlyextractcopy-pastedcodeinlargesoftwaresuchasanoperatingsystem.Eventhoughsomepreviousstudies[16,20]haveaddressedtherelatedproblemofplagiarismdetection,theyarenotsuit-ablefordetectingcopy-pastedcode.Thosetools,suchasthecommonlyusedJPlag[33],weredesignedtomea-surethedegreeofsimilaritybetweenapairofprogramsinordertodetectcheating.Ifthesetoolsweretobeusedtodetectcopy-pastedcodeinasingleprogramwithoutanymodication,theywouldneedtocompareallpossiblepairsofcodefragments.Foraprogramwithnstatements,atotalofO(n3)pairwisecomparisons1wouldneedtobeperformed.ThiscomplexityiscertainlyimpracticalforsoftwarewithmillionsoflinesofcodesuchasLinuxandFreeBSD.Ofcourse,itispossibletomodifythesetoolstoidentifycopy-pastedcodeinsinglesoftware,butthemod-icationisnottrivialandstraightforward.Sofar,onlyafewtoolshavebeenproposedtoidentifycopy-pastedcodeinasingleprogram.ExamplesofsuchtoolsincludeMoss[4,35],Dup[6],CCFinder[24]and1Consideringcomparisonbetweenthepairofcodefragmentswithkstatements,thereare(nk+1)differentfragments.Sotherearenk+12=O(n2)possiblepaircomparisons.Sincekcanbe1;2;:::;n2,thetotalnumberofpairwisecomparisonsisO(n3).others[5,7].Mostofthesetoolssufferfromsomeorallofthefollowinglimitations:(1)Efciency:Mostexistingtoolsarenotscalabletolargesoftwaresuchasoperatingsystemcodebecausetheycon-sumealargeamountofmemoryandtakealongtimetoanalyzemillionsoflinesofcode.(2)Tolerancetomodications:Mosttoolscannotdealwithmodicationsincopy-pastedcode.Sometools[13,22]canonlydetectcopy-pastedsegmentsthatareexactlyidentical.Moreover,mostoftheexistingtoolsdonotal-lowstatementinsertionsormodicationsinacopy-pastedsegment.Suchmodicationsareverycommoninstan-dardpractice.OurexperimentswithCP-Minershowthataboutonethirdofcopy-pastedsegmentscontaininsertionormodicationof1-2statements.(3)Bugdetection:Theexistingtoolscannotdetectcopy-pasterelatedbugs.Theyonlyaimatdetectingcopy-pastedcodeanddonotconsiderbugsassociatedwithcopy-paste.1.2OurContributionsInthispaperwepresentCP-Miner,atoolthatusesdataminingtechniquestoefcientlyidentifycopy-pastedcodeinlargesoftwareincludingoperatingsystemcode,andalsodetectscopy-pasterelatedbugs.Itrequiresnomodi-cationorannotationtothesourcecodeofsoftwarebeinganalyzed.Ourpapermakesthreemaincontributions:(1)Ascalablecopy-pastedetectiontoolforlargesoft-ware:CP-Minercanefcientlyndcopy-pastedcodeinlargesoftwareincludingoperatingsystemcode.Ourex-perimentalresultsshowthatittakeslessthan20minutesforCP-Minertodetect150,000–190,000differentcopy-pastedsegmentsthataccountforabout20–22%ofthesourcecodeinLinuxandFreeBSD(eachwithmorethan3millionlinesofcode).Additionally,ittakeslessthanoneminutetodetectcopy-pastedsegmentsinApachewebserverandPostgreSQL,accountingforabout17–22%oftotalsourcecode.ComparedtoCCFinder[24],CP-Minerisabletond17–52%morecopy-pastedsegmentsbecauseCP-Minercantoleratestatementinsertionsandmodications.(2)Detectionofbugsassociatedwithcopy-paste:CP-Minercandetectcopy-pasterelatedbugssuchastheoneshowninFigure1,mostofwhicharehardtodetectwithexistingstaticordynamicbugdetectiontools.Morespecically,CP-Minerhasdetected28potentialbugsinthelatestversionofLinux,23inFreeBSD,5inWebApache,and2inPostgreSQL.Mostofthesebugshadneverbeenreported.Wehavereportedthesebugstothecorrespondingde-velopers.SofarvebugshaverecentlybeenconrmedandxedbyLinuxdevelopers,andonebughasbeencon-rmedandxedbyApachedevelopers.(3)Statisticalstudyofcopy-pastedcodedistributioninoperatingsystemcode:Fewpreviousstudieshavebeen conductedonthecharacteristicsofcopy-pasteinlargesoftware.Ourworkanalyzedsomeinterestingstatisticsofcopy-pastedcodeinLinuxandFreeBSD.Ourresultsindicatethat(1)copy-pastedsegmentsareusuallynottoolarge,mostwith5–16statements;(2)althoughmorethan50%ofcopy-pastedsegmentshaveonlytwocopies,afew(6.3–6.7%)copy-pastedsegmentsarecopiedmorethan8times;(3)thereisasignicantnumber(11.3–13.5%)ofcopy-pastedsegmentsatfunctiongranularity(copy-pasteofanentirefunction);(4)most(65–67%)copy-pastedseg-mentsrequirerenamingatleastoneidentier,and23–27%ofcopy-pastedsegmentshaveinserted,modied,ordeletedonestatement;(5)differentOSmoduleshaveverydifferentcopy-pastecoverage:drivers,arch,andcrypthavehigherpercentageofcopy-pastethanothermodulesinLinux;(6)astheoperatingsystemcodeevolves,theamountofcopy-pastealsoincreases,butthecoverageper-centageofcopy-pastedcoderemainsrelativelystableovertherecentversionsofLinuxandFreeBSD.2Background2.1DetectionofCopy-pastedCodeSincecopy-pastedcodesegmentsareusuallysimilartotheoriginalones,detectionofcopy-pastedcodeinvolvesde-tectingcodesegmentsthatareidenticalorsimilar.Previoustechniquesforcopy-pastedetectioncanberoughlyclassiedintothreecategories:(1)string-based,inwhichtheprogramisdividedintostrings(typicallylines),andthesestringsarecomparedagainsteachothertondsequencesofduplicatedstrings[6];(2)parse-tree-based,inwhichpatternmatchingisperformedontheparse-treeofthecodetosearchforsimilarsubtrees[7,27];(3)token-based,inwhichtheprogramisdividedintoastreamoftokensandduplicatetokensequencesareidenti-ed[24,33].Ourtool,CP-Miner,istoken-based.Thisapproachhasadvantagesovertheothertwo.First,astring-basedap-proachdoesnotexploitanylexicalinformation,soitcan-notdealwithsimplemodicationssuchasidentierre-naming.Second,usingparsetreescanintroducefalsepos-itivesbecausetwosegmentswithidenticalsyntaxtreesarenotnecessarilycopy-pasted.Thisisbecausecopy-pasteiscode-basedratherthansyntax-based,i.e.,itreusesapieceofcoderatherthananabstractsyntaxstructure.Mostpreviouscopy-pastedetectiontoolsdonotsuf-cientlyaddressthelimitationsdescribedinSection1.Mostofthemconsumetoomuchtimeormemorytobescalabletolargesoftware,ordonottoleratemodicationsmadeincopy-pastedcode.Incontrast,CP-Minercanad-dressbothchallenges.2.2FrequentSubsequenceMiningCP-Minerisbasedonfrequentsubsequencemining(alsocalledfrequentsequencemining),anassociationanalysistechniquethatdiscoversfrequentsubsequencesinase-quencedatabase[2].Frequentsubsequenceminingisanactiveresearchtopicindatamining[38,39].Ithasbroadapplications,includingminingmotifsinDNAsequences,analysisofcustomershoppingbehavior,etc.Asubsequenceisconsideredfrequentwhenitoc-cursinatleastaspeciednumberofsequences(calledminsupport)inthesequencedatabase.Asubsequenceisnotnecessarilycontiguousinanoriginalsequence.Wedenotethenumberofoccurrencesofasubsequenceasitssupport.Asequencethatcontainsagivensubsequenceiscalledasupportingsequenceofthissubsequence.Forexample,asequencedatabaseDhasvesequences:D=fabced;abecf;agbch;abijc;aklcg.Thenum-berofoccurrencesofsubsequenceabcis4,andse-quenceagbchisoneofabc'ssupportingsequences.Ifminsupportisspeciedas4,thefrequentsubsequencesarefa:5;b:4;c:5;ab:4;ac:5;bc:4;abc:4g,wherethenumbersarethesupportsofthesubsequences.CP-Minerusesarecentlyproposedfrequentsub-sequenceminingalgorithmcalledCloSpan(ClosedSequentialPatternMining)[38],whichoutperformsmostpreviousalgorithms.Insteadofminingthecompletesetoffrequentsubsequences,CloSpanminesonlytheclosedsubsequences.Aclosedsubsequenceisthesub-sequencewhosesupportisdifferentfromthatofitssuper-sequences.CloSpanmainlyconsistsoftwostages:(1)us-ingadepth-rstsearchproceduretogenerateacandidatesetoffrequentsubsequencesthatincludesalltheclosedfrequentsubsequences;and(2)pruningthenon-closedsubsequencesfromthecandidateset.ThecomputationalcomplexityofCloSpanisO(n2)ifthemaximumlengthoffrequentsequencesisconstrainedbyaconstant.MiningefciencyinCloSpanisimprovedbytwomainideas.Therstisbasedonanobservationthatifase-quenceisfrequent,allofitssubsequencesarefrequent.Forexample,ifabcisfrequent,allofitssubsequencesfa;b;c;ab;ac;bcgarealsofrequent.CloSpanrecursivelyproducesalongerfrequentsubsequencebyconcatenatingeveryfrequentitemtoashorterfrequentsubsequencethathasalreadybeenobtainedinthepreviousiterations.Letusconsideranexample.LetLndenotethesetoffrequentsubsequenceswithlengthn.InordertogetLn,wecanjointhesetsLn1andL1.Forexample,supposewehavealreadycomputedL1andL2asshownbelow.InordertocomputeL3,wecanrstcomputeL0byconcate-natingasubsequencefromL2andanitemfromL1:L1=fa;b;cg;L2=fab;ac;bcg;L0=L2L1=fabc;abb;abc;aca;acb;acc;bca;bcb;bccgForgreaterefciency,CloSpandoesnotjointhese-quencesinsetL2withalltheitemsinL1.Instead,eachsequenceinL2isconcatenatedwithonlythefre-quentitemsinitssufxdatabase.Asufxdatabaseofasubsequencesisthedatabaseofallthemaximumsuf-xesofthesequencesthatcontains.Inourexample, forthefrequentsequenceabinL2,itssufxdatabaseisDab=fced;cef;ch;ijcg,andonlycisafrequentitem,soabisonlyconcatenatedwithcandwegetalongerse-quenceabcthatbelongstoL0.Thesecondideaforimprovingminingperformanceistoefcientlyevaluatewhetheraconcatenatedsubsequenceisfrequent.Ratherthansearchingthewholedatabase,CloSpanonlycheckscertainsufxes.Inourexample,foreachsequencesinL03,CloSpancheckswhetheritisfre-quentornotbysearchingthesufxdatabaseDs.Ifthenumberofitsoccurrencesisgreaterthanminsup,sisaddedintoL3,whichisthesetoffrequentsubsequencesoflength3.CloSpancontinuescomputingL4fromL3,L5fromL4,andsoonuntilnomoresubsequencescanbeaddedintothesetoffrequentsubsequences.Duetospacelimitation,adetaileddiscussionoftheCloSpanalgorithmcanbefoundin[29,38].3CP-MinerCP-Minerhastwomajorfunctionalities:detectingcopy-pastedcodesegments,andndingcopy-pasterelatedbugs.Itrequiresnomodicationtothesourcecodeofsoftwarebeinganalyzed.Thefollowingtwosubsectionsdescribethedesignforeachfunctionality.3.1IdentifyingCopy-pastedCodeTodetectcopy-pastedcode,CP-Minerrstconvertstheproblemintoafrequentsubsequenceminingproblem.ItthenusesanenhancedalgorithmofCloSpantondba-siccopy-pastedsegments.Finally,itprunesfalsepositivesandcomposeslargercopy-pastedsegments.Forconve-nienceofdescription,werefertoagroupofcodesegmentsthataresimilartoeachotherasacopy-pastegroup.CP-Minercandetectcopy-pastedsegmentsefcientlybecauseitusesfrequentsubsequenceminingtechniquesthatcanavoidmanyunnecessaryorredundantcompar-isons.Tomapourproblemtoafrequentsubsequencemin-ingproblem,CP-Minerrstmapsastatementtoanumber,withsimilarstatementsbeingmappedtothesamenumber.Then,abasicblock(i.e.,astraight-linepieceofcodewith-outanyjumpsorjumptargetsinthemiddle)becomesase-quenceofnumbers.Asaresult,aprogramismappedintoadatabaseofmanysequences.ByminingthedatabaseusingCloSpan,wecanndfrequentsubsequencesthatoccuratleasttwiceinthesequencedatabase.Thesefre-quentsubsequencesareexactlycopy-pastedsegmentsintheoriginalprogram.Byapplyingsomepruningtech-niquessuchasidentiermapping,wecanndbasiccopy-pastedsegments,whichcanthenbecombinedwithneigh-boringonestocomposelargercopy-pastedsegments.CP-Mineriscapableofhandlingmodicationsincopy-pastedsegmentsfortworeasons.First,similarstatementsaremappedintothesamevalue.Thisisachievedbymap-pingallidentiers(variables,functionsandtypes)ofthesametypeintothesamevalue,regardlessoftheiractualnames.Thisrelaxationtoleratesidentierrenamingincopy-pastedsegments.Eventhoughfalsepositivesmaybeintroducedduringthisprocess,theyareaddressedlaterthroughvariouspruningtechniquessuchasidentiermap-ping(describedinSection3.1.4).Second,wehaveen-hancedthebasicfrequentsubsequenceminingalgorithm,CloSpan,tosupportgapconstraintsinfrequentsubse-quences.ThisenhancementallowsCP-Minertotolerate1–2statementinsertions,deletions,ormodicationsincopy-pastedcode.Insertionsanddeletionsaresymmetricbecauseastatementdeletioninonecopycanalsobeseenasaninsertionintheothercopy.Modicationisaspecialcaseofinsertion.Basically,themodiedstatementcanbetreatedasifbothsegmentshaveastatementinserted.Themainstepsoftheprocesstoidentifycopy-pastedsegmentsinclude:(1)Parsingsourcecode:Parsethegivensourcecodeandbuildasequencedatabase(acollectionofsequences).Inaddition,informationregardingbasicblocksandblocknestinglevelsarealsopassedtotheminingalgorithm.(2)Miningforbasiccopy-pastedsegments:Theenhancedfrequentsubsequenceminingalgorithmisappliedtothesequencedatabasetondbasiccopy-pastedsegments.(3)Pruningfalsepositives:Varioustechniquesincludingidentiermappingareusedtoprunefalsepositives.(4)Composinglargercopy-pastedsegments:Largercopy-pastedsegmentsareidentiedbycombiningconsecutivesmallerones.Thecombinedcopy-pastedsegmentsarefedbacktostep(3)toprunefalsepositives.Thisisnecessarybecausethecombinedonemaynotbecopy-pasted,eventhougheachsmalleroneis.Likeothercopy-pastedetectiontools,CP-Minercanonlydetectcopy-pastedsegments,butcannottellwhichsegmentisoriginalandwhichiscopy-pastedfromtheoriginal.Fortunately,thislimitationisnotabigprob-lembecauseinmostcasesitisenoughforprogrammerstoknowwhatsegmentsaresimilartoeachother.Moreover,ourbugdetectionmethoddescribedinSection3.2doesnotrelyonsuchdifferentiation.Additionally,ifprogram-mersreallyneedthedifferentiation,navigatingthroughRCSversionscouldhelpguringoutwhichsegmentistheoriginalcopy.3.1.1ParsingSourceCodeThemainpurposeofparsingsourcecodeistobuildase-quencedatabase(acollectionofsequences)inordertoconvertthecopy-pastedetectionproblemtoafrequentsubsequenceminingproblem.Commentsarenotconsid-erednormalstatementsinCP-Miner,andaretherebyl-teredbyourparser.ThecurrentprototypeoftheCP-MinerparseronlyworksforprogramswritteninCorC++,butitiseasytomodifyitforotherprogramminglanguages.Astatementismappedtoanumberbyrsttokeniz-ingitscomponentssuchasvariables,operators,constants,functions,keywords,etc.Totolerateidentierrenamingincopy-pastedsegments,identiersofthesametypeare mappedintothesametoken.Constantsarehandledinthesamewayasidentiers:constantsofthesametypearemappedintothesametoken.However,operatorsandkey-wordsarehandleddifferently,witheachonemappedtoauniquetoken.Afterallthecomponentsofastatementaretokenized,ahashvaluedigestiscomputedusingthe“hashpjw”[3]hashfunction,chosenforitslowcollisionrate.Figure2showsthehashvalueforeachstatementintheexampleshowninFigure1ofSection1.Asshowninthisgure,thestatementinlines93–94andthestatementinlines112–113havethesamehashvalues.Aftereachstatementismapped,theprogrambecomesalongnumbersequence.Unfortunately,thefrequentsubse-quenceminingalgorithmsneedacollectionofsequences(asequencedatabase)asdescribedin2.2,soweneedawaytocutthislongsequenceintomanyshortones.Onesimplemethodistouseaxedcuttingwindowsize(e.g.,every20statements)tobreakthelongsequenceintomanyshortones.Thismethodhastwodisadvantages.First,somefrequentsubsequencesacrosstwoormorewindowsmaybelost.Second,itisnoteasytodecidethewindowsize:ifitistoolong,theminingalgorithmwouldbeveryslow;iftooshort,toomuchinformationmaybelostontheboundaryoftwoconsecutivewindows.Instead,CP-Minerusesamoreelegantmethodtoper-formthecutting.Ittakesadvantageofsomesimplesyntaxinformationandusesabasicprogrammingblockastheunittobreakthelongsequenceintoshortones.Theideaforthiscuttingmethodisthatacopy-pastedsegmentisusuallyeitherapartofabasicblockorconsistsofmulti-plebasicblocks.Inaddition,basicblocksareusuallynottoolongtocauseperformanceproblemsinCloSpan.Byusingabasicblockasthecuttingunit,CP-Minercanrstndbasiccopy-pastedsegmentsandthencomposelargeronesfromsmallerones.Sincedifferentbasicblockshaveadifferentnumberofstatements,theircorrespondingse-quencesalsohavedifferentlength.Butthisisnotaprob-lemforCloSpanbecauseitcandealwithsequencesofdif-ferentsizes.TheexampleshowninFigures1and2isconvertedintothefollowingcollectionofsequences:(35487793)133872016,82589171)......133872016,82589171)......Besidesacollectionofsequences,theparseralsopassestotheminingalgorithmthesourcecodeinformationofeachsequence.Suchinformationincludes(1)thenestinglevelofeachbasicblock,whichislaterusedtoguidethecompositionoflargercopy-pastedsegmentsfromsmallerones;(2)thelenameandlinenumber,whichisusedtolocatethecopy-pastedcodecorrespondingtoafrequentsubsequenceidentiedbytheminingalgorithm.STATEMENT\r 68 void __init prom_meminit(void)\r 69 {\r . . . . . .\r 92 for(iter=0; iternum_regs; iter++) {\r 93 prom_phys_total[iter].start_adr =\r 94 prom_reg_memlist[iter].phys_addr;\r 95 prom_phys_total[iter].num_bytes =\r 96 prom_reg_memlist[iter].reg_size;\r 97 prom_phys_total[iter].theres_more =\r 98 &prom_phys_total[iter+1];\r 99 }\r . . . . . .\r111 for(iter=0; iternum_regs; iter++) {\r112 prom_prom_taken[iter].start_adr =\r113 prom_reg_memlist[iter].phys_addr;\r114 prom_prom_taken[iter].num_bytes =\r115 prom_reg_memlist[iter].reg_size;\r116 prom_prom_taken[iter].theres_more =\r117 &prom_phys_total [iter+1];\r118 }\r . . . . . .\r143 }\r35487793\r. . . . . .\r67641265\r133872016\r133872016\r82589171\r. . . . . .\r67641265\r133872016\r133872016\r82589171\r. . . . . .\rHASH\rFigure2:Anexampleofhashingstatements3.1.2MiningforBasicCopy-pastedSegmentsAfterCP-Minerparsesthesourcecodeofagivenprogram,itgeneratesasequencedatabasewitheachsequencerep-resentingabasicblock.Atthenextstep,itappliesthefrequentsubsequenceminingalgorithm,CloSpan,onthisdatabasetondfrequentsubsequenceswithsupportvalueofatleast2,whichcorrespondstocodesegmentsthathaveappearedintheprogramatleasttwice.Intheexam-pleshowninFigure2,CP-Minerwouldnd(133872016,133872016,82589171)asafrequentsubsequencebecauseitoccurstwiceinthesequencedatabase.Therefore,thecorrespondingcodesegmentsinline111–118andline92–99arebasiccopy-pastedsegments.Unfortunately,theminingprocessisnotasstraightfor-wardasexpected.ThemainreasonisthattheoriginalCloSpanalgorithmwasnotdesignedexactlyforourpur-pose,andnorwereotherfrequentsubsequenceminingalgorithms.MostexistingalgorithmsincludingCloSpanhavethefollowingtwolimitationsthatwehadtoenhanceCloSpantomakeitapplicableforcopy-pastedetection:(1)Addinggapconstraintsinfrequentsubsequences:Inmostexistingfrequentsubsequenceminingalgorithms,frequentsubsequencesarenotnecessarilycontiguousintheirsupportingsequences.Forexample,sequenceabdecprovides1supportforsubsequenceabc,eventhoughabcdoesnotappearcontiguouslyinabdec.Itispossibletohavealargegapintheoccurrenceofafrequentsubse-quenceinoneofitssupportingsequences.Hence,itscor-respondingcodesegmentwouldhaveseveralstatementsinserted.Suchsegmentisunlikelytobecopy-pasted.Toaddressthisproblem,wemodiedCloSpantoaddagapconstraintinfrequentsubsequences.CP-Mineronlyminesforfrequentsubsequenceswithamaximumgapnotlargerthanagiventhresholdcalledmaxgap.Ifthemax- imumgapofasubsequenceinasequenceislargerthanmaxgap,thissequenceisnot“supporting”thissubse-quence.Forexample,forthesequencedatabaseD=fabced;abecf;agbch;abijc;aklcg,thesupportofsubse-quenceabcis1ifmaxgapequals0,andthesupportis3ifmaxgapequals1.Thegapconstraintwithmaxgap=0meansthatnostatementinsertionordeletionsareallowedincopy-paste,whereasthegapconstraintwithmaxgap=1ormaxgap=2meansthat1or2statementinser-tions/deletionsaretoleratedincopy-paste.(2)Matchingfrequentsubsequencestocopy-pastedsegments:TheoriginalCloSpanalgorithmoutputsonlyfrequentsubsequencesandtheircorrespondingsupportvalues,butnottheircorrespondingsupportingsequences.Tondcopy-pastedcode,weneedtondthesupportingsequencesforeachfrequentsubsequence.WeenhanceCloSpantoaddressthisproblem.WhenCP-Minergeneratesafrequentsubsequence,itmaintainsalistofIDsofitssupportingsequences.Intheaboveexample,CP-Mineroutputstwofrequentsubsequences:(67641265)and(133872016,133872016,82589171),eachwiththeirsupportingsequenceIDs,basedonwhichthelocationsofthecorrespondingbasiccopy-pastedseg-ments(lenameandlinenumbers)canbeidentied.3.1.3ComposingLargerCopy-pastedSegmentsSinceeverysequencefedtotheminingalgorithmrepre-sentsabasicblock,abasiccopy-pastedsegmentmayonlybeapartofalargercopy-pastedsegment.Therefore,itisnecessarytocombineabasiccopy-pastedsegmentwithitsneighborstoconstructalargerone,ifpossible.Thecompositionprocedureisverystraightforward.CP-Minermaintainsacandidatesetofcopy-pastegroups,whichinitiallyincludesallofthebasiccopy-pastedseg-mentsthatsurvivethepruningproceduredescribedinSec-tion3.1.4.Foreachcopy-pastegroup,CP-Minercheckstheirneighboringcodesegmentstoseeiftheyalsoformacopy-pastegroup.Ifso,thetwogroupsarecombinedto-gethertoformalargerone.Thislargercopy-pastegroupischeckedagainstthepruningprocedure.Ifitcansurvivethepruningprocess,itisaddedtothecandidatesetandthetwosmalleronesareremoved.Otherwise,thetwosmalleronesstillremaininthesetandaremarkedas“non-expandable”.CP-Minerrepeatsthisprocessuntilallgroupsinthecan-didatesetarenon-expandable.3.1.4PruningFalsePositivesItispossiblethatcopy-pastedsegmentsdiscoveredbytheminingalgorithmorthecompositionprocessmaycontainfalsepositives.Themaincauseoffalsepositivesisthetokenizationofidentiers(variable/function/type)inordertotolerateidentier-renamingincopy-paste.Sinceiden-tiersofthesametypearemappedintothesametoken,itispossibletoidentifyfalsecopy-pastedsegments.Forexample,allstatementssimilartox=y+zwouldhavethesamehashvalue,whichcanintroducemanyfalsepos-itives.Toprunefalsepositives,CP-Minerhasappliedsev-eraltechniquestobothofbasicandcomposedcopy-pastedsegments.Thepruningtechniquesinclude:(1)Pruningunmappablesegments:Thistechniqueisusedtoprunefalsepositivesintroducedbythetokeniza-tionofidentiers.Thisisbasedontheobservationthatifaprogrammercopy-pastesacodesegmentandthenre-namesanidentier,he/shewouldmostlikelyrenamethisidentierinallitsoccurrencesinthenewcopy-pastedseg-ment.Therefore,wecanbuildanidentiermappingthatmapsoldnamesinonesegmenttotheircorrespondingnewonesintheothersegmentthatbelongstothesamecopy-pastegroup.IntheexampleshowninFigure2,variablepromphystotalischangedintoprompromtaken(ex-ceptthebugonline117).Amappingschemeisconsistentifthereareveryfewconictsthatmaponeidentiernametotwoormoredif-ferentnewnames.Ifnoconsistentidentiermappingcanbeestablishedbetweenapairofcopy-pastedsegments,theyarelikelytobefalsepositives.Tomeasuretheamountofconict,CP-MinerusesametriccalledCon\rictRatio,whichrecordstheconictratioforanidentiermappingbetweentwocandidatecopy-pastedsegments.Forexample,ifavariableAfromsegment1ischangedintoain75%ofitsoccurrencesinsegment2but25%ofitsoccurrencesischangedintoothervariables,theCon\rictRatioofmappingA!ais25%.TheCon\rictRatioforthewholemappingschemebetweenthesetwosegmentsaretheweightedsumofCon\rictRatioofthemappingforeachuniqueidentier.TheweightforanidentierAinagivencodesegmentisthefractionoftotalidentieroccurrencesthatareoccur-rencesofA.IfCon\rictRatiofortwocandidatecopy-pastedsegmentsishigherthanapredenedthreshold,thesetwocodesegmentsarelteredasfalsepositives.Inourexperiments,wesetthethresholdtobe60%.(2)Pruningtinysegments:Ourminingalgorithmmayndtinycopy-pastedsegmentsthatconsistofonly1-2simplestatements.Ifsuchatinysegmentcannotbecom-binedwithneighborstocomposealargersegment,itisremovedfromthecopy-pastelist.Thisisbasedontheob-servationthatcopy-pastedsegmentsareusuallynotverysmallbecauseprogrammerscannotsavemucheffortincopy-pastingasimpletinycodesegment.CP-Minerusesthenumberoftokenstomeasurethesizeofasegment.Thismetricismoreappropriatethanthenumberofstatements,becausethelengthofstatementsishighlyvariable.Ifasinglestatementisverycomplicatedwithmanytokens,itisstillpossibleforprogrammerstocopy-pasteit.Toprunetinysegments,CP-Minerusesatunablepa-rametercalledminsize.Ifthenumberoftokensinapairofcopy-pastedsegmentsisfewerthanminsize,thispairisremoved. (3)Pruningoverlappedsegments:Ifapairofcandidatecopy-pastedsegmentsoverlapwitheachother,theyarealsoconsideredfalsepositives.CP-Minerstopsextend-ingthepairofcopy-pastedsegmentsoncetheyoverlap.Forsomeprogramstructuressuchastheswitchstatementthatcontainmanypairsofself-similarsegments,pruningoverlappedsegmentscanavoidmostofthefalsepositivesinswitchstatements.(4)Pruningsegmentswithlargegaps:Besidestheminingprocedureforbasiccopy-pastedsegments,thegapconstraintisalsoappliedtocomposedones.Whentwoneighboringsegmentsarecombined,themaximumgapofthenewlycomposedlargesegmentmaybecomelargerthanapredenedthreshold,maxtotalgap.Ifthisistrue,thecompositionisinvalid.Sothenewlycomposedoneisnotaddedintothecandidatesetandthetwosmalleronesaremarkedasnon-expandableintheset.Ofcourse,evenaftersuchrigorouspruning,falsepos-itivesmaystillexist.However,wehavemanuallyex-amined100randomcopy-pastedsegmentsreportedbyCP-MinerforLinux,andonlyafewfalsepositives(8)arefound.Wecanonlymanuallyexamineeachidenti-edcopy-pastedsegmentbecausetherearenotracesthatrecordprogrammers'copy-pasteoperationsduringthede-velopmentofthesoftware.3.1.5ComputationalComplexityofCP-MinerCP-Minercanextractcopy-pastedcodedirectlyfromasinglesoftwarewithtotalcomplexityofO(n2)intheworstcase(wherenisthenumberoflinesofcode),andtheoptimizationsfurtherimproveitsefciencyinpractice.Forexample,CP-Minercanidentifymorethan150,000copy-pastedsegmentsfrom3–4millionlinesofcodeinlessthan20minutesasshowninourresultsinSection5.3.InCP-Miner,webreakallofthelargebasicblocksintosmallblockswithatmost30statementsbeforefeedingtotheminingalgorithm.Therefore,thesearchtreeisatmostwithdepth30.Withthisconstraintofsearchtree,themin-ingcomplexityofCP-MinerisO(n2)intheworstcase.Furthermore,theoptimizationsdescribedinSection2.2makeitmoreefcientinbothtimeandspaceoverheadsthantheworstcase.3.2DetectingCopy-pasteRelatedBugsAswehavementionedinSection1,themaincauseofcopy-pasterelatedbugsisthatprogrammersforgettomod-ifyidentiersconsistentlyaftercopy-pasting.Oncewegetthemappingrelationshipbetweenidentiersinapairofcopy-pastedsegments(seeSection3.1.4),wecanndtheinconsistencyandreportthesecopy-pasterelatedbugs.Table1showstheidentiermappingfortheexamplede-scribedinSection1.Foranidentierthatappearsmorethanonceinacopy-pastedsegment,itisconsistentwhenitalwaysmapstothesameidentierintheothersegment.Similarly,itisin-consistentwhenitmapsitselftomultipleidentiers.InIdentiersinsegmentIIdentiersinsegmentII(line92-99)(line111-118)iter(9)iter(9)numreg(1)numreg(1)promphystotal(4)prompromtaken(3);promphystotal(1)promregmemlist(2)promregmemlist(2)Table1:IdentiermappingintheexampleinFigure1(thenumberaftereachidentierindicatesthenumberofoccurrences).Table1,wecanseethatpromphystotalismappedin-consistently,becauseitmapstoprompromtakenthreetimesandpromphystotalonce.Alltheothervariablemappingsareconsistent.Unfortunately,inconsistencydoesnotnecessarilyin-dicateabug.Iftheamountofinconsistencyishigh,itmayindicatethatthecodesegmentsarenotcopy-pasted.Section3.1.4describeshowwepruneunmappablecopy-pastedsegmentsbasedonthisobservation.Therefore,thechallengeistodecidewhenaninconsis-tencyislikelytobeabuginsteadofafalsepositiveofcopy-paste.Toaddressthischallenge,weneedtoconsidertheprogrammers'intention.Ourbugdetectionmethodisbasedonthefollowingobservation:ifaprogrammermakesachangeinacopy-pastedsegment,thechangedidentierisunlikelytobeabug.Butifhe/shechangesanidentierinmostplacesbutforgetstochangeitinafewplaces,theunchangedidentierislikelytobeabug.Inotherwords,“forget-to-change”ismorelikelytobeabugthananintentional“change”.Forexample,ifinsomecases,anidentierAismappedintoaandinothercasesitismappedintoa0(bothaanda0aredifferentfromA),itisunlikelytobeabugbecauseprogrammersintention-allychangeAtoothernames.Ontheotherhand,ifAischangedintoainmostcasesbutremainsunchangedonlyinafewcases,theunchangedplacesarelikelytobebugs.Basedontheaboveobservation,CP-Minerreexam-ineseachnon-expandablecopy-pastegroupafterrunningthroughthepruningandcomposingprocedures.Foreachpairofcopy-pastedsegments,itusesametriccalledUnchangedRatiotodetectbugsinanidentiermapping.WedeneUnchangedRatio=NumUnchangedNumTotalwhereNumUnchangedmeansthenumberofoccurrencesthatagivenidentierisunchanged,andNumTotalmeansthenumberoftotaloccurrencesofthisidentierinagivencopy-pastedsegment.Therefore,thelowertheUnchangedRatio,themorelikelyitisabug,unlessUnchangedRatio=0,whichmeansthatallofitsoccur-renceshavebeenchanged.NotethatUnchangedRatioisdifferentfromCon\rictRatio.Theformeronlymeasurestheratioofunchangedoccurrences,whereasthelattermeasurestheratioofconicts.IntheexampleshownonTable1,UnchangedRatioforpromphystotalis0.25,whereasallotheridentiershaveUnchangedRatio=1.CP-MinerusesathresholdforUnchangedRatiotode-tectbugs.IfUnchangedRatioforanidentierisnotzero andnotlargerthanthethreshold,theunchangedplacesarereportedasbugs.WhenCP-Minerreportsabug,thecorrespondingidentiermappinginformationisalsopro-videdtoprogrammerstohelpindebugging.Intheexam-pleshownonTable1,identierpromphystotalonline117isreportedasabug.ItispossibletofurtherextendCP-Miner'sbugdetectionengine.Forexample,itmightbeusefultoexploitvariablecorrelations.AssumevariableAalwaysappearsincloserangetoanothervariableB,andaalwaysappearsveryclosetob.Soifinapairofcopy-pastedsegments,Aisrenamedtoa,Bthenshouldberenamedtobwithhighcondence.Anyviolationofthisrulemayindicateabug.ButthecurrentversionofCP-Minerhasnotexploitedthispossibility.Itremainsasourfuturework.4MethodologyWehaveevaluatedtheeffectivenessofCP-MinerwithlargesoftwareincludingLinux,FreeBSD,ApachewebserverandPostgreSQL.Thenumberofles(onlyCles)andthenumberoflinesofcode(LOC)forthesoftwareareshowninTable2.Softwareversion#les#LOCLinux2.6.66,4974,365,124FreeBSD5.2.17,1143,299,622Apache2.0.49479223,886PostgreSQL7.4.2553458,058Table2:Softwareevaluatedinourexperiments.WesetthethresholdsusedinCP-Minerasfollow-ing.Theminimumcopy-pastedsegmentsizeminsizeis30tokens.Wealsovarythegapconstraints:(1)whenmaxgap=0,CP-Mineronlyidentiescopy-pastedcodewithidentier-renaming;(2)whenmaxgap=1andmaxtotalgap=2,itmeansthatCP-Minerallowscopy-pastedsegmentswithinsertionanddeletionofonestate-mentbetweenanytwoconsecutivestatements,andatotaloftwostatementinsertionsanddeletionsinthewholeseg-ment.Withoutspecifying,weusesetting(2)bydefault.WedeneCPCoveragetomeasurethepercentageofcopy-pasteingivensoftware(oragivenmodule):CPCoverage=#LOCincopy-pastedsegments#LOCinthesoftwareorthemodule100%Inourexperiments,wealsocompareCP-MinerwitharecentlyproposedtoolcalledCCFinder[24].Similartoourtool,CCFinderalsotokenizesidentiers,keywords,constant,operators,etc.Butdifferentfromourtool,itusesasufxtreealgorithminsteadofadataminingalgorithm.Therefore,itcannottoleratestatementinsertionsanddele-tionsincopy-pastedcode.OurresultsshowthatCP-Minerdetects17–52%morecopy-pastedcodethanCCFinder.Inaddition,CCFinderdoesnotlterincomplete,tinycopy-pastedsegmentswhichareverylikelytobefalsepositives.CCFinderdoesnotdetectcopy-pasterelatedbugs,sowecannotcomparethisfunctionalitybetweenthem.Inourexperiments,werunCP-MinerandCCFinderonanIntelXeon2.4GHzmachinewith2GBmemory.5EvaluationResultsofCP-MinerWerstpresenttheevaluationresultsofCP-Minerinthissection,includingthenumberofcopy-pastedseg-ments,thenumberofdetectedcopy-pasterelatedbugs,CP-Mineroverhead,comparisonwithCCFinder,andef-fectsofthresholdsetting.Thestatisticalresultsofcopy-pastecharacteristicsinLinuxandFreeBSDwillbepre-sentedinSection6.5.1OverallResultsDetectingCopy-pastedCodeCP-Minerhasfoundasignicantnumberofcopy-pastedsegmentsintheevalu-atedsoftware.Inthissoftware,copy-pastedcodemakesup17.7–22.3%ofthecodebase.Table3showsthenumbersofcopy-pastedsegmentsandCPCoverage.Asshowninthistable,inLinuxandFreeBSD,therearemorethan100,000and120,000copy-pastedsegmentswithoutanystatementinsertion(maxgap=0),whichaccountsforabout15%ofthesourcecode.Wehavemanuallyexam-ined100randompairsofcopy-pastedsegmentsfromallpotentialcopy-pastedsegmentsinLinux(withmaxgap=1),andfoundafew(only8)falsepositives.Thelargenumberofcopy-pastedsegmentsmotivatesasupportinsoftwaredevelopmentenvironmentssuchasMicrosoftVi-sualStudiotomaintaincopy-pastedcode.Softwaremaxgap=0maxgap=1#SegmentsCPCoverage#SegmentsCPCoverageLinux122,28215.3%198,60522.3%FreeBSD101,69914.9%153,23020.4%Apache4,15513.1%6,19617.7%PostgreSQL12,10516.5%16,66222.2%Table3:Thenumberofcopy-pastedsegmentsandCPCoverageOurresultsalsoshowthatalargepercentage(30–50%)ofcopy-pastedsegmentshavestatementinsertionsandmodications.Forexample,whenmaxgapis1,CP-Minernds62.4%morecopy-pastedsegmentsinLinux.InFreeBSD,theCPCoverageincreasesfrom14.9%to20.4%whenmaxgapisrelaxedfrom0to1.TheseresultsshowthatprevioustoolsincludingCCFinderthatcannottoleratestatementinsertionsandmodicationswouldmissalotofcopy-paste.Byincreasingmaxgapfrom1to2orhigher,wecanfurtherrelaxthegapconstraint.Duetospacelimitation,wedonotshowthoseresultshere.Alsothenumberoffalsepositiveswillincreasewithmaxgap.OurmanualexaminationresultswiththeLinuxlesystemmodulein-dicatethatfalsepositivesarelowwithmaxgap=1,andrelativelylowwithmaxgap=2.DetectingCopy-pasteRelatedBugsCP-Minerhasalsoreportedmanycopy-pasterelatederrorsintheevaluatedsoftware.SincetheerrorsreportedbyCP-Minermaynotbebugs,weverifyeachreportederrormanuallyandthenreporttothecorrespondingdevelopercommunitythoseer-rorsthatwesuspecttobebugswithhighcondence.The Softwareerrorsbugscarelessfalsealarmsreportedveriedprogramming(1)(2)(3)Linux42128211514157FreeBSD4432383074130Apache1750316PostgreSQL7420131043Table4:ErrorsreportedbyCP-Miner(UnchangedRatiothreshold=0.4)andbugsveriedbyuswithhighcondence,someofwhichareconrmedandxedbycorrespondingdevelopersafterwereported.Thefalsealarmsincludethreecategories:(1)incorrectlymatchedsegments,(2)exchangeableorders,and(3)others.Thersttwocategoriescanbepruned,whichremainsasourimmediatefuturework.numbersoferrorsfoundbyCP-MinerandveriedbugsareshownonTable4.TheresultsareachievedbysettingtheUnchangedRatiothresholdtobe0.4.BothLinuxandFreeBSDhavemanycopy-pasterelatedbugs.Sofar,wehaveveried28and23bugsinthelatestversionsofLinuxandFreeBSD.Mostofthesebugshadneverbeenreportedbefore.Wehavereportedthesebugstothekerneldevelopercommunities.Recently,veLinuxbugshavebeenconrmedandxedbykerneldevelopers,andtheothersarestillintheprocessofbeingconrmed.SinceApacheandPostgreSQLaremuchsmallercom-paredtoLinuxandFreeBSD,CP-Minerfoundmuchfewercopy-pasterelatedbugs.Wehaveveried5bugsforApacheand2bugsforPostgreSQLwithhighcondence.OnebuginApachewasimmediatelyxedbytheApachedevelopersafterwereportedittothem.Inadditiontothosebugsveried,wealsondmany“potentialbugs”(21inLinuxand8inFreeBSD)thatarenotbugsbycoincidencebutmightbecomebugsinthefu-ture.Wecallthistypeoferrors“carelessprogramming”.Similartothebugsveried,theseerrorsalsoforgettochangesomeidentiersconsistentlyatafewplaces.For-tunately,bycoincidence,thenewidentiersandtheoldoneshappentohavethesamevalues.However,ifsuchimplicitassumptionsareviolatedinfutureversionsofthesoftware,itwouldleadtobugsthatarehardtodetect.5.2FalseAlarmsTable4alsoshowsthenumberoffalsealarmsreportedbyCP-Miner.Thesefalsealarmsaremostlycausedbythefollowingtwomajorreasonsandcanbefurtherprunedinourimmediatefuturework:(1)Incorrectlymatchedcopy-pastedsegments:Insomecopy-pastedsegmentsthatcontainmultiple“case”or“if”blocks,therearemanypossiblecombinationsforthesecontiguouscopy-pastedblockstocomposelargerones.SinceCP-Minersimplyfollowstheprogramordertocom-poselargercopy-pastes,itislikelythatawrongcomposi-tionmightbechosen.Asaresult,identiersarecomparedbetweentwoincorrectlymatchedcopy-pastedsegments,whichresultsinfalsealarms.Thesefalsealarmscanbeprunedifweusemoreseman-ticinformationoftheidentiersinthesesegments.Thesegmentswithanumberof“case/if”blocksusuallycon-tainalotofconstantidentiers,butourcurrentCP-Minertreatsthemasnormalvariablenames.Ifweusetheinfor-mationoftheseconstantstomatch“case/if”blockswhencomposinglargercopy-pastedsegments,itcanreducethenumberofincorrectlymatchedsegmentsandmostofsuchfalsealarmscanbepruned.(2)Exchangeableorders:Inacopy-pastepair,theordersofsomestatementsorexpressionscanbeswitched.Forexample,asegmentwithseveralsimilarstatementssuchas“a1=b1;a2=b2;”isthesameas“a2=b2;a1=b1;”.ThecurrentversionofCP-Minersimplycomparestheidenti-ersinapairofcopy-pastedsegmentsinstrictorderandthereforeafalsealarmmightbereported.InLinux,41falsealarmsarecausedbysuchexchangeableorders.Thesefalsealarmscanbeprunedifwerelaxthestrictordercomparisonbyfurthercheckingwhetherthecor-responding“changed”identiersareintheneighboringstatements/expressions.5.3TimeandSpaceOverheadsCP-Minercanidentifycopy-pastedcodeinlargesoft-wareveryefciently.TheexecutiontimeofCP-MinerisshowninTable5.CP-Minertakes11–20minutestoiden-tify101,699–198,605copy-pastedsegmentsinLinuxandFreeBSD,eachwith3–4millionlinesofcode.Ittakeslessthan1minutetodetectcopy-pastedsegmentsinApacheandPostgreSQLwithmorethan200,000linesofcode.CP-Minerisalsospace-efcient.Forexample,ittakeslessthan530MBtondcopy-pastedcodeinLinuxandFreeBSD.ForApacheandPostgreSQL,CP-Minercon-sumes27–57MBofmemory.Softwaremaxgap=0maxgap=1Time(s)Space(MB)Time(s)Space(MB)Linux7704381164527FreeBSD6153341155459Apache14271530PostgreSQL32443857Table5:ExecutiontimeandmemoryspaceofCP-Miner5.4ComparisonwithCCFinderWehavecomparedCP-MinerwithCCFinder[24].CCFinderhasexecutiontimesimilartothatofCP-Miner,butCP-Minerdiscoversmuchmorecopy-pastedsegments.Inaddition,CCFinderdoesnotdetectcopy-pasterelatedbugs.AsweexplainedinSection4,CCFinderallowsidentier-renamingbutnotstatementinsertions.Inaddi-tion,pruninginCCFinderisnotsorigorousasCP-Miner.Forexample,CCFinderreportsincompletestatementsincopy-pastedsegments,whichisunlikelyinpractice.Af-terpruningtheincompletestatements,manysmallcopy-SoftwareCCFinderCP-MinerLinux14.7%(19.8%)22.3%FreeBSD14.5%(19.6%)20.4%Apache11.8%(15.3%)17.7%PostgreSQL18.5%(23.8%)22.2%Table6:CPCoveragecomparisonbetweenCP-MinerandCCFinder.ForCCFinder,therstnumberistheresultafterpruningthoseincomplete,smallsegments,andthesecondnumberinparenthe-sesistheresultbeforepruning. pastedsegmentsconsistoflessthan30tokens,whicharetoosimpletobeworthcopying.CP-Minercanidentify17–52%morecopy-pastedcodethanCCFinderbecauseCP-Minercantoleratestatementinsertionsandmodications.Table6comparestheCPCoverageidentiedbyCP-MinerandCCFinder.TheresultswithCP-Minerareachievedusingthedefaultthresholdsetting(minsize=30andmaxgap=1).Forfaircomparison,wealsolterthoseincomplete,smallseg-mentsfromCCFinder'soutput.5.5EffectsofThresholdSettingsSegmentSizeThresholdFigure3showstheeffectofsegmentsizethresholdminsizeonCPCoverage.Asexpected,CPCoveragedecreaseswhenminsizein-creasesbecausemorecopy-pastedsegmentsarepruned.Theresultsalsoshowthatthedecrementslowdownswhenminsizeisintherangeof30–100tokens,whichindicatesthatnottoomanycopy-segments'sizesfallinthisrange.Thisimpliesthatsegmentswithfewerthan30tokensareverylikelytobefalsepositives,whereasthosewithmorethan40tokensareverylikelytobecopy-paste. 0 5 10 15 20 25 30 35 20 30 40 50 60 70 80 90 100CP_Coverage(%)min_sizeLinuxFreeBSDApachePostgreSQLFigure3:EffectsofminsizeonCPCoverageUnchangedRatioThresholdFigure4showstheeffectofunchangedratiothresholdonthenumberofbugsre-ported.SinceUnchangedRatio0:5meansthatmostoftheidentiersarenotchangedaftercopy-pasting,theseunchangedidentiersareunlikely“forget-to-change”andsoitcannotindicateacopy-pasterelatederror.Therefore,weonlyshowtheerrorswithUnchangedRatiothresholdlessthan0.5.Asexpected,moreerrorsarereportedbyCP-MinerwhentheUnchangedRatiothresholdincreases.Specif-ically,thenumberoferrorsreportedincreasesgradually 0 50 100 150 200 250 300 350 400 450 500 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45Number of ErrorsUnchangedRatio ThresholdLinuxFreeBSDApachePostgreSQLFigure4:EffectsofUnchangedRatiothresholdonerrorsreported2500050000 Size of Segment (#Statements) #Copypasted GroupsLinux12916173265128�1281500090000 Size of Segment (#Statements) #Copypasted GroupsFreeBSD12916173265128�128(a)#ofcopy-pastegroupswithvarioussegmentsizes(#ofstatements)108 Size of Segment (#Statements) CP_Coverage (%)Linux12916173265128�128108 Size of Segment (#Statements) CP_Coverage (%)FreeBSD12916173265128�128(b)TheCPCoveragewithvarioussegmentsizes(#ofstatements)Figure5:Sizedistributionofcopy-pastedsegments.Duetotheover-lapofcopy-pastedsegmentsthathavedifferentsegmentsizesandalsobelongtodifferentcopy-pastegroups,thesumofallCPCoveragedoesnotequaltotheoverallCPCoverage.whenthethresholdislessthan0.25,andthenincreasessharplywhenthethreshold2(0:25;0:35).WefoundthatmostoftheerrorswithhighUnchangedRatioturnouttobefalsealarmsduringourverication.Forexample,CP-Minerreportsmanyerrorswhereonly1outof3identiersisunchanged(UnchangedRatio=0:33).However,itcannotstronglysupportthatitisacopy-pasterelatedbug.Inordertoprunesuchfalsealarms,wecanfurtheranalyzetheidentiersinthecontextofthecopy-pastedsegments(e.g.,thewholefunction).Weleavethisimprovementasourfuturework.6StatisticsofCopy-pasteinOScodeThissectionpresentsthestatisticalresultsoncopy-pastecharacteristicsinlargesoftware.Ourresultsincludethedistributionofcopy-pastedsegmentsacrossdiffer-entgroupsizes,segmentsizes,granularity,amountofchanges,modules,andversions.6.1Copy-pasteSizeandGranularityFigure5illustratesthedistributionofcopy-pastedseg-mentswithdifferentsizes(intermsofthenumberofstate-ments).Theresultsshowthatmost(60–64%)copy-pastedsegmentsarenotverylarge,withonly5–16statements.Onlyafew(0.2–5.0%)copy-pastedsegmentshavemorethan64statements.Inparticular,Figure5(a)showsthatmost(35–40%)copy-pastegroupscontain5–8statementsineachsegment.Figure5(b)showssimilarcharacter-istics:copy-pastedsegmentswith5–8statementscoverabout7–10%ofthesourcecode.Figure6showsthedistributionofcopy-pastegroupsize.About60%ofcopy-pastegroupscontainonlytwosegments,whichindicatesthatthereareonlytwocopies(originalandreplicated)formostcopy-pastedcode.Butstill,alotofcodeisreplicatedmorethanonce. 750 Size of Group (#Segments) Group (%)Linux278916�16750 Size of Group (#Segments) Group (%)FreeBSD278916�16Figure6:Copy-pastegroupsizedistributionintermsofthenumberofsegmentsineachgroup.Eachbarrepresentsthepercentageofcopy-pastegroupsthatcontainthecorrespondingnumberofsegments.SoftwarebasicblockfunctionLinux17,818(9.0%)26,744(13.5%)FreeBSD13,999(9.1%)17,254(11.3%)Table7:Distributionofcopy-pastegranularity:numbersandper-centagesofcopy-pastedsegmentsatdifferentgranularity.NoteherethepercentageisnotCPCoverage.Itiscalculatedbycomparingtothetotalnumberofcopy-pastedsegments.Total6.3–6.7%ofcopy-pastedsegmentsarecopy-pastedmorethan8times.Ifabugisdetectedinoneofthecopies,itisdifcultforprogrammerstorememberx-ingthebugintheother8ormorecopies.Thismotivatesatoolthatcanautomaticallyxothercopy-pastedsegmentsonceaprogrammerxesonesegment.Table7showsthenumberofcopy-pastedsegmentsatbasic-blockandfunctiongranularity.Ourresultsshowthat9%ofcopy-pastedsegmentsarebasicblocks,whichin-dicatesthatprogrammersseldomcopy-pastebasicblocksbecausemostofthemaretoosimpletoworthit.Moreinterestingly,thereare13.5%ofcopy-pastedsegmentswithwholefunctionsinLinuxand11.3%inFreeBSD.Thereasonisthatmanyfunctionsprovidesim-ilarfunctionalities,suchasreadingdatafromdifferenttypesofdevices.Thosefunctionscanbecopy-pastedwithmodicationssuchasreplacingdatatypesofparameters.Thismotivatessomerefactoringtools[23]tobettermain-tainthesecopy-pastedfunctions.6.2ModicationsinCopy-pastedSegmentsFigure7showshowmanyidentiersarechangedincopy-pastedsegments.Sinceinsomecasestherearemorethantwosegmentsineachcopy-pastegroup,weonlypresentthedistributioninthebestcase:comparingthemostsimi-larpairofsegmentsfromeachcopy-pastegroup.Eachbarincludestwoparts:onewithnostatementinsertionandtheotherwithonestatementinsertion.Theresultsindicatethat65–67%ofcopy-pastedseg-mentsrequireidentierrenaming.Forexample,inLinux,27%copy-pastedsegmentsareidentical,and8%segmentsarealmostidenticalwithonlyonestatementinserted.Therest65%ofthecopy-pastedsegmentsinLinuxrenameatleastoneidentier.Suchresultsmotivateatooltosupportconsistentlyrenamingidentiersincopy-pastedcode.TheresultsinFigure7alsoshowthatabout23–27%ofcopy-pastedsegmentscontainatleastonestatementin-sertion,deletion,andmodication(Gap=1).Itindicatesthatitisimportantforcopy-pastedetectiontoolstotoler-atesuchstatementmodications.LinuxFreeBSD   \n\n     \r\r500 # of Identifiers Renamed Segments (%)012345�5Gap=1Gap=0  500 # of Identifiers Renamed Segments (%)012345�5Gap=1Gap=0Figure7:Distributionofidentierschangedincopy-pastedseg-ments.Eachbarrepresentsthepercentageofsegmentsthathavethecorrespondingnumberofrenamedidentiers.Eachbarhastwoparts:“Gap=0”and“Gap=1”representthecopy-pastedsegmentswithnoandonestatementmodications,respectively.6000 Module LOC (kilo lines)Linuxarchfskernelmmnetsounddriverscryptoothers5000 Module LOC (kilo lines)FreeBSDsyslibcryptousr.sbinusr.binsbinbingnuothers(a)Thenumberofcopy-pastedlinesindifferentmodules305 Module CP_Coverage (%)Linuxarchfskernelmmnetsounddriverscryptoothers305 Module CP_Coverage (%)FreeBSDsyslibcryptousr.sbinusr.binsbinbingnuothers(b)CPCoverageindifferentmodulesFigure8:Copy-pastedcodeindifferentmodules.6.3Copy-pastedCodeacrossModulesDifferentmoduleshavedifferentcopy-pastecharacteris-tics.Inthissubsection,weanalyzecopy-pastedcodeacrossdifferentmodulesinoperatingsystemcode.WesplitLinuxinto9categories:arch(platformspecic),fs(lesystem),kernel(mainkernel),mm(memoryman-agement),net(networking),sound(sounddevicedrivers),drivers(devicedriversotherthannetworkingandsounddevice),crypto(cryptography),andothers(allothercode).ForFreeBSD,modulesarealsosplitinto9categories:sys(kernelsources),lib(systemlibraries),crypto(cryp-tography),usr.sbin(systemadministrationcommands),usr.bin(usercommands),sbin(systemcommands),bin(system/usercommands),gnu,andothers.Figure8showsthenumberandCPCoverageofcopy-pastedsegmentsindifferentmodules.TheCPCoverageiscomputedbasedonthesizeofeachcorrespondingmod-ule,insteadoftheentiresoftware.Figure8(a)showsthatmostcopy-pastedcodeinLinuxandFreeBSDislocatedinoneortwomainmodules.Forexample,modules“drivers”and“arch”accountfor71%ofallcopy-pastedcodeinLinux,andmodule“sys”ac-countsfor60%inFreeBSD.Thisisbecausemanydriversaresimilar,anditismucheasiertomodifyacopy-pasteofanotherdriverthanwritingonefromscratch.Figure8(b)showsthatalargepercentage(20–28%)of thecodeinLinuxiscopy-pastedinthe“arch”module,the“crypto”module,andthedevicedrivermodulesin-cluding“net”,“sound”,and“drivers”.The“arch”modulehasalotofcopy-pastedcodebecauseithasmanysimi-larsub-modulesfordifferentplatforms.Thedevicedrivermodulescontainasignicantportionofcopy-pastedcodebecausemanydevicessharesimilarfunctionalities.Addi-tionally,“crypto”isaverysmallmodule(lessthan10,000LOC),butthemaincryptographyalgorithmsconsistofanumberofsimilarcomputingsteps,soitcontainsalotofcopy-pastedcode.Ourresultsindicatethatmoreattentionshouldbepaidtothesemodulesbecausetheyaremorelikelytocontaincopy-pasterelatedbugs.Incontrast,themodules“mm”and“kernel”containmuchlesscopy-pastedcodethanothers,whichindicatesthatitisraretoreusecodeinkernelsandmemoryman-agementmodules.6.4EvolutionofCopy-pasteFigure9showsthatthecopy-pastedcodeincreasesastheoperatingsystemcodeevolves.Forexample,Figure9(a)showsthatasLinux'scodesizeincreasesfrom141,000to4.4millionlines,copy-pastedcodealsokeepsincreasingfrom23,000to975,000linesthroughversion1.0to2.6.6.IntermsofCPCoverage,thepercentageofcopy-pastedcodealsosteadilyincreasesalongsoftwareevolu-tion.Forexample,Figure9(a)showsthatCPCoverageinLinuxincreasesfrom16.2%to22.3%fromversion1.0to2.6.6,andFigure9(b)showsthatCPCoverageinFreeBSDincreasesfrom17.5%to21.7%fromversion2.0to4.10.However,theCPCoverageremainsrelativelystableovertherecentseveralversionsforbothLinuxandFreeBSD.Forexample,theCPCoverageforFreeBSDhasbeenstayingaround21–22%sinceversion4.0.7RelatedWorkInthissection,webrieydiscusscloselyrelatedworkthathasnotbeendescribedinearliersections.7.1DetectingCopy-PastedCodeSeveralstudieshavebeenconductedondetectionofcopy-pastedcode.Thetechniquesusedinclude:line-by-line[6],token-by-token[24,33],ngerprinting[21],visualization[11,13],abstractsyntaxtree[7,27],anddependencegraph[26,28].Dup[6]ndsallpairsofmatchingparameterizedcodefragments.Acodefragmentmatchesanotherifbothfrag-mentsarecontiguoussequencesofsourcelineswithsomeconsistentidentiermappingscheme.Becausethisap-proachisline-based,itissensitivetolexicalaspectslikethepresenceorabsenceofnewlines.Inaddition,itdoesnotndnon-contiguouscopy-pastes.CP-Minerdoesnothavetheseshortcomings.Johnson[21]proposedusingangerprintingalgorithmonasubstringofthesourcecode.Inthisalgorithm,cal-culatedsignaturesperlinearecomparedinordertoiden- 0 1 2 3 4 501/9401/9601/9801/0001/0201/04 0 5 10 15 20 25 301.01.2.01.3.02.1.02.1.402.1.802.2.02.3.02.3.502.4.02.5.02.5.302.5.602.6.02.6.6Million LOCCP_Coverage (%)TimeVersiontotal LOCtotal copy-pasted LOCCP_Coverage(a)Linux1.0–2.6.6 0 1 2 3 4 501/9401/9601/9801/0001/0201/04 0 5 10 15 20 25 302.02.12.23.04.04.24.54.74.84.94.10Million LOCCP_Coverage (%)TimeVersiontotal LOCtotal copy-pasted LOCCP_Coverage(b)FreeBSD2.0–4.10Figure9:Copy-pastedcodeinLinuxandFreeBSDthroughvari-ousversions.Thex-axis(versionnumber)isdrawnintimescalewiththecorrespondingreleasetime.TheversionsofLinuxweanalyzearethrough1.0tothecurrentversion2.6.6.TheversionsofFreeBSDin-cludethemainbranchthrough2.0to4.10.tifymatchedsubstrings.Aswithline-basedtechniques,thisapproachissensitivetominormodicationsmadeincopy-pastedcode.Somegraphicaltoolswereproposedtounderstandcodesimilaritiesindifferentprograms(orinthesameprogram)visually.Dotplots[11]ofsourcecodecanbeconstructedbytokenizingthecodeintolinesandplacingadotincoor-dinates(i;j)ona2-Dgraph,iftheithinputtokenmatchesjthinputtoken.Similarly,Duploc[13]providesascat-terplotvisualizationofcopy-pastes(detectedbystringmatchingoflines)andalsotextualreportsthatsumma-rizealldiscoveredsequences.BothDotplotsandDuploconlysupportlinegranularity.Inaddition,theycanonlydetectidenticalduplicatesanddonottoleraterenaming,insertions,anddeletions.Baxteretal.[7]proposedatoolthattransformssourcecodeintoabstract-syntaxtrees(AST),anddetectscopy-pastebyndingidenticalsubtrees.Similartoothertools,itisnottoleranttomodicationsincopy-pastedsegments.Inaddition,itmayintroducemanyfalsepositivesbecausetwocodesegmentswiththesamesyntaxsubtreesarenotnecessarilycopy-pastes.Komondooretal.[26]proposedusingprogramdepen-dencegraph(PDG)andprogramslicingtondisomorphicsubgraphsandcodeduplication.Althoughthisapproachissuccessfulatidentifyingcopieswithreorderedstatements,itsrunningtimeisverylong.Forexample,ittakes1.5hourstoanalyzeonly11,540linesofsourcecodefrom bison,muchslowerthanCP-Miner.AnotherslowPDG-basedapproachisfoundin[28].Mayrandetal.[31]usedanIntermediateRepresenta-tionLanguagetocharacterizeeachfunctioninthesourcecodeanddetectcopy-pastedfunctionbodiesthathavesim-ilarmetricvalues.Thistooldoesnotdetectcopy-pasteatothergranularitysuchassegment-basedcopy-paste,whichoccursmorefrequentlythanfunction-basedcopy-pasteasshowninourresults.Somecopy-pastedetectiontechniquesaretoocoarse-grainedtobeusefulforourpurpose.JPlag[33],Moss[35],andsif[30]aretoolstondsimilarprogramsamongagivenset.Theyhavebeencommonlyusedtode-tectplagiarism.Mostofthemarenotsuitablefordetectingcopy-pastedcodeinasinglelargeprogram.Kontogiannisetal.[27]builtanabstractpatternmatch-ingtooltoidentifyprobablematchesusingMarkovmod-els.Thisapproachdoesnotndcopy-pastedcode.In-stead,itonlymeasuressimilaritybetweentwoprograms.7.2DetectingSoftwareBugsManytoolshavebeenproposedfordetectingsoftwarebugs.Oneapproachisdynamiccheckingthatdetectsbugsduringexecution.ExamplesofdynamictoolsincludePu-rify[19],Valgrind[36],DIDUCE[18],Eraser[34],andCCured[12].Dynamictoolshavemoreaccurateinforma-tionbutmayintroduceoverheadsduringexecution.More-over,theycanonlyndbugsontheexecutionpaths.Mostdynamictoolscannotdetectbugsinoperatingsystems.Anotherapproachistoperformchecksstatically.Ex-amplesofthisapproachincludeexplicitmodelcheck-ing[15,32,37]andprogramanalysis[8,14,17].Moststatictoolsrequiresignicantinvolvementofprogram-merstowritespecicationsorannotateprograms.Buttheadvantageofstatictoolsisthattheyaddnooverheaddur-ingexecution,anditcanndbugsthatmaynotoccurinthecommonexecutionpaths.Afewtoolsdonotrequireannotations,buttheyfocusondetectingdifferenttypesofbugs,insteadofcopy-pasterelatedbugs.Ourtool,CP-Miner,isastatictoolthatcandetectcopy-pasterelatedbugs,withoutanyannotationrequirementfromprogrammers.CP-Minercomplementsotherbugde-tectiontoolsbecauseitisbasedonadifferentobservation:ndingbugscausedbycopy-paste.Somecopy-pastere-latedbugscanbefoundbyprevioustoolsiftheyleadtobufferoveroworsomeobviousmemorycorruption,butmanyofthem,especiallythosesemanticones,cannotbefoundbyprevioustools.OurworkismotivatedbyandrelatedtoEngleretal.'sempiricalanalysisofoperatingsystemserrors[10].Theirstudygaveanoverallerrordistributionandevolutionanal-ysisinoperatingsystems,andfoundthatcopy-pasteisoneofthemajorcausesforbugs.Ourworkpresentsatooltodetectcopy-pastedcodeandrelatedbugsinlargesoftwareincludingoperatingsystemcode.ManyofthesebugssuchastheoneinFigure1cannotbedetectedbytheirtools.8ConclusionsThispaperpresentsatoolcalledCP-Miner2thatusesdataminingtechniquestoefcientlyidentifycopy-pastedcodeinlargesoftwareincludingoperatingsystems,andalsodetectscopy-pasterelatedbugs.Specically,ittakeslessthan20minutesforCP-Minertoidentify190,000and150,000copy-pastedsegmentsthataccountfor20–22%ofthesourcecodeinLinuxandFreeBSD.More-over,CP-Minerhasdetected28and23copy-pasterelatedbugsinthelatestversionsofLinuxandFreeBSD,respec-tively.ComparedtoCCFinder[24],CP-Minernds17–52%morecopy-pastedsegmentsbecauseitcantoleratestatementinsertionsandmodicationsincopy-paste.Inaddition,wehaveshownsomeinterestingcharacteristicsofcopy-pastedcodesinLinuxandFreeBSD,includingdistributionofcopy-pasteacrossdifferentsegmentsizes,groupsizes,granularity,modules,amountofmodica-tions,andsoftwareevolution.Ourresultsindicatethatmaintainingcopy-pastedcodewouldbeveryusefulforprogrammersbecauseitiscom-monlyusedinlargesoftwaresuchasoperatingsystemcode,anditcaneasilyintroducehard-to-detectbugs.Wehopeourstudymotivatessoftwaredevelopmentenviron-mentssuchasMicrosoftVisualStudiotoprovidefunc-tionalitytomaintaincopy-pastedcodeandautomaticallydetectcopy-pasterelatedbugs.EventhoughCP-Minerfocusesonlyon“forget-to-change”bugscausedbycopy-paste,copy-pastecanintro-ducemanyothertypesofbugs.Forexample,aftercopy-pasteoperation,theprogrammerforgetstoaddsomestate-mentsthatarespecictothenewcopy-pastedsegment.However,suchbugsarehardtodetectbecauseitreliesonsemanticinformation.Itisimpossibletoguesswhattheprogrammerwouldwanttoinsertormodify.Anothertypeofcopy-pasterelatedbugsiscausedbyprogrammersfor-gettingtoxaknownbuginallcopy-pastedsegments.Theyonlyxoneortwosegmentsbutforgettochangeitintheothers.OurtoolCP-Minercandetectsimplecasesofthistypeoferrors.Butifthexistoocomplicated,CP-Minerwouldmissthebugbecausethemodiedcodesegmentbecomestoodifferentfromtheotherstobeidenti-edascopy-paste.Tosolvethisproblemmorethoroughly,itwouldrequiresupportfromsoftwaredevelopmentenvi-ronmentssuchasMicrosoftVisualStudio.9AcknowledgementsTheauthorswouldliketothanktheshepherd,AndrewMyers,theanonymousreviewers,andJamesLarus(Mi-crosoftResearchLab)fortheirinvaluablefeedback.WeappreciateProfessorJiaweiHanandhisgroupfortheirCloSpanminingalgorithm.WewouldalsoliketothankCigdemSengulforherhelpwiththeinitialinvestigationofourproject.ThisresearchissupportedbyIBMFaculty2CP-Minerwillbereleasedtotheresearchcommunity. Award,NSFCNS-0347854(careeraward),NSFCCR-0305854grantandNSFCCR-0325603grant.Ourexperi-mentswereconductedonequipmentprovidedthroughtheIBMSURgrant.REFERENCES[1]Linuxkernelmailinglist.http://lkml.org.[2]R.AgrawalandR.Srikant.Miningsequentialpatterns.InEleventhInternationalConferenceonDataEngineering,1995.[3]A.V.Aho,R.Sethi,andJ.Ullman.Compilers:Principles,Tech-niquesandTools.Addison-Wesley,1986.[4]A.Aiken.Moss:Asystemfordetectingsoftwareplagiarism.http://www.cs.berkeley.edu/˜aiken/moss.html.[5]B.S.Baker.Aprogramforidentifyingduplicatedcode.Comput-ingScienceandStatistics,24:49–57,1992.[6]B.S.Baker.Onndingduplicationandnear-duplicationinlargesoftwaresystems.InProceedingsoftheSecondWorkingConfer-enceonReverseEngineering,page86.IEEEComputerSociety,1995.[7]I.D.Baxter,A.Yahin,L.Moura,M.Sant'Anna,andL.Bier.Clonedetectionusingabstractsyntaxtrees.InProceedingsoftheInternationalConferenceonSoftwareMaintenance,page368.IEEEComputerSociety,1998.[8]J.-D.Choi,K.Lee,A.Loginov,R.O'Callahan,V.Sarkar,andM.Sridharan.Efcientandprecisedataracedetectionformul-tithreadedobject-orientedprograms.InProceedingoftheACMSIGPLAN2002ConferenceonProgrammingLanguageDesignandImplementation,pages258–269.ACMPress,2002.[9]A.Chou,B.Chelf,D.R.Engler,andM.Heinrich.Usingmeta-levelcompilationtocheckFLASHprotocolcode.InProceed-ingsofthe9thInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystem,pages59–70.ACMPress,2000.[10]A.Chou,J.Yang,B.Chelf,S.Hallem,andD.R.Engler.Anempiricalstudyofoperatingsystemerrors.InSymposiumonOp-eratingSystemsPrinciples,pages73–88,2001.[11]K.W.ChurchandJ.I.Helfman.Dotplot:Aprogramforexploringself-similarityinmillionsoflinesoftextandcode.JournalofComputationalandGraphicalStatistics,1993.[12]J.Condit,M.Harren,S.McPeak,G.C.Necula,andW.Weimer.CCuredintherealworld.InProceedingsoftheACMSIGPLAN2003ConferenceonProgrammingLanguageDesignandImple-mentation,pages232–244.ACMPress,2003.[13]S.Ducasse,M.Rieger,andS.Demeyer.Alanguageindependentapproachfordetectingduplicatedcode.InProceedingsofInter-nationalConferenceonSoftwareMaintenance,pages109–118.IEEE,1999.[14]D.EnglerandK.Ashcraft.RacerX:effective,staticdetectionofraceconditionsanddeadlocks.InProceedingsofthe19thACMSymposiumonOperatingSystemsPrinciples,pages237–252.ACMPress,2003.[15]D.Engler,D.Y.Chen,andA.Chou.Bugsasinconsistentbehav-ior:Ageneralapproachtoinferringerrorsinsystemscode.InProceedingsofthe18thACMSymposiumonOperatingSystemsPrinciples,pages57–72.ACMPress,2001.[16]S.Grier.AtoolthatdetectsplagiarisminPascalprograms.InPro-ceedingsofthe12thSIGCSETechnicalSymposiumonComputerScienceEducation,pages15–20.ACMPress,1981.[17]S.Hallem,B.Chelf,Y.Xie,andD.Engler.Asystemandlanguageforbuildingsystem-specic,staticanalyses.InProceedingsoftheACMSIGPLAN2002ConferenceonProgrammingLanguageDesignandImplementation,pages69–82.ACMPress,2002.[18]S.HangalandM.S.Lam.Trackingdownsoftwarebugsusingautomaticanomalydetection.InProceedingsoftheInternationalConferenceonSoftwareEngineering,May2002.[19]R.HastingsandB.Joyce.Purify:Fastdetectionofmemoryleaksandaccesserrors.InProceedingsoftheWinterUSENIXConfer-ence,pages158–185,Dec1992.[20]H.T.Jankowitz.DetectingplagiarisminstudentPascalprograms.ComputerJournal,31(1):1–8,1988.[21]J.H.Johnson.Identifyingredundancyinsourcecodeusingn-gerprints.InProceedingsoftheconferenceoftheCentreforAdvancedStudiesonCollaborativeresearch,Toronto,Ontario,Canada,October1993.[22]J.H.Johnson.Substringmatchingforclonedetectionandchangetracking.InProceedingsoftheInternationalConferenceonSoftwareMaintenance,pages120–126.IEEEComputerSociety,1994.[23]R.E.JohnsonandW.F.Opdyke.Refactoringandaggregation.InObjectTechnologiesforAdvancedSoftware,FirstJSSSTInterna-tionalSymposium,volume742,pages264–278.Springer-Verlag,1993.[24]T.Kamiya,S.Kusumoto,andK.Inoue.CCFinder:amultilinguis-tictoken-basedcodeclonedetectionsystemforlargescalesourcecode.IEEETransactionsonSoftwareEngineering,28(7):654–670,2002.[25]C.KapserandM.W.Godfrey.Towardataxonomyofclonesinsourcecode:Acasestudy.EvolutionofLarge-scaleIndustrialSoftwareApplications(ELISA),Sept2003.[26]R.KomondoorandS.Horwitz.Usingslicingtoidentifydupli-cationinsourcecode.In8thInternationalSymposiumonStaticAnalysis(SAS),2001.[27]K.Kontogiannis,M.Galler,andR.DeMori.Detectingcodesim-ilarityusingpatterns.WorkingNotesoftheThirdWorkshoponAIandSoftwareEngineering:BreakingtheToyMold(AISE),1995.[28]J.Krinke.Identifyingsimilarcodewithprogramdependencegraphs.InEighthWorkingConferenceonReverseEngineering(WCRE),2001.[29]Z.Li,Z.Chen,S.Srinivasan,andY.Zhou.C-Miner:MiningBlockCorrelationsinStorageSystems.InProceedingsofthe3rdUSENIXConferenceonFileandStorageTechnology,2004.[30]U.Manber.Findingsimilarlesinalargelesystem.InPro-ceedingsoftheUSENIXWinter1994TechnicalConference,pages1–10,SanFransisco,CA,USA,17–211994.[31]J.Mayrand,C.Leblanc,andE.Merlo.Experimentontheau-tomaticdetectionoffunctionclonesinasoftwaresystemusingmetrics.InProceedingsofthe1996InternationalConferenceonSoftwareMaintenance,page244.IEEEComputerSociety,1996.[32]M.Musuvathi,D.Park,A.Chou,D.R.Engler,andD.L.Dill.CMC:Apragmaticapproachtomodelcheckingrealcode.InProceedingsoftheFifthSymposiumonOperatingSystemsDesignandImplementation,Dec.2002.[33]L.Prechelt,G.Malpohl,andM.Philippsen.FindingplagiarismsamongasetofprogramswithJPlag.JournalofUniversalCom-puterScience,8(11):1016–1038,Nov2002.[34]S.Savage,M.Burrows,G.Nelson,P.Sobalvarro,andT.Ander-son.Eraser:Adynamicdataracedetectorformultithreadedpro-grams.ACMTransactionsonComputerSystems,15(4):391–411,1997.[35]S.Schleimer,D.S.Wilkerson,andA.Aiken.Winnowing:localalgorithmsfordocumentngerprinting.InProceedingsofthe2003ACMSIGMODInternationalConferenceonManagementofData,pages76–85.ACMPress,2003.[36]J.Seward.Valgrind,anopen-sourcememorydebuggerforx86-GNU/Linux.availableatURLhttp://developer.kde.org/sewardj/.[37]U.SternandD.L.Dill.AutomaticvericationoftheSCIcachecoherenceprotocol.InConferenceonCorrectHardwareDesignandVericationMethods,pages21–34,1995.[38]X.Yan,J.Han,andR.Afshar.CloSpan:Miningclosedsequentialpatternsinlargedatasets.InProceedingsof2003SIAMInterna-tionalConferenceonDataMining(SDM'03),SanFransisco,CA,May2003.[39]M.Zaki.SPADE:Anefcientalgorithmforminingfrequentse-quences.MachineLearning,40:31–60,2001.