/
1.1MotivationECENTstudies[2],[3],[4]haveshownthatalargeportionofcodeap 1.1MotivationECENTstudies[2],[3],[4]haveshownthatalargeportionofcodeap

1.1MotivationECENTstudies[2],[3],[4]haveshownthatalargeportionofcodeap - PDF document

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
358 views
Uploaded On 2015-09-20

1.1MotivationECENTstudies[2],[3],[4]haveshownthatalargeportionofcodeap - PPT Presentation

similarbugscausedbycopypasteinLinuxFreeBSDPostgreSQLandWebApacheAnothercopypasterelatedbugdetectedbyCPMinerisshowninFig1bInthisexamplethesegmentinlines258269wascopiedfromlines246257Each ID: 135148

 similarbugscausedbycopy-pasteinLinux FreeBSD PostgreSQL andWebApache.Anothercopy-pasterelatedbugdetectedbyCP-MinerisshowninFig.1b.Inthisexample thesegmentinlines258-269wascopiedfromlines246-257.Each

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "1.1MotivationECENTstudies[2],[3],[4]have..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

 1.1MotivationECENTstudies[2],[3],[4]haveshownthatalargeportionofcodeappearstobeduplicatedinsoftware.Forexample,KapserandGodfrey[4],usingacodeclonedetectiontoolcalledCCFinder[5],foundthat12percentoftheLinuxfilesystemcode(279Klines)wasinvolvedincodecloningactivity.Baker[2]foundthat,inthecompletesourceoftheXWindowsystem(714Klines),19percentofthecodewasidentifiedasduplicates.Duplicatedcodeislikelytoresultfromcopy-pasteactivitybecauseitcansignificantlyreduceprogrammingeffortandtimebyreusingapieceof similarbugscausedbycopy-pasteinLinux,FreeBSD,PostgreSQL,andWebApache.Anothercopy-pasterelatedbugdetectedbyCP-MinerisshowninFig.1b.Inthisexample,thesegmentinlines258-269wascopiedfromlines246-257.EachsegmentinitializesdifferentIOPs(I/Oprocessors)specifiedbyconstants(=0)and(=1),respec-tively.However,theidentifierinline264isnotchangedtoaccordinglyanditresultsinawronginitialstateofIOPs.Thisbugwouldincorrectlyoverwritethevalue(0x87)of�ofstatus_ctrlby0.Thiscannotbedetectedbyexistingbugdetectiontoolsbecauseitisnotasimplebufferoverflowbug(sinceIOP_NUM_SCCequals0),incorrectpointermanipulation,orfreememoryaccess.Ifknownbymal-icioususerswhoplanasecurityattack,thisbugmaycausetheservertocrash.Itisachallengingtasktoefficientlyextractcopy-pastedcodeinlargesoftwaresuitessuchasanoperatingsystem.Eventhoughseveralpreviousstudies[16],[17]haveaddressedtherelatedproblemofdetectingplagiarism,theyarenotsuitablefordetectingcopy-pastedcode.Thosetools,suchasthecommonlyusedJPlag[18],weredesignedtomeasurethedegreeofsimilaritybetweenapairofprogramsinordertodetectplagiarism.Ifthesetoolsweretobeusedtodetectcopy-pastedcodeinasingleprogramwithoutanymodification,theywouldneedtocompareallpossiblepairsofcodefragments.Foraprogramwithstatements,atotalofpairwisecomparisonsneedtobeperformed.Thiscomplexityiscertainlyimpracticalforsoftwarewithmillionsoflinesofcode,suchasLinuxandFreeBSD.Ofcourse,itispossibletomodifythesetoolstoidentifycopy-pastedcodeinsinglesoftware,butthemodificationisnottrivialandstraightforward.Forexample,anewdynamicprogrammingalgorithmmaybeintegratedintotheoriginaldetectionalgorithm,whichwouldrequiresignificanteffort.Sofar,onlyafewtoolshavebeenproposedtoidentifycopy-pastedcodeinasingleprogram.ExamplesofsuchtoolsincludeMoss[19],[20],Dup[2],CCFinder[5],andothers[6],[21].Mostofthesetoolssufferfromsomeorallofthefollowinglimitations:Mostexistingtoolsarenotscalabletolargesoftwaresuitessuchasoperatingsystemcodebecausetheyconsumealargeamountofmemoryandtakealongtimetoanalyzemillionsoflinesofcode.Tolerancetomodifications.Mosttoolscannotdealwithmodificationsincopy-pastedcode.Sometools[3],[22]canonlydetectcopy-pastedsegmentsthatareexactlyidentical.Suchmodificationsareverycom-moninstandardpractice.OurexperimentswithCP-Minershowthataboutonethirdofcopy-pastedsegmentscontaininsertionormodificationofonetotwostatements.Bugdetection.Althoughsomeexistingtoolsreportcopy-pastedcode,theycannotdetectcopy-pasterelatedbugs.1.2OurContributionsInthispaper,wepresentCP-Miner,atoolthatusesdataminingtechniquestoidentifycopy-pastedcodeinlargesoftwaresuitesincludingoperatingsystemcodeandalsodetectscopy-pasterelatedbugs.Itrequiresno2IEEETRANSACTIONSONSOFTWAREENGINEERING,VOL.32,NO.3,MARCH2006 Fig.1.Copy-pasterelatedbugsinLinux2.6.6detectedbyCP-Miner.ThesebugshavebeenconfirmedandfixedbyLinuxkerneldevelopers.(a)Detectedinfile/arch/sparc64/prom/memory.c.Asimilarbugisalsodetectedinfile/arch/sparc/prom/memory.c.(b)Detectedinfile/arch/m61.Consideringcomparisonbetweenthepairofcodefragmentswithstatements,therearedifferentfragments.So,therearepossiblepaircomparisons.Sincecanbe... ,thetotalnumberofpairwisecomparisonsis modificationorannotationtothesourcecodeofthesoftwarebeinganalyzed.OurworkmakesthreemainAscalablecopy-pastedetectiontoolforlargesoftwaresuites.CP-Minercanefficientlyfindcopy-pastedcodeinlargesoftwaresuitesincludingoperatingsystemcode.Ourexperimentalresultsshowthatittakeslessthan20minutesforCP-Minertodetect200,000and150,000uniquecopy-pastedsegmentsthataccountforabout22percentand20percentofthesourcecodeinLinuxandFreeBSD(eachwithmorethan3millionlinesofcode),respectively.Additionally,ittakeslessthanoneminutetodetectcopy-pastedsegmentsinApachewebserverandPostgreSQL,accountingforabout18percentand22percentofthetotalsourcecode,ComparedtoCCFinder[5],CP-Minerisabletofind17-52percentmorecopy-pastedsegmentsinthefourtestapplicationsbecauseCP-Minercantoleratestatementinsertionsandmodifications.Detectionofbugsassociatedwithcopy-paste.CP-Minercandetectcopy-pasterelatedbugssuchasthebugsshowninFig.1,mostofwhicharehardtodetectwithexistingstaticordynamicbugdetectiontools.Specifically,CP-Minerhasdetected49newbugsinthelatestversionofLinux,31inFreeBSD,5inWebApache,and2inPostgreSQL.Thesebugshadnotbeenreportedbefore.Wehavereportedthesebugstothecorrespond-ingdevelopers.Sofar,mostofthesebugshavebeenconfirmedandfixedbyLinuxandFreeBSDdevel-opersandhavebeenrectifiedinthefollowingStatisticalstudyofcopy-pastedcodedistributioninoperatingsystemcode.Fewearlierstudieshavebeenconductedonthecharacteristicsofcopy-pasteinlargesoftwaresuites.Ourworkfoundsomeinterestingcharacteristicsofcopy-pastedcodeinLinuxandFreeBSD.Ourresultsindicatethat:copy-pastedsegmentsareusuallynottoolarge,mostwith5-16statements;althoughmorethan50percentofcopy-pastedsegmentshaveonlytwocopies,afew(6.3-6.7percent)copy-pastedsegmentsarecopiedmorethaneighttimes;thereareasignificantnumber(11.3-13.5percent)ofcopy-pastedsegmentsatfunctiongranularity(copy-pasteofanentirefunction);most(65-67percent)copy-pastedsegmentsrequirerenamingatleastoneidentifierand23-27percentofcopy-pastedsegmentshavein-serted,modified(excludingrenaming),orde-letedonestatement;differentOSmoduleshaveverydifferentper-centagesofcopy-pastedcode:,andhavehigherpercentageofcopy-pastethanothermodulesinLinux;andastheoperatingsystemcodeevolves,theamountofcopy-pastealsoincreases,butthecoveragepercentageofcopy-pastedcodere-mainsrelativelystableovertherecentversionsofLinuxandFreeBSD.ORKAND2.1DetectionofCopy-PastedCodeSincecopy-pastedcodesegmentsareusuallysimilartotheoriginalones,detectionofcopy-pastedcodeinvolvesdetectingcodesegmentsthatareidenticalorsimilar.Previoustechniquesforcopy-pastedetectioncanberoughlyclassifiedintothreecategories:1),inwhichtheprogramisdividedintostrings(typicallylines)andthesestringsarecomparedagainsteachothertofindsequencesofduplicatedstrings[2],[23];2)inwhichpatternmatchingisperformedontheparse-treeofthecodetosearchforsimilarsubtrees[6],[24],[25];3),inwhichtheprogramisdividedintoastreamoftokensandduplicatetokensequencesareidentified[5],[18].,proposedbyBaker[2],findsallpairsofmatchingcodefragments.Acodefragmentmatchesanotherifbothfragmentsarecontiguoussequencesofsourcelineswithsomeconsistentidentifiermappingscheme.Becausethisapproachisline-based,itissensitivetolexicalaspectssuchasthepresenceorabsenceofnewlines.Inaddition,itdoesnotfindnoncontiguouscopy-pastes.CP-Minerdoesnothavetheseshortcomings.Johnson[23]proposedusingafingerprintingalgorithmonasubstringofthesourcecode.Inthisalgorithm,signaturescalculatedperlinearecomparedinordertoidentifymatchedsubstrings.Aswithline-basedtechniques,thisapproachissensitivetominormodificationsmadeincopy-pastedcode.Baxteretal.[6]proposedatoolthattransformssourcecodeintoabstract-syntaxtrees(AST)anddetectscopy-pastebyfindingidenticalsubtrees.Similarlytoothertools,itcannotperformrobustlywhenmodificationsaremadeincopy-pastedsegments.KomondoorandHorwitz[24]proposedusingaprogramdependencegraph(PDG)andprogramslicingtofindisomorphicsubgraphsandcodeduplication.Althoughthisapproachissuccessfulatidentifyingcopieswithreorderedstatements,itsrunningtimeisverylong.Forexample,ittakes1.5hourstoanalyzeonly11,540linesofsourcecodeof,muchslowerthanCP-Miner.AnotherslowPDG-basedapproachisfoundin[26].Kontogiannisetal.[25]builtanabstractpatternmatch-ingtooltoidentifyprobablematchesusingMarkovmodels.Thisapproachdoesnotfindcopy-pastedcode.Instead,itonlymeasuressimilaritybetweentwoprograms.Mayrandetal.[27]usedanIntermediateRepresentationLanguagetocharacterizeeachfunctioninthesourcecodeanddetectcopy-pastedfunctionbodiesthathavesimilarmetricvalues.Thistooldoesnotdetectcopy-pasteatothergranularitysuchassegment-basedcopy-paste,whichoccursmorefrequentlythanfunction-basedcopy-paste,asshowninourresults.Somecopy-pastedetectiontechniquesaretoocoarse-grainedtobeusefulforourpurpose. LIETAL.:CP-MINER:FINDINGCOPY-PASTEANDRELATEDBUGSINLARGE-SCALESOFTWARECODE CloSpanproducesonlyclosedfrequentsubsequencesratherthanallfrequentsubsequencessinceanynonclosedsubsequencecanbeinferredfromitssupersequence.Intheexampleabove,thefrequentsubsequencesare;ab;ac;bc;abcifthe is4,butweonlyneedtoproducetheclosedsubsequences;abc.Thisfeaturesignificantlyreducesthenumberoffrequentsubsequencesgenerated,especiallyforlongfrequentsubsequences.TherearetwomainideastoimprovetheminingefficiencyinCloSpan.Thefirstideaisbasedonanobviousobservationthatifasequenceisfrequent,thenallofitssubsequencesarefrequent.Forexample,ifasequencefrequent,allofitssubsequencesa;b;c;ab;ac;bcfrequent.CloSpanrecursivelyproducesalongerfrequentsubsequencebyconcatenatingeveryfrequentitemwithashorterfrequentsubsequencethathasalreadybeenobtainedinthepreviousiterations.Tobetterexplainthisidea,letusconsideranexample.denotethesetoffrequentsubsequenceswithlength.Inordertoget,wecanjointhesets.Forexample,supposewehavealreadycomputedshownbelow.Inordertocompute,wecanfirstcomputebyconcatenatingasubsequencefromandanitema;b;cab;ac;bcaba;abb;abc;aca;acb;acc;bca;bcb;bccForgreaterefficiency,CloSpandoesnotjointhesequencesinsetwithalltheitemsin.Instead,eachsequenceinisconcatenatedonlywiththefrequentitemsinitssuffixSuffixdatabaseforasubsequenceiscomposedofallsuffixesofthesequencescontainingintheoriginaldatabase.Inourexample,forthefrequentsequence,itssuffixdatabaseisced;ecf;ch;ijcandonlyafrequentitemin(itssupportisequalto ),soisonlyconcatenatedwiththenwegetalongersequencethatbelongstoThesecondideaisusedtoefficientlyevaluatewhetheraconcatenatedsubsequenceisfrequentornot.Ittriestoavoidsearchingthroughthewholedatabase.Instead,itonlycheckssuffixdatabasesthatcanbecreatedbyscanningthewholeatthebeginningandthenupdatedwhenfrequentsubsequencesaregenerated.Intheaboveexample,foreachsequence,CloSpancheckswhetheritisfrequentornotbysearchingthesuffixdatabase.Ifthenumberofitsoccurrencesisgreaterthan addedinto,whichisthesetoffrequentsubsequencesoflength3.CloSpancontinuescomputing,andsoonuntilnomoresubsequencescanbeaddedintothesetoffrequentsubsequences.Intheexampleabove,thealgorithmstopsatsincethereisnofrequentitemsinsuffixdatabaseed;f;h,wherecanbeeasilyobtainedfromsuffixdatabaseced;ecf;ch;ijcoutscanningthewholedatabaseRecently,wehaveusedCloSpantodetectblockcorrela-tionsinstoragesystems[37].WedonotdiscusstheCloSpanalgorithminmoredetailasitcanbefoundin[36].3CP-MCP-Minerhastwomajorfunctions:detectingcopy-pastedcodesegmentsandfindingcopy-pasterelatedbugs.Itrequiresnomodificationinthesourcecodeofthesoftwarebeinganalyzed.Thefollowingtwosubsectionsdescribethedesignofeachfunction.3.1IdentifyingCopy-PastedCodeTodetectcopy-pastedcode,CP-Minerfirstconvertstheproblemintoafrequentsubsequenceminingproblem.ItthenusesanenhancedalgorithmofCloSpantofindcopy-pastedsegments.Finally,itprunesfalsepositivesareunlikelytoberealcopy-pastedcodeandthencomposeslargercopy-pastedsegments.Forconvenience,werefertoagroupofcodesegmentsthataresimilartoeachotherasacopy-pastegroupCP-Minercandetectcopy-pastedsegmentsbecauseitusesfrequentsubsequenceminingtechniquesthatcanavoidmanyunnecessaryorredundantcompar-isons.Tomapourproblemtoafrequentsubsequenceminingproblem,CP-Minerfirstmapsastatementtoanumber,withsimilarstatementsbeingmappedtothesamenumber(describedinSection3.1.1).Then,abasicblock(i.e.,astraightlinepieceofcodewithoutanyjumpsorjumptargetsinthemiddle)becomesasequenceofnumbers.Asaresult,aprogramismappedintoadatabaseofmanysequences.ByminingthedatabaseusingCloSpan,wecanfindfrequentsubsequencesthatoccuratleasttwiceinthesequencedatabase.Thesefrequentsubsequencesareexactlycopy-pastedsegmentsintheoriginalprogram.Byapplyingsomepruningtechniques,wecanfindbasiccopy-pastedsegments,whichthencanbecombinedwithneighboringonestocomposelargercopy-pastedsegments.CP-Mineriscapableofhandlingmodificationsincopy-pastedsegmentsfortworeasons.First,similarstatementsaremappedintothesamevalue.Thisisachievedbymappingallidentifiers(variables,functions,types,etc.)ofthesametypeintothesamevalue,regardlessoftheiractualnames.Thisrelaxationtoleratesidentifierrenamingincopy-pastedsegments.Eventhoughfalsepositivesmaybeintroducedduringthisprocess,theyareaddressedlaterthroughvariouspruningtechniques,suchasidentifiermapping(describedinSection3.1.4).Second,afrequentsubsequencecanbeinterleavedinitssupportingsequences.Sincetheminingalgorithmallowsarbitraryinterleavinggapsinthesequences,weenhancethebasicminingalgorithm,CloSpan,tosupportgapconstraintsinfrequentsubsequences.ThisenhancementallowsCP-Minertotolerateonetotwostatementinsertions,deletions,ormodificationsincopy-pastedcode,ignoringanarbitrarilylongdifferentcodesegmentthatisunlikelytobecopy-pastedcode.Insertionsanddeletionsaresymmetricbecauseastatementdeletioninonecopycanalsobeseenasaninsertionintheothercopy.Modificationisaspecialcaseofinsertion.Basically,themodifiedstatementcanbetreatedasifbothsegmentshaveaninsertedstatement.Themainstepsintheprocessofidentifyingcopy-pastedsegmentsinclude: LIETAL.:CP-MINER:FINDINGCOPY-PASTEANDRELATEDBUGSINLARGE-SCALESOFTWARECODE positives.Toprunefalsepositives,CP-Minerhasappliedseveraltechniquestobothbasicandcomposedcopy-pastedsegments.Thepruningtechniquesaredescribedbelow.Pruningunmappablesegments.Thistechniqueisusedtoprunefalsepositivesintroducedbythetokenizationofidentifiers.Thisisbasedontheobservationthatifaprogrammercopy-pastesacodesegmentandthenrenamesanidentifier,he/shewouldmostlikelyrenamethisidentifierinallitsoccurrencesinthenewcopy-pastedsegment.There-fore,wecanbuildanidentifiermappingthatmapsoldnamesinonesegmenttotheircorrespondingnewonesintheothersegmentthatbelongstothesamecopy-pastegroup.IntheexampleshowninFig.2,variable phys ischangedinto prom (exceptthebugonline117).Theoriginalsourceidentifiers,insteadoftokens,areusedhereforthemappingfromonecodesegmenttotheother.Thissourcecode-levelinformationob-tainedfromtheparsingphaseismaintainedtogetherwitheachhashedsentencestoredinoursequenceAmappingschemeisconsistentifthereareveryfewconflictsthatmaponeidentifiernametotwoormoredifferentnewnames.Ifnoconsistentidentifiermappingcanbeestablishedbetweenapairofcopy-pastedsegments,theyarelikelytobefalsepositives.Tomeasurethedegreeofconflict,CP-MinerusesametriccalledConflictRatio,whichrecordstheconflictratioforanidentifiermappingbetweentwocandidatecopy-pastedsegments.Forexample,vari-fromsegment1ischangedtomultipleidentifiers(usuallyjustinoneortwoidentifiers)insegment2.Fromthesemultipleidentifiers,wefirstpickouttheonethathasthelargestnumberofoccurrencesin’scorrespondingpositions.Supposethisidentifieris.Then,wecalculatewhatpercen-tageofinsegment1isNOTmappedtosegment2,whereconflicthappens.Ifismappedtoin75percentofitsoccurrencesinsegment2,but25percentofitsoccurrencesischangedintoothervariables,theConflictRatioofmappingsegment1tosegment2is25percent.Similarly,ifonly25percentofinsegment1ismappedtosegment2,theConflictRatioofmappingsegment1tosegment2is75percent.Here,theConflictRatioisasymmetricamongthecodeseg-mentpair,soCP-Minercalculatesthevaluesinbothdirectionsofmapping.TheConflictRatioforthewholemappingschemebetweenthesetwosegmentsistheweightedsumofConflictRatioofthemappingforeachuniqueidentifier.Theweightforaninagivencodesegmentisthefractionofoccurrencesofoverthetotaloccurrencesofallidentifiers.IfConflictRatiofortwocandidatecopy-pastedsegments(ineitheroneofthemappingdirections)ishigherthanapredefinedthreshold,thesetwocodesegmentsarefilteredasfalsepositives.Inourexperiments,wesetthethresholdtobe60percent.Pruningtinysegments.Ourminingalgorithmmayfindtinycopy-pastedsegmentsthatconsistofonlyonetotwosimplestatements.Ifsuchatinysegmentcannotbecombinedwithneighboringsegmentstocomposealargersegment,itisremovedfromthecandidateset.Thisisbasedontheobservationthatcopy-pastedsegmentsareusuallynotverysmallbecauseprogrammerscannotsavemuchprogram-mingeffortbycopy-pastingtinycodesegments.CP-Minermeasuresthesizeofasegmentbythenumberoftokensinit.Thismetricismoreappropriatethanthenumberofstatementsbecausethelengthofstatementsishighlyvariable.Ifasinglestatementisverycomplicatedwithmanytokens,itisstillpossibleforprogrammerstocopy-pasteit.Toprunetinysegments,CP-Minerusesatunableparametercalled .Ifthenumberoftokensinapairofcopy-pastedsegmentsisfewerthan ,thispairisremoved.Pruningoverlappedsegments.Theconcatenationapproachforconstructinglargersegmentswillinevitablyleadtomanysegmentswhichoverlap.Ifapairofcandidatecopy-pastedsegmentsoverlapwitheachother,theyareconsideredfalsepositives.Toavoidsuchfalsepositives,CP-Minerstopsextendingthepairofcopy-pastedsegmentsoncetheyoverlap.Forsomeprogramstructures,suchasstatement,whichcontainmanypairsofself-similarsegments,pruningoverlappedsegmentscanavoidmostofthefalsepositivesinswitchPruningsegmentswithlargegaps.Besidestheminingprocedureforbasiccopy-pastedsegments,thegapconstraintisalsoappliedtocomposedones.Whentwoneighboringsegmentsarecombined,themaximumgapofthenewlycomposedlargesegmentmaybecomelargerthanapredefinedthreshold, total .Ifthisistrue,thecompositionisinvalid.So,thenewlycomposedoneisnotaddedintothecandidatesetandthetwosmalleronesaremarkedasnonexpandableintheset.Ofcourse,evenaftersuchrigorouspruning,falsepositivesmaystillexist.However,wehavemanu-allyexamined100randomcopy-pastedsegmentsreportedbyCP-MinerforLinux,andonlyafewfalsepositives(eight)arefound.Therefore,theofouralgorithmisaround92percent.Wecanonlymanuallyexamineeachidentifiedcopy-pastedseg-mentbecausetherearenotracesthatrecordtheprogrammers’copy-pasteactivityduringthedevel-opmentofthesoftware.3.1.5ComputationalComplexityofCP-MinerCP-Minercanextractcopy-pastedcodedirectlyfromasinglesoftwarewithtotalcomplexityofintheworstcase(whereisthenumberoflinesofcode)andtheoptimizationsfurtherimproveitsefficiencyinpractice.Forexample,CP-Minercanidentifymorethan150,000copy-pastedsegmentsfrom3-4millionlinesofcodeinlessthan20minutes,asshowninourresultsinSection5.3.InCP-Miner,webreakverylargebasicblocksintosmall8IEEETRANSACTIONSONSOFTWAREENGINEERING,VOL.32,NO.3,MARCH2006 reporttothecorrespondingdevelopercommunitythosebugsthatwesuspecttoberealbugswithhighconfidence.ThenumbersofbugsdetectedbyCP-MinerandverifiedbugsareshowninTable4.TheresultsareachievedbysettingtheUnchangedRatioThresholdtobe0.4.BothLinuxandFreeBSDhavemanycopy-pasterelatedbugs.Sofar,wehaveverified49and32bugsinthelatestversionsofLinuxandFreeBSD.Mostofthesebugshadneverbeenreportedbefore.Wehavereportedthesebugstothekerneldevelopercommunities.MostoftheLinuxbugshavebeenconfirmedandfixedinthefollowingreleasesbyLinuxkerneldevelopersandtheothersarestillintheprocessofbeingSinceApacheandPostgreSQLaremuchsmallercom-paredtoLinuxandFreeBSD,CP-Minerfoundmanyfewercopy-pasterelatedbugs.WehaveverifiedsixbugsforApacheandtwobugsforPostgreSQLwithhighconfidence.OnebuginApachewasimmediatelyfixedbytheApachedevelopersafterwereportedittothem.5.2FalsePositivesinBugDetectionTable4alsoshowsthenumberoffalsepositivesreportedbyCP-Minerinbugdetection.Thesefalsepositivesaremostlycausedbythefollowingtwomajorreasonsandcanbefurtherprunedinourfuturework:Incorrectlymatchedcopy-pastedsegments.Insomecopy-pastedsegmentsthatcontainmultiple“blocks,therearemanypossiblecombinationsforthesecontiguouscopy-pastedblockstocomposelargerones.SinceCP-Minersimplyfollowstheprogramordertocomposelargercopy-pastes,itislikelythatawrongcombinationmightbechosen.Asaresult,identifiersarecomparedbetweentwoincorrectlymatchedcopy-pastedsegments,whichresultsinfalsepositives.Thesefalsepositivescanbeprunedifweusemoresemanticinformationoftheidentifiersinthesesegments.Thesegmentswithanumberof“blocksusuallycontainalotofconstantidentifiers,butourcurrentCP-Minertreatsthemasnormalvariablenames.Ifweusetheinformationoftheseconstantstomatch“”blockswhencomposinglargercopy-pastedsegments,itcanreducethenumberofincorrectlymatchedsegmentsandmostfalsepositivescanbepruned.Exchangeableorders.Inacopy-pastepair,theordersofsomestatementsorexpressionscanbeswitched.Forexample,asegmentwithseveralsimilarstatementssuchas“a1=b1;a2=b2;”isthesameas“a2=b2;a1=b1;.”ThecurrentversionofCP-Minersimplycomparestheidentifiersinapairofcopy-pastedsegmentsinstrictorderand,there-fore,afalsealarmmightbereported.InLinux,41falsepositivesarecausedbysuchexchangeableThesefalsepositivescanbeprunedifwerelaxthestrictordercomparisonbyfurthercheckingwhetherthecorresponding“changed”identifiersareintheneighboringstatements/expressions.5.3TimeandSpaceOverheadsCP-Minercanidentifycopy-pastedcodeinlarge-scalesoftwarecodeveryefficiently.TheexecutiontimeofCP-MinerisshowninTable5.CP-Minertakes11-20minutestoidentify101,699-198,605copy-pastedsegmentsinLinuxandFreeBSD,eachwith3-4millionsoflinesofcode.Ittakeslessthan1minutetodetectcopy-pastedsegmentsinApacheandPostgreSQLwithmorethan200,000linesofcode.CP-Minerisalsospace-efficient.Forexample,ittakeslessthan530MBtofindcopy-pastedcodeinLinux.ForApacheandPostgreSQL,CP-Mineronlyconsumes27-57MBofmemory.5.4ComparisonwithCCFinderWehavecomparedCP-MinerwithCCFinder[5].CCFinderhassimilarexecutiontimeasCP-Miner,butCP-Minerdiscoversmanymorecopy-pastedsegments.Inaddition,CCFindercannotdetectcopy-pasterelatedbugs.AsweexplainedinSection4,CCFinderallowsidentifier-renam-ing,butnotstatementinsertions.Inaddition,CCFinderdoesnothaveasrigorouspruningoperationsasCP-Miner.Forexample,CCFinderreportsmanytinycopy-pastedsegmentsfewerthan30tokens,whicharetoosimpletobeworthcopy-pasting.Inaddition,italsoincludesincompletestatementsincopy-pastedsegments,whichisveryunlikelytobethecaseinpractice.CP-Minercanidentify17-52percentmorecopy-pastedcodethanCCFinderbecauseCP-Minercantoleratestate-mentinsertionsandmodifications.Table6comparestheidentifiedbyCP-MinerandCCFinder.TheresultswithCP-Minerareachievedusingthedefaultthresholdsetting( size¼30andmax ).Forfaircompar-ison,wealsofilterthosetiny,incompletesegmentsfromCCFinder’soutput.Theresultsshowthataround25percentofcopy-pasteisprunedafterfiltering. LIETAL.:CP-MINER:FINDINGCOPY-PASTEANDRELATEDBUGSINLARGE-SCALESOFTWARECODE TABLE4Copy-PasteRelatedBugsReportedbyCP-MinerUnchangedRatioThreshold)andBugsVerifiedbyUswithHighConfidence,MostofWhichWereConfirmedandFixedbyCorrespondingDevelopersafterWeReported Thefalsealarmsincludethreecategories:1)incorrectlymatchedsegments,2)exchangeableorders,and3)others.Thefirsttwocategoriescanbepruned,whichremainsasourfuturework.TABLE5ExecutionTimeandMemorySpaceofCP-Miner LIETAL.:CP-MINER:FINDINGCOPY-PASTEANDRELATEDBUGSINLARGE-SCALESOFTWARECODE TABLE8Copy-PasteCodewithinaModuleandacrossModules:(a)Linux2.6.6and(b)FreeBSD5.2.1 Eachnumberinthetablerepresentstheofcodecopy-pastedfromanothermodule.Forexample,in(a),thenumberatrow“arch”andcolumn“arch”representsthat25.1percentofthecodeinmodule“arch”iscopy-pastedwithinthemoduleitself;thenumberatrow“arch”andcolumn“drivers”representsthat3.2percentofthecodeinmodule“arch”iscopy-pastedfrom/toanothermodule“drivers.”Notethatthesetablesareasymmetricbecauseisrelatedtothesizeoftherowelementmodule. Fig.9.Copy-pastedcodeinLinuxandFreeBSDthroughvariousversions.Thex-axis(versionnumber)isdrawnintimescalewiththecorrespondingreleasetime.TheversionsofLinuxweanalyzearefrom1.0tothecurrentversion2.6.6.TheversionsofFreeBSDincludethemainbranchfrom2.0to4.10.(a)Linux1.0-2.6.6.(b)“Drivers”inLinux.(c)FreeBSD2.0-4.10.(c)“Sys”inFreeBSD. U.Manber,“FindingSimilarFilesinaLargeFileSystem,”USENIXWinter1994TechnicalConf.,pp.1-10,1994.1994.K.W.ChurchandJ.I.Helfman,“Dotplot:AProgramforExploringSelf-SimilarityinMillionsofLinesofTextandCode,”ComputationalandGraphicalStatistics,Statistics,S.HangalandM.S.Lam,“TrackingDownSoftwareBugsUsingAutomaticAnomalyDetection,”Proc.Int’lConf.SoftwareEng.,May2002.2002.S.Savage,M.Burrows,G.Nelson,P.Sobalvarro,andT.Anderson,“Eraser:ADynamicDataRaceDetectorforMultithreadedACMTrans.ComputerSystems,vol.15,no.4,pp.391-411,1997.1997.D.Engler,D.Y.Chen,andA.Chou,“BugsasInconsistentBehavior:AGeneralApproachtoInferringErrorsinSystemsProc.ACMSymp.OperatingSystemsPrinciples,pp.57-72,57-72,U.SternandD.L.Dill,“AutomaticVerificationoftheSCICacheCoherenceProtocol,”Proc.Conf.CorrectHardwareDesignandVerificationMethods,pp.21-34,1995.1995.J.-D.Choi,K.Lee,A.Loginov,R.O’Callahan,V.Sarkar,andM.Sridharan,“EfficientandPreciseDataraceDetectionforMulti-threadedObject-OrientedPrograms,”Proc.ACMSIGPLANConf.ProgrammingLanguageDesignandImplementation,pp.258-269,258-269,R.AgrawalandR.Srikant,“MiningSequentialPatterns,”11thInt’lConf.DataEng.,Eng.,X.Yan,J.Han,andR.Afshar,“CloSpan:MiningClosedSequentialPatternsinLargeDatasets,”Proc.SIAMInt’lConf.DataMining,May2003.2003.Z.Li,Z.Chen,S.M.Srinivasan,andY.Zhou,“C-Miner:MiningBlockCorrelationsinStorageSystems,”Proc.ThirdUSENIXConf.FileandStorageTechnologies,pp.173-186,2004.2004.A.V.Aho,R.Sethi,andJ.Ullman,Compilers:Principles,TechniquesandTools.Addison-Wesley,1986.1986.R.E.JohnsonandW.F.Opdyke,“RefactoringandAggregation,”Proc.Int’lSymp.ObjectTechnologiesforAdvancedSoftware,pp.264-278,1993.ZhenminLireceivedtheBEandMEdegreesincomputersciencefromTsinghuaUniversity,China.HeisaPhDstudentintheDepartmentofComputerScience,UniversityofIllinoisatUrbana-Champaign.Hisresearchinterestsin-cludecomputersystems,softwarereliability,datamining,storagesystems,andenergymanagement.ShanLuisaPhDstudentintheDepartmentofComputerScience,UniversityofIllinoisatUrbana-Champaign.Herresearchinterestsin-cludearchitecturalandsystemsupportforsoft-waredebuggingandsystemconfigurationmanagement.SuvdaMyagmarreceivedtheBSdegreeincomputersciencefromConcordCollegeandtheMSdegreeincomputersciencefromUniversityofIllinoisatUrbana-Champaign.SheisaPhDstudentintheDepartmentofComputerScience,UniversityofIllinoisatUrbana-Cham-paign.Sheisworkingonsecurityissuesofsoftwaredefinedradios.Herresearchinterestsincludecomputerandnetworksecurity,wirelesscommunications,andreconfigurableplatforms.YuanyuanZhoureceivedtheMAandPhDdegreesfromPrincetonUniversity.SheiscurrentlyanassistantprofessorintheDepart-mentofComputerScience,UniversityofIllinoisatUrbana-Champaign(UIUC).PriortoUIUC,sheworkedatNECResearchInstituteasascientistfrom2000to2002.Hermainresearchinterestsincludedatabasestorage,architectureandOSsupportforsoftwaredebugging,powermanagement,andmemorymanagement.SheisamemberoftheIEEE.Formoreinformationonthisoranyothercomputingtopic,pleasevisitourDigitalLibraryatwww.computer.org/publications/dlib. LIETAL.:CP-MINER:FINDINGCOPY-PASTEANDRELATEDBUGSINLARGE-SCALESOFTWARECODE

Related Contents


Next Show more