102K - views

Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and

parninorsogatechedu ABSTRACT Debugging is notoriously di64259cult and extremely time con suming Researchers have therefore invested a considerable amount of e64256ort in developing automated techniques and tools for supporting various debugging tasks

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "Are Automated Debugging Techniques Actua..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and






Presentation on theme: "Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and"— Presentation transcript:

Figure2:NanoXMLTask:IdentifythecauseofthefailureinNanoXMLand xtheproblem.Jonesandcolleagues[11],forseveralreasons.First,Taran-tulais,likemoststate-of-the-artdebuggingtechniques,basedonsomeformofstatisticalrankingofpotentiallyfaultystatements.Second,athoroughempiricalevaluationofTaran-tulahasshownthatitcanoutperformothertechniques[12].(Morerecently,moretechniqueshavebeenproposed,buttheirimprovements,whenpresent,areforthemostpartmarginalanddependentonthecontext.)Finally,Tarantulaiseasytoexplainandteachtodevelopers.4.4.2PluginTomakeiteasyfortheparticipantstousetheselectedstatisticaldebuggingtechnique,wecreatedanEclipseplu-ginthatprovidestheuserswiththerankedlinkedofstate-mentsthatwouldbeproducedbyTarantula.Wedecidedtokeeptheplugin'sinterfaceassimpleaspossible:alistofstatements,orderedbysuspiciousness,whereclickingonastatementinthelistopensthecorrespondingsource leinEclipseandnavigatestothatlineofcode.Webelievethatthisapproachhasthetwofoldadvantageof(1)lettingusinvestigateourresearchquestionsdirectly,byhavingtheparticipantsoperateonarankedlistofstatements,and(2)clearlyseparatingthebene tsprovidedbytherankingbasedapproachfromthoseprovidedbytheuseofamoresophis-ticatedinterface,suchasTarantula'svisualization[11].Theplugin,showninFigure3,worksasfollows.First,theuserinputsacon guration leforataskbypressingtheload leicon.Oncethe leisloaded,theplugindisplaysatablewithseveralrows,whereeachrowsshowsastatementandthecorresponding lename,linenumber,andsuspi-ciousnessscore.Besidesclickingonastatementtojumptoit,asdiscussedabove,userscanalsouseapreviousandnextbuttontonavigatethroughthestatements.Tocomputetherankedlistofstatementsfortheplugin,weusedtheTarantulaformulasprovidedinReference[11],whichrequirecoveragedataandpass/failinformationforasetoftestinputs.ForbothTetrisandNanoXML,wecollectedcoveragedatausingEmma(http://www.eclemma.org/).ForNanoXML,weusedthetestcasesandpass/failstatusforsuchtestcasesavailablefromtheSIRrepository.ForTetris,forwhichnotestcaseswereavailable,wewroteacapture-replaysystemthatcouldrecordthekeyspressedwhenplayingTetrisandreplaythemastestcases.Overall,wecollected10gamesessions,2ofwhichexecutedthefaultystatement(i.e.,rotatedasquareblock). Figure4:Participantsaresplitintodi erentgroupshavingdi erentconditions.Eachboxrepresentsatask:thelabelintheboxindicatesthesoftwaresub-jectforthetask;thepresenceofawrenchindicatestheuseoftheautomateddebuggingtoolforthattask;theiconsrepresentinganarrowindicatetasksforwhichtherankofthefaultystatementhasbeenincreased(up)ordecreased(down).4.4.3DataAvailabilityOurEclipseplugin,programsubjects,andinstructionssheetsareavailableforresearcherswishingtoreplicateourstudyathttp://www.cc.gatech.edu/~vector/study/.4.5MethodToevaluateourHypothesis1,andassesswhetherpartici-pantscouldcompletetasksfasterwhenusinganautomateddebuggingtool(tool,hereafter),wecreatedtwoexperimen-talgroups:AandB.ParticipantsingroupAwereinstructedtousethetooltosolvetheTetristask.Conversely,partici-pantsingroupBhadtocompletetheTetristaskusingonlytraditionaldebuggingcapabilitiesavailablewithinEclipse.Ifthetoolweree ective,thereshouldbeasigni cantdif-ferencebetweenthetwogroup'staskcompletiontime.WeinvestigatedourHypothesis2,andassessedwhetherparticipantsbene tedmorefromusingthetoolonhardertasks,bygivingtheexperimentalgroupsasecondtask: xafaultinNanoXML.IngroupA,participantswerelimitedtouseonlytraditionaldebuggingtechniquesavailablewithinEclipse,whereasingroupB,participantscouldalsousethetooltosolvethetask.Inthiscase,wecomparedthedif-ferenceinperformanceforthegroupsusingthetoolfortheTetrisandtheNanoXMLtasks.Ifthetoolweremoree ec-tiveforhardertasks,theperformancegainofparticipantsusingthetoolfortheNanoXMLtaskshouldbebetterthanthatofparticipantsusingthetoolontheTetristask.OurHypothesis3aimstounderstandthee ectsoftherankofthefaultystatementontaskperformance.Toin-vestigatethishypothesis,wecreatedtwonewexperimentalgroups:CandD.BothgroupsweregivenboththeTetrisandtheNanoXMLtasksandwereinstructedtousethetooltocompletethetasks.Thedi erencebetweenthetwogroupswasthat,forgroupD,weloweredtherankofthefaultystatementforTetris(i.e.,moveditdownthelist)andincreasedtherankofthefaultystatementforNanoXML(i.e.,moveditupthelist).Ifrankwereanimportantfac-tor,thereshouldbeadecreaseinperformancefortheTetristaskandanincreaseinperformancefortheNanoXMLtaskforgroupD.AsummaryofthemethodweusedforinvestigatingourhypothesescanbeseeninFigure4. groupAperformedtheTetristask2.5timesfasterthantheNanoXMLtask.SubjectsingroupBperformedtheTetristask1.3timesfasterthantheNanoXMLtask.Thesevaluesaresigni cantlydi erent(p0:02)byatwo-tailedt-test.Accordingtotheseresults,statisticaldebuggingwiththetoolwasnomoree ectivethantraditionaldebuggingforsolvingahardertask.Therefore,wefoundnosupportforHypothesis2.Overall,theresultssuggestthattheremightbeseveralfactorsthatcanexplainwhytheautomateddebuggingtooldidnothelptheNanoXMLtask.Inthediscussionahead,wespeculatewhatthesefactorsmaybe.5.3ChangesinRankHavenoSignicantEf-fectsForHypothesis3,wewantedtoexplorethee ectofrankonthee ectivenessofthetool.Toassessthishypothesis,wemeasuredthee ectofarti ciallydecreasingandincreasingtherankofthefaultystatements.Ifthishypothesisweretrue,wewouldexpectthee ectivenesstodecreasewhendroppingtherankandincreasewhenraisingtherank.AsdiscussedinSection4.5,wetestedthishypothesisbyconductinganexperimentwith10newparticipantssplitintogroupsCandD.ForgroupC,therankoffaultystatementswaskeptintact.ForgroupD,therankforthefaultystate-mentinTetriswasloweredfromposition7toposition35.Similarly,therankforthefaultystatementinNanoXMLwasraisedfromposition83toposition16.(Thenewrankswereselectedinarandomfashion.)Thismodi cationoftheranksshouldhaveimprovedthee ectivenessofthetoolfortheNanoXMLtask,andhurtthee ectivenessofthetoolfortheTetristask,forgroupDincomparisontogroupC.ComparingtheaveragecompletiontimeoftheTetristaskforgroupsCandD,wedidobservethatgroupD(12:36)wasalittleslowerthangroupC(10:12).Surprisingly,fortheNanoXMLtaskgroupDwasnotanyfasterthangroupCdespitethemuchlowerrankofthefaultystatement(16ver-sus83).Infact,groupDactuallyperformedtheNanoXMLtaskslowerthangroupC|15:12forgroupCversus18:30forgroupD.However,thedi erencesinperformancebetweenthegroupswerenotstatisticallysigni cant.Infact,acomparisonofthecompletiontimeratioofTetristoNanoXMLyieldsthesameexactaveragefraction(.79)forbothgroups.Thissuggeststhatbothgroupswereverysimilarinperformance.Lower-ingrankdidnothurttheperformanceofgroupDontheTetristask,nordidraisingtherankforNanoXMLhelpim-provegroupD'sperformance.Therefore,overall,theresultsprovidenosupportforHy-pothesis3.Thissuggeststhattherankofthefaultystate-ment(s)maynotbeasimportantasotherfactorsorstrate-gies.Theparticipantsmaybeusingthetoolto ndotherstatementsthatarenearthefault,butrankedhigherthanthefault.Ortheymaybesearchingthroughthestatementsbasedonsomeintuition,thuscancelingthee ectofchang-ingtherelativepositionofthefaultystatement.Forex-ample,fourparticipantsingroupD,whohadtherankofthefaultyTetrisstatementlowered,wereabletoovercomethishandicapbyvisitinganotherstatementinposition3(previously8)thatwasinthesamefunctionasthebug.Thissuggeststhatprogrammersmayusethetooltoiden-tifystartingpointsfortheirinvestigation,someofwhichmaybenearthefault.Thiswouldlessentheimportanceofcorrectlyrankingtheexactlineofcodewiththefault.5.4ProgrammersSearchStatementsToanswerourResearchQuestion1onpatternsusedbydeveloperswheninspectingstatements,weanalyzedthelogsproducedduringtheusageofthetool.Speci cally,wewantedtoassesswhetherdevelopersinspectedstatementsinorderofranking,onebyone,orfollowedsomeotherstrategy.Weusedthenavigationdatacollectedfromthe24participantsingroupAandB,ofwhich22hadusablenavigationdata.Wealsoexaminedthequestionnaireofall34participantstogaininsightintotheirstrategiesforusingthetool.Basedonthisdata,wehavedeterminedthatprogrammersdonotvisiteachstatementinalinearfashion.Thereareseveralsourcesofsupportforthisobservation.First,foreachvisit,wemeasuredthedeltabetweenthepositionsoftwostatementsvisitedinsequence.Allpartic-ipantsexhibitedsomeformofjumpingbetweenpositions.Speci cally,37%ofthevisitsjumpedmorethanoneposi-tionand,onaverage,eachjumpskipped10positions.Theonlyexceptionwerelowperformers(thosewhodidnotcom-pleteanytask),whosemajority(95%)cycledthroughthestatementsandveryrarelyskippedpositions.Observingthenumberofpositionsskippedduringallthevisits,wehypoth-esizethatsmallerjumpsmaycorrespondtotheskippingofblocksofstatements;conversely,largerjumpsseemtocorre-spondtosomeformofsearchingor lteringofstatements|ahypothesisalsosupportedbytheresponsesinthepartici-pants'questionnaires.Second,thenavigationpatternwasnotlinear.Partici-pantsconsistentlychangeddirections(i.e.,theystartedde-scendingthelist, ippedaround,andstartedascendingthelist).Wemeasuredthenumberof\zigzags"inapartici-pant'snavigationpatternanytimetherewasachangeindirection.Onaverage,eachparticipanthad10.3zigzags,withanoverallrangebetween1and36zigzags.Finally,onourquestionnairegiventoallparticipants,manyparticipantsindicatedthatsometimestheywouldscantherankedlistto ndastatementthatmightcon rmtheirhypothesisaboutthecauseofthefailure,whereasothertimestheyskippedstatementsthatdidnotappearrelevant.5.5NoPerfectBugUnderstandingToinvestigateourResearchQuestion2ontheassumptionofperfectbugunderstanding,wemeasuredthetoolusagepatterns.Welookedatthe rsttimeaparticipantclickedonthefaultystatementinthetool,andthenexaminedtheparticipant'ssubsequentactivity.Ifthefaultynatureofastatementwereapparenttothedevelopersbyjustlookingatit,toolusageshouldstopassoonastheygettothatstatementinthelist.Weusedthelogdatafromthe24participantsingroupsAandBandexcludeddataforparticipantsthatneverclickedonthefaultystatement,whichleftuswithdatafor10par-ticipants.Only1participantoutof10stoppedusingthetoolafterclickingonthefault.Theremainingparticipants,onaverage,spentanothertenminutesusingthetoolafterthey rstexaminedthefaultystatement.Thatis,partici-pantsspent(orwasted)onaverage61%oftheirtimecon-tinuingtoinspectstatementswiththetoolaftertheyhadalreadyencounteredthefault.Thissuggeststhatsimplygivingthestatementwasnotenoughfortheparticipantstounderstandtheproblemandthatmorecontextwasneeded,whichmadeusconcludethatperfectbugunder-standingisgenerallynotarealisticassumption. modeltowhichthedevelopercanrelate.Whenusingthesetools,insteadofworkingwiththefamiliarandreliablestep-by-stepapproachofatraditionaldebugger,developersarecurrentlypresentedwithasetofapparentlydisconnectedstatementsandnoadditionalsupport.Observation2-Providingoverviewsthatclusterresultsandexplanationsthatincludedatavalues,testcaseinforma-tion,andinformationaboutslicescouldmakefaultseasiertoidentifyandtoolsultimatelymoree ective.6.2ResearchImplications6.2.1PercentagesWillNotCutItAstandardevaluationmetricforautomateddebuggingtechniquesistonormalizetherankoffaultystatementswithrespecttothesizeoftheprogram.Forexample,assigningthefaultystatementinNanoXML(4,408LOC)witharankof83,whenexpressedasapercentage,suggeststhatonly1.8%ofstatementswouldneedtobeinspected.Althoughthisresult,at rstglance,mayappearquitepositive,inprac-ticeweobservedthatdeveloperswerenotabletotranslatethisintoasuccessfuldebuggingactivity.Basedonourdata,werecommendthattechniquesfocusonimprovingabsoluterankratherthanpercentagerank,fortworeasons.First,thecollecteddatasuggeststhatpro-grammerswillstopinspectingstatements,andtransitiontotraditionaldebugging,iftheydonotgetpromisingresultswithinthe rstfewstatementstheyinspect.Forexample,evenwhenwechangedtherankofthefaultystatementinNanoXMLfrom83to16,therewasnoobservedbene t.Thisisconsistentwithotherresearchinsearchtasks,whereitisclearlyshownthatmostusersdonotinspectresultsbeyondthe rstpageandreformulatetheirsearchqueryin-stead[7].Second,theuseofpercentagesunderscoreshowdiculttheproblembecomeswhenmovingtolargerpro-grams.Percentageswouldsuggestthatwewouldnothavetochangeourtechniques,nomatterwhetherwearedealingwitha400LOCora4millionLOCprogram.Fromdirectexperiencewithscalingprogramanalysesfromtoyprogramstoindustrial-sizedprograms,weknowthatthisistypicallyunlikelytobetrue.Bettermeasurescanmakesureweareusingtheappro-priatemetricforevaluatingwhat,andtowhatextent,willhelpdevelopersinpractice.Otherwise,whatmayappearasasuccessfulnewdebuggingtechniqueinthelaboratory,couldinrealitybenomoree ectivethantraditionaldebug-gingapproaches.Implication1-Techniquesshouldfocusonimprovingab-soluterankratherthanpercentagerank.6.2.2FocusMoreOnSearchIfcurrentresearchisunabletoachievegoodvaluesforabsolutestatementranks,analternativedirectionmaybetoenrichthedebuggingtechniquesbyleveragingsomeofthesuccessfulstrategiesdeveloperswereobservedtouse.Inparticular,itmaybepromisingtofocusresearche ortsonhowsearchofstatementscanbeimproved.Weobservedthatacommoncauseoffrustrationduringdebuggingistheinabilitytodistinguishirrelevantstate-mentsfromrelevantones.Accordingtoreportsfromtheparticipants,somedeveloperssuccessfullyovercamethisprob-lemby lteringtheresultsbasedonkeywordsinthestate-ments.Wefoundthistobekey,astheremaybesomefunda-mentalinformationinthedeveloper'smindthat,whencom-binedwiththeautomateddebuggingalgorithm,mayyieldexcellentresults.Forexample,intheNanoXMLtask,developersnotedthatusingtermssuchas\index"or\colon"to lterthroughtheresultscouldhelpthem ndaresultthatmatchedtheirsus-picion.Infact,hadthedeveloperssearchedforanyoftheterms\index",\colon",\pre x",or\substring",theycouldhave lteredthestatementssothatthefaultyonewaswithinthe rst verankedresults.Unfortunately,performingthissearchmanuallyamongmanyresultscanbedicultinprac-tice,whereasthecombinationofrankingandsearchcouldbeapromisingdirection.Besidescombiningsearchandranking,futureresearchcouldalsoinvestigatewaystoautomaticallysuggestorhigh-lighttermsthatmayberelatedtoafailure.Thiswouldhelpincaseswhereadeveloperdoesnotknowtherighttermstosearchforandcouldbedone,forinstance,basedonthetypeoftheexceptionraisedorothercontextualclues.Itmayevenbethat,givengoodsearchtools,developerscoulddiscoverthattherankofafaultystatementdoesnotmatterasmuchasthesearchrank.Implication2-Debuggingtoolsmaybemoresuccessfuliftheyfocusedonsearchingthroughorautomaticallyhigh-lightingcertainsuspiciousstatements.6.2.3GrowtheEcosystemThewayperformance(withrespecttotime)iscomputedinmanystudiesmakesassumptionsthatdonotholdinprac-tice.Liketestsuiteprioritization,withautomatedfaultlo-calizationthetotaltimesavedbycon guringandusingthetoolshouldbelessthanthetimespentusingtraditionalde-buggingalone.Inpractice,atleastinsomescenarios,thetimetocollectcoverageinformation,manuallylabelthetestcasesasfailingorpassing,andrunthecalculationsmayex-ceedtheactualtimesavedusingthetool.Ingeneral,foratooltobee ective,itshouldseamlesslyintegratethedi erentpartsofthedebuggingtechniquecon-sideredandprovideend-to-endsupportforit.Althoughsomeoftheseissuescanbeaddressedwithcarefulengineer-ing,itmaybeusefultofocusresearche ortsonwaystostreamlineandintegrateactivitiessuchascoveragecollec-tion,test-caseclassi cationandrerunning,codeinspection,andsoon.Implication3-Researchshouldfocusonprovidinganecosystemthatsupportstheentiretoolchainforfaultlo-calization,includingmanagingandorchestratingtestcases.6.3ThreatstoValidityWechooseatimelimitfortasksthatmadeitpossibletoconductourexperimentwithinatwo-hourtimeframe,soastoavoidexhaustingparticipants.However,thistimelimitmighthaveexcludedlessexperiencedparticipantswhomayneedmoretimetocompletethetasks.Ourstudyhasfo-cusedonmoreexperienceddevelopers,manyofwhichcouldcompletethetaskswithinthetimelimit,andmaynotgen-eralizetonoviceusers.