/
Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and

Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and - PDF document

debby-jeon
debby-jeon . @debby-jeon
Follow
423 views
Uploaded On 2014-10-07

Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and - PPT Presentation

parninorsogatechedu ABSTRACT Debugging is notoriously di64259cult and extremely time con suming Researchers have therefore invested a considerable amount of e64256ort in developing automated techniques and tools for supporting various debugging tasks ID: 3303

parninorsogatechedu ABSTRACT Debugging notoriously

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Are Automated Debugging Techniques Actua..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Figure2:NanoXMLTask:IdentifythecauseofthefailureinNanoXMLand xtheproblem.Jonesandcolleagues[11],forseveralreasons.First,Taran-tulais,likemoststate-of-the-artdebuggingtechniques,basedonsomeformofstatisticalrankingofpotentiallyfaultystatements.Second,athoroughempiricalevaluationofTaran-tulahasshownthatitcanoutperformothertechniques[12].(Morerecently,moretechniqueshavebeenproposed,buttheirimprovements,whenpresent,areforthemostpartmarginalanddependentonthecontext.)Finally,Tarantulaiseasytoexplainandteachtodevelopers.4.4.2PluginTomakeiteasyfortheparticipantstousetheselectedstatisticaldebuggingtechnique,wecreatedanEclipseplu-ginthatprovidestheuserswiththerankedlinkedofstate-mentsthatwouldbeproducedbyTarantula.Wedecidedtokeeptheplugin'sinterfaceassimpleaspossible:alistofstatements,orderedbysuspiciousness,whereclickingonastatementinthelistopensthecorrespondingsource leinEclipseandnavigatestothatlineofcode.Webelievethatthisapproachhasthetwofoldadvantageof(1)lettingusinvestigateourresearchquestionsdirectly,byhavingtheparticipantsoperateonarankedlistofstatements,and(2)clearlyseparatingthebene tsprovidedbytherankingbasedapproachfromthoseprovidedbytheuseofamoresophis-ticatedinterface,suchasTarantula'svisualization[11].Theplugin,showninFigure3,worksasfollows.First,theuserinputsacon guration leforataskbypressingtheload leicon.Oncethe leisloaded,theplugindisplaysatablewithseveralrows,whereeachrowsshowsastatementandthecorresponding lename,linenumber,andsuspi-ciousnessscore.Besidesclickingonastatementtojumptoit,asdiscussedabove,userscanalsouseapreviousandnextbuttontonavigatethroughthestatements.Tocomputetherankedlistofstatementsfortheplugin,weusedtheTarantulaformulasprovidedinReference[11],whichrequirecoveragedataandpass/failinformationforasetoftestinputs.ForbothTetrisandNanoXML,wecollectedcoveragedatausingEmma(http://www.eclemma.org/).ForNanoXML,weusedthetestcasesandpass/failstatusforsuchtestcasesavailablefromtheSIRrepository.ForTetris,forwhichnotestcaseswereavailable,wewroteacapture-replaysystemthatcouldrecordthekeyspressedwhenplayingTetrisandreplaythemastestcases.Overall,wecollected10gamesessions,2ofwhichexecutedthefaultystatement(i.e.,rotatedasquareblock). Figure4:Participantsaresplitintodi erentgroupshavingdi erentconditions.Eachboxrepresentsatask:thelabelintheboxindicatesthesoftwaresub-jectforthetask;thepresenceofawrenchindicatestheuseoftheautomateddebuggingtoolforthattask;theiconsrepresentinganarrowindicatetasksforwhichtherankofthefaultystatementhasbeenincreased(up)ordecreased(down).4.4.3DataAvailabilityOurEclipseplugin,programsubjects,andinstructionssheetsareavailableforresearcherswishingtoreplicateourstudyathttp://www.cc.gatech.edu/~vector/study/.4.5MethodToevaluateourHypothesis1,andassesswhetherpartici-pantscouldcompletetasksfasterwhenusinganautomateddebuggingtool(tool,hereafter),wecreatedtwoexperimen-talgroups:AandB.ParticipantsingroupAwereinstructedtousethetooltosolvetheTetristask.Conversely,partici-pantsingroupBhadtocompletetheTetristaskusingonlytraditionaldebuggingcapabilitiesavailablewithinEclipse.Ifthetoolweree ective,thereshouldbeasigni cantdif-ferencebetweenthetwogroup'staskcompletiontime.WeinvestigatedourHypothesis2,andassessedwhetherparticipantsbene tedmorefromusingthetoolonhardertasks,bygivingtheexperimentalgroupsasecondtask: xafaultinNanoXML.IngroupA,participantswerelimitedtouseonlytraditionaldebuggingtechniquesavailablewithinEclipse,whereasingroupB,participantscouldalsousethetooltosolvethetask.Inthiscase,wecomparedthedif-ferenceinperformanceforthegroupsusingthetoolfortheTetrisandtheNanoXMLtasks.Ifthetoolweremoree ec-tiveforhardertasks,theperformancegainofparticipantsusingthetoolfortheNanoXMLtaskshouldbebetterthanthatofparticipantsusingthetoolontheTetristask.OurHypothesis3aimstounderstandthee ectsoftherankofthefaultystatementontaskperformance.Toin-vestigatethishypothesis,wecreatedtwonewexperimentalgroups:CandD.BothgroupsweregivenboththeTetrisandtheNanoXMLtasksandwereinstructedtousethetooltocompletethetasks.Thedi erencebetweenthetwogroupswasthat,forgroupD,weloweredtherankofthefaultystatementforTetris(i.e.,moveditdownthelist)andincreasedtherankofthefaultystatementforNanoXML(i.e.,moveditupthelist).Ifrankwereanimportantfac-tor,thereshouldbeadecreaseinperformancefortheTetristaskandanincreaseinperformancefortheNanoXMLtaskforgroupD.AsummaryofthemethodweusedforinvestigatingourhypothesescanbeseeninFigure4. groupAperformedtheTetristask2.5timesfasterthantheNanoXMLtask.SubjectsingroupBperformedtheTetristask1.3timesfasterthantheNanoXMLtask.Thesevaluesaresigni cantlydi erent(p0:02)byatwo-tailedt-test.Accordingtotheseresults,statisticaldebuggingwiththetoolwasnomoree ectivethantraditionaldebuggingforsolvingahardertask.Therefore,wefoundnosupportforHypothesis2.Overall,theresultssuggestthattheremightbeseveralfactorsthatcanexplainwhytheautomateddebuggingtooldidnothelptheNanoXMLtask.Inthediscussionahead,wespeculatewhatthesefactorsmaybe.5.3ChangesinRankHavenoSignicantEf-fectsForHypothesis3,wewantedtoexplorethee ectofrankonthee ectivenessofthetool.Toassessthishypothesis,wemeasuredthee ectofarti ciallydecreasingandincreasingtherankofthefaultystatements.Ifthishypothesisweretrue,wewouldexpectthee ectivenesstodecreasewhendroppingtherankandincreasewhenraisingtherank.AsdiscussedinSection4.5,wetestedthishypothesisbyconductinganexperimentwith10newparticipantssplitintogroupsCandD.ForgroupC,therankoffaultystatementswaskeptintact.ForgroupD,therankforthefaultystate-mentinTetriswasloweredfromposition7toposition35.Similarly,therankforthefaultystatementinNanoXMLwasraisedfromposition83toposition16.(Thenewrankswereselectedinarandomfashion.)Thismodi cationoftheranksshouldhaveimprovedthee ectivenessofthetoolfortheNanoXMLtask,andhurtthee ectivenessofthetoolfortheTetristask,forgroupDincomparisontogroupC.ComparingtheaveragecompletiontimeoftheTetristaskforgroupsCandD,wedidobservethatgroupD(12:36)wasalittleslowerthangroupC(10:12).Surprisingly,fortheNanoXMLtaskgroupDwasnotanyfasterthangroupCdespitethemuchlowerrankofthefaultystatement(16ver-sus83).Infact,groupDactuallyperformedtheNanoXMLtaskslowerthangroupC|15:12forgroupCversus18:30forgroupD.However,thedi erencesinperformancebetweenthegroupswerenotstatisticallysigni cant.Infact,acomparisonofthecompletiontimeratioofTetristoNanoXMLyieldsthesameexactaveragefraction(.79)forbothgroups.Thissuggeststhatbothgroupswereverysimilarinperformance.Lower-ingrankdidnothurttheperformanceofgroupDontheTetristask,nordidraisingtherankforNanoXMLhelpim-provegroupD'sperformance.Therefore,overall,theresultsprovidenosupportforHy-pothesis3.Thissuggeststhattherankofthefaultystate-ment(s)maynotbeasimportantasotherfactorsorstrate-gies.Theparticipantsmaybeusingthetoolto ndotherstatementsthatarenearthefault,butrankedhigherthanthefault.Ortheymaybesearchingthroughthestatementsbasedonsomeintuition,thuscancelingthee ectofchang-ingtherelativepositionofthefaultystatement.Forex-ample,fourparticipantsingroupD,whohadtherankofthefaultyTetrisstatementlowered,wereabletoovercomethishandicapbyvisitinganotherstatementinposition3(previously8)thatwasinthesamefunctionasthebug.Thissuggeststhatprogrammersmayusethetooltoiden-tifystartingpointsfortheirinvestigation,someofwhichmaybenearthefault.Thiswouldlessentheimportanceofcorrectlyrankingtheexactlineofcodewiththefault.5.4ProgrammersSearchStatementsToanswerourResearchQuestion1onpatternsusedbydeveloperswheninspectingstatements,weanalyzedthelogsproducedduringtheusageofthetool.Speci cally,wewantedtoassesswhetherdevelopersinspectedstatementsinorderofranking,onebyone,orfollowedsomeotherstrategy.Weusedthenavigationdatacollectedfromthe24participantsingroupAandB,ofwhich22hadusablenavigationdata.Wealsoexaminedthequestionnaireofall34participantstogaininsightintotheirstrategiesforusingthetool.Basedonthisdata,wehavedeterminedthatprogrammersdonotvisiteachstatementinalinearfashion.Thereareseveralsourcesofsupportforthisobservation.First,foreachvisit,wemeasuredthedeltabetweenthepositionsoftwostatementsvisitedinsequence.Allpartic-ipantsexhibitedsomeformofjumpingbetweenpositions.Speci cally,37%ofthevisitsjumpedmorethanoneposi-tionand,onaverage,eachjumpskipped10positions.Theonlyexceptionwerelowperformers(thosewhodidnotcom-pleteanytask),whosemajority(95%)cycledthroughthestatementsandveryrarelyskippedpositions.Observingthenumberofpositionsskippedduringallthevisits,wehypoth-esizethatsmallerjumpsmaycorrespondtotheskippingofblocksofstatements;conversely,largerjumpsseemtocorre-spondtosomeformofsearchingor lteringofstatements|ahypothesisalsosupportedbytheresponsesinthepartici-pants'questionnaires.Second,thenavigationpatternwasnotlinear.Partici-pantsconsistentlychangeddirections(i.e.,theystartedde-scendingthelist, ippedaround,andstartedascendingthelist).Wemeasuredthenumberof\zigzags"inapartici-pant'snavigationpatternanytimetherewasachangeindirection.Onaverage,eachparticipanthad10.3zigzags,withanoverallrangebetween1and36zigzags.Finally,onourquestionnairegiventoallparticipants,manyparticipantsindicatedthatsometimestheywouldscantherankedlistto ndastatementthatmightcon rmtheirhypothesisaboutthecauseofthefailure,whereasothertimestheyskippedstatementsthatdidnotappearrelevant.5.5NoPerfectBugUnderstandingToinvestigateourResearchQuestion2ontheassumptionofperfectbugunderstanding,wemeasuredthetoolusagepatterns.Welookedatthe rsttimeaparticipantclickedonthefaultystatementinthetool,andthenexaminedtheparticipant'ssubsequentactivity.Ifthefaultynatureofastatementwereapparenttothedevelopersbyjustlookingatit,toolusageshouldstopassoonastheygettothatstatementinthelist.Weusedthelogdatafromthe24participantsingroupsAandBandexcludeddataforparticipantsthatneverclickedonthefaultystatement,whichleftuswithdatafor10par-ticipants.Only1participantoutof10stoppedusingthetoolafterclickingonthefault.Theremainingparticipants,onaverage,spentanothertenminutesusingthetoolafterthey rstexaminedthefaultystatement.Thatis,partici-pantsspent(orwasted)onaverage61%oftheirtimecon-tinuingtoinspectstatementswiththetoolaftertheyhadalreadyencounteredthefault.Thissuggeststhatsimplygivingthestatementwasnotenoughfortheparticipantstounderstandtheproblemandthatmorecontextwasneeded,whichmadeusconcludethatperfectbugunder-standingisgenerallynotarealisticassumption. modeltowhichthedevelopercanrelate.Whenusingthesetools,insteadofworkingwiththefamiliarandreliablestep-by-stepapproachofatraditionaldebugger,developersarecurrentlypresentedwithasetofapparentlydisconnectedstatementsandnoadditionalsupport.Observation2-Providingoverviewsthatclusterresultsandexplanationsthatincludedatavalues,testcaseinforma-tion,andinformationaboutslicescouldmakefaultseasiertoidentifyandtoolsultimatelymoree ective.6.2ResearchImplications6.2.1PercentagesWillNotCutItAstandardevaluationmetricforautomateddebuggingtechniquesistonormalizetherankoffaultystatementswithrespecttothesizeoftheprogram.Forexample,assigningthefaultystatementinNanoXML(4,408LOC)witharankof83,whenexpressedasapercentage,suggeststhatonly1.8%ofstatementswouldneedtobeinspected.Althoughthisresult,at rstglance,mayappearquitepositive,inprac-ticeweobservedthatdeveloperswerenotabletotranslatethisintoasuccessfuldebuggingactivity.Basedonourdata,werecommendthattechniquesfocusonimprovingabsoluterankratherthanpercentagerank,fortworeasons.First,thecollecteddatasuggeststhatpro-grammerswillstopinspectingstatements,andtransitiontotraditionaldebugging,iftheydonotgetpromisingresultswithinthe rstfewstatementstheyinspect.Forexample,evenwhenwechangedtherankofthefaultystatementinNanoXMLfrom83to16,therewasnoobservedbene t.Thisisconsistentwithotherresearchinsearchtasks,whereitisclearlyshownthatmostusersdonotinspectresultsbeyondthe rstpageandreformulatetheirsearchqueryin-stead[7].Second,theuseofpercentagesunderscoreshowdiculttheproblembecomeswhenmovingtolargerpro-grams.Percentageswouldsuggestthatwewouldnothavetochangeourtechniques,nomatterwhetherwearedealingwitha400LOCora4millionLOCprogram.Fromdirectexperiencewithscalingprogramanalysesfromtoyprogramstoindustrial-sizedprograms,weknowthatthisistypicallyunlikelytobetrue.Bettermeasurescanmakesureweareusingtheappro-priatemetricforevaluatingwhat,andtowhatextent,willhelpdevelopersinpractice.Otherwise,whatmayappearasasuccessfulnewdebuggingtechniqueinthelaboratory,couldinrealitybenomoree ectivethantraditionaldebug-gingapproaches.Implication1-Techniquesshouldfocusonimprovingab-soluterankratherthanpercentagerank.6.2.2FocusMoreOnSearchIfcurrentresearchisunabletoachievegoodvaluesforabsolutestatementranks,analternativedirectionmaybetoenrichthedebuggingtechniquesbyleveragingsomeofthesuccessfulstrategiesdeveloperswereobservedtouse.Inparticular,itmaybepromisingtofocusresearche ortsonhowsearchofstatementscanbeimproved.Weobservedthatacommoncauseoffrustrationduringdebuggingistheinabilitytodistinguishirrelevantstate-mentsfromrelevantones.Accordingtoreportsfromtheparticipants,somedeveloperssuccessfullyovercamethisprob-lemby lteringtheresultsbasedonkeywordsinthestate-ments.Wefoundthistobekey,astheremaybesomefunda-mentalinformationinthedeveloper'smindthat,whencom-binedwiththeautomateddebuggingalgorithm,mayyieldexcellentresults.Forexample,intheNanoXMLtask,developersnotedthatusingtermssuchas\index"or\colon"to lterthroughtheresultscouldhelpthem ndaresultthatmatchedtheirsus-picion.Infact,hadthedeveloperssearchedforanyoftheterms\index",\colon",\pre x",or\substring",theycouldhave lteredthestatementssothatthefaultyonewaswithinthe rst verankedresults.Unfortunately,performingthissearchmanuallyamongmanyresultscanbedicultinprac-tice,whereasthecombinationofrankingandsearchcouldbeapromisingdirection.Besidescombiningsearchandranking,futureresearchcouldalsoinvestigatewaystoautomaticallysuggestorhigh-lighttermsthatmayberelatedtoafailure.Thiswouldhelpincaseswhereadeveloperdoesnotknowtherighttermstosearchforandcouldbedone,forinstance,basedonthetypeoftheexceptionraisedorothercontextualclues.Itmayevenbethat,givengoodsearchtools,developerscoulddiscoverthattherankofafaultystatementdoesnotmatterasmuchasthesearchrank.Implication2-Debuggingtoolsmaybemoresuccessfuliftheyfocusedonsearchingthroughorautomaticallyhigh-lightingcertainsuspiciousstatements.6.2.3GrowtheEcosystemThewayperformance(withrespecttotime)iscomputedinmanystudiesmakesassumptionsthatdonotholdinprac-tice.Liketestsuiteprioritization,withautomatedfaultlo-calizationthetotaltimesavedbycon guringandusingthetoolshouldbelessthanthetimespentusingtraditionalde-buggingalone.Inpractice,atleastinsomescenarios,thetimetocollectcoverageinformation,manuallylabelthetestcasesasfailingorpassing,andrunthecalculationsmayex-ceedtheactualtimesavedusingthetool.Ingeneral,foratooltobee ective,itshouldseamlesslyintegratethedi erentpartsofthedebuggingtechniquecon-sideredandprovideend-to-endsupportforit.Althoughsomeoftheseissuescanbeaddressedwithcarefulengineer-ing,itmaybeusefultofocusresearche ortsonwaystostreamlineandintegrateactivitiessuchascoveragecollec-tion,test-caseclassi cationandrerunning,codeinspection,andsoon.Implication3-Researchshouldfocusonprovidinganecosystemthatsupportstheentiretoolchainforfaultlo-calization,includingmanagingandorchestratingtestcases.6.3ThreatstoValidityWechooseatimelimitfortasksthatmadeitpossibletoconductourexperimentwithinatwo-hourtimeframe,soastoavoidexhaustingparticipants.However,thistimelimitmighthaveexcludedlessexperiencedparticipantswhomayneedmoretimetocompletethetasks.Ourstudyhasfo-cusedonmoreexperienceddevelopers,manyofwhichcouldcompletethetaskswithinthetimelimit,andmaynotgen-eralizetonoviceusers.