Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and - PDF document

Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and
Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and

Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and - Description


parninorsogatechedu ABSTRACT Debugging is notoriously di64259cult and extremely time con suming Researchers have therefore invested a considerable amount of e64256ort in developing automated techniques and tools for supporting various debugging tasks ID: 3303 Download Pdf

Tags

parninorsogatechedu ABSTRACT Debugging notoriously

Download Section

Please download the presentation from below link :


Download Pdf - The PPT/PDF document "Are Automated Debugging Techniques Actua..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Embed / Share - Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and


Presentation on theme: "Are Automated Debugging Techniques Actually Helping Programmers Chris Parnin and"— Presentation transcript


Figure2:NanoXMLTask:IdentifythecauseofthefailureinNanoXMLand xtheproblem.Jonesandcolleagues[11],forseveralreasons.First,Taran-tulais,likemoststate-of-the-artdebuggingtechniques,basedonsomeformofstatisticalrankingofpotentiallyfaultystatements.Second,athoroughempiricalevaluationofTaran-tulahasshownthatitcanoutperformothertechniques[12].(Morerecently,moretechniqueshavebeenproposed,buttheirimprovements,whenpresent,areforthemostpartmarginalanddependentonthecontext.)Finally,Tarantulaiseasytoexplainandteachtodevelopers.4.4.2PluginTomakeiteasyfortheparticipantstousetheselectedstatisticaldebuggingtechnique,wecreatedanEclipseplu-ginthatprovidestheuserswiththerankedlinkedofstate-mentsthatwouldbeproducedbyTarantula.Wedecidedtokeeptheplugin'sinterfaceassimpleaspossible:alistofstatements,orderedbysuspiciousness,whereclickingonastatementinthelistopensthecorrespondingsource leinEclipseandnavigatestothatlineofcode.Webelievethatthisapproachhasthetwofoldadvantageof(1)lettingusinvestigateourresearchquestionsdirectly,byhavingtheparticipantsoperateonarankedlistofstatements,and(2)clearlyseparatingthebene tsprovidedbytherankingbasedapproachfromthoseprovidedbytheuseofamoresophis-ticatedinterface,suchasTarantula'svisualization[11].Theplugin,showninFigure3,worksasfollows.First,theuserinputsacon guration leforataskbypressingtheload leicon.Oncethe leisloaded,theplugindisplaysatablewithseveralrows,whereeachrowsshowsastatementandthecorresponding lename,linenumber,andsuspi-ciousnessscore.Besidesclickingonastatementtojumptoit,asdiscussedabove,userscanalsouseapreviousandnextbuttontonavigatethroughthestatements.Tocomputetherankedlistofstatementsfortheplugin,weusedtheTarantulaformulasprovidedinReference[11],whichrequirecoveragedataandpass/failinformationforasetoftestinputs.ForbothTetrisandNanoXML,wecollectedcoveragedatausingEmma(http://www.eclemma.org/).ForNanoXML,weusedthetestcasesandpass/failstatusforsuchtestcasesavailablefromtheSIRrepository.ForTetris,forwhichnotestcaseswereavailable,wewroteacapture-replaysystemthatcouldrecordthekeyspressedwhenplayingTetrisandreplaythemastestcases.Overall,wecollected10gamesessions,2ofwhichexecutedthefaultystatement(i.e.,rotatedasquareblock). Figure4:Participantsaresplitintodi erentgroupshavingdi erentconditions.Eachboxrepresentsatask:thelabelintheboxindicatesthesoftwaresub-jectforthetask;thepresenceofawrenchindicatestheuseoftheautomateddebuggingtoolforthattask;theiconsrepresentinganarrowindicatetasksforwhichtherankofthefaultystatementhasbeenincreased(up)ordecreased(down).4.4.3DataAvailabilityOurEclipseplugin,programsubjects,andinstructionssheetsareavailableforresearcherswishingtoreplicateourstudyathttp://www.cc.gatech.edu/~vector/study/.4.5MethodToevaluateourHypothesis1,andassesswhetherpartici-pantscouldcompletetasksfasterwhenusinganautomateddebuggingtool(tool,hereafter),wecreatedtwoexperimen-talgroups:AandB.ParticipantsingroupAwereinstructedtousethetooltosolvetheTetristask.Conversely,partici-pantsingroupBhadtocompletetheTetristaskusingonlytraditionaldebuggingcapabilitiesavailablewithinEclipse.Ifthetoolweree ective,thereshouldbeasigni cantdif-ferencebetweenthetwogroup'staskcompletiontime.WeinvestigatedourHypothesis2,andassessedwhetherparticipantsbene tedmorefromusingthetoolonhardertasks,bygivingtheexperimentalgroupsasecondtask: xafaultinNanoXML.IngroupA,participantswerelimitedtouseonlytraditionaldebuggingtechniquesavailablewithinEclipse,whereasingroupB,participantscouldalsousethetooltosolvethetask.Inthiscase,wecomparedthedif-ferenceinperformanceforthegroupsusingthetoolfortheTetrisandtheNanoXMLtasks.Ifthetoolweremoree ec-tiveforhardertasks,theperformancegainofparticipantsusingthetoolfortheNanoXMLtaskshouldbebetterthanthatofparticipantsusingthetoolontheTetristask.OurHypothesis3aimstounderstandthee ectsoftherankofthefaultystatementontaskperformance.Toin-vestigatethishypothesis,wecreatedtwonewexperimentalgroups:CandD.BothgroupsweregivenboththeTetrisandtheNanoXMLtasksandwereinstructedtousethetooltocompletethetasks.Thedi erencebetweenthetwogroupswasthat,forgroupD,weloweredtherankofthefaultystatementforTetris(i.e.,moveditdownthelist)andincreasedtherankofthefaultystatementforNanoXML(i.e.,moveditupthelist).Ifrankwereanimportantfac-tor,thereshouldbeadecreaseinperformancefortheTetristaskandanincreaseinperformancefortheNanoXMLtaskforgroupD.AsummaryofthemethodweusedforinvestigatingourhypothesescanbeseeninFigure4. groupAperformedtheTetristask2.5timesfasterthantheNanoXMLtask.SubjectsingroupBperformedtheTetristask1.3timesfasterthantheNanoXMLtask.Thesevaluesaresigni cantlydi erent(p0:02)byatwo-tailedt-test.Accordingtotheseresults,statisticaldebuggingwiththetoolwasnomoree ectivethantraditionaldebuggingforsolvingahardertask.Therefore,wefoundnosupportforHypothesis2.Overall,theresultssuggestthattheremightbeseveralfactorsthatcanexplainwhytheautomateddebuggingtooldidnothelptheNanoXMLtask.Inthediscussionahead,wespeculatewhatthesefactorsmaybe.5.3ChangesinRankHavenoSignicantEf-fectsForHypothesis3,wewantedtoexplorethee ectofrankonthee ectivenessofthetool.Toassessthishypothesis,wemeasuredthee ectofarti ciallydecreasingandincreasingtherankofthefaultystatements.Ifthishypothesisweretrue,wewouldexpectthee ectivenesstodecreasewhendroppingtherankandincreasewhenraisingtherank.AsdiscussedinSection4.5,wetestedthishypothesisbyconductinganexperimentwith10newparticipantssplitintogroupsCandD.ForgroupC,therankoffaultystatementswaskeptintact.ForgroupD,therankforthefaultystate-mentinTetriswasloweredfromposition7toposition35.Similarly,therankforthefaultystatementinNanoXMLwasraisedfromposition83toposition16.(Thenewrankswereselectedinarandomfashion.)Thismodi cationoftheranksshouldhaveimprovedthee ectivenessofthetoolfortheNanoXMLtask,andhurtthee ectivenessofthetoolfortheTetristask,forgroupDincomparisontogroupC.ComparingtheaveragecompletiontimeoftheTetristaskforgroupsCandD,wedidobservethatgroupD(12:36)wasalittleslowerthangroupC(10:12).Surprisingly,fortheNanoXMLtaskgroupDwasnotanyfasterthangroupCdespitethemuchlowerrankofthefaultystatement(16ver-sus83).Infact,groupDactuallyperformedtheNanoXMLtaskslowerthangroupC|15:12forgroupCversus18:30forgroupD.However,thedi erencesinperformancebetweenthegroupswerenotstatisticallysigni cant.Infact,acomparisonofthecompletiontimeratioofTetristoNanoXMLyieldsthesameexactaveragefraction(.79)forbothgroups.Thissuggeststhatbothgroupswereverysimilarinperformance.Lower-ingrankdidnothurttheperformanceofgroupDontheTetristask,nordidraisingtherankforNanoXMLhelpim-provegroupD'sperformance.Therefore,overall,theresultsprovidenosupportforHy-pothesis3.Thissuggeststhattherankofthefaultystate-ment(s)maynotbeasimportantasotherfactorsorstrate-gies.Theparticipantsmaybeusingthetoolto ndotherstatementsthatarenearthefault,butrankedhigherthanthefault.Ortheymaybesearchingthroughthestatementsbasedonsomeintuition,thuscancelingthee ectofchang-ingtherelativepositionofthefaultystatement.Forex-ample,fourparticipantsingroupD,whohadtherankofthefaultyTetrisstatementlowered,wereabletoovercomethishandicapbyvisitinganotherstatementinposition3(previously8)thatwasinthesamefunctionasthebug.Thissuggeststhatprogrammersmayusethetooltoiden-tifystartingpointsfortheirinvestigation,someofwhichmaybenearthefault.Thiswouldlessentheimportanceofcorrectlyrankingtheexactlineofcodewiththefault.5.4ProgrammersSearchStatementsToanswerourResearchQuestion1onpatternsusedbydeveloperswheninspectingstatements,weanalyzedthelogsproducedduringtheusageofthetool.Speci cally,wewantedtoassesswhetherdevelopersinspectedstatementsinorderofranking,onebyone,orfollowedsomeotherstrategy.Weusedthenavigationdatacollectedfromthe24participantsingroupAandB,ofwhich22hadusablenavigationdata.Wealsoexaminedthequestionnaireofall34participantstogaininsightintotheirstrategiesforusingthetool.Basedonthisdata,wehavedeterminedthatprogrammersdonotvisiteachstatementinalinearfashion.Thereareseveralsourcesofsupportforthisobservation.First,foreachvisit,wemeasuredthedeltabetweenthepositionsoftwostatementsvisitedinsequence.Allpartic-ipantsexhibitedsomeformofjumpingbetweenpositions.Speci cally,37%ofthevisitsjumpedmorethanoneposi-tionand,onaverage,eachjumpskipped10positions.Theonlyexceptionwerelowperformers(thosewhodidnotcom-pleteanytask),whosemajority(95%)cycledthroughthestatementsandveryrarelyskippedpositions.Observingthenumberofpositionsskippedduringallthevisits,wehypoth-esizethatsmallerjumpsmaycorrespondtotheskippingofblocksofstatements;conversely,largerjumpsseemtocorre-spondtosomeformofsearchingor lteringofstatements|ahypothesisalsosupportedbytheresponsesinthepartici-pants'questionnaires.Second,thenavigationpatternwasnotlinear.Partici-pantsconsistentlychangeddirections(i.e.,theystartedde-scendingthelist, ippedaround,andstartedascendingthelist).Wemeasuredthenumberof\zigzags"inapartici-pant'snavigationpatternanytimetherewasachangeindirection.Onaverage,eachparticipanthad10.3zigzags,withanoverallrangebetween1and36zigzags.Finally,onourquestionnairegiventoallparticipants,manyparticipantsindicatedthatsometimestheywouldscantherankedlistto ndastatementthatmightcon rmtheirhypothesisaboutthecauseofthefailure,whereasothertimestheyskippedstatementsthatdidnotappearrelevant.5.5NoPerfectBugUnderstandingToinvestigateourResearchQuestion2ontheassumptionofperfectbugunderstanding,wemeasuredthetoolusagepatterns.Welookedatthe rsttimeaparticipantclickedonthefaultystatementinthetool,andthenexaminedtheparticipant'ssubsequentactivity.Ifthefaultynatureofastatementwereapparenttothedevelopersbyjustlookingatit,toolusageshouldstopassoonastheygettothatstatementinthelist.Weusedthelogdatafromthe24participantsingroupsAandBandexcludeddataforparticipantsthatneverclickedonthefaultystatement,whichleftuswithdatafor10par-ticipants.Only1participantoutof10stoppedusingthetoolafterclickingonthefault.Theremainingparticipants,onaverage,spentanothertenminutesusingthetoolafterthey rstexaminedthefaultystatement.Thatis,partici-pantsspent(orwasted)onaverage61%oftheirtimecon-tinuingtoinspectstatementswiththetoolaftertheyhadalreadyencounteredthefault.Thissuggeststhatsimplygivingthestatementwasnotenoughfortheparticipantstounderstandtheproblemandthatmorecontextwasneeded,whichmadeusconcludethatperfectbugunder-standingisgenerallynotarealisticassumption. modeltowhichthedevelopercanrelate.Whenusingthesetools,insteadofworkingwiththefamiliarandreliablestep-by-stepapproachofatraditionaldebugger,developersarecurrentlypresentedwithasetofapparentlydisconnectedstatementsandnoadditionalsupport.Observation2-Providingoverviewsthatclusterresultsandexplanationsthatincludedatavalues,testcaseinforma-tion,andinformationaboutslicescouldmakefaultseasiertoidentifyandtoolsultimatelymoree ective.6.2ResearchImplications6.2.1PercentagesWillNotCutItAstandardevaluationmetricforautomateddebuggingtechniquesistonormalizetherankoffaultystatementswithrespecttothesizeoftheprogram.Forexample,assigningthefaultystatementinNanoXML(4,408LOC)witharankof83,whenexpressedasapercentage,suggeststhatonly1.8%ofstatementswouldneedtobeinspected.Althoughthisresult,at rstglance,mayappearquitepositive,inprac-ticeweobservedthatdeveloperswerenotabletotranslatethisintoasuccessfuldebuggingactivity.Basedonourdata,werecommendthattechniquesfocusonimprovingabsoluterankratherthanpercentagerank,fortworeasons.First,thecollecteddatasuggeststhatpro-grammerswillstopinspectingstatements,andtransitiontotraditionaldebugging,iftheydonotgetpromisingresultswithinthe rstfewstatementstheyinspect.Forexample,evenwhenwechangedtherankofthefaultystatementinNanoXMLfrom83to16,therewasnoobservedbene t.Thisisconsistentwithotherresearchinsearchtasks,whereitisclearlyshownthatmostusersdonotinspectresultsbeyondthe rstpageandreformulatetheirsearchqueryin-stead[7].Second,theuseofpercentagesunderscoreshowdiculttheproblembecomeswhenmovingtolargerpro-grams.Percentageswouldsuggestthatwewouldnothavetochangeourtechniques,nomatterwhetherwearedealingwitha400LOCora4millionLOCprogram.Fromdirectexperiencewithscalingprogramanalysesfromtoyprogramstoindustrial-sizedprograms,weknowthatthisistypicallyunlikelytobetrue.Bettermeasurescanmakesureweareusingtheappro-priatemetricforevaluatingwhat,andtowhatextent,willhelpdevelopersinpractice.Otherwise,whatmayappearasasuccessfulnewdebuggingtechniqueinthelaboratory,couldinrealitybenomoree ectivethantraditionaldebug-gingapproaches.Implication1-Techniquesshouldfocusonimprovingab-soluterankratherthanpercentagerank.6.2.2FocusMoreOnSearchIfcurrentresearchisunabletoachievegoodvaluesforabsolutestatementranks,analternativedirectionmaybetoenrichthedebuggingtechniquesbyleveragingsomeofthesuccessfulstrategiesdeveloperswereobservedtouse.Inparticular,itmaybepromisingtofocusresearche ortsonhowsearchofstatementscanbeimproved.Weobservedthatacommoncauseoffrustrationduringdebuggingistheinabilitytodistinguishirrelevantstate-mentsfromrelevantones.Accordingtoreportsfromtheparticipants,somedeveloperssuccessfullyovercamethisprob-lemby lteringtheresultsbasedonkeywordsinthestate-ments.Wefoundthistobekey,astheremaybesomefunda-mentalinformationinthedeveloper'smindthat,whencom-binedwiththeautomateddebuggingalgorithm,mayyieldexcellentresults.Forexample,intheNanoXMLtask,developersnotedthatusingtermssuchas\index"or\colon"to lterthroughtheresultscouldhelpthem ndaresultthatmatchedtheirsus-picion.Infact,hadthedeveloperssearchedforanyoftheterms\index",\colon",\pre x",or\substring",theycouldhave lteredthestatementssothatthefaultyonewaswithinthe rst verankedresults.Unfortunately,performingthissearchmanuallyamongmanyresultscanbedicultinprac-tice,whereasthecombinationofrankingandsearchcouldbeapromisingdirection.Besidescombiningsearchandranking,futureresearchcouldalsoinvestigatewaystoautomaticallysuggestorhigh-lighttermsthatmayberelatedtoafailure.Thiswouldhelpincaseswhereadeveloperdoesnotknowtherighttermstosearchforandcouldbedone,forinstance,basedonthetypeoftheexceptionraisedorothercontextualclues.Itmayevenbethat,givengoodsearchtools,developerscoulddiscoverthattherankofafaultystatementdoesnotmatterasmuchasthesearchrank.Implication2-Debuggingtoolsmaybemoresuccessfuliftheyfocusedonsearchingthroughorautomaticallyhigh-lightingcertainsuspiciousstatements.6.2.3GrowtheEcosystemThewayperformance(withrespecttotime)iscomputedinmanystudiesmakesassumptionsthatdonotholdinprac-tice.Liketestsuiteprioritization,withautomatedfaultlo-calizationthetotaltimesavedbycon guringandusingthetoolshouldbelessthanthetimespentusingtraditionalde-buggingalone.Inpractice,atleastinsomescenarios,thetimetocollectcoverageinformation,manuallylabelthetestcasesasfailingorpassing,andrunthecalculationsmayex-ceedtheactualtimesavedusingthetool.Ingeneral,foratooltobee ective,itshouldseamlesslyintegratethedi erentpartsofthedebuggingtechniquecon-sideredandprovideend-to-endsupportforit.Althoughsomeoftheseissuescanbeaddressedwithcarefulengineer-ing,itmaybeusefultofocusresearche ortsonwaystostreamlineandintegrateactivitiessuchascoveragecollec-tion,test-caseclassi cationandrerunning,codeinspection,andsoon.Implication3-Researchshouldfocusonprovidinganecosystemthatsupportstheentiretoolchainforfaultlo-calization,includingmanagingandorchestratingtestcases.6.3ThreatstoValidityWechooseatimelimitfortasksthatmadeitpossibletoconductourexperimentwithinatwo-hourtimeframe,soastoavoidexhaustingparticipants.However,thistimelimitmighthaveexcludedlessexperiencedparticipantswhomayneedmoretimetocompletethetasks.Ourstudyhasfo-cusedonmoreexperienceddevelopers,manyofwhichcouldcompletethetaskswithinthetimelimit,andmaynotgen-eralizetonoviceusers.

Shom More....