parninorsogatechedu ABSTRACT Debugging is notoriously di64259cult and extremely time con suming Researchers have therefore invested a considerable amount of e64256ort in developing automated techniques and tools for supporting various debugging tasks ID: 1862
Download Pdf The PPT/PDF document "Are Automated Debugging Techniques Actua..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Presentation Transcript
Figure2:NanoXMLTask:IdentifythecauseofthefailureinNanoXMLandxtheproblem.Jonesandcolleagues[11],forseveralreasons.First,Taran-tulais,likemoststate-of-the-artdebuggingtechniques,basedonsomeformofstatisticalrankingofpotentiallyfaultystatements.Second,athoroughempiricalevaluationofTaran-tulahasshownthatitcanoutperformothertechniques[12].(Morerecently,moretechniqueshavebeenproposed,buttheirimprovements,whenpresent,areforthemostpartmarginalanddependentonthecontext.)Finally,Tarantulaiseasytoexplainandteachtodevelopers.4.4.2PluginTomakeiteasyfortheparticipantstousetheselectedstatisticaldebuggingtechnique,wecreatedanEclipseplu-ginthatprovidestheuserswiththerankedlinkedofstate-mentsthatwouldbeproducedbyTarantula.Wedecidedtokeeptheplugin'sinterfaceassimpleaspossible:alistofstatements,orderedbysuspiciousness,whereclickingonastatementinthelistopensthecorrespondingsourceleinEclipseandnavigatestothatlineofcode.Webelievethatthisapproachhasthetwofoldadvantageof(1)lettingusinvestigateourresearchquestionsdirectly,byhavingtheparticipantsoperateonarankedlistofstatements,and(2)clearlyseparatingthebenetsprovidedbytherankingbasedapproachfromthoseprovidedbytheuseofamoresophis-ticatedinterface,suchasTarantula'svisualization[11].Theplugin,showninFigure3,worksasfollows.First,theuserinputsacongurationleforataskbypressingtheloadleicon.Oncetheleisloaded,theplugindisplaysatablewithseveralrows,whereeachrowsshowsastatementandthecorrespondinglename,linenumber,andsuspi-ciousnessscore.Besidesclickingonastatementtojumptoit,asdiscussedabove,userscanalsouseapreviousandnextbuttontonavigatethroughthestatements.Tocomputetherankedlistofstatementsfortheplugin,weusedtheTarantulaformulasprovidedinReference[11],whichrequirecoveragedataandpass/failinformationforasetoftestinputs.ForbothTetrisandNanoXML,wecollectedcoveragedatausingEmma(http://www.eclemma.org/).ForNanoXML,weusedthetestcasesandpass/failstatusforsuchtestcasesavailablefromtheSIRrepository.ForTetris,forwhichnotestcaseswereavailable,wewroteacapture-replaysystemthatcouldrecordthekeyspressedwhenplayingTetrisandreplaythemastestcases.Overall,wecollected10gamesessions,2ofwhichexecutedthefaultystatement(i.e.,rotatedasquareblock). Figure4:Participantsaresplitintodierentgroupshavingdierentconditions.Eachboxrepresentsatask:thelabelintheboxindicatesthesoftwaresub-jectforthetask;thepresenceofawrenchindicatestheuseoftheautomateddebuggingtoolforthattask;theiconsrepresentinganarrowindicatetasksforwhichtherankofthefaultystatementhasbeenincreased(up)ordecreased(down).4.4.3DataAvailabilityOurEclipseplugin,programsubjects,andinstructionssheetsareavailableforresearcherswishingtoreplicateourstudyathttp://www.cc.gatech.edu/~vector/study/.4.5MethodToevaluateourHypothesis1,andassesswhetherpartici-pantscouldcompletetasksfasterwhenusinganautomateddebuggingtool(tool,hereafter),wecreatedtwoexperimen-talgroups:AandB.ParticipantsingroupAwereinstructedtousethetooltosolvetheTetristask.Conversely,partici-pantsingroupBhadtocompletetheTetristaskusingonlytraditionaldebuggingcapabilitiesavailablewithinEclipse.Ifthetoolwereeective,thereshouldbeasignicantdif-ferencebetweenthetwogroup'staskcompletiontime.WeinvestigatedourHypothesis2,andassessedwhetherparticipantsbenetedmorefromusingthetoolonhardertasks,bygivingtheexperimentalgroupsasecondtask:xafaultinNanoXML.IngroupA,participantswerelimitedtouseonlytraditionaldebuggingtechniquesavailablewithinEclipse,whereasingroupB,participantscouldalsousethetooltosolvethetask.Inthiscase,wecomparedthedif-ferenceinperformanceforthegroupsusingthetoolfortheTetrisandtheNanoXMLtasks.Ifthetoolweremoreeec-tiveforhardertasks,theperformancegainofparticipantsusingthetoolfortheNanoXMLtaskshouldbebetterthanthatofparticipantsusingthetoolontheTetristask.OurHypothesis3aimstounderstandtheeectsoftherankofthefaultystatementontaskperformance.Toin-vestigatethishypothesis,wecreatedtwonewexperimentalgroups:CandD.BothgroupsweregivenboththeTetrisandtheNanoXMLtasksandwereinstructedtousethetooltocompletethetasks.Thedierencebetweenthetwogroupswasthat,forgroupD,weloweredtherankofthefaultystatementforTetris(i.e.,moveditdownthelist)andincreasedtherankofthefaultystatementforNanoXML(i.e.,moveditupthelist).Ifrankwereanimportantfac-tor,thereshouldbeadecreaseinperformancefortheTetristaskandanincreaseinperformancefortheNanoXMLtaskforgroupD.AsummaryofthemethodweusedforinvestigatingourhypothesescanbeseeninFigure4. groupAperformedtheTetristask2.5timesfasterthantheNanoXMLtask.SubjectsingroupBperformedtheTetristask1.3timesfasterthantheNanoXMLtask.Thesevaluesaresignicantlydierent(p0:02)byatwo-tailedt-test.Accordingtotheseresults,statisticaldebuggingwiththetoolwasnomoreeectivethantraditionaldebuggingforsolvingahardertask.Therefore,wefoundnosupportforHypothesis2.Overall,theresultssuggestthattheremightbeseveralfactorsthatcanexplainwhytheautomateddebuggingtooldidnothelptheNanoXMLtask.Inthediscussionahead,wespeculatewhatthesefactorsmaybe.5.3ChangesinRankHavenoSignicantEf-fectsForHypothesis3,wewantedtoexploretheeectofrankontheeectivenessofthetool.Toassessthishypothesis,wemeasuredtheeectofarticiallydecreasingandincreasingtherankofthefaultystatements.Ifthishypothesisweretrue,wewouldexpecttheeectivenesstodecreasewhendroppingtherankandincreasewhenraisingtherank.AsdiscussedinSection4.5,wetestedthishypothesisbyconductinganexperimentwith10newparticipantssplitintogroupsCandD.ForgroupC,therankoffaultystatementswaskeptintact.ForgroupD,therankforthefaultystate-mentinTetriswasloweredfromposition7toposition35.Similarly,therankforthefaultystatementinNanoXMLwasraisedfromposition83toposition16.(Thenewrankswereselectedinarandomfashion.)ThismodicationoftheranksshouldhaveimprovedtheeectivenessofthetoolfortheNanoXMLtask,andhurttheeectivenessofthetoolfortheTetristask,forgroupDincomparisontogroupC.ComparingtheaveragecompletiontimeoftheTetristaskforgroupsCandD,wedidobservethatgroupD(12:36)wasalittleslowerthangroupC(10:12).Surprisingly,fortheNanoXMLtaskgroupDwasnotanyfasterthangroupCdespitethemuchlowerrankofthefaultystatement(16ver-sus83).Infact,groupDactuallyperformedtheNanoXMLtaskslowerthangroupC|15:12forgroupCversus18:30forgroupD.However,thedierencesinperformancebetweenthegroupswerenotstatisticallysignicant.Infact,acomparisonofthecompletiontimeratioofTetristoNanoXMLyieldsthesameexactaveragefraction(.79)forbothgroups.Thissuggeststhatbothgroupswereverysimilarinperformance.Lower-ingrankdidnothurttheperformanceofgroupDontheTetristask,nordidraisingtherankforNanoXMLhelpim-provegroupD'sperformance.Therefore,overall,theresultsprovidenosupportforHy-pothesis3.Thissuggeststhattherankofthefaultystate-ment(s)maynotbeasimportantasotherfactorsorstrate-gies.Theparticipantsmaybeusingthetooltondotherstatementsthatarenearthefault,butrankedhigherthanthefault.Ortheymaybesearchingthroughthestatementsbasedonsomeintuition,thuscancelingtheeectofchang-ingtherelativepositionofthefaultystatement.Forex-ample,fourparticipantsingroupD,whohadtherankofthefaultyTetrisstatementlowered,wereabletoovercomethishandicapbyvisitinganotherstatementinposition3(previously8)thatwasinthesamefunctionasthebug.Thissuggeststhatprogrammersmayusethetooltoiden-tifystartingpointsfortheirinvestigation,someofwhichmaybenearthefault.Thiswouldlessentheimportanceofcorrectlyrankingtheexactlineofcodewiththefault.5.4ProgrammersSearchStatementsToanswerourResearchQuestion1onpatternsusedbydeveloperswheninspectingstatements,weanalyzedthelogsproducedduringtheusageofthetool.Specically,wewantedtoassesswhetherdevelopersinspectedstatementsinorderofranking,onebyone,orfollowedsomeotherstrategy.Weusedthenavigationdatacollectedfromthe24participantsingroupAandB,ofwhich22hadusablenavigationdata.Wealsoexaminedthequestionnaireofall34participantstogaininsightintotheirstrategiesforusingthetool.Basedonthisdata,wehavedeterminedthatprogrammersdonotvisiteachstatementinalinearfashion.Thereareseveralsourcesofsupportforthisobservation.First,foreachvisit,wemeasuredthedeltabetweenthepositionsoftwostatementsvisitedinsequence.Allpartic-ipantsexhibitedsomeformofjumpingbetweenpositions.Specically,37%ofthevisitsjumpedmorethanoneposi-tionand,onaverage,eachjumpskipped10positions.Theonlyexceptionwerelowperformers(thosewhodidnotcom-pleteanytask),whosemajority(95%)cycledthroughthestatementsandveryrarelyskippedpositions.Observingthenumberofpositionsskippedduringallthevisits,wehypoth-esizethatsmallerjumpsmaycorrespondtotheskippingofblocksofstatements;conversely,largerjumpsseemtocorre-spondtosomeformofsearchingorlteringofstatements|ahypothesisalsosupportedbytheresponsesinthepartici-pants'questionnaires.Second,thenavigationpatternwasnotlinear.Partici-pantsconsistentlychangeddirections(i.e.,theystartedde-scendingthelist, ippedaround,andstartedascendingthelist).Wemeasuredthenumberof\zigzags"inapartici-pant'snavigationpatternanytimetherewasachangeindirection.Onaverage,eachparticipanthad10.3zigzags,withanoverallrangebetween1and36zigzags.Finally,onourquestionnairegiventoallparticipants,manyparticipantsindicatedthatsometimestheywouldscantherankedlisttondastatementthatmightconrmtheirhypothesisaboutthecauseofthefailure,whereasothertimestheyskippedstatementsthatdidnotappearrelevant.5.5NoPerfectBugUnderstandingToinvestigateourResearchQuestion2ontheassumptionofperfectbugunderstanding,wemeasuredthetoolusagepatterns.Welookedatthersttimeaparticipantclickedonthefaultystatementinthetool,andthenexaminedtheparticipant'ssubsequentactivity.Ifthefaultynatureofastatementwereapparenttothedevelopersbyjustlookingatit,toolusageshouldstopassoonastheygettothatstatementinthelist.Weusedthelogdatafromthe24participantsingroupsAandBandexcludeddataforparticipantsthatneverclickedonthefaultystatement,whichleftuswithdatafor10par-ticipants.Only1participantoutof10stoppedusingthetoolafterclickingonthefault.Theremainingparticipants,onaverage,spentanothertenminutesusingthetoolaftertheyrstexaminedthefaultystatement.Thatis,partici-pantsspent(orwasted)onaverage61%oftheirtimecon-tinuingtoinspectstatementswiththetoolaftertheyhadalreadyencounteredthefault.Thissuggeststhatsimplygivingthestatementwasnotenoughfortheparticipantstounderstandtheproblemandthatmorecontextwasneeded,whichmadeusconcludethatperfectbugunder-standingisgenerallynotarealisticassumption. modeltowhichthedevelopercanrelate.Whenusingthesetools,insteadofworkingwiththefamiliarandreliablestep-by-stepapproachofatraditionaldebugger,developersarecurrentlypresentedwithasetofapparentlydisconnectedstatementsandnoadditionalsupport.Observation2-Providingoverviewsthatclusterresultsandexplanationsthatincludedatavalues,testcaseinforma-tion,andinformationaboutslicescouldmakefaultseasiertoidentifyandtoolsultimatelymoreeective.6.2ResearchImplications6.2.1PercentagesWillNotCutItAstandardevaluationmetricforautomateddebuggingtechniquesistonormalizetherankoffaultystatementswithrespecttothesizeoftheprogram.Forexample,assigningthefaultystatementinNanoXML(4,408LOC)witharankof83,whenexpressedasapercentage,suggeststhatonly1.8%ofstatementswouldneedtobeinspected.Althoughthisresult,atrstglance,mayappearquitepositive,inprac-ticeweobservedthatdeveloperswerenotabletotranslatethisintoasuccessfuldebuggingactivity.Basedonourdata,werecommendthattechniquesfocusonimprovingabsoluterankratherthanpercentagerank,fortworeasons.First,thecollecteddatasuggeststhatpro-grammerswillstopinspectingstatements,andtransitiontotraditionaldebugging,iftheydonotgetpromisingresultswithintherstfewstatementstheyinspect.Forexample,evenwhenwechangedtherankofthefaultystatementinNanoXMLfrom83to16,therewasnoobservedbenet.Thisisconsistentwithotherresearchinsearchtasks,whereitisclearlyshownthatmostusersdonotinspectresultsbeyondtherstpageandreformulatetheirsearchqueryin-stead[7].Second,theuseofpercentagesunderscoreshowdiculttheproblembecomeswhenmovingtolargerpro-grams.Percentageswouldsuggestthatwewouldnothavetochangeourtechniques,nomatterwhetherwearedealingwitha400LOCora4millionLOCprogram.Fromdirectexperiencewithscalingprogramanalysesfromtoyprogramstoindustrial-sizedprograms,weknowthatthisistypicallyunlikelytobetrue.Bettermeasurescanmakesureweareusingtheappro-priatemetricforevaluatingwhat,andtowhatextent,willhelpdevelopersinpractice.Otherwise,whatmayappearasasuccessfulnewdebuggingtechniqueinthelaboratory,couldinrealitybenomoreeectivethantraditionaldebug-gingapproaches.Implication1-Techniquesshouldfocusonimprovingab-soluterankratherthanpercentagerank.6.2.2FocusMoreOnSearchIfcurrentresearchisunabletoachievegoodvaluesforabsolutestatementranks,analternativedirectionmaybetoenrichthedebuggingtechniquesbyleveragingsomeofthesuccessfulstrategiesdeveloperswereobservedtouse.Inparticular,itmaybepromisingtofocusresearcheortsonhowsearchofstatementscanbeimproved.Weobservedthatacommoncauseoffrustrationduringdebuggingistheinabilitytodistinguishirrelevantstate-mentsfromrelevantones.Accordingtoreportsfromtheparticipants,somedeveloperssuccessfullyovercamethisprob-lembylteringtheresultsbasedonkeywordsinthestate-ments.Wefoundthistobekey,astheremaybesomefunda-mentalinformationinthedeveloper'smindthat,whencom-binedwiththeautomateddebuggingalgorithm,mayyieldexcellentresults.Forexample,intheNanoXMLtask,developersnotedthatusingtermssuchas\index"or\colon"tolterthroughtheresultscouldhelpthemndaresultthatmatchedtheirsus-picion.Infact,hadthedeveloperssearchedforanyoftheterms\index",\colon",\prex",or\substring",theycouldhavelteredthestatementssothatthefaultyonewaswithintherstverankedresults.Unfortunately,performingthissearchmanuallyamongmanyresultscanbedicultinprac-tice,whereasthecombinationofrankingandsearchcouldbeapromisingdirection.Besidescombiningsearchandranking,futureresearchcouldalsoinvestigatewaystoautomaticallysuggestorhigh-lighttermsthatmayberelatedtoafailure.Thiswouldhelpincaseswhereadeveloperdoesnotknowtherighttermstosearchforandcouldbedone,forinstance,basedonthetypeoftheexceptionraisedorothercontextualclues.Itmayevenbethat,givengoodsearchtools,developerscoulddiscoverthattherankofafaultystatementdoesnotmatterasmuchasthesearchrank.Implication2-Debuggingtoolsmaybemoresuccessfuliftheyfocusedonsearchingthroughorautomaticallyhigh-lightingcertainsuspiciousstatements.6.2.3GrowtheEcosystemThewayperformance(withrespecttotime)iscomputedinmanystudiesmakesassumptionsthatdonotholdinprac-tice.Liketestsuiteprioritization,withautomatedfaultlo-calizationthetotaltimesavedbyconguringandusingthetoolshouldbelessthanthetimespentusingtraditionalde-buggingalone.Inpractice,atleastinsomescenarios,thetimetocollectcoverageinformation,manuallylabelthetestcasesasfailingorpassing,andrunthecalculationsmayex-ceedtheactualtimesavedusingthetool.Ingeneral,foratooltobeeective,itshouldseamlesslyintegratethedierentpartsofthedebuggingtechniquecon-sideredandprovideend-to-endsupportforit.Althoughsomeoftheseissuescanbeaddressedwithcarefulengineer-ing,itmaybeusefultofocusresearcheortsonwaystostreamlineandintegrateactivitiessuchascoveragecollec-tion,test-caseclassicationandrerunning,codeinspection,andsoon.Implication3-Researchshouldfocusonprovidinganecosystemthatsupportstheentiretoolchainforfaultlo-calization,includingmanagingandorchestratingtestcases.6.3ThreatstoValidityWechooseatimelimitfortasksthatmadeitpossibletoconductourexperimentwithinatwo-hourtimeframe,soastoavoidexhaustingparticipants.However,thistimelimitmighthaveexcludedlessexperiencedparticipantswhomayneedmoretimetocompletethetasks.Ourstudyhasfo-cusedonmoreexperienceddevelopers,manyofwhichcouldcompletethetaskswithinthetimelimit,andmaynotgen-eralizetonoviceusers.