/
SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomat SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomat

SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomat - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
365 views
Uploaded On 2015-08-30

SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomat - PPT Presentation

MrinalKantiDasymrinalcsaiiscernetinSuparnaBhattacharyabsuparnainibmcomIBMResearchIndiaChiranjibBhattacharyyaychirucsaiiscernetinKGopinathygopicsaiiscernetinyDepartmentofComputerScienc ID: 118900

MrinalKantiDasymrinal@csa.iisc.ernet.inSuparnaBhattacharyabsuparna@in.ibm.comIBMResearch-IndiaChiranjibBhattacharyyaychiru@csa.iisc.ernet.inK.Gopinathygopi@csa.iisc.ernet.inyDepartmentofComputerScienc

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "SubtleTopicModelsandDiscoveringSubtlyMan..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically MrinalKantiDasymrinal@csa.iisc.ernet.inSuparnaBhattacharyabsuparna@in.ibm.comIBMResearch-IndiaChiranjibBhattacharyyaychiru@csa.iisc.ernet.inK.Gopinathygopi@csa.iisc.ernet.inyDepartmentofComputerScienceandAutomation,IndianInstituteofScience,Bangalore,IndiaAbstractInarecentpioneeringapproachLDAwasusedtodiscovercrosscuttingconcerns(CCC)automaticallyfromsoftwarecodebases.LDAthoughsuccessfulindetectingprominentconcerns,failstodetectmanyusefulCCCsincludingonesthatmaybeheavilyexecutedbuteludediscoverybecausetheydonothaveastrongprevalenceinsource-code.Weposethisproblemasthatofdiscoveringtopicsthatrarelyoccurinindividualdocuments,whichwewillrefertoassubtletopics.Re-centlyaninterestingapproach,namelyfo-cusedtopicmodels(FTM)wasproposedin( Williamsonetal. , 2010 )fordetectingraretopics.FTM,thoughsuccessfulindetectingtopicswhichoccurprominentlyinveryfewdocuments,isunabletodetectsubtletop-ics.Discoveringsubtletopicsthusremainsanimportantopenproblem.Toaddressthisissueweproposesubtletopicmodels(STM).STMusesageneralizedstickbreakingpro-cess(GSBP)asapriorforde ningmultipledistributionsovertopics.ThishierarchicalstructureontopicsallowsSTMtodiscoverraretopicsbeyondthecapabilitiesofFTM.Theassociatedinferenceisnon-standardandissolvedbyexploitingtherelationshipbe-tweenGSBPandgeneralizedDirichletdistri-bution.EmpiricalresultsshowthatSTMisabletodiscoversubtleCCCintwobench- Proceedingsofthe30thInternationalConferenceonMa-chineLearning,Atlanta,Georgia,USA,2013.JMLR:W&CPvolume28.Copyright2013bytheauthor(s).markcode-bases,afeatwhichisbeyondthescopeofexistingtopicmodels,thusdemon-stratingthepotentialofthemodelinauto-matedconcerndiscovery,aknowndicultprobleminSoftwareEngineering.Further-moreitisobservedthateveningeneraltextcorporaSTMoutperformsthestateofartindiscoveringsubtletopics.1.IntroductionHierarchicalDirichletprocess(HDP)( Tehetal. , 2007 )isoneofthemostwidelyusedtopicmodels.Re-callthat,HDPplacesaDirichletprocess(DP)prioroverpotentiallyin nitenumberoftopicsatcorpuslevel.Subsequently,itusesaDPprioroverthetop-icsforeachdocument,andeachdocumentlevelDPisdistributedasthecorpuslevelDP.ThoughHDPisextremelysuccessfulindiscoveringtopicsingen-eral,itfailstodiscovertopicswhichoccurinveryfewdocuments,oftenreferredasraretopics.Thisinabil-itystemsfromthefactthatHDPinherentlyassumesthatafrequenttopicwillonaverageoccurfrequentlywithineachdocument,leadingtoapositivecorrela-tionbetweenproportionofatopicinanarticleandprevalenceofatopicintheentirecorpus.Thisimportantproblemhaspartiallybeenaddressedin( Williamsonetal. , 2010 ).ByusingIndianBuf-fetprocess(IBP),( Williamsonetal. , 2010 )de nedacompoundDPnamelyICDtodecorrelatedocumentwiseprevalenceandcorpuswideproportion.ICDwasappliedonfocusedtopicmodels(FTM)todetectraretopicswhichareprominentlyplacedinveryfewdoc-uments. SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically Considerthecorpus,proceedingsofNIPS,2005 1 .Be-causeoflimitednumberofpapersonsupervisedclassi- cation,anHDPbasedapproachfailstoidentifytop-icsrelatedtosupervisedclassi cationbutFTMde-tectsthiseasily(seesupplementarymaterialformoredetails).However,therearesometopicswhicharenotonlyrareacrossthedocumentsbutalsorarelyappearwithinadocument.UnderthesesituationsFTMwillfailtodiscoverthem.Acaseinpointis,atopicre-latedtoneuromorphicengineeringaboutcochlearmod-elling.Thetopichasbeendiscussedonlyinonepaper( Wen&Boahen , 2005 ).Inadditiontothat,themaintheme(cochlear)israrelyexplicitlymentioned(5%ofsentences)inthatpaper.Therefore,thetopicisassignedanextremelylowprobabilitymakingitex-tremelydicultforevenFTMtodetect.Thisphe-nomenaisnotspeci conlytoscienti ccorpora,butisalsoobservedinothertextcorporatoo.WestudiedthespeechesofBarackObamafromJuly27,2004tillOctober30,2012 2 ,aspanofeightyears.Weobservethat,inthiscorpustherearetwospeechesonCarnegieMellonUniversity(CMU).ThosespeechesweregivenwhenhevisitedCMUon2June,2010and24June2011.ItisadiculttaskforFTMtodetectthistopicas\CarnegieMellon"iscontainednotonlyintwodocuments(whichisrare),butalsopresentinlessthan10%ofsentencesinthosetwodocuments.Theseexamplesshowthat,discoveringtopicswhichrarelyoccurinindividualdocumentsstillremainanunsolvedproblem.Tothisend,weproposetostudythediscov-eryofsubtle 3 topics,whichrarelyoccurinthecorpusaswellasinindividualdocuments.Animmediatemotivationforstudyingsubtletopicsistheautomaticdiscoveryofcrosscuttingconcernsinsoftwarecodes.LatentDirichletAllocation(LDA)( Bleietal. , 2003 )hasbeenappliedinsoftwareanalysistoautomaticallydiscovertopicsthatrepresentsoft-wareconcerns( Baldietal. , 2008 ).Theuseoftopicmodelsforthisproblemisattractivebecauseunlikemostotherstate-of-the-artconcernidenti cationtech-niquesitisneitherlimitedbyaprioriassumptionsaboutthenatureofconcernstolookfornorbytheneedforhumaninputandothersourcesofinforma-tionbeyondthesourcecode.However,inframeworkbasedsoftwares,importantprogramconcernscanhavesuchasubtlepresenceinthecodethatexistingtopicmodelsfailtodetectthem. 1nips.cc/Conferences/20052www.americanrhetoric.com/barackobamaspeeches.htm3Thedictionarymeaningofsubtleisdiculttodetect,whichmotivatesthename.InBerkeley-DB,awidelyusedsoftwarecodebase,thecross-cuttingconcerninvolvedintheupdationofvariousbook-keepingcountsisnoteasytode-tectasthecountersarenamedasnWaits,nRequests,nINsCleanedThisRun...etcwhichdonotcontaintheword\count".Suchconcernsthatareexpressedsub-tlyinthesourcecodecannotbeignoredastheymaybesourcesofhighresourceusageorsupportacrit-icalprogramfunctionality.Forexample,Verifyisausefulcross-cuttingconcerninBerkeley-DB,yetitistoosubtleinthecodeforanyoftheexistingmodels,includingFTM,torecognizeit.Contributions:Inthispaperweproposethesub-tletopicmodels(STM)whichhastheabilitytodetecttopicsthoseoccurveryrarelyinindividualdocuments.HDPandFTMuseasingledistributionovertopicsforadocumentandwehaveobservedthatmakesitdif- cultforthemtodetectsubtletopics.Inordertogiveimportancetosubtletopicswithinadocument,weproposetosplittheco-occurrencedomaininsideadocumentbyusingmultipledistributionsovertopicsforeachdocument.Itisnontrivialtoselectaproperprioroverthesetopicvectors.Weusegeneralizedstickbreakingprocess(GSBP)( Ishwaran&James , 2001 )toaddressthisissue.UsingGSBP,STMallowsthetopicvectorstobesharedacrossthedocumentandtheproportionsoverthetopicvectorstobeindependentofeachotherwhichisessentialinmodelingsubtletop-icsasexplainedindetaillater.TheinferenceproblemduetoGSBPisnotstandard.WeproposetosolvethisbyutilizingtherelationshipbetweenGSBPandgen-eralizedDirichletdistribution(GD)andsubsequentlyconjugacybetweenGDandthemultinomialdistribu-tion.Webelievethatthisprocessandtheassociatedinferenceprocedureisnovelandisofindependentin-teresttotheBayesiannon-parametriccommunity.Themostsigni cantcontributionintermsoftheutil-ityofSTMliesinitsabilitytodetectsubtlymanifestedconcernsinsoftwareprograms,aknownhardprobleminsoftwareengineering.Theresultsobtainedherethusmarkabreakthroughinthisarea.Inaddition,STM ndssubtletopicsfromproceedingsofNIPS,2005andspeechesofBarackObamasince2004,thatshowsitsabilitytodowellongeneraltextcorpora.Structureofthepaper:Thepaperisorganizedasfollows.Section2discussestheapplicationoftopicmodelsindiscoveringconcernsinsoftwarecodebases.Insection3,wepresenttheproposedmodel,whilesec-tion4describestheinferenceprocedure.Experimentalstudyhasbeencoveredinsection5. SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically 2.TopicmodelsfordetectingSoftwareconcernsSoftwareconcernsarefeatures,designidiomsorotherconceptualconsiderationsthatimpacttheimplemen-tationofaprogram.Aconcerncanbecharacterizedintermsofitsintentandextent( Marinetal. , 2007 ).Aconcern'sintentisde nedasitsconceptualobjective(ortopic).Aconcern'sextentisitsconcreterepre-sentationinsoftwarecode,i.e.thesourcecodemod-ulesandstatementswheretheconcernisimplemented.Programconcernsmaybemodular,i.e.implementedbyasinglesource leormodule,orcross-cutting,i.e.dispersedacrossseveralcodemodulesandinterspersedwithotherconcerns.Identifyingandlocatingconcernsinexistingpro-gramsisanimportantandheavilyresearchedprob-leminsoftware(re)engineering( Robillard , 2008 ; Savageetal. , 2010a ; Marinetal. , 2007 ; Eaddyetal. , 2008 ; Revelleetal. , 2010 ).Yet,itremainshardtoautomatecompletelyinasatisfactoryfashion.Typi-calconcernlocationandaspectminingtechniquesaresemi-automatic:someusemanualquerypatternsandsomegenerateseedsautomaticallybasedonstructuralinformation( Marinetal. , 2007 ).Bothapproacheshavetherestrictionthattheytendtobedrivenbysomepriorexpectationorsearchcluesaboutthecon-cern(s)ofinteresteitherintermsoftheconcerns'in-tent(e.gseedwordpatterns,testcases),orabouttheconcerns'extent(e.g.fan-inanalysis).RecentlyitwasshownthatLDAcanautomaticallydetectprominentcross-cuttingconcerns( Baldietal. , 2008 ; Savageetal. , 2010b )quitesuccessfullywithouttheserestrictions.AlthoughtheLDAapproachworkswellforsurfacingconcerns(includingCCCs)thathaveastatisticallysigni cantmanifestationinthesourcecode(alargeextent),itcanmissinterestingCCCs(e.g.concernsthatareexecutedheavilyandthusimpactruntimere-sourceusage)justbecausetheymaynothaveapromi-nentpresenceinsourcecode.Thisisespeciallylikelyinframeworkbasedcodewhereallunderlyingmodulesourcesmaynotbeavailable,andaconcern'sextentmayincludeasmallpercentageofstatementsinsourcecode lestobeanalyzed.Orevenwhenaconcern'sextentisnotallthatsmall,itcaneludedetectionbe-causeofthesubtlepresenceofrepresentativewordsthatre\rectitsintent.ConsidertheexampleofVerify,animportantCCCinBerkeley-DB.Accordingtoapublishedmanualanal-ysisthatincludesa negrainedmappingofBerkeley-DBcodeconcerns(availableat( Apeletal. , 2009 ; Kastneretal. , 2007 )),thisCCCoccursindividuallyasamainconcernandalsohas7derivativeconcerns(combinationofmultipleconcerns).Seesupplemen-tarymaterialformoredetails.However,thisconcernissurprisinglyhardtodetectnotjustbyLDA/HDPbutevenbyFTMandMG-LDA( Titov&McDonald , 2008 ).Theconcern'sextentisnotparticularlysmallbutitsstatementsarespreadacross lesandcontaintheinternalsofoperationsper-formedtoverifydi erentstructures.Thustheword\verify"occursinonlyasmallfractionofthesestate-ments.Itisverychallengingtosurfacesuchsubtletracesoftheconcern'sintentautomaticallywithoutrelyingonanyaprioriinformation.DespiteFTM'sstrengthindetectingraretopics,FTMfailsatthistaskaswellbecauseeveninthe lewiththestrongestpresenceoftheverifyconcern,thewordisre\rectedinlessthan10%ofthestatementsinthat le.Thus,detectingsubtlymanifestedconcernsremaintobeachallengingopentask.3.SubtleTopicModelsInthissectionwepresentsubtletopicmodels(STM),designedtodetectsubtletopicswhicharerarelypresentacrossthecorpusaswellaswithindocuments.Wewillbrie\rydiscussgeneralizedstickbreakingpro-cessbeforedescribingSTM.3.1.Generalizedstickbreakingprocess(GSBP)Underthegeneralizedstickbreakingprocessframe-work,anyPisastickbreakingrandommeasureifitisofthefollowingform.P=JXj=1j j(:)1=v1;j=vjYlj(1vl)(1)wherevjBeta(j;j) 4 ,and jsareindependentlychosenfromadistributionH. jdenotesadis-cretemeasureconcentratedat j.Byconstruction,0j1,andPJj=1j=1almostsurely,whereJcanbe niteorin nite.Whenj=1;8jandj=;8j,andJ!1,thenitreducestoDP(H).ThetwoparameterPoisson-DP(Pitman-Yorprocess)correspondstothecasewhenJ!1,j=1,andj=+jwith01and&#x-5.1;䘘.Formorediscussionsee( Ishwaran&James , 2001 ).HereweareinterestedinthesituationwhenJ1,forwhichtoensurePJj=1j=1oneneedstoset 4denotes\distributedas" SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically vJ=1.Wewillutilizeoneinterestingpropertyofthis nitedimensionalstick-breakingprocess{randomweightsjsde nedinthismannerarealsogeneralizedDirichlet(GD)distributed.3.2.SubtletopicmodelsWeconsideradatasetasfffwdingNdin=1gSdi=1gDd=1g,whereDisthenumberofdocumentsinthecorpus,Sdisthenumberofsentencesindocumentd,Ndibe-ingthenumberofwordsinsentenceiofdocumentd.Inaddition,letusdenotethenumberofwordsinadocumentbyNd.InSTM,foreachdocumentd,weproposetohaveJd1numberofdistributionsovertopics.Topicsdenotedby ksaresharedacrossthecorpus.Weas-sumeadistributionovertheseJdtopicvectorsatsen-tencelevel,denotedbydiforsentenceiindocumentd.Notethat,adistributionoverthetopicvectorsatdocumentlevelwillleadtotheproblemofhavinghighprobabilityforthosetopicvectorswhicharepopularinthedocument.3.2.1.SelectingpriorovertopicvectorsTherearevariousoptionsinchoosingJd,diandapriordistributionoverdi.Thesimplestpossibilityisthat:Jd=J;8d,anddiDirichlet(),be-ingaJdimensionalvector.Theproblemwitha xedJisthatitcannotmodelthefactthat,thedocu-mentswithhigherSd(orNd)ingeneralhavehigherprobabilityofbeingmoreincoherentthanthosewithsmallerSd.Inordertoavoidthisissue,onecanuseJdPoisson(Sd).However,thiswillmakeexpectedvalueofJdlargefordocumentswithlargeSd.ThatinturnincreasesthechanceofdocumentswithlargeSdtobeincoherentinmostofthecaseswhichisundesir-able.However,duetotherichgettingricherpropertyHDPisnotsuitableinthiscase.Ontheotherhand,usingICDinthiscasewillmakeitdiculttolearnasthecontentofasentenceistoosmall.Notingthat,Jdcanbeatmostthenumberofsen-tencesSdindocumentd(whendihas1inonecom-ponentandzeroelsewhereandeachdiisdi erentfordi erenti),wesetJd=Sd.Then,weuseGSBPasdescribedinsection 3.1 toconstructdijsasfollows.Fordocumentd,i=1;:::;Sdandj=1;:::;Sd1vdijBeta(j;j)(2)di1=vdi1;dij=vdijYlj(1vdil)withvdiSd=1.LetusdenotetheaboveprocessasGSBPSd(;),whereandareSd1dimensionalvectorsofparameters.Notethat,PSdj=1dij=1as1PSd1j=1dij=QSd1l=1(1vdil).Duetothiscon-struction,Sdistheupperlimitandnottheexactnum-berofdistributionsovertopicsperdocument.There-fore,althoughthereisapossibilityofhighervalueofJforalargerdocumentbutselectinghigherindexedtopicvectorsarediscouraged.Asdiscussedearlier,withproperparametersetting, niteGSBPcanbetreatedastruncated-DPortruncated-PYP(Pitman-Yorprocess).DPorPYPcanalsobeusedalternativelywhichdoesnota ectrestofthemodel.However,GSBPisamore\rexibledistributionandisbettersuitedforsmallsentencesencounteredinsoftwaredatasets.3.2.2.ConstructionoftopicvectorsWedenotethedistributionsovertopicsindocumentdasfdjgforj=1;2;:::;Sd.HDPisnotasuitablepriorfordjasweneedthedistributionovertopicstobeuncorrelatedtothedocumentleveltopicpropor-tionsandwitheachotherasmuchaspossible.There-fore,ICDseemstobeamoreappropriatechoicehere.WeusetwoparameterIBP( Grithsetal. , 2007 )tosamplebinaryrandomvectors\rdjandthenwesam-pledjsfromICDasfollows.For,j=1;:::;Sd,andk=1;:::;K\rdjkBernoulli(k);djkDirichlet( 1K:\rdj)wherekBeta( K;).1KdenotesaK-dimensionalvectorwithalloneand\."denotescomponentwise(Hadamard)product.isarepulsionparameter,withsameexpectednumberoftopicsthevariabilityamong\rdjsacrossjincreaseswhenincreases,andwhen=1itreducestostandardIBP.Kisthetrunca-tionleveland( Doshi-Velezetal. , 2009 )showsthattheprobabilityof\rdjktobe1foranyjisverylowifKissucientlyhigh.Usingtheabovetwoconstructionswegetthebasedis-tributioncorrespondingtoasentence,asfollowsGdi=ZddiSdXj=1ZddjXkdijdjkp(di)p(dj) kThisformsadependentDirichletprocesswherethe ksaresharedacrossallthesentencesinallthedocu-mentsandthedjsaresharedacrossallthesentenceswithinadocument.STMassumesthegenerativepro-cessasdescribedinAlgorithm 1 . SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically Algorithm1GenerativeprocessofSTM fork=1;2;:::dodrawtopic kDirichlet(1V)topicselectionprobkBeta( K;)endforfordocumentsd=1toDdosamplenumberofsentencesSdPoisson(%)fordistributionovertopicsj=1;:::;Sddosample\rdjkBernoulli(k),k=1;:::sampledjDirichlet( 1K:\rdj)endforforsentencesi=1;:::;SddosamplediGSBPSd(;)forwordsn=1;:::;Ndidoselectbdinmult(di)sampletopiczdinmult(dbdin)samplewordwdinmult( zdin)endforendforendfor 4.PosteriorInferenceWeusetheGibbssamplingapproachtosamplethelatentvariablesusingtheposteriorconditionaldistri-bution.Themainchallengesintheinferenceproce-dureareduetothebinaryrandomvectors\rs,andthegeneralizedstickbreakingprocess(GSBP)vari-ablesvs.Wesamplethebinaryrandomvectorscon-sideringtruncated-IBP.Adiscussiononthee ectoftruncationcanbefoundin( Doshi-Velezetal. , 2009 ).InferenceprocedureduetoGSBPisrelativelyunex-ploredareaandnotstraightforward.WeutilizethefactthatGSBPisequivalentofgeneralizedDirichletdistribution(GD)( Wong , 1998 ).Thebene twedrawfromthisrelationshipisthat,likeDirichletdistribu-tionGDisalsoconjugatetothemultinomialdistribu-tion.Thisapproachmakestheinferenceverysimpleaswedescribenext.Wecollapsetheconditionaldistributionsbyintegrat-ingoutthetopicdistributions( ),distributionsovertopics(),Bernoulliparameters(k)anddistribu-tionovertopicvectorsforeachsentences(da).Wehowever,sampletopicassignmentvariableszs,andbsalongwithbinaryvector\rs.Wewillusenotationforcountsasfollows.disthedocumentindex,aisthesentenceindexandiisthewordpositionindex.nrepresentsthecountingvari-ableandindicesareputinthesubscript,where\:"representsmarginalization.Super-scriptdenotesthatinallcountsthecurrentwordisexcluded(wedonotrepeatthisinthetextfollows).Thus,ndai:::kwdaiisthenumberoftimeswordtypewdaiisassociatedwithtopick.ndai:::k:representsthenumberoftimestopickisusedinthewholecorpus.ndaid:bdaik:isthenumberoftimestopickandbdaiareused.ndaid:bdai::denotesthenumberoftimesbdaiisused.ndaidaj::isnumberoftimestopicvectorindexedbyjisused,andndaida:::isthenumberofwordsinthesentence.Kisthetrun-cationlevelfortopics.Forthesakeofbrevity,inthefollowingtextwedonotputallthevariablesintheconditionalhopingthatiseasytotrackfollowingthegenerativeprocess(Algorithm 1 ).Samplingzand\r:Theconditionalprobabilityoftopicassignmentofwordiatsentenceaindocumentdcanbeexpressedas:p(zdai=kjw;zdai)(3)/p(wdaijzdai=k;zdai)p(zdai=kjzdai)=+ndai:::kwdai V+ndai:::k:\rdbdaik +ndaid:bdaik: Pk\rdbdaik +ndaid:bdai::Noticethat,weneedtoinferonly\r,andbinordertoassigntopics.Notethat,\rcontainbinaryselectionvalues.Therefore,ifndaid:jk:�0,thena.s.posteriorprobabilityof\rdjktobeoneis1.Otherwise:p(\rdjk=1jz;\rdjk)(4)/p(zdj\rdjk=1;\rdjk)p(\rdjk=1j\rdjk)/(Ps=k\rdjs + ) (Ps=k\rdjs + +ndaid:j::)PrPl\rrlk+ K PrPl1+ K+Samplingb:FromtherelationbetweenGSBPandGDwegetthat,ifdisareconstructedasEq. 2 ,thentheyareequivalentlydistributedasGDandtheden-sityofdiis:fdi=Sd1Yj=1j1dij(1Pjl=1dil)j B(j;j)(5)whereB(j;j)=(j)(j) (j+j).j=jj+1j+1forj=1;2;:::;Sd2andSd1=Sd11.Notethat,diSd=1PSd1l=1dil.Notethat,bysettingj1=j+j;2jSd,GDreducestostandardDirichletdistribution.NowusingtheconjugacybetweenGDandmulti-nomialweintegrateoutsandvs.IfdaGDSd1(1;:::;Sd1;1;:::;Sd1),andbdajsare SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically sampledfrommult(da),thentheposteriordistri-butionofdagivenbdajsisagainaGDwithden-sityGDSd1(1;:::;Sd1;1;:::;Sd1),wherej=j+ndaidaj;j=j+PSdl=j+1ndaidal.Thuswecomputeconditionalp(bdai=jjbdai;;),forjSdasj+ndaidaj j+j+PSdr=jndaidarYljl+PSds=l+1ndaidas l+l+PSds=lndaidasandp(bdai=Sdjbdai;;)=1PSd1l=1p(bdai=ljbdai;;).Noticethat,thestickbreakingpropertyofGSBPisclearlyvisiblehere.Theposteriorprob-abilityofselectingatopicvectorforawordcanbefoundtobeasbelow:p(bdai=jjbdai;z;\rdj)(6)/p(zdaijbdai=j;zdai;\rdj)p(bdai=jjbdai)=\rdjzdai +ndaid:jzdai: Pk\rdjk +ndaid:j::p(bdai=jjbdai)Equations 3 , 4 , 6 togetherformtheinferenceprocedureofSTM.Discussion:Notethat,whenJ=1,wegetamodelequivalenttoFTM,andthatwaywegetanalter-nativeinferenceprocedureforFTM.Recallthat,in( Williamsonetal. , 2010 )thebinaryvectorsareinte-gratedoutusingapproximationforcomputingcondi-tionalfortopicassignmentvariableszs,andthebinaryvectorsaresampledtocomputeks.Wehaveobservedthat,boththesealternativesareequallywell,howeversamplingthebinaryvectorsmakestheinferencesim-plerwiththecostofmarginallyslowerconvergencerate(matchesupinlikelihoodinabout100iterations)incaseoftruncatedIBP.5.EmpiricalStudyInthissectionweempiricallystudytheproposedmodelSTMonaspecialtaskof ndingoutsubtlecross-cuttingconcernsfromsoftwarerepositories.InadditionweapplySTMontwotextdatasetswhichareapparentlyrichofsubtletopics.Thissectionisorganizedasfollows.Firstweexplainchallengesre-latedtoempiricalevaluationandourapproachunderthelimitedscope.Thenwedescribethebaselinesfol-lowedbythedatasetsusedintheevaluation.Next,wediscussourresultsintwosubsectionsfollowedbyashortdiscussionontheempirical ndings 5 . 5Forrelevantresourcesseemllab.csa.iisc.ernet.in/stm.5.1.EvaluationapproachWeevaluateSTMontwoaspects,(1)modelingabili-tiesbyusingperplexityandtopiccoherence.(2)abil-itytodiscoversubtletopics.Forthe rstcase,wewillusestandardmetrics.However,itisnoteasytoevalu-ateonthesecondaspect.Unlikesemanticcoherence,itisdicultforahumantojudgesubtletyofatopicbylookingatfewtopwords.Moreover,subtletyisrel-ativetothedataseti.e.atopicmaybesubtlewithre-specttoadataset,butthatmaybeprominentinsomeotherdataset.Asanalternativetohuman-judgment,wecancheckhowgoodamodelcan ndoutsomeknownorpre-de nedsubtletopics.But,itisdicultto ndadatasetwithasetofpre-de nedsubtletopicsasgoldstandardandthentocompareagainstthat.Inthegivenconditionofunavailability,wemanuallycre-ateourgold-standardasexplainedlaterandprovidethecompletelistinthesupplementary.5.2.BaselinesForevaluation,wecomparewithHDP,FTM 6 andMG-LDA( Titov&McDonald , 2008 ).MG-LDAhastheabilitytodiscoverlocaltopicswhichmightbemissedbyHDPorFTM.Although,subtletopicsmaynotlocalizeproperlyinsideadocument,itisusefultobenchmarkSTMagainstMG-LDA.5.3.Dataset&pre-processingNIPS-05:Thisisacorpusof207acceptedpapersfromtheproceedingsofNeuralInformationProcessingSys-tems,2005 7 ( Globersonetal. , 2007 ).Obama-speech:CollectionofpublicspeechesbyBarackObamasinceJuly27,2004tillOctober30,2012 8 thatcomprises142articleswhicharetranscribeddirectlyfromaudio.BerkeleyDB:WehaveselectedBerkeleyDBJavaEdi-tionasoursoftwaredataset( Apeletal. , 2009 ).Asof2012,BerkeleyDBisthemostwidelyuseddatabasetoolkitintheworld 9 ,anditisknowntohaveawiderangeofcross-cuttingconcerns.JHotDraw:JHotDrawisawellknownopensourceGUIframeworkfordrawingtechnicalandstructuredgraphics 10 .WehaveselectedJHotDrawasLDAisobservedto ndagoodsetofconcerns.Forsoftwaredatasets,onlythetextualcontent(with- 6usinginferenceasin( Williamsonetal. , 2010 )7nips.cc/Conferences/20058www.americanrhetoric.com/barackobamaspeeches.htm9en.wikipedia.org/wiki/Berkeley DB10www.jhotdraw.org/ SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically Table1.Comparisononperplexity(top)andtopiccoher-ence(bottom).STMachieveslowestperplexitywithgoodcoherence. Held-outdataPerplexity Dataset HDP MG-LDA FTM STM BerkeleyDB 182 127 80 60 JHotDraw 131 156 93 81 NIPS-05 941 2107 413 402 Obama-speech 3591 4721 901 582 Average 1211 1778 372 281 Averagetopiccoherence Dataset HDP MG-LDA FTM STM BerkeleyDB -58.6 -49.6 -20.3 -27.9 JHotDraw -80.9 -94.2 -37.9 -28.2 NIPS-05 -78.1 -43.7 -45.1 -37.7 Obama-speech -72.4 -53.2 -67.2 -52.5 Average -72.5 -59.9 -42.6 -36.6 outprogrammingsyntax)ofthe'.java' les(nodoc-umentationetc)usedasinput.EachstatementhasbeentreatedasasentenceandJavakey-words 11 areremovedbutcommonjavalibrarynamesareretained.TokenslikeStringCopyhavebeensplitintotwowordsStringandCopybasedonthepositionofacapitalfaceinsideatoken.Foralldatasets,wehaveremovedstan-dardEnglishstopwords,digits,sentencessmallerthan20charactersandwordssmallerthan3characters.Weconvertedcapitalfacestosmallfaces.IncaseofNIPSdataset,wehaveusedmostfrequent5000wordsandinothercaseswehaveusedfullvocabulary.Wehaveusedtheparameters ;;;j;jas1andas100.Wehaverunallthemodelsfor2000iterations(wefounditsucientforallmodelstoconvergeintermsoflog-likelihood),andusedthetruncationparameterKas100(adequateconsideringthesizeofourdatasets).5.4.Evaluationonperplexity&coherenceWehaverandomlypickedone-thirdofthedatasetsasheld-outdatasetsandusedthestandardde nitionofperplexityascanbefoundin( Bleietal. , 2003 ).Lowervalueinperplexitymeansthatthemodel tsthedatasetbetter.Byapproximatingtheuserex-perienceoftopicqualityonWtopwordsofatopictopiccoherence(TC)canbemeasuredas:TC(W)=PiPjilogD(wi;wj)+ D(wj)whereD(w)isthedocumentfrequencyofanywordw,andD(wi;wj)isthedocu-mentfrequencyofwiandwjtogether( Mimnoetal. , 2011 ).isasmallconstanttoavoidlogzero.Val-uesclosertozeroindicatesbettercoherence.Wehave 11en.wikipedia.org/wiki/List of Java keywordsTable2.ComparisononaveragerecallandaveragetopiccoherenceconsideringonlythegoldstandardtopicswithDoSgreaterthan0.2. HDP MG-LDA FTM STM BerkeleyDB Coherence -48.48 -42.89 -40.11 -21.99 Recall 0.42 0.59 0.68 0.94 JHotDraw Coherence -36.64 -52.13 -43.26 -46.17 Recall 0.31 0.16 0.42 0.97 NIPS-05 Coherence -40.23 -38.55 -37.38 -34.84 Recall 0.41 0.49 0.56 0.79 Obama-speech Coherence -32.65 -22.03 -53.15 -40.6 Recall 0.26 0.27 0.58 0.95 Table3.Fractionofsubtletopics(DoS0:5)de-tected(recall0:75)byallthemodels. HDP MG-LDA FTM STM BerkeleyDB 0 0.25 0.38 1.0 JHotDraw 0 0 0.29 0.86 NIPS-05 0 0 0.07 0.69 Obama-speech 0.14 0.07 0.28 0.93 usedtop5wordstocomputecoherenceofatopic.Table 1 containsresultsonperplexityandaveragetopiccoherenceforallthedatasets.WeobservethatSTMisabettermodelthanallothersintermsheld-outdataperplexityandcoherence(inmostofthecases).Notethat,theabilityofSTMtodetectsubtletopicsliesinsplittingtheco-occurrencedomain,how-everthisbringsinmilddicultyfornormaltopicstobelearnt.Hence,coherencemaysu erlittlebitwhichisobservedinTable 1 too.5.5.EvaluationindetectingsubtletopicsMeasureofsubtlety:Wede nedegreeofsubtletyoftopickasDoS(k)=QDd=1(1pdk),wherepdk=Pw2KkPSdi=1I[w2Sdi] jKkjSd.Kkisthesetofkeywordsdescribingtopick.I[w2Sdi]is1ifwordwispresentinsentenceiofdocumentd.Notethat,0DoS1.ThevalueofDoSincreasesifararewordisincludedintoKkanditdecreasesifafrequentwordisinserted.Goldstandard:Inordertocompareperformanceonsubtletopics,wehand-pickedsometopicsfromeachcorpussothattheirDoSisgreaterthan0:2whichwe SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically Table4.Fiveexamplesubtletopics(DoS0:5)detected(recall0:75)bySTM.Amongthem,'*'markedarealsodetectedbyMG-LDAand'!'markedarealsodetectedbyFTM.HDPcouldnotdetectanyofthem. BerkeleyDB JHotDraw NIPS Obamaspeech transaction,checkpoint, nano,xmldom chip,processing, cyber,security, recovery architecture,circuit internet stats,count roundrect cochlear,cochlea school,students,college *checksum,validate,errors !rendering topic,model,topics, carnegie,mellon,technology dirichlet !trace,level,info,con g zoom,factor walk,walks,steering,robot de cit,cuts,budget verify,con g,keys collection,family, video,texture, regulations,infrastructure, families resolution,image employees considerasareasonablychallengingdegreeofsub-tlety.Forcomparisonwehoweverconsideredonlythosehand-pickedtopicsforwhichaleast75%ofthekeywordsareretrievedamongtop5wordsbyatleastonemethodincomparison.Wecomputerecallofeachtopicconsideringtop vewordswithKofeachgoldstandardtopic.Atopicissaidtobeamatchforthegold-standardtopicwhichhasthehighestrecall,ifre-callislessthan0:75,thenwesaythetopicisnotde-tectedbythemodel.Thereasonofkeepingthresholdofrecallhighisthatsomekeywordsmaybepopularandpartofnormaltopics.Hence,detectingsuchakeywordalonedoesnotsignifydetectingthesubtletopicinconsideration.Followingtheaboveapproach,wehandpicked11,10,21,and16topicsrespectivelyfromBerkeleyDB,JHot-Draw,NIPSandObama-speechdatasets.Foreachgoldstandardtopicweconsidertherecallandcoher-enceofthebestmatchedtopicforeachmodelandthenweaverageoverallthegoldstandardtopicscor-respondingtoeachdatasetandreportatTable 2 .Ta-ble 3 containstheresultonfractionofsubtletopicsdetectedbyallthemodels.Acompletelistcanbefoundinthesupplementarymaterial.InTable 4 ,weprovide veexamplesubtletopicsfromeachdatasetwhicharedetectedbySTM,butothermodelshardlydetectthem.5.6.DiscussionSubtletopicsinmanycasesconsistofrarewords.Forexample,the\cochlea"topicissubtleduetorarenessofitskeywordsandisdetectedonlybySTM.Incer-taincases,somekeywordsmaynotberarebutitisthecombinationofthewordsthatmakesthetopicsubtletodetect.Forexample,inBerkeley-DBthetopic\trace,level,info,con g"consistsoffourwordsofwhichtraceisnotarareword.Thetopicasawholesigni estheabilitytocon guretracelevels,andmani-festsinaverylocalizedfashionintheTracerclassandhencecanbedetectedbyFTM,butnotbyHDPorMG-LDA.Ontheotherhandthecross-cuttingconcern\checksum,errors,validate"appearssubtlyinindivid-ual lesbutisdi usedwidelyacrossthecorpusandhenceitcanbedetectedbyMG-LDA,butHDPandFTMfail.NotonlySTMdetectsallofthesetopics,butitalsosucceedsindetectingthemoreinterestingcross-cuttingconcern,\verify",whichmanifestssub-tlyinevery leandthereforeeludesHDP,FTMandMG-LDA.Anotherexampleofanimportantyetsub-tletopicdetectedonlybySTMis\cybersecurity"ontheinternet.ThisisanimportanttopicthatindicatespoliciesorprioritiesofpresidentObama.6.ConclusionTheutilityofatopicisnotnecessarilylinkedtoitsprevalenceinacorpus.Whentopicmodelsareusedforautomaticallydiscoveringconcerns,theinabilitytodiscoverimportantorinterestingconcernsjustbe-causetheyaresubtlymanifestedinthesourcecodecanbeacriticaldrawback.Inthispaperweproposeanovelmodel,namelySTMtoaddressthisproblemforthe rsttimeintheliterature.STM,byusingmultipledistributionsovertopicsperdocumentisob-servedtoe ectivelydiscoversubtletopicswherestateofartmodelsfail.Thisisapromisingresultforad-vancingthestate-of-the-artinthedicultproblemofautomaticconcerndiscoveryinsoftwareengineering.Onempiricalevaluation,we ndSTMtooutperformstateofartmodelsnotonlyincaseofsubtletopicsbutalsoingeneralwithlowperplexityonunseendata,andgoodtopiccoherence.AcknowledgmentsWearethankfultoallthereviewersfortheirvaluablecomments.TheauthorsMKDandCBwerepartiallysupportedbyDSTgrant(DST/ECA/CB/1101). SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically ReferencesApel,Sven,Kastner,Christian,andLengauer,Chris-tian.FEATUREHOUSE:Language-Independent,AutomatedSoftwareComposition.InProceedingsofthe31stInternationalConferenceonSoftwareEn-gineering(ICSE),pp.221{231,2009.Baldi,PierreF.,Lopes,CristinaV.,Linstead,ErikJ.,andBajracharya,SushilK.Atheoryofaspectsaslatenttopics.InProceedingsofthe23rdACMSIG-PLANconferenceonObject-orientedprogrammingsystemslanguagesandapplications(OOPSLA),pp.543{562,2008.Blei,DavidM.,Ng,AndrewY.,andJordan,MichaelI.LatentDirichletAllocation.TheJournalofMachineLearningResearch(JMLR),3:993{1022,2003.Doshi-Velez,Finale,Miller,KurtT.,Gael,Jur-genVan,andTeh,YeeWhye.VariationalInfer-encefortheIndianBu etProcess.InProceedingsoftheIntl.Conf.onArti cialIntelligenceandStatis-tics(AISTATS),pp.137{144,2009.Eaddy,M.,Aho,A.V.,Antoniol,G.,andGueheneuc,Y.G.CERBERUS:TracingRequirementstoSourceCodeUsingInformationRetrieval,DynamicAnaly-sis,andProgramAnalysis.InInternationalConfer-enceonProgramComprehensionInProgramCom-prehension(ICPC),pp.53{62,2008.Globerson,A.,Chechik,G.,Pereira,F.,andTishby,N.EuclideanEmbeddingofCo-occurrenceData.TheJournalofMachineLearningResearch(JMLR),8:2265{2295,2007.Griths,T.L.,Ghahramani,Z.,andSollich,Peter.BayesianNonparametricLatentFeatureModels.InBayesianStatistics,pp.201{225,2007.Ishwaran,H.andJames,L.F.GibbsSamplingMeth-odsforStick-BreakingPriors.JournaloftheAmer-icanStatisticalAssociation,96:161{173,2001.Kastner,Christian,Apel,Sven,andBatory,Don.ACaseStudyImplementingFeaturesUsingAspectJ.InProceedingsofthe11thInternationalSoftwareProductLineConference(SPLC),pp.223{232,2007.Marin,Marius,Deursen,ArieVan,andMoonen,Leon.Identifyingcrosscuttingconcernsusingfan-inanal-ysis.ACMTransactionsonSoftwareEngineeringandMethodology(TOSEM),17,2007.Mimno,David,Wallach,Hanna,Talley,Edmund,Leenders,Miriam,andMcCallum,Andrew.Op-timizingsemanticcoherenceintopicmodels.InProceedingsoftheConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing(EMNLP),pp.262{272,2011.Revelle,Meghan,Dit,Bogdan,andPoshyvanyk,Denys.Usingdatafusionandwebminingtosupportfeaturelocationinsoftware.InProceedingsofthe2010IEEE18thInternationalConferenceonPro-gramComprehension(ICPC),pp.14{23,2010.Robillard,M.P.TopologyAnalysisofSoftwareDe-pendencies.ACMTransactionsonSoftwareEngi-neeringandMethodology(TOSEM),(4),2008.Savage,T.,Revelle,M.,andPoshyvanyk,D.FLAT3:FeatureLocationandTextualTracingTool.InPro-ceedingsofthe32ndACM/IEEEInternationalCon-ferenceonSoftwareEngineering(ICSE)-Volume2,pp.255{258,2010a.Savage,Trevor,Dit,Bogdan,Gethers,Malcom,andPoshyvanyk,Denys.TopicXP:ExploringtopicsinsourcecodeusingLatentDirichletAllocation.InProceedingsoftheIEEEInternationalConferenceonSoftwareMaintenance(ICSM),pp.1{6,2010b.Teh,Y.,Jordan,M.I.,andBeal,M.HierarchicalDirichletprocesses.JournalofAmericanStatisticalAssociation,pp.1566{1581,2007.Titov,IvanandMcDonald,Ryan.ModelingOnlineReviewswithMulti-grainTopicModels.InProceed-ingsofthe17thInternationalConferenceonWorldWideWeb(WWW),pp.111{120,2008.Wen,BoandBoahen,Kwabena.ActiveBidirectionalCouplinginaCochlearChip.InAdvancesinNeuralInformationProcessingSystems(NIPS),2005.Williamson,Sinead,Wang,Chong,Heller,Kather-ineA.,andBlei,DavidM.TheIBPCompoundDirichletProcessanditsApplicationtoFocusedTopicModeling.InProceedingsofthe27thInter-nationalConferenceonMachineLearning(ICML),2010.Wong,T.T.GeneralizedDirichletdistributioninBayesiananalysis.AppliedMathematicsandCom-putation,97:165{181,1998.

Related Contents


Next Show more