MrinalKantiDasymrinalcsaiiscernetinSuparnaBhattacharyabsuparnainibmcomIBMResearchIndiaChiranjibBhattacharyyaychirucsaiiscernetinKGopinathygopicsaiiscernetinyDepartmentofComputerScienc ID: 118900
Download Pdf The PPT/PDF document "SubtleTopicModelsandDiscoveringSubtlyMan..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically MrinalKantiDasymrinal@csa.iisc.ernet.inSuparnaBhattacharyabsuparna@in.ibm.comIBMResearch-IndiaChiranjibBhattacharyyaychiru@csa.iisc.ernet.inK.Gopinathygopi@csa.iisc.ernet.inyDepartmentofComputerScienceandAutomation,IndianInstituteofScience,Bangalore,IndiaAbstractInarecentpioneeringapproachLDAwasusedtodiscovercrosscuttingconcerns(CCC)automaticallyfromsoftwarecodebases.LDAthoughsuccessfulindetectingprominentconcerns,failstodetectmanyusefulCCCsincludingonesthatmaybeheavilyexecutedbuteludediscoverybecausetheydonothaveastrongprevalenceinsource-code.Weposethisproblemasthatofdiscoveringtopicsthatrarelyoccurinindividualdocuments,whichwewillrefertoassubtletopics.Re-centlyaninterestingapproach,namelyfo-cusedtopicmodels(FTM)wasproposedin( Williamsonetal. , 2010 )fordetectingraretopics.FTM,thoughsuccessfulindetectingtopicswhichoccurprominentlyinveryfewdocuments,isunabletodetectsubtletop-ics.Discoveringsubtletopicsthusremainsanimportantopenproblem.Toaddressthisissueweproposesubtletopicmodels(STM).STMusesageneralizedstickbreakingpro-cess(GSBP)asapriorfordeningmultipledistributionsovertopics.ThishierarchicalstructureontopicsallowsSTMtodiscoverraretopicsbeyondthecapabilitiesofFTM.Theassociatedinferenceisnon-standardandissolvedbyexploitingtherelationshipbe-tweenGSBPandgeneralizedDirichletdistri-bution.EmpiricalresultsshowthatSTMisabletodiscoversubtleCCCintwobench- Proceedingsofthe30thInternationalConferenceonMa-chineLearning,Atlanta,Georgia,USA,2013.JMLR:W&CPvolume28.Copyright2013bytheauthor(s).markcode-bases,afeatwhichisbeyondthescopeofexistingtopicmodels,thusdemon-stratingthepotentialofthemodelinauto-matedconcerndiscovery,aknowndicultprobleminSoftwareEngineering.Further-moreitisobservedthateveningeneraltextcorporaSTMoutperformsthestateofartindiscoveringsubtletopics.1.IntroductionHierarchicalDirichletprocess(HDP)( Tehetal. , 2007 )isoneofthemostwidelyusedtopicmodels.Re-callthat,HDPplacesaDirichletprocess(DP)prioroverpotentiallyinnitenumberoftopicsatcorpuslevel.Subsequently,itusesaDPprioroverthetop-icsforeachdocument,andeachdocumentlevelDPisdistributedasthecorpuslevelDP.ThoughHDPisextremelysuccessfulindiscoveringtopicsingen-eral,itfailstodiscovertopicswhichoccurinveryfewdocuments,oftenreferredasraretopics.Thisinabil-itystemsfromthefactthatHDPinherentlyassumesthatafrequenttopicwillonaverageoccurfrequentlywithineachdocument,leadingtoapositivecorrela-tionbetweenproportionofatopicinanarticleandprevalenceofatopicintheentirecorpus.Thisimportantproblemhaspartiallybeenaddressedin( Williamsonetal. , 2010 ).ByusingIndianBuf-fetprocess(IBP),( Williamsonetal. , 2010 )denedacompoundDPnamelyICDtodecorrelatedocumentwiseprevalenceandcorpuswideproportion.ICDwasappliedonfocusedtopicmodels(FTM)todetectraretopicswhichareprominentlyplacedinveryfewdoc-uments. SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically Considerthecorpus,proceedingsofNIPS,2005 1 .Be-causeoflimitednumberofpapersonsupervisedclassi-cation,anHDPbasedapproachfailstoidentifytop-icsrelatedtosupervisedclassicationbutFTMde-tectsthiseasily(seesupplementarymaterialformoredetails).However,therearesometopicswhicharenotonlyrareacrossthedocumentsbutalsorarelyappearwithinadocument.UnderthesesituationsFTMwillfailtodiscoverthem.Acaseinpointis,atopicre-latedtoneuromorphicengineeringaboutcochlearmod-elling.Thetopichasbeendiscussedonlyinonepaper( Wen&Boahen , 2005 ).Inadditiontothat,themaintheme(cochlear)israrelyexplicitlymentioned(5%ofsentences)inthatpaper.Therefore,thetopicisassignedanextremelylowprobabilitymakingitex-tremelydicultforevenFTMtodetect.Thisphe-nomenaisnotspeciconlytoscienticcorpora,butisalsoobservedinothertextcorporatoo.WestudiedthespeechesofBarackObamafromJuly27,2004tillOctober30,2012 2 ,aspanofeightyears.Weobservethat,inthiscorpustherearetwospeechesonCarnegieMellonUniversity(CMU).ThosespeechesweregivenwhenhevisitedCMUon2June,2010and24June2011.ItisadiculttaskforFTMtodetectthistopicas\CarnegieMellon"iscontainednotonlyintwodocuments(whichisrare),butalsopresentinlessthan10%ofsentencesinthosetwodocuments.Theseexamplesshowthat,discoveringtopicswhichrarelyoccurinindividualdocumentsstillremainanunsolvedproblem.Tothisend,weproposetostudythediscov-eryofsubtle 3 topics,whichrarelyoccurinthecorpusaswellasinindividualdocuments.Animmediatemotivationforstudyingsubtletopicsistheautomaticdiscoveryofcrosscuttingconcernsinsoftwarecodes.LatentDirichletAllocation(LDA)( Bleietal. , 2003 )hasbeenappliedinsoftwareanalysistoautomaticallydiscovertopicsthatrepresentsoft-wareconcerns( Baldietal. , 2008 ).Theuseoftopicmodelsforthisproblemisattractivebecauseunlikemostotherstate-of-the-artconcernidenticationtech-niquesitisneitherlimitedbyaprioriassumptionsaboutthenatureofconcernstolookfornorbytheneedforhumaninputandothersourcesofinforma-tionbeyondthesourcecode.However,inframeworkbasedsoftwares,importantprogramconcernscanhavesuchasubtlepresenceinthecodethatexistingtopicmodelsfailtodetectthem. 1nips.cc/Conferences/20052www.americanrhetoric.com/barackobamaspeeches.htm3Thedictionarymeaningofsubtleisdiculttodetect,whichmotivatesthename.InBerkeley-DB,awidelyusedsoftwarecodebase,thecross-cuttingconcerninvolvedintheupdationofvariousbook-keepingcountsisnoteasytode-tectasthecountersarenamedasnWaits,nRequests,nINsCleanedThisRun...etcwhichdonotcontaintheword\count".Suchconcernsthatareexpressedsub-tlyinthesourcecodecannotbeignoredastheymaybesourcesofhighresourceusageorsupportacrit-icalprogramfunctionality.Forexample,Verifyisausefulcross-cuttingconcerninBerkeley-DB,yetitistoosubtleinthecodeforanyoftheexistingmodels,includingFTM,torecognizeit.Contributions:Inthispaperweproposethesub-tletopicmodels(STM)whichhastheabilitytodetecttopicsthoseoccurveryrarelyinindividualdocuments.HDPandFTMuseasingledistributionovertopicsforadocumentandwehaveobservedthatmakesitdif-cultforthemtodetectsubtletopics.Inordertogiveimportancetosubtletopicswithinadocument,weproposetosplittheco-occurrencedomaininsideadocumentbyusingmultipledistributionsovertopicsforeachdocument.Itisnontrivialtoselectaproperprioroverthesetopicvectors.Weusegeneralizedstickbreakingprocess(GSBP)( Ishwaran&James , 2001 )toaddressthisissue.UsingGSBP,STMallowsthetopicvectorstobesharedacrossthedocumentandtheproportionsoverthetopicvectorstobeindependentofeachotherwhichisessentialinmodelingsubtletop-icsasexplainedindetaillater.TheinferenceproblemduetoGSBPisnotstandard.WeproposetosolvethisbyutilizingtherelationshipbetweenGSBPandgen-eralizedDirichletdistribution(GD)andsubsequentlyconjugacybetweenGDandthemultinomialdistribu-tion.Webelievethatthisprocessandtheassociatedinferenceprocedureisnovelandisofindependentin-teresttotheBayesiannon-parametriccommunity.Themostsignicantcontributionintermsoftheutil-ityofSTMliesinitsabilitytodetectsubtlymanifestedconcernsinsoftwareprograms,aknownhardprobleminsoftwareengineering.Theresultsobtainedherethusmarkabreakthroughinthisarea.Inaddition,STMndssubtletopicsfromproceedingsofNIPS,2005andspeechesofBarackObamasince2004,thatshowsitsabilitytodowellongeneraltextcorpora.Structureofthepaper:Thepaperisorganizedasfollows.Section2discussestheapplicationoftopicmodelsindiscoveringconcernsinsoftwarecodebases.Insection3,wepresenttheproposedmodel,whilesec-tion4describestheinferenceprocedure.Experimentalstudyhasbeencoveredinsection5. SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically 2.TopicmodelsfordetectingSoftwareconcernsSoftwareconcernsarefeatures,designidiomsorotherconceptualconsiderationsthatimpacttheimplemen-tationofaprogram.Aconcerncanbecharacterizedintermsofitsintentandextent( Marinetal. , 2007 ).Aconcern'sintentisdenedasitsconceptualobjective(ortopic).Aconcern'sextentisitsconcreterepre-sentationinsoftwarecode,i.e.thesourcecodemod-ulesandstatementswheretheconcernisimplemented.Programconcernsmaybemodular,i.e.implementedbyasinglesourceleormodule,orcross-cutting,i.e.dispersedacrossseveralcodemodulesandinterspersedwithotherconcerns.Identifyingandlocatingconcernsinexistingpro-gramsisanimportantandheavilyresearchedprob-leminsoftware(re)engineering( Robillard , 2008 ; Savageetal. , 2010a ; Marinetal. , 2007 ; Eaddyetal. , 2008 ; Revelleetal. , 2010 ).Yet,itremainshardtoautomatecompletelyinasatisfactoryfashion.Typi-calconcernlocationandaspectminingtechniquesaresemi-automatic:someusemanualquerypatternsandsomegenerateseedsautomaticallybasedonstructuralinformation( Marinetal. , 2007 ).Bothapproacheshavetherestrictionthattheytendtobedrivenbysomepriorexpectationorsearchcluesaboutthecon-cern(s)ofinteresteitherintermsoftheconcerns'in-tent(e.gseedwordpatterns,testcases),orabouttheconcerns'extent(e.g.fan-inanalysis).RecentlyitwasshownthatLDAcanautomaticallydetectprominentcross-cuttingconcerns( Baldietal. , 2008 ; Savageetal. , 2010b )quitesuccessfullywithouttheserestrictions.AlthoughtheLDAapproachworkswellforsurfacingconcerns(includingCCCs)thathaveastatisticallysignicantmanifestationinthesourcecode(alargeextent),itcanmissinterestingCCCs(e.g.concernsthatareexecutedheavilyandthusimpactruntimere-sourceusage)justbecausetheymaynothaveapromi-nentpresenceinsourcecode.Thisisespeciallylikelyinframeworkbasedcodewhereallunderlyingmodulesourcesmaynotbeavailable,andaconcern'sextentmayincludeasmallpercentageofstatementsinsourcecodelestobeanalyzed.Orevenwhenaconcern'sextentisnotallthatsmall,itcaneludedetectionbe-causeofthesubtlepresenceofrepresentativewordsthatre\rectitsintent.ConsidertheexampleofVerify,animportantCCCinBerkeley-DB.Accordingtoapublishedmanualanal-ysisthatincludesanegrainedmappingofBerkeley-DBcodeconcerns(availableat( Apeletal. , 2009 ; Kastneretal. , 2007 )),thisCCCoccursindividuallyasamainconcernandalsohas7derivativeconcerns(combinationofmultipleconcerns).Seesupplemen-tarymaterialformoredetails.However,thisconcernissurprisinglyhardtodetectnotjustbyLDA/HDPbutevenbyFTMandMG-LDA( Titov&McDonald , 2008 ).Theconcern'sextentisnotparticularlysmallbutitsstatementsarespreadacrosslesandcontaintheinternalsofoperationsper-formedtoverifydierentstructures.Thustheword\verify"occursinonlyasmallfractionofthesestate-ments.Itisverychallengingtosurfacesuchsubtletracesoftheconcern'sintentautomaticallywithoutrelyingonanyaprioriinformation.DespiteFTM'sstrengthindetectingraretopics,FTMfailsatthistaskaswellbecauseeveninthelewiththestrongestpresenceoftheverifyconcern,thewordisre\rectedinlessthan10%ofthestatementsinthatle.Thus,detectingsubtlymanifestedconcernsremaintobeachallengingopentask.3.SubtleTopicModelsInthissectionwepresentsubtletopicmodels(STM),designedtodetectsubtletopicswhicharerarelypresentacrossthecorpusaswellaswithindocuments.Wewillbrie\rydiscussgeneralizedstickbreakingpro-cessbeforedescribingSTM.3.1.Generalizedstickbreakingprocess(GSBP)Underthegeneralizedstickbreakingprocessframe-work,anyPisastickbreakingrandommeasureifitisofthefollowingform.P=JXj=1jj(:)1=v1;j=vjYlj(1 vl)(1)wherevjBeta(j;j) 4 ,andjsareindependentlychosenfromadistributionH.jdenotesadis-cretemeasureconcentratedatj.Byconstruction,0j1,andPJj=1j=1almostsurely,whereJcanbeniteorinnite.Whenj=1;8jandj=;8j,andJ!1,thenitreducestoDP(H).ThetwoparameterPoisson-DP(Pitman-Yorprocess)correspondstothecasewhenJ!1,j=1 ,andj=+jwith01and-5.1;䘘 .Formorediscussionsee( Ishwaran&James , 2001 ).HereweareinterestedinthesituationwhenJ1,forwhichtoensurePJj=1j=1oneneedstoset 4denotes\distributedas" SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically vJ=1.Wewillutilizeoneinterestingpropertyofthisnitedimensionalstick-breakingprocess{randomweightsjsdenedinthismannerarealsogeneralizedDirichlet(GD)distributed.3.2.SubtletopicmodelsWeconsideradatasetasfffwdingNdin=1gSdi=1gDd=1g,whereDisthenumberofdocumentsinthecorpus,Sdisthenumberofsentencesindocumentd,Ndibe-ingthenumberofwordsinsentenceiofdocumentd.Inaddition,letusdenotethenumberofwordsinadocumentbyNd.InSTM,foreachdocumentd,weproposetohaveJd1numberofdistributionsovertopics.Topicsdenotedbyksaresharedacrossthecorpus.Weas-sumeadistributionovertheseJdtopicvectorsatsen-tencelevel,denotedbydiforsentenceiindocumentd.Notethat,adistributionoverthetopicvectorsatdocumentlevelwillleadtotheproblemofhavinghighprobabilityforthosetopicvectorswhicharepopularinthedocument.3.2.1.SelectingpriorovertopicvectorsTherearevariousoptionsinchoosingJd,diandapriordistributionoverdi.Thesimplestpossibilityisthat:Jd=J;8d,anddiDirichlet(),be-ingaJdimensionalvector.TheproblemwithaxedJisthatitcannotmodelthefactthat,thedocu-mentswithhigherSd(orNd)ingeneralhavehigherprobabilityofbeingmoreincoherentthanthosewithsmallerSd.Inordertoavoidthisissue,onecanuseJdPoisson(Sd).However,thiswillmakeexpectedvalueofJdlargefordocumentswithlargeSd.ThatinturnincreasesthechanceofdocumentswithlargeSdtobeincoherentinmostofthecaseswhichisundesir-able.However,duetotherichgettingricherpropertyHDPisnotsuitableinthiscase.Ontheotherhand,usingICDinthiscasewillmakeitdiculttolearnasthecontentofasentenceistoosmall.Notingthat,Jdcanbeatmostthenumberofsen-tencesSdindocumentd(whendihas1inonecom-ponentandzeroelsewhereandeachdiisdierentfordierenti),wesetJd=Sd.Then,weuseGSBPasdescribedinsection 3.1 toconstructdijsasfollows.Fordocumentd,i=1;:::;Sdandj=1;:::;Sd 1vdijBeta(j;j)(2)di1=vdi1;dij=vdijYlj(1 vdil)withvdiSd=1.LetusdenotetheaboveprocessasGSBPSd(;),whereandareSd 1dimensionalvectorsofparameters.Notethat,PSdj=1dij=1as1 PSd 1j=1dij=QSd 1l=1(1 vdil).Duetothiscon-struction,Sdistheupperlimitandnottheexactnum-berofdistributionsovertopicsperdocument.There-fore,althoughthereisapossibilityofhighervalueofJforalargerdocumentbutselectinghigherindexedtopicvectorsarediscouraged.Asdiscussedearlier,withproperparametersetting,niteGSBPcanbetreatedastruncated-DPortruncated-PYP(Pitman-Yorprocess).DPorPYPcanalsobeusedalternativelywhichdoesnotaectrestofthemodel.However,GSBPisamore\rexibledistributionandisbettersuitedforsmallsentencesencounteredinsoftwaredatasets.3.2.2.ConstructionoftopicvectorsWedenotethedistributionsovertopicsindocumentdasfdjgforj=1;2;:::;Sd.HDPisnotasuitablepriorfordjasweneedthedistributionovertopicstobeuncorrelatedtothedocumentleveltopicpropor-tionsandwitheachotherasmuchaspossible.There-fore,ICDseemstobeamoreappropriatechoicehere.WeusetwoparameterIBP( Grithsetal. , 2007 )tosamplebinaryrandomvectors\rdjandthenwesam-pledjsfromICDasfollows.For,j=1;:::;Sd,andk=1;:::;K\rdjkBernoulli(k);djkDirichlet(1K:\rdj)wherekBeta( K;).1KdenotesaK-dimensionalvectorwithalloneand\."denotescomponentwise(Hadamard)product.isarepulsionparameter,withsameexpectednumberoftopicsthevariabilityamong\rdjsacrossjincreaseswhenincreases,andwhen=1itreducestostandardIBP.Kisthetrunca-tionleveland( Doshi-Velezetal. , 2009 )showsthattheprobabilityof\rdjktobe1foranyjisverylowifKissucientlyhigh.Usingtheabovetwoconstructionswegetthebasedis-tributioncorrespondingtoasentence,asfollowsGdi=ZddiSdXj=1ZddjXkdijdjkp(di)p(dj)kThisformsadependentDirichletprocesswheretheksaresharedacrossallthesentencesinallthedocu-mentsandthedjsaresharedacrossallthesentenceswithinadocument.STMassumesthegenerativepro-cessasdescribedinAlgorithm 1 . SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically Algorithm1GenerativeprocessofSTM fork=1;2;:::dodrawtopickDirichlet(1V)topicselectionprobkBeta( K;)endforfordocumentsd=1toDdosamplenumberofsentencesSdPoisson(%)fordistributionovertopicsj=1;:::;Sddosample\rdjkBernoulli(k),k=1;:::sampledjDirichlet(1K:\rdj)endforforsentencesi=1;:::;SddosamplediGSBPSd(;)forwordsn=1;:::;Ndidoselectbdinmult(di)sampletopiczdinmult(dbdin)samplewordwdinmult(zdin)endforendforendfor 4.PosteriorInferenceWeusetheGibbssamplingapproachtosamplethelatentvariablesusingtheposteriorconditionaldistri-bution.Themainchallengesintheinferenceproce-dureareduetothebinaryrandomvectors\rs,andthegeneralizedstickbreakingprocess(GSBP)vari-ablesvs.Wesamplethebinaryrandomvectorscon-sideringtruncated-IBP.Adiscussionontheeectoftruncationcanbefoundin( Doshi-Velezetal. , 2009 ).InferenceprocedureduetoGSBPisrelativelyunex-ploredareaandnotstraightforward.WeutilizethefactthatGSBPisequivalentofgeneralizedDirichletdistribution(GD)( Wong , 1998 ).Thebenetwedrawfromthisrelationshipisthat,likeDirichletdistribu-tionGDisalsoconjugatetothemultinomialdistribu-tion.Thisapproachmakestheinferenceverysimpleaswedescribenext.Wecollapsetheconditionaldistributionsbyintegrat-ingoutthetopicdistributions(),distributionsovertopics(),Bernoulliparameters(k)anddistribu-tionovertopicvectorsforeachsentences(da).Wehowever,sampletopicassignmentvariableszs,andbsalongwithbinaryvector\rs.Wewillusenotationforcountsasfollows.disthedocumentindex,aisthesentenceindexandiisthewordpositionindex.nrepresentsthecountingvari-ableandindicesareputinthesubscript,where\:"representsmarginalization.Super-scriptdenotesthatinallcountsthecurrentwordisexcluded(wedonotrepeatthisinthetextfollows).Thus,n dai:::kwdaiisthenumberoftimeswordtypewdaiisassociatedwithtopick.n dai:::k:representsthenumberoftimestopickisusedinthewholecorpus.n daid:bdaik:isthenumberoftimestopickandbdaiareused.n daid:bdai::denotesthenumberoftimesbdaiisused.n daidaj::isnumberoftimestopicvectorindexedbyjisused,andn daida:::isthenumberofwordsinthesentence.Kisthetrun-cationlevelfortopics.Forthesakeofbrevity,inthefollowingtextwedonotputallthevariablesintheconditionalhopingthatiseasytotrackfollowingthegenerativeprocess(Algorithm 1 ).Samplingzand\r:Theconditionalprobabilityoftopicassignmentofwordiatsentenceaindocumentdcanbeexpressedas:p(zdai=kjw;z dai)(3)/p(wdaijzdai=k;z dai)p(zdai=kjz dai)=+n dai:::kwdai V+n dai:::k:\rdbdaik+n daid:bdaik: Pk\rdbdaik+n daid:bdai::Noticethat,weneedtoinferonly\r,andbinordertoassigntopics.Notethat,\rcontainbinaryselectionvalues.Therefore,ifn daid:jk:0,thena.s.posteriorprobabilityof\rdjktobeoneis1.Otherwise:p(\rdjk=1jz;\r djk)(4)/p(zdj\rdjk=1;\r djk)p(\rdjk=1j\r djk)/ (Ps=k\rdjs+) (Ps=k\rdjs++n daid:j::)PrPl\rrlk+ K PrPl1+ K+Samplingb:FromtherelationbetweenGSBPandGDwegetthat,ifdisareconstructedasEq. 2 ,thentheyareequivalentlydistributedasGDandtheden-sityofdiis:fdi=Sd 1Yj=1j 1dij(1 Pjl=1dil)j B(j;j)(5)whereB(j;j)= (j) (j) (j+j).j=j j+1 j+1forj=1;2;:::;Sd 2andSd 1=Sd 1 1.Notethat,diSd=1 PSd 1l=1dil.Notethat,bysettingj 1=j+j;2jSd,GDreducestostandardDirichletdistribution.NowusingtheconjugacybetweenGDandmulti-nomialweintegrateoutsandvs.IfdaGDSd 1(1;:::;Sd 1;1;:::;Sd 1),andbdajsare SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically sampledfrommult(da),thentheposteriordistri-butionofdagivenbdajsisagainaGDwithden-sityGDSd 1(1;:::;Sd 1;1;:::;Sd 1),wherej=j+n daidaj;j=j+PSdl=j+1n daidal.Thuswecomputeconditionalp(bdai=jjb dai;;),forjSdasj+n daidaj j+j+PSdr=jn daidarYljl+PSds=l+1n daidas l+l+PSds=ln daidasandp(bdai=Sdjb dai;;)=1 PSd 1l=1p(bdai=ljb dai;;).Noticethat,thestickbreakingpropertyofGSBPisclearlyvisiblehere.Theposteriorprob-abilityofselectingatopicvectorforawordcanbefoundtobeasbelow:p(bdai=jjb dai;z;\rdj)(6)/p(zdaijbdai=j;z dai;\rdj)p(bdai=jjb dai)=\rdjzdai+n daid:jzdai: Pk\rdjk+n daid:j::p(bdai=jjb dai)Equations 3 , 4 , 6 togetherformtheinferenceprocedureofSTM.Discussion:Notethat,whenJ=1,wegetamodelequivalenttoFTM,andthatwaywegetanalter-nativeinferenceprocedureforFTM.Recallthat,in( Williamsonetal. , 2010 )thebinaryvectorsareinte-gratedoutusingapproximationforcomputingcondi-tionalfortopicassignmentvariableszs,andthebinaryvectorsaresampledtocomputeks.Wehaveobservedthat,boththesealternativesareequallywell,howeversamplingthebinaryvectorsmakestheinferencesim-plerwiththecostofmarginallyslowerconvergencerate(matchesupinlikelihoodinabout100iterations)incaseoftruncatedIBP.5.EmpiricalStudyInthissectionweempiricallystudytheproposedmodelSTMonaspecialtaskofndingoutsubtlecross-cuttingconcernsfromsoftwarerepositories.InadditionweapplySTMontwotextdatasetswhichareapparentlyrichofsubtletopics.Thissectionisorganizedasfollows.Firstweexplainchallengesre-latedtoempiricalevaluationandourapproachunderthelimitedscope.Thenwedescribethebaselinesfol-lowedbythedatasetsusedintheevaluation.Next,wediscussourresultsintwosubsectionsfollowedbyashortdiscussionontheempiricalndings 5 . 5Forrelevantresourcesseemllab.csa.iisc.ernet.in/stm.5.1.EvaluationapproachWeevaluateSTMontwoaspects,(1)modelingabili-tiesbyusingperplexityandtopiccoherence.(2)abil-itytodiscoversubtletopics.Fortherstcase,wewillusestandardmetrics.However,itisnoteasytoevalu-ateonthesecondaspect.Unlikesemanticcoherence,itisdicultforahumantojudgesubtletyofatopicbylookingatfewtopwords.Moreover,subtletyisrel-ativetothedataseti.e.atopicmaybesubtlewithre-specttoadataset,butthatmaybeprominentinsomeotherdataset.Asanalternativetohuman-judgment,wecancheckhowgoodamodelcanndoutsomeknownorpre-denedsubtletopics.But,itisdiculttondadatasetwithasetofpre-denedsubtletopicsasgoldstandardandthentocompareagainstthat.Inthegivenconditionofunavailability,wemanuallycre-ateourgold-standardasexplainedlaterandprovidethecompletelistinthesupplementary.5.2.BaselinesForevaluation,wecomparewithHDP,FTM 6 andMG-LDA( Titov&McDonald , 2008 ).MG-LDAhastheabilitytodiscoverlocaltopicswhichmightbemissedbyHDPorFTM.Although,subtletopicsmaynotlocalizeproperlyinsideadocument,itisusefultobenchmarkSTMagainstMG-LDA.5.3.Dataset&pre-processingNIPS-05:Thisisacorpusof207acceptedpapersfromtheproceedingsofNeuralInformationProcessingSys-tems,2005 7 ( Globersonetal. , 2007 ).Obama-speech:CollectionofpublicspeechesbyBarackObamasinceJuly27,2004tillOctober30,2012 8 thatcomprises142articleswhicharetranscribeddirectlyfromaudio.BerkeleyDB:WehaveselectedBerkeleyDBJavaEdi-tionasoursoftwaredataset( Apeletal. , 2009 ).Asof2012,BerkeleyDBisthemostwidelyuseddatabasetoolkitintheworld 9 ,anditisknowntohaveawiderangeofcross-cuttingconcerns.JHotDraw:JHotDrawisawellknownopensourceGUIframeworkfordrawingtechnicalandstructuredgraphics 10 .WehaveselectedJHotDrawasLDAisobservedtondagoodsetofconcerns.Forsoftwaredatasets,onlythetextualcontent(with- 6usinginferenceasin( Williamsonetal. , 2010 )7nips.cc/Conferences/20058www.americanrhetoric.com/barackobamaspeeches.htm9en.wikipedia.org/wiki/Berkeley DB10www.jhotdraw.org/ SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically Table1.Comparisononperplexity(top)andtopiccoher-ence(bottom).STMachieveslowestperplexitywithgoodcoherence. Held-outdataPerplexity Dataset HDP MG-LDA FTM STM BerkeleyDB 182 127 80 60 JHotDraw 131 156 93 81 NIPS-05 941 2107 413 402 Obama-speech 3591 4721 901 582 Average 1211 1778 372 281 Averagetopiccoherence Dataset HDP MG-LDA FTM STM BerkeleyDB -58.6 -49.6 -20.3 -27.9 JHotDraw -80.9 -94.2 -37.9 -28.2 NIPS-05 -78.1 -43.7 -45.1 -37.7 Obama-speech -72.4 -53.2 -67.2 -52.5 Average -72.5 -59.9 -42.6 -36.6 outprogrammingsyntax)ofthe'.java'les(nodoc-umentationetc)usedasinput.EachstatementhasbeentreatedasasentenceandJavakey-words 11 areremovedbutcommonjavalibrarynamesareretained.TokenslikeStringCopyhavebeensplitintotwowordsStringandCopybasedonthepositionofacapitalfaceinsideatoken.Foralldatasets,wehaveremovedstan-dardEnglishstopwords,digits,sentencessmallerthan20charactersandwordssmallerthan3characters.Weconvertedcapitalfacestosmallfaces.IncaseofNIPSdataset,wehaveusedmostfrequent5000wordsandinothercaseswehaveusedfullvocabulary.Wehaveusedtheparameters;;;j;jas1andas100.Wehaverunallthemodelsfor2000iterations(wefounditsucientforallmodelstoconvergeintermsoflog-likelihood),andusedthetruncationparameterKas100(adequateconsideringthesizeofourdatasets).5.4.Evaluationonperplexity&coherenceWehaverandomlypickedone-thirdofthedatasetsasheld-outdatasetsandusedthestandarddenitionofperplexityascanbefoundin( Bleietal. , 2003 ).Lowervalueinperplexitymeansthatthemodeltsthedatasetbetter.Byapproximatingtheuserex-perienceoftopicqualityonWtopwordsofatopictopiccoherence(TC)canbemeasuredas:TC(W)=PiPjilogD(wi;wj)+ D(wj)whereD(w)isthedocumentfrequencyofanywordw,andD(wi;wj)isthedocu-mentfrequencyofwiandwjtogether( Mimnoetal. , 2011 ).isasmallconstanttoavoidlogzero.Val-uesclosertozeroindicatesbettercoherence.Wehave 11en.wikipedia.org/wiki/List of Java keywordsTable2.ComparisononaveragerecallandaveragetopiccoherenceconsideringonlythegoldstandardtopicswithDoSgreaterthan0.2. HDP MG-LDA FTM STM BerkeleyDB Coherence -48.48 -42.89 -40.11 -21.99 Recall 0.42 0.59 0.68 0.94 JHotDraw Coherence -36.64 -52.13 -43.26 -46.17 Recall 0.31 0.16 0.42 0.97 NIPS-05 Coherence -40.23 -38.55 -37.38 -34.84 Recall 0.41 0.49 0.56 0.79 Obama-speech Coherence -32.65 -22.03 -53.15 -40.6 Recall 0.26 0.27 0.58 0.95 Table3.Fractionofsubtletopics(DoS0:5)de-tected(recall0:75)byallthemodels. HDP MG-LDA FTM STM BerkeleyDB 0 0.25 0.38 1.0 JHotDraw 0 0 0.29 0.86 NIPS-05 0 0 0.07 0.69 Obama-speech 0.14 0.07 0.28 0.93 usedtop5wordstocomputecoherenceofatopic.Table 1 containsresultsonperplexityandaveragetopiccoherenceforallthedatasets.WeobservethatSTMisabettermodelthanallothersintermsheld-outdataperplexityandcoherence(inmostofthecases).Notethat,theabilityofSTMtodetectsubtletopicsliesinsplittingtheco-occurrencedomain,how-everthisbringsinmilddicultyfornormaltopicstobelearnt.Hence,coherencemaysuerlittlebitwhichisobservedinTable 1 too.5.5.EvaluationindetectingsubtletopicsMeasureofsubtlety:WedenedegreeofsubtletyoftopickasDoS(k)=QDd=1(1 pdk),wherepdk=Pw2KkPSdi=1I[w2Sdi] jKkjSd.Kkisthesetofkeywordsdescribingtopick.I[w2Sdi]is1ifwordwispresentinsentenceiofdocumentd.Notethat,0DoS1.ThevalueofDoSincreasesifararewordisincludedintoKkanditdecreasesifafrequentwordisinserted.Goldstandard:Inordertocompareperformanceonsubtletopics,wehand-pickedsometopicsfromeachcorpussothattheirDoSisgreaterthan0:2whichwe SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically Table4.Fiveexamplesubtletopics(DoS0:5)detected(recall0:75)bySTM.Amongthem,'*'markedarealsodetectedbyMG-LDAand'!'markedarealsodetectedbyFTM.HDPcouldnotdetectanyofthem. BerkeleyDB JHotDraw NIPS Obamaspeech transaction,checkpoint, nano,xmldom chip,processing, cyber,security, recovery architecture,circuit internet stats,count roundrect cochlear,cochlea school,students,college *checksum,validate,errors !rendering topic,model,topics, carnegie,mellon,technology dirichlet !trace,level,info,cong zoom,factor walk,walks,steering,robot decit,cuts,budget verify,cong,keys collection,family, video,texture, regulations,infrastructure, families resolution,image employees considerasareasonablychallengingdegreeofsub-tlety.Forcomparisonwehoweverconsideredonlythosehand-pickedtopicsforwhichaleast75%ofthekeywordsareretrievedamongtop5wordsbyatleastonemethodincomparison.WecomputerecallofeachtopicconsideringtopvewordswithKofeachgoldstandardtopic.Atopicissaidtobeamatchforthegold-standardtopicwhichhasthehighestrecall,ifre-callislessthan0:75,thenwesaythetopicisnotde-tectedbythemodel.Thereasonofkeepingthresholdofrecallhighisthatsomekeywordsmaybepopularandpartofnormaltopics.Hence,detectingsuchakeywordalonedoesnotsignifydetectingthesubtletopicinconsideration.Followingtheaboveapproach,wehandpicked11,10,21,and16topicsrespectivelyfromBerkeleyDB,JHot-Draw,NIPSandObama-speechdatasets.Foreachgoldstandardtopicweconsidertherecallandcoher-enceofthebestmatchedtopicforeachmodelandthenweaverageoverallthegoldstandardtopicscor-respondingtoeachdatasetandreportatTable 2 .Ta-ble 3 containstheresultonfractionofsubtletopicsdetectedbyallthemodels.Acompletelistcanbefoundinthesupplementarymaterial.InTable 4 ,weprovideveexamplesubtletopicsfromeachdatasetwhicharedetectedbySTM,butothermodelshardlydetectthem.5.6.DiscussionSubtletopicsinmanycasesconsistofrarewords.Forexample,the\cochlea"topicissubtleduetorarenessofitskeywordsandisdetectedonlybySTM.Incer-taincases,somekeywordsmaynotberarebutitisthecombinationofthewordsthatmakesthetopicsubtletodetect.Forexample,inBerkeley-DBthetopic\trace,level,info,cong"consistsoffourwordsofwhichtraceisnotarareword.Thetopicasawholesigniestheabilitytoconguretracelevels,andmani-festsinaverylocalizedfashionintheTracerclassandhencecanbedetectedbyFTM,butnotbyHDPorMG-LDA.Ontheotherhandthecross-cuttingconcern\checksum,errors,validate"appearssubtlyinindivid-uallesbutisdiusedwidelyacrossthecorpusandhenceitcanbedetectedbyMG-LDA,butHDPandFTMfail.NotonlySTMdetectsallofthesetopics,butitalsosucceedsindetectingthemoreinterestingcross-cuttingconcern,\verify",whichmanifestssub-tlyineveryleandthereforeeludesHDP,FTMandMG-LDA.Anotherexampleofanimportantyetsub-tletopicdetectedonlybySTMis\cybersecurity"ontheinternet.ThisisanimportanttopicthatindicatespoliciesorprioritiesofpresidentObama.6.ConclusionTheutilityofatopicisnotnecessarilylinkedtoitsprevalenceinacorpus.Whentopicmodelsareusedforautomaticallydiscoveringconcerns,theinabilitytodiscoverimportantorinterestingconcernsjustbe-causetheyaresubtlymanifestedinthesourcecodecanbeacriticaldrawback.Inthispaperweproposeanovelmodel,namelySTMtoaddressthisproblemforthersttimeintheliterature.STM,byusingmultipledistributionsovertopicsperdocumentisob-servedtoeectivelydiscoversubtletopicswherestateofartmodelsfail.Thisisapromisingresultforad-vancingthestate-of-the-artinthedicultproblemofautomaticconcerndiscoveryinsoftwareengineering.Onempiricalevaluation,wendSTMtooutperformstateofartmodelsnotonlyincaseofsubtletopicsbutalsoingeneralwithlowperplexityonunseendata,andgoodtopiccoherence.AcknowledgmentsWearethankfultoallthereviewersfortheirvaluablecomments.TheauthorsMKDandCBwerepartiallysupportedbyDSTgrant(DST/ECA/CB/1101). SubtleTopicModelsandDiscoveringSubtlyManifestedSoftwareConcernsAutomatically ReferencesApel,Sven,Kastner,Christian,andLengauer,Chris-tian.FEATUREHOUSE:Language-Independent,AutomatedSoftwareComposition.InProceedingsofthe31stInternationalConferenceonSoftwareEn-gineering(ICSE),pp.221{231,2009.Baldi,PierreF.,Lopes,CristinaV.,Linstead,ErikJ.,andBajracharya,SushilK.Atheoryofaspectsaslatenttopics.InProceedingsofthe23rdACMSIG-PLANconferenceonObject-orientedprogrammingsystemslanguagesandapplications(OOPSLA),pp.543{562,2008.Blei,DavidM.,Ng,AndrewY.,andJordan,MichaelI.LatentDirichletAllocation.TheJournalofMachineLearningResearch(JMLR),3:993{1022,2003.Doshi-Velez,Finale,Miller,KurtT.,Gael,Jur-genVan,andTeh,YeeWhye.VariationalInfer-encefortheIndianBuetProcess.InProceedingsoftheIntl.Conf.onArticialIntelligenceandStatis-tics(AISTATS),pp.137{144,2009.Eaddy,M.,Aho,A.V.,Antoniol,G.,andGueheneuc,Y.G.CERBERUS:TracingRequirementstoSourceCodeUsingInformationRetrieval,DynamicAnaly-sis,andProgramAnalysis.InInternationalConfer-enceonProgramComprehensionInProgramCom-prehension(ICPC),pp.53{62,2008.Globerson,A.,Chechik,G.,Pereira,F.,andTishby,N.EuclideanEmbeddingofCo-occurrenceData.TheJournalofMachineLearningResearch(JMLR),8:2265{2295,2007.Griths,T.L.,Ghahramani,Z.,andSollich,Peter.BayesianNonparametricLatentFeatureModels.InBayesianStatistics,pp.201{225,2007.Ishwaran,H.andJames,L.F.GibbsSamplingMeth-odsforStick-BreakingPriors.JournaloftheAmer-icanStatisticalAssociation,96:161{173,2001.Kastner,Christian,Apel,Sven,andBatory,Don.ACaseStudyImplementingFeaturesUsingAspectJ.InProceedingsofthe11thInternationalSoftwareProductLineConference(SPLC),pp.223{232,2007.Marin,Marius,Deursen,ArieVan,andMoonen,Leon.Identifyingcrosscuttingconcernsusingfan-inanal-ysis.ACMTransactionsonSoftwareEngineeringandMethodology(TOSEM),17,2007.Mimno,David,Wallach,Hanna,Talley,Edmund,Leenders,Miriam,andMcCallum,Andrew.Op-timizingsemanticcoherenceintopicmodels.InProceedingsoftheConferenceonEmpiricalMeth-odsinNaturalLanguageProcessing(EMNLP),pp.262{272,2011.Revelle,Meghan,Dit,Bogdan,andPoshyvanyk,Denys.Usingdatafusionandwebminingtosupportfeaturelocationinsoftware.InProceedingsofthe2010IEEE18thInternationalConferenceonPro-gramComprehension(ICPC),pp.14{23,2010.Robillard,M.P.TopologyAnalysisofSoftwareDe-pendencies.ACMTransactionsonSoftwareEngi-neeringandMethodology(TOSEM),(4),2008.Savage,T.,Revelle,M.,andPoshyvanyk,D.FLAT3:FeatureLocationandTextualTracingTool.InPro-ceedingsofthe32ndACM/IEEEInternationalCon-ferenceonSoftwareEngineering(ICSE)-Volume2,pp.255{258,2010a.Savage,Trevor,Dit,Bogdan,Gethers,Malcom,andPoshyvanyk,Denys.TopicXP:ExploringtopicsinsourcecodeusingLatentDirichletAllocation.InProceedingsoftheIEEEInternationalConferenceonSoftwareMaintenance(ICSM),pp.1{6,2010b.Teh,Y.,Jordan,M.I.,andBeal,M.HierarchicalDirichletprocesses.JournalofAmericanStatisticalAssociation,pp.1566{1581,2007.Titov,IvanandMcDonald,Ryan.ModelingOnlineReviewswithMulti-grainTopicModels.InProceed-ingsofthe17thInternationalConferenceonWorldWideWeb(WWW),pp.111{120,2008.Wen,BoandBoahen,Kwabena.ActiveBidirectionalCouplinginaCochlearChip.InAdvancesinNeuralInformationProcessingSystems(NIPS),2005.Williamson,Sinead,Wang,Chong,Heller,Kather-ineA.,andBlei,DavidM.TheIBPCompoundDirichletProcessanditsApplicationtoFocusedTopicModeling.InProceedingsofthe27thInter-nationalConferenceonMachineLearning(ICML),2010.Wong,T.T.GeneralizedDirichletdistributioninBayesiananalysis.AppliedMathematicsandCom-putation,97:165{181,1998.