com Bin Wu wubinbupteducn Bai Wang wangbaibupteducn Chuan Shi shichuanbupteducn Le Yu yulebuptgmailcom Beijing Key Lab of Intelligent Telecommunication Software and Multimedia Beijing University of Posts and Telecommunications Beijing 100876 China Ed ID: 68678
Download Pdf The PPT/PDF document "JMLR Workshop and Conference Proceedings..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
QiuWuWangShiYuWiththeadventoftheeraofbigdata,theamountofdatainourworldhasbeenexploding,andanalyzinglargedatasetshasbecomingakeyinmanyareas.LDAmodel,withthecollapsedGibbssamplingalgorithmincommonuse,hasnowbeenbroadlyappliedinmachinelearninganddatamining,particularlyinclassication,recommendationandwebsearch,inwhichbothhighaccuracyandspeedarerequired.SomeimprovementshavebeenexploredforadaptingtobigdatabasedonvariationalBayesian,suchasTehetal.(2006);Nallapatietal.(2007);Wolfeetal.(2008).TheCollapsedVariationalBayes(CVB)algorithmhasbeenimplementedinMahoutandWenetal.(2013)improveditbycombiningGPUandHadoop.CVBconvergesfasterthancollapsedGibbssampling,butthelatterattainsabettersolutionintheendwithenoughsamples.SothereisasignicantmotivationtospeedupcollapsedGibbssamplingforLDAmodel.InthisgeneralcontextweintroduceourparallelcollapsedGibbssamplingalgorithmanddemonstratehowtoimplementitonSpark,anewin-memoryclustercomputingframeworkproposedbyZahariaetal.(2010,2012).Thekeyideaofouralgorithmistoreducethetimetakenforcommunicationandsynchronization.Onlyapartofglobalparametersareparalleltransferredincommunicationandnocomplexcalculationisneededinsynchronization.However,adisadvantageofouralgorithmisthatthenumberofiterationssignicantlyincreases.Inordertoovercomethisproblem,weadoptSpark,whichisverysuitableforiterativeandinteractivealgorithm,toimplementourmethod.Werstbrie yreviewtherelatedworkoncollapsedGibbssamplinginSection2;somepriorknowledge,includingtheLDAmodel,collapsedGibbssamplingforLDAandSpark,isreviewedinSection3;inSection4wedescribethedetailsofourdistributedapproachonSparkviaasimpleexample;inSection5wepresentperplexityandspeedupresults;nally,wediscussfutureresearchplansinSection6.2.RelatedWorkVariousimplementationsandimprovementshavebeenexploredforspeedingupLDAmodel.RelevantcollapsedGibbssamplingmethodsareasfollows:Newmanetal.(2007,2009)proposedtwoversionsofLDAwherethedataandtheparametersaredistributedoverdistinctprocessors:ApproximateDistributedLDAmodel(AD-LDA)andHierarchicalDistributedLDAmodel(HD-LDA).InAD-LDA,theysimplyimplementLDAoneachprocessorandupdateglobalparametersafteralocalGibbssamplingiteration.HD-LDAcanbeviewedasamixturemodelwithPLDA,whichoptimizesthecorrectposteriorquantitybutismorecomplextoimplementandslowertorun.Porteousetal.(2008)presentedanewsamplingscheme,whichproducesexactlythesameresultsasthestandardsamplingschemebutfaster.AnasynchronousdistributedversionofLDA(Async-LDA)wasintroducedbyAsun-cionetal.(2008).InAsync-LDA,eachprocessorperformsalocalcollapsedGibbssamplingstepfollowedbyastepofcommunicatingwithanotherrandomprocessortogainthebenetsofasynchronouscomputing.Async-LDAhasbeenimprovedinGraphLab,agraph-basedparallelframeworkformachinelearning,whichwaspro-posedbyLowetal.(2010).18 QiuWuWangShiYudrawnwithtopickchosenwithprobabilitykjj.Thenwordxijisdrawnfromthezthijtopic,withxijtakingonvaluewwithprobabilitywjk,wherewjkisdrawnfromaDirichletpriorwithparameter.Finally,thegenerativeprocessisbelow:kjjDir()wjkDir()zijkjjxijwjzij(1)whereDir()representstheDirichletdistribution.Figure1showsthegraphicalmodelrepresentationoftheLDAmodel.GiventhetrainingdatawithNwordsx=xij,itispossibletoinfertheposteriordistributionoflatentvariables.AnecientprocedureistousecollapsedGibbssampling,whichsamplesthelatentvariablesz=zijbyintegratingoutkjjandwjk.Theconditionalprobabilityofzijiscomputedasfollows:p(zij=kjz:ij;x;;)/(+n:ijkjj)(+n:ijxijjk)(W+n:ijk)1(2)wherethesuperscript:ijmeansthecorrespondingdata-itemisexcludedinthecountvalues,nkjjdenotesthenumberoftokensindocumentjassignedtotopick,nxijjkdenotesthenumberoftokenswithwordwassignedtotopick,andnk=Pwnwjk.3.2.SparkSparkisafastandgeneralengineforlarge-scaledataprocessingthatcanrunprogramsupto100xfasterthanHadoopMapReduceinmemory,or10xfasterondisk.TheprojectstartedasaresearchprojectattheUCBerkeleyAMPLabin2009,andwasopensourcedinearly2010.Afterbeingreleased,SparkgrewadevelopercommunityonGitHubandenteredApachein2013asitspermanenthome.Awiderangeofcontributorsnowdeveloptheproject(over120developersfrom25companies)spa.RelativetoHadoop,Sparkwasdesignedtorunmorecomplex,multi-passalgorithms,suchastheiterativealgorithmsthatarecommoninmachinelearningandgraphprocessing;orexecutemoreinteractiveadhocqueriestoexplorethedata.Thecoreproblemoftheseapplicationsisthatbothmulti-passandinteractiveapplicationsneedtosharedataacrossmultipleMapReducesteps.Unfortunately,theonlywaytosharedatabetweenparalleloperationsinMapReduceistowritetoadistributedlesystem,whichaddssubstantialoverheadduetodatareplicationanddiskI/O.Indeed,ithadbeenfoundthatthisoverheadcouldtakeupmorethan90%oftherunningtimeofcommonmachinelearningalgorithmsimplementedonHadoop.Sparkovercomesthisproblembyprovidinganewstorageprim-itivecalledresilientdistributeddataset(RDD),whichrepresentsaread-onlycollectionofobjectspartitionedacrossasetofmachinesandprovidesfaulttolerancewithoutrequiringreplication,bytrackinghowtocomputelostdatafrompreviousRDDs.RDDisthekeyabstractioninSpark.UserscanexplicitlycacheanRDDinmemoryordiskacrossmachinesandreuseitinmultipleparalleloperations.SparkincludesMLlib,whichisalibraryofmachinelearningalgorithmsforlargedata,includingclassication,regression,clustering,collaborativeltering,dimensionalityreduc-tion,aswellasunderlyingoptimizationprimitives.However,notopicmodelalgorithmhasbeenaddedsofar.20 QiuWuWangShiYu Figure2:(x;y)representpartitions.Inthiscase,Xissplitinto9partitions,whichareshuedandrecombinedtoX0,X1andX2.WerstputthedatasetsXintoPpartitions,throughdividingthecolumnsintoequalPparts;thenswapthecolumnstomakethenumberofwordsineachpartitioncloser.Afterthat,wedividethecolumnsofNwcorrespondingtoX.ThenwedividetherowsintoPparts,nowthedatasetXhasbeensplitintoPPpartitions;andmakethenumberofwordsineachpartition,whichbelongstothesamesub-datasetafterrecombination,closer.Thoughthenumberofrowsisnotnecessarilyequal,itisstillverydiculttondanecientwaytosolvethisproblem.Here,weuseasimplerandomizedmethodtoswaptherowsandcalculatethedierencebetweenthemaximumandminimumnumberofwordsineachpartitionwhichbelongstothesamesub-dataset.Thesmallerdierencemeansbetterpartitions.WecanrunthismethodforseveraltimesandselectthebestonebycallinganRDDscachemethodtotellSparktotryandkeepthedatasetinmemory.Itisfast,sincethedatasetisinmemoryandcanbecalculatedparallel.Inourexperiment,itworkswell.Finally,weselectPnon-con ictpartitionsandrecombinethemintoasub-dataset.RepeattheprocessandwewillgetPsub-datasetsatlast.Considerthesamplingprocessforasub-dataset.Eachsub-datasetcontainsPnon-con ictpartitions,whichcanbedistributedtoPprocessors;recallthatNidandNjwhavebeendistributed,too.SowecanshueNidandNjwcorrespondingtothepartitions,andletthem,whichcontainthesamedocumentsandwords,ononeprocessor.ThenstandardcollapsedGibbssamplingalgorithmwillbeexecutedoneachprocessor,NidandNjwwillbeupdatedatthesametime.Aftercompletingthesamplingofallpartitionsinasub-dataset,wecalculateNkby~Npk,whichtrackedchangesinNpk,andNkisthenbroadcastedtoallprocessorsandstoredasNpk.TheaboveprocessviaasimpleexampleisgiveninFigure3.Thuswehavecompletedthesamplingprocessforasub-dataset,andthenweneedtorepeatthisprocessforremainingsub-datasets.Aftersamplingallthesub-datasets,wehavecompletedaniterationoftheglobal.Toimplementthealgorithm1onSpark,therearetwoissuesthatmustberesolved.OneisthatitisaskedtooperatethreeRDDs,Xp,NdandNw,duringthesamplingprocess.22 Spark-LDA Figure4:TestsetperplexityversusnumberofprocessorsPforKOS(left)andNIPS(right). Figure5:TestsetperplexityversusnumberoftopicsKforKOS(left)andNIPS(right).Figure4showsthetestsetperplexityforKOSandNIPS.WeusedstandardLDAtocomputeperplexityforP=1andSpark-LDAforP=8;16;24.ThegureclearlyshowsthattheperplexityresultswereofnosignicantdierencebetweenLDAandSpark-LDAforvaryingnumbersoftopics.ItsuggeststhatouralgorithmconvergedtomodelshavingthesamepredictivepowerasstandardLDA.WealsousedSpark-LDAtocomputeperplexitiesforKOSwithK=128;256,NIPSwithK=80;160,andtheresultspresentedthesameconclusion.Becausetheperplexitiesincreasedwiththenumberoftopicsincreasinginthecurrentparameters,whichmadethegureconfusing,weonlyshowedpartsoftheseresultsinFigure5.Limitedbythehardwareconditions,wedidnotperformexperimentsonmorethan24processors.TherateofconvergenceisshowninFigure6.Asthenumberofprocessorsincreased,therateofconvergencesloweddowninourexperiments,sincetheinformationforsamplingin25 QiuWuWangShiYueachprocessorwasnotaccurateenough.However,alloftheseresultseventuallyconvergedtothesamerangeaftertheburn-inperiodof1000iterations. Figure6:TestsetperplexityversusiterationonKOS,K=8.Werespectivelyperformedspeedupexperimentsontwosmalldatasets(KOS,NIPS)andonelargedataset(NYT).ThespeedupwascomputedasSpeedup=1 (1S)+S=P,whereSwasthepercentofsamplingtimeandPwasthenumberofprocesses.TheresultsareshowninFigure7(left).Thegureclearlyshowsthatspeedupincreasedasthedatasetsize.Thephenomenonre ectedthatouralgorithmperformedwellonlargedataset,buttheeectwasnotsignicantonsmalldatasets. Figure7:SpeedupofSpark-LDA(left)andtheproportionofthesamplingtimeperiteration(right)onKOS,NIPSandNYTInordertoanalyzethecausesofthisphenomenon,wethenmeasuredtheproportionofthesamplingtimeperiterationandshowedtheresultsinFigure7(right).Withthe26 QiuWuWangShiYuYuchengLow,JosephGonzalez,AapoKyrola,DannyBickson,CarlosGuestrin,andJosephMHellerstein.Graphlab:Anewframeworkforparallelmachinelearning.arXivpreprintarXiv:1006.4990,2010.RameshNallapati,WilliamCohen,andJohnLaerty.Parallelizedvariationalemforlatentdirichletallocation:Anexperimentalevaluationofspeedandscalability.InDataMiningWorkshops,2007.ICDMWorkshops2007.SeventhIEEEInternationalConferenceon,pages349{354.IEEE,2007.DavidNewman,ArthurUAsuncion,PadhraicSmyth,andMaxWelling.Distributedinfer-enceforlatentdirichletallocation.InNIPS,volume20,pages1081{1088,2007.DavidNewman,ArthurAsuncion,PadhraicSmyth,andMaxWelling.Distributedalgo-rithmsfortopicmodels.TheJournalofMachineLearningResearch,10:1801{1828,2009.IanPorteous,DavidNewman,AlexanderIhler,ArthurAsuncion,PadhraicSmyth,andMaxWelling.Fastcollapsedgibbssamplingforlatentdirichletallocation.InProceedingsofthe14thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages569{577.ACM,2008.YeeWhyeTeh,DavidNewman,andMaxWelling.Acollapsedvariationalbayesianinferencealgorithmforlatentdirichletallocation.InNIPS,volume6,pages1378{1385,2006.YiWang,HongjieBai,MattStanton,Wen-YenChen,andEdwardYChang.Plda:Parallellatentdirichletallocationforlarge-scaleapplications.AlgorithmicAspectsinInformationandManagement,pages301{314,2009.LaWen,JianwuRui,TingtingHe,andLiangGuo.Acceleratinghierarchicaldistributedlatentdirichletallocationalgorithmbyparallelgpu.JournalofComputerApplications,33(12):3313{3316,2013.JasonWolfe,AriaHaghighi,andDanKlein.Fullydistributedemforverylargedatasets.InProceedingsofthe25thinternationalconferenceonMachinelearning,pages1184{1191.ACM,2008.HanXiaoandThomasStibor.Ecientcollapsedgibbssamplingforlatentdirichletalloca-tion.JournalofMachineLearningResearch-ProceedingsTrack,13:63{78,2010.FengYan,NingyiXu,andYuanQi.Parallelinferenceforlatentdirichletallocationongraphicsprocessingunits.InNIPS,volume9,pages2134{2142,2009.MateiZaharia,MosharafChowdhury,MichaelJFranklin,ScottShenker,andIonStoica.Spark:clustercomputingwithworkingsets.InProceedingsofthe2ndUSENIXconfer-enceonHottopicsincloudcomputing,pages10{10,2010.MateiZaharia,MosharafChowdhury,TathagataDas,AnkurDave,JustinMa,MurphyMcCauley,MichaelJFranklin,ScottShenker,andIonStoica.Resilientdistributeddatasets:Afault-tolerantabstractionforin-memoryclustercomputing.InProceedingsofthe9thUSENIXconferenceonNetworkedSystemsDesignandImplementation,pages2{2.USENIXAssociation,2012.28