/
JMLR Workshop and Conference Proceedings      BIGMINE JMLR Workshop and Conference Proceedings      BIGMINE

JMLR Workshop and Conference Proceedings BIGMINE - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
462 views
Uploaded On 2015-05-17

JMLR Workshop and Conference Proceedings BIGMINE - PPT Presentation

com Bin Wu wubinbupteducn Bai Wang wangbaibupteducn Chuan Shi shichuanbupteducn Le Yu yulebuptgmailcom Beijing Key Lab of Intelligent Telecommunication Software and Multimedia Beijing University of Posts and Telecommunications Beijing 100876 China Ed ID: 68678

com Bin wubinbupteducn Bai

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JMLR Workshop and Conference Proceedings..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

QiuWuWangShiYuWiththeadventoftheeraofbigdata,theamountofdatainourworldhasbeenexploding,andanalyzinglargedatasetshasbecomingakeyinmanyareas.LDAmodel,withthecollapsedGibbssamplingalgorithmincommonuse,hasnowbeenbroadlyappliedinmachinelearninganddatamining,particularlyinclassi cation,recommendationandwebsearch,inwhichbothhighaccuracyandspeedarerequired.SomeimprovementshavebeenexploredforadaptingtobigdatabasedonvariationalBayesian,suchasTehetal.(2006);Nallapatietal.(2007);Wolfeetal.(2008).TheCollapsedVariationalBayes(CVB)algorithmhasbeenimplementedinMahoutandWenetal.(2013)improveditbycombiningGPUandHadoop.CVBconvergesfasterthancollapsedGibbssampling,butthelatterattainsabettersolutionintheendwithenoughsamples.Sothereisasigni cantmotivationtospeedupcollapsedGibbssamplingforLDAmodel.InthisgeneralcontextweintroduceourparallelcollapsedGibbssamplingalgorithmanddemonstratehowtoimplementitonSpark,anewin-memoryclustercomputingframeworkproposedbyZahariaetal.(2010,2012).Thekeyideaofouralgorithmistoreducethetimetakenforcommunicationandsynchronization.Onlyapartofglobalparametersareparalleltransferredincommunicationandnocomplexcalculationisneededinsynchronization.However,adisadvantageofouralgorithmisthatthenumberofiterationssigni cantlyincreases.Inordertoovercomethisproblem,weadoptSpark,whichisverysuitableforiterativeandinteractivealgorithm,toimplementourmethod.We rstbrie yreviewtherelatedworkoncollapsedGibbssamplinginSection2;somepriorknowledge,includingtheLDAmodel,collapsedGibbssamplingforLDAandSpark,isreviewedinSection3;inSection4wedescribethedetailsofourdistributedapproachonSparkviaasimpleexample;inSection5wepresentperplexityandspeedupresults; nally,wediscussfutureresearchplansinSection6.2.RelatedWorkVariousimplementationsandimprovementshavebeenexploredforspeedingupLDAmodel.RelevantcollapsedGibbssamplingmethodsareasfollows:Newmanetal.(2007,2009)proposedtwoversionsofLDAwherethedataandtheparametersaredistributedoverdistinctprocessors:ApproximateDistributedLDAmodel(AD-LDA)andHierarchicalDistributedLDAmodel(HD-LDA).InAD-LDA,theysimplyimplementLDAoneachprocessorandupdateglobalparametersafteralocalGibbssamplingiteration.HD-LDAcanbeviewedasamixturemodelwithPLDA,whichoptimizesthecorrectposteriorquantitybutismorecomplextoimplementandslowertorun.Porteousetal.(2008)presentedanewsamplingscheme,whichproducesexactlythesameresultsasthestandardsamplingschemebutfaster.AnasynchronousdistributedversionofLDA(Async-LDA)wasintroducedbyAsun-cionetal.(2008).InAsync-LDA,eachprocessorperformsalocalcollapsedGibbssamplingstepfollowedbyastepofcommunicatingwithanotherrandomprocessortogainthebene tsofasynchronouscomputing.Async-LDAhasbeenimprovedinGraphLab,agraph-basedparallelframeworkformachinelearning,whichwaspro-posedbyLowetal.(2010).18 QiuWuWangShiYudrawnwithtopickchosenwithprobabilitykjj.Thenwordxijisdrawnfromthezthijtopic,withxijtakingonvaluewwithprobabilitywjk,wherewjkisdrawnfromaDirichletpriorwithparameter .Finally,thegenerativeprocessisbelow:kjjDir( )wjkDir( )zijkjjxijwjzij(1)whereDir( )representstheDirichletdistribution.Figure1showsthegraphicalmodelrepresentationoftheLDAmodel.GiventhetrainingdatawithNwordsx=xij,itispossibletoinfertheposteriordistributionoflatentvariables.AnecientprocedureistousecollapsedGibbssampling,whichsamplesthelatentvariablesz=zijbyintegratingoutkjjandwjk.Theconditionalprobabilityofzijiscomputedasfollows:p(zij=kjz:ij;x; ; )/( +n:ijkjj)( +n:ijxijjk)(W +n:ijk)�1(2)wherethesuperscript:ijmeansthecorrespondingdata-itemisexcludedinthecountvalues,nkjjdenotesthenumberoftokensindocumentjassignedtotopick,nxijjkdenotesthenumberoftokenswithwordwassignedtotopick,andnk=Pwnwjk.3.2.SparkSparkisafastandgeneralengineforlarge-scaledataprocessingthatcanrunprogramsupto100xfasterthanHadoopMapReduceinmemory,or10xfasterondisk.TheprojectstartedasaresearchprojectattheUCBerkeleyAMPLabin2009,andwasopensourcedinearly2010.Afterbeingreleased,SparkgrewadevelopercommunityonGitHubandenteredApachein2013asitspermanenthome.Awiderangeofcontributorsnowdeveloptheproject(over120developersfrom25companies)spa.RelativetoHadoop,Sparkwasdesignedtorunmorecomplex,multi-passalgorithms,suchastheiterativealgorithmsthatarecommoninmachinelearningandgraphprocessing;orexecutemoreinteractiveadhocqueriestoexplorethedata.Thecoreproblemoftheseapplicationsisthatbothmulti-passandinteractiveapplicationsneedtosharedataacrossmultipleMapReducesteps.Unfortunately,theonlywaytosharedatabetweenparalleloperationsinMapReduceistowritetoadistributed lesystem,whichaddssubstantialoverheadduetodatareplicationanddiskI/O.Indeed,ithadbeenfoundthatthisoverheadcouldtakeupmorethan90%oftherunningtimeofcommonmachinelearningalgorithmsimplementedonHadoop.Sparkovercomesthisproblembyprovidinganewstorageprim-itivecalledresilientdistributeddataset(RDD),whichrepresentsaread-onlycollectionofobjectspartitionedacrossasetofmachinesandprovidesfaulttolerancewithoutrequiringreplication,bytrackinghowtocomputelostdatafrompreviousRDDs.RDDisthekeyabstractioninSpark.UserscanexplicitlycacheanRDDinmemoryordiskacrossmachinesandreuseitinmultipleparalleloperations.SparkincludesMLlib,whichisalibraryofmachinelearningalgorithmsforlargedata,includingclassi cation,regression,clustering,collaborative ltering,dimensionalityreduc-tion,aswellasunderlyingoptimizationprimitives.However,notopicmodelalgorithmhasbeenaddedsofar.20 QiuWuWangShiYu Figure2:(x;y)representpartitions.Inthiscase,Xissplitinto9partitions,whichareshuedandrecombinedtoX0,X1andX2.We rstputthedatasetsXintoPpartitions,throughdividingthecolumnsintoequalPparts;thenswapthecolumnstomakethenumberofwordsineachpartitioncloser.Afterthat,wedividethecolumnsofNwcorrespondingtoX.ThenwedividetherowsintoPparts,nowthedatasetXhasbeensplitintoPPpartitions;andmakethenumberofwordsineachpartition,whichbelongstothesamesub-datasetafterrecombination,closer.Thoughthenumberofrowsisnotnecessarilyequal,itisstillverydicultto ndanecientwaytosolvethisproblem.Here,weuseasimplerandomizedmethodtoswaptherowsandcalculatethedi erencebetweenthemaximumandminimumnumberofwordsineachpartitionwhichbelongstothesamesub-dataset.Thesmallerdi erencemeansbetterpartitions.WecanrunthismethodforseveraltimesandselectthebestonebycallinganRDDscachemethodtotellSparktotryandkeepthedatasetinmemory.Itisfast,sincethedatasetisinmemoryandcanbecalculatedparallel.Inourexperiment,itworkswell.Finally,weselectPnon-con ictpartitionsandrecombinethemintoasub-dataset.RepeattheprocessandwewillgetPsub-datasetsatlast.Considerthesamplingprocessforasub-dataset.Eachsub-datasetcontainsPnon-con ictpartitions,whichcanbedistributedtoPprocessors;recallthatNidandNjwhavebeendistributed,too.SowecanshueNidandNjwcorrespondingtothepartitions,andletthem,whichcontainthesamedocumentsandwords,ononeprocessor.ThenstandardcollapsedGibbssamplingalgorithmwillbeexecutedoneachprocessor,NidandNjwwillbeupdatedatthesametime.Aftercompletingthesamplingofallpartitionsinasub-dataset,wecalculateNkby~Npk,whichtrackedchangesinNpk,andNkisthenbroadcastedtoallprocessorsandstoredasNpk.TheaboveprocessviaasimpleexampleisgiveninFigure3.Thuswehavecompletedthesamplingprocessforasub-dataset,andthenweneedtorepeatthisprocessforremainingsub-datasets.Aftersamplingallthesub-datasets,wehavecompletedaniterationoftheglobal.Toimplementthealgorithm1onSpark,therearetwoissuesthatmustberesolved.OneisthatitisaskedtooperatethreeRDDs,Xp,NdandNw,duringthesamplingprocess.22 Spark-LDA Figure4:TestsetperplexityversusnumberofprocessorsPforKOS(left)andNIPS(right). Figure5:TestsetperplexityversusnumberoftopicsKforKOS(left)andNIPS(right).Figure4showsthetestsetperplexityforKOSandNIPS.WeusedstandardLDAtocomputeperplexityforP=1andSpark-LDAforP=8;16;24.The gureclearlyshowsthattheperplexityresultswereofnosigni cantdi erencebetweenLDAandSpark-LDAforvaryingnumbersoftopics.ItsuggeststhatouralgorithmconvergedtomodelshavingthesamepredictivepowerasstandardLDA.WealsousedSpark-LDAtocomputeperplexitiesforKOSwithK=128;256,NIPSwithK=80;160,andtheresultspresentedthesameconclusion.Becausetheperplexitiesincreasedwiththenumberoftopicsincreasinginthecurrentparameters,whichmadethe gureconfusing,weonlyshowedpartsoftheseresultsinFigure5.Limitedbythehardwareconditions,wedidnotperformexperimentsonmorethan24processors.TherateofconvergenceisshowninFigure6.Asthenumberofprocessorsincreased,therateofconvergencesloweddowninourexperiments,sincetheinformationforsamplingin25 QiuWuWangShiYueachprocessorwasnotaccurateenough.However,alloftheseresultseventuallyconvergedtothesamerangeaftertheburn-inperiodof1000iterations. Figure6:TestsetperplexityversusiterationonKOS,K=8.Werespectivelyperformedspeedupexperimentsontwosmalldatasets(KOS,NIPS)andonelargedataset(NYT).ThespeedupwascomputedasSpeedup=1 (1�S)+S=P,whereSwasthepercentofsamplingtimeandPwasthenumberofprocesses.TheresultsareshowninFigure7(left).The gureclearlyshowsthatspeedupincreasedasthedatasetsize.Thephenomenonre ectedthatouralgorithmperformedwellonlargedataset,butthee ectwasnotsigni cantonsmalldatasets. Figure7:SpeedupofSpark-LDA(left)andtheproportionofthesamplingtimeperiteration(right)onKOS,NIPSandNYTInordertoanalyzethecausesofthisphenomenon,wethenmeasuredtheproportionofthesamplingtimeperiterationandshowedtheresultsinFigure7(right).Withthe26 QiuWuWangShiYuYuchengLow,JosephGonzalez,AapoKyrola,DannyBickson,CarlosGuestrin,andJosephMHellerstein.Graphlab:Anewframeworkforparallelmachinelearning.arXivpreprintarXiv:1006.4990,2010.RameshNallapati,WilliamCohen,andJohnLa erty.Parallelizedvariationalemforlatentdirichletallocation:Anexperimentalevaluationofspeedandscalability.InDataMiningWorkshops,2007.ICDMWorkshops2007.SeventhIEEEInternationalConferenceon,pages349{354.IEEE,2007.DavidNewman,ArthurUAsuncion,PadhraicSmyth,andMaxWelling.Distributedinfer-enceforlatentdirichletallocation.InNIPS,volume20,pages1081{1088,2007.DavidNewman,ArthurAsuncion,PadhraicSmyth,andMaxWelling.Distributedalgo-rithmsfortopicmodels.TheJournalofMachineLearningResearch,10:1801{1828,2009.IanPorteous,DavidNewman,AlexanderIhler,ArthurAsuncion,PadhraicSmyth,andMaxWelling.Fastcollapsedgibbssamplingforlatentdirichletallocation.InProceedingsofthe14thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,pages569{577.ACM,2008.YeeWhyeTeh,DavidNewman,andMaxWelling.Acollapsedvariationalbayesianinferencealgorithmforlatentdirichletallocation.InNIPS,volume6,pages1378{1385,2006.YiWang,HongjieBai,MattStanton,Wen-YenChen,andEdwardYChang.Plda:Parallellatentdirichletallocationforlarge-scaleapplications.AlgorithmicAspectsinInformationandManagement,pages301{314,2009.LaWen,JianwuRui,TingtingHe,andLiangGuo.Acceleratinghierarchicaldistributedlatentdirichletallocationalgorithmbyparallelgpu.JournalofComputerApplications,33(12):3313{3316,2013.JasonWolfe,AriaHaghighi,andDanKlein.Fullydistributedemforverylargedatasets.InProceedingsofthe25thinternationalconferenceonMachinelearning,pages1184{1191.ACM,2008.HanXiaoandThomasStibor.Ecientcollapsedgibbssamplingforlatentdirichletalloca-tion.JournalofMachineLearningResearch-ProceedingsTrack,13:63{78,2010.FengYan,NingyiXu,andYuanQi.Parallelinferenceforlatentdirichletallocationongraphicsprocessingunits.InNIPS,volume9,pages2134{2142,2009.MateiZaharia,MosharafChowdhury,MichaelJFranklin,ScottShenker,andIonStoica.Spark:clustercomputingwithworkingsets.InProceedingsofthe2ndUSENIXconfer-enceonHottopicsincloudcomputing,pages10{10,2010.MateiZaharia,MosharafChowdhury,TathagataDas,AnkurDave,JustinMa,MurphyMcCauley,MichaelJFranklin,ScottShenker,andIonStoica.Resilientdistributeddatasets:Afault-tolerantabstractionforin-memoryclustercomputing.InProceedingsofthe9thUSENIXconferenceonNetworkedSystemsDesignandImplementation,pages2{2.USENIXAssociation,2012.28