/
raining Products of Experts Minimizing Con trastiv Div raining Products of Experts Minimizing Con trastiv Div

raining Products of Experts Minimizing Con trastiv Div - PDF document

mitsue-stanley
mitsue-stanley . @mitsue-stanley
Follow
395 views
Uploaded On 2015-06-15

raining Products of Experts Minimizing Con trastiv Div - PPT Presentation

Hin ton Gatsb y Computational Neuroscience Unit Univ ersit y College London 17 Queen Square London W C1N 3AR UK httpwwwgatsby u cl ac uk Abstract It is p ossible to com bine m ultiple probabilistic mo dels of the same data b ym ultiplying their prob ID: 86195

Hin ton Gatsb

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "raining Products of Experts Minimizing C..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

erydi erenyofcombiningdistributionsistomultiplythemtogetherandrenormalize.productproductproductssmoothmodelsareabitmorecomplicatedandeachcontainoneormorelatenhidden)verypomodelsofthiskindwillbecalled\experts".ProductsExpertsproducesharperexpertmodels.expertmodelproductconstrainallofthedimensions.ormodelinghandwrittendigits,onelow-resolutionmodelcangenerateimagesthathaetheapproximateoerallshapeofthedigitandothermorelocalmodelscanensurethatsmallimagepatchescontainsegmentsofstrokewiththecorrect nestructure.ormodelingsentences,eachexpertcanenforceanuggetoflinguisticknoorexample,oneexpertcouldensurethatthetensesagree,onecouldensurethatthereisnberagreemenbeteenthesubjectandverbandonecouldensurethatstringsinwhichcolouradjectivesfollosizeadjectivesaremoreprobablethanthetherevoEtodataappearsdicultbecauseitappearstobenecessarytocomputees,withrepecttotheparameters,ofthepartitionfunctionthatisusedintherenormal-Asweshallsee,hoer,thesederivescanbe nessedbyoptimizingalessobobjectivefunctionthantheloglikelihoodofthedata.productsexpertselihoodeconsiderindividualexpertmodelsforwhichitistractabletocomputethederiveoftheectorwithrespectparametersofexpert.individualexpertmodelsasfollo adataectorinparametersofmodel)istheprobabilityofundermodel,andindexesallpossiblevectorsinthedataorconuousdataspacesthesumisreplacedbytheappropriateinoranexpertehighytodataanditasteaslittleyaspossibleontherestofdataspace.er,can tthedatawellevenifeachexpertwastesalotofitsprobabilityoninappropriateregionsofthedataspaceprovideddi erentexpertswasteprobabilityindi erentregions.Theobviouswyto taPoEtoasetofobservdatav,istofollowthederivoftheloglikelihoodofeachobservedv,underthePoE.Thisisgivenb =@logpm(djm) Xcp(cj1)@logpm(cjm) ThesecondtermontheRHSofEq.2isjusttheexpectedderiveofthelogprobabilitofanexpertonfantasydata,,thatisgeneratedfromthePoE.So,assumingthateachofthe ThesymbolhasnosimplerelationshiptothesymbolusedontheLHSofEq.Indeed,solongas)ispositiveitdoesnotneedtobeaprobabilityatall,thoughitwillgenerallybeaprobabilityinthispaper.ortimeseriesmodels,isawholesequence. individualexpertshasatractablederive,theobviousdicultyinestimatingthederivofthelogprobabilityofthedataunderthePoEisgeneratingcorrectlydistributedfantasydata.Thiscanbedoneinvariouswordiscretedataitispossibletouserejectionsampling:expertgeneratesadatavectorindependentlyandthisprocessisrepeateduntilalltheexpertshappengoodoEspeci eserallprobabilitydistributionandhowdi erentitfromacausalmodel,istAMarkhainMonteCarlomethodthatusesGibbssamplingistypicallymmoreecienInGibbssampling,eacariabledrawsasamplefromitsposteriordistributionexpertscanalwysbeupdatedinparallelbecausetheyareconditionallyindependenThisisaeryimportantconsequenceoftheproductformIftheindividualexpertsalsohaethepropertythatthecomponentsofthedatavectorareconditionallyindependentgiventhehiddenstateoftheexpert,thehiddenandvisiblevariablesformabipartitegraphanditispossibletoupdatecomponenectorinexperts.betupdatesogetanunbiasedestimateofthegradientforthePoEitisnecessaryfortheMarkhaintoconergetotheequilibriumdistribution.,evenifitiscomputationallyfeasibletoapproachtheequilibriumdistributionbeforetakingsamples,thereisasecond,seriousdicultSamplesfromtheequilibriumdistri-butiongenerallyhaeryhighvariancesincetheycomefromalloerthemodel'sdistribution.Thishighvarianceswampsthederivorsestill,thevarianceinthesamplesdependsontheparametersofthemodel.Thisvariationinthevariancecausestheparameterstoberepelledconsiderahorizontalsheetoftinwhichisresonatinginsucythatsomepartshaestrongerticaloscillationsandotherpartsaremotionless.Sandscatteredonthetinwillinthemotionlessareaseventhoughthetime-aeragedgradientiszeroevMaximizingtheloglikelihoodofthedata(aeragedoerthedatadistribution)isequivttominimizingtheKullback-Lieblerdivergencebeteenthedatadistribution,,andtheequilib-riumdistributionoerthevisiblev,thatisproducedbyprolongedGibbssamplingfromthegenerativemodeldenotesaKullback-Leiblerdivergence,theanglebracetsdenoteexpectationsoerthedistributionspeci edasasubscriptand)istheenyofthedatadistribution.doesnotdependontheparametersofthemodel,so)canbeignoredduringtheoptimization.distribution,canberewrittenas: Q0=@m(djm) Q0@m(cjm) Thereisasimpleande ectivealternativetomaximumlikelihoodlearningwhicheliminatesalmostallofthecomputationrequiredtogetsamplesfromtheequilibriumdistributionandalsoeliminatesmhofthevariancethatmasksthegradientsignal.eapproachin-esoptimizingadi erentobjectivefunction.Insteadofjustminimizingeminimize isaifwimagingstartingdistributionattime0. betreconstructionsofthedatavectorsthataregeneratedbyonefullstepofGibbssampling.hainthatisimplementedbyGibbssamplingtoleaetheinitial,distributionthevisiblevariablesunaltered.Insteadofrunningthechaintoequilibriumandcomparingtheinitialand nalderiveswecansimplyrunthechainforonefullstepandthenupdatetheparameterstoreducethetendencyofthechaintowanderayfromtheinitialdistributiononteedthat,sotheconedivergencecanbenegativhainsinwhicalltransitionshaenon-zeroprobabilitsotheconedivergencecanonlybezeroifthemodelisperfect.Themathematicalmotivationfortheconedivergenceisthattheintractableexpecta-tionoontheRHSofEq.4cancelsout: Q01Q1jjQ1=@m(djm) Q0*@logpm(^djm) +Q1+ jjQ1 expertpossibleesoflog)andlogItisalsostraighardtosamplefromprocedureproducesanunbiasedsamplefromkadatav,fromthedistributionofthedataCompute,foreachexpertseparately,theposteriorprobabilitydistributionoeritslatenhidden)variablesgiventhedatavkavalueforeachlatenariablefromitsposteriordistribution.enthealuesoferallthevisiblevariablesbultiplyingtogethertheconditionaldistributionsspeci edyeachexpert.constitutethereconstructeddatavThethirdtermontheRHSofEq.5isproblematictocompute,butextensivesim(seesection10)showthatitcansafelybeignoredbecauseitissmallanditseldomopposesthetofoterms.parametersofexpertsproportiontotheapproximatederiveoftheconediv Q0*@m(^djm) Thisworksverywellinpracticeevenwhenasinglereconstructionofeachdatavectorisusedinplaceofthefullprobabilitydistributionoerreconstructions.Thedi erenceinthederivbecauseprocedurestocmodellingmoderatelyclosematchbeteenadatavectoranditsreconstructionreducessamplingvarianceinmhthe samewyastheuseofmatchedpairsforexperimentalandcontrolconditionsinaclincaltrial.TheloariancemakesitfeasibletoperformonlinelearningaftereachdatavectorispresenthoughthesimulationsdescribedinthispaperusebatchlearninginwhichtheparameterupdatesThereisanalternativejusti cationforthelearningalgorithminEq.Inhigh-dimensionaldatasets,thedatanearlyalwyslieson,orcloseto,amhloerdimensional,smoothlycurvThePoEneedsto ndparametersthatmakeasharpridgeoflogprobabilityalongthewdimensionalmanifold.Bystartingwithapointonthemanifoldandensuringthatthispoinhashigherlogprobabilitythanypicalreconstructionsfromtheariablesofalltheexperts,thePoEensuresthattheprobabilitydistributionhastherightlocalcurvature(proreconstructionsareclosetopossiblethatoEwillaccidentallyassignhighprobabilitytootherdistantandunvisitedpartsofthedataspace,butthisisunlikelyifthelogprobabiltysurfaceissmoothandifbothitsheightanditslocalcurvatureareconstrainedatthedatapoinItisalsopossibleto ndandeliminatesucpoinyperformingprolongedGibbssamplingwithoutanydata,butthisisjustawyofimprovingthelearningandnot,asinBoltzmannmachinelearning,anessentialpartofit.oE'sshouldworkverywellondatadistributionsthatcanbefactorizedintoaproductoflo gure1.are15\unigauss"expertshofwhichisamixtureofauniformandasingle,axis-alignedGaussian.Inthe ttedmodel,htightdataclusterisrepresentedbytheintersectionoftoGaussianswhichareelongatedupdatestheparameters.oreachupdateoftheparameters,thefollowingcomputationisperformedoneryobserveddataventhedata,,calculatetheposteriorprobabilityofselectingtheGaussianratherthantheuniformineachexpertandcomputethe rsttermontheRHSofEq.oreachexpert,stochasticallyselecttheGaussianortheuniformaccordingtotheposteri-ComputethenormalizedproductoftheselectedGaussians,whichisitselfaGaussian,andsamplefromitisusedtogeta\reconstructed"vectorinthedataspace.ComputethenegativeterminEq.6usingthereconstructedvectoraspopulationcodemodelexpertdimensionandprecisionisobtainedbytheintersectionofalargenberofexperts.Figure3wswhathappenswhenexpertsofthetypeusedinthepreviousexampleare tteddimensionalsyntheticimagesthateachcontainoneedge.Theedgesvariedintheirorienposition,andtheintensitiesoneachsideoftheedge.Theinypro leacrosstheedgewasahexpertalsolearnedavarianceforeachpixelandalthoughthesevariancesvindividualexpertsdidnotspecializeinasmallsubsetofthedimensions.enanimage,abouthalfoftheexpertshaeahighprobabilityofpickingtheirGaussianratherthantheiruniform.TheproductsofthechosenGaussiansareexcellentreconstructionsoftheimage.Theexpertsatthetopof gure3looklikeedgedetectorsinariousorientations,positionsandpolarities.expertslocateTheyeacorkfortodi erentsetsofedgesthathaeoppositepolaritiesanddi erenpositions. Figure1:hdotisadatapoinThedatahasbeen ttedwithaproductof15experts.ellipsesshowtheonestandarddeviationcontoursoftheGaussiansineachexpert.Theexpertsareinitializedwithrandomlylocated,circularGaussiansthathaeaboutthesamevarianceasthedata.The veunneededexpertsremainvague,butthemixingproportionsoftheirGaussiansremainhigh. Figure2:300datapointsgeneratedbyprolongedGibbssamplingfromthe15experts ttedin gure1.TheGibbssamplingstartedfromarandompointintherangeofthedataandused25paralleliterationswithannealing.Noticethatthe ttedmodelgeneratesdataatthegridpointhatismissingintherealdata.expertsexpertexpertsygivingthemdi erentordi erentlywtedtrainingcasesorbytrainingthemondi erensubsetsofthedatadimensions,orbyusingdi erentmodelclassesforthedi erentexperts.hexperthasbeeninitializedseparatelytheindividualydistributionsneedtoberaisedtoafractionalpoertocreatetheinitialPSeparateinitializationoftheexpertsseemslikeasensibleidea,butsimulationsindicatethatthePoEisfarmorelikelytobecometrappedinpoorlocaloptimaiftheexpertsarealloedtospecializeseparatelyBettersolutionsareobtainedbysimplyinitializingtheexpertsrandomlywithveryvaguedistributionsandusingthelearningruleinEq. Figure3:meansofallthe100-dimensionalGaussiansinaproductof40experts,hofhisamixtureofaGaussianandauniform.ThePoEwas ttedto1010imagesthateactainedasingleinyedge.TheexpertshaebeenorderedbyhandsothatqualitativsimilarexpertsareadjacenTheBoltzmannmachinelearningalgorithm(HintonandSejnowski,1986)istheoreticallyeleganandeasytoimplementinhardware,butitisveryslowinnetorkswithinterconnectedhiddenunitsbecauseofthevarianceproblemsdescribedinsection2.Smolensky(1986)introducedarestrictedtypeofBoltzmannmachinewithonevisiblelaer,onehiddenlaer,andnoinproportionalproductprobabilitiesthatthevisiblevectorwouldbegeneratedbyeachofthehiddenunitsactingalone.expertperexpertiso itspeci esafactorialprobabilitydistributioninwhicheachvisibleunitisequallyelytobeonoro .Whenthehiddenunitison,itspeci esadi erentfactorialdistributionbusingthewtonitsconnectiontoeachvisibleunittospecifythelogoddsthatthevisibleunitison.Multiplyingtogetherthedistributionsoerthevisiblestatesspeci edbydi erentexpertsisacedbysimplyaddingthelogodds.ExactinferenceistractableinanRBMbecausethestatesofthehiddenunitsareconditionallyindependentgiventhedata.learningalgorithmforanRBM.ConsiderthederiveofthelogprobabilityofthedatawithrespectbetavisibleunitahiddenThe rsttermontheRHSofEq.2is: BoltzmannmachinesandProductsofExpertsareverydi erentclassesofprobabilisticgenerativemodelandtheintersectionofthetoclassesisRBM's @j(djwj) expectedclampedposteriordistributiongiv,andistheexpectedvalueofwhenalternatingGibbssamplingofthehiddenandvisibleunitsisiteratedtogetsamplesfromtheequilibriumdistributioninanetorkwhoseonlyhiddenunitisThesecondtermontheRHSofEq.2is: alloftsinexpectedalternatingGibbssamplingofallthehiddenandallthevisibleunitsisiteratedtogetsamplesfromtheequilibriumdistributionoftheRBM.SubtractingEq.8fromEq.7andtakingexpectationsoerthedistributionofthedatagiv Q0=1 Thetimerequiredtoapproachequilibriumandthehighsamplingvarianceinlearningdicult.Itismhmoree ectivetousetheapproximategradientoftheconoranRBMthisapproximategradientisparticularlyeasytocompute: istheexpectedvalueofwhenone-stepreconstructionsareclampedonthevisibleunitsandissampledfromitsposteriordistributiongiventhereconstruction.high-dimensionaldata,featuresthatmodelthedatawotestthisconjecture,anRBMwith500hiddenunitsand256visibleunitswastrainedon80001616real-valuedimagesofhandwrittendigitsfromall10classes.Theimages,fromthetrainingsetontheUSPSCedarROM,werenormalizedbutariableinbettheycouldbetreatedasprobabilitiesandEq.10wasmodi edtouseprobabilitiesinplaceofstochasticbinaryvaluesforboththedataandtheone-stepreconstructions: Stocprobabilitiesofthereconstructedpixels,butinsteadofpickingbinarystatesforthepixelsfromthoseprobabilities,theprobabilitiesthemselveswereusedasthereconstructeddatavIttooktodaysinmatlabona500MHzworkstationtoperform658epochsoflearning.hepoch,thewtswereupdated80timesusingtheapproximategradientofthecon ergencecomputedonmini-batchesofsize100thatcontained10exemplarsofeachdigitclass.Thelearningratewassetempiricallytobeaboutonequarteroftheratethatcauseddivoscillationsintheparameters.ofurtherimproethelearningspeedamomentummethodwAfterthe rst10epochs,theparameterupdatesspeci edbyEq.11weresupplemenyadding09timesthepreviousupdate.ThePoElearnedlocalisedfeatureswhosebinarystatesyieldedalmostperfectreconstructions.himageaboutonethirdereturnedlearnedfeatureshadon-centero -surroundreceptive eldsorviceversa,somelookedlikepiecesofstroke,andsomelookedlikeGabor ltersorwThewtsof100ofthehiddenunits,selectedatrandom,areshownin gure4. Figure4:e eldsofarandomlyselectedsubsetofthe500hiddenunitsinathatwastrainedon8000imagesofdigitswithequalnbersfromeachclass.hblockshothe256learnedwtsconnectingahiddenunittothepixels.Thescalegoesfrom+2(white)2(blacmodelsAnattractiveaspectofPoE'sisthatitiseasytocomputethenumeratorinEq.1soitiseasytocomputethelogprobabilityofadatavectoruptoanadditiveconstant,log,whichisthelogofthedenominatorinEq.,itisveryhardtocomputethisadditiveconstanThisdoesnotmatterifweonlywttocomparetheprobabilitiesoftodi erentdatavunderthePoE,butitmakesitdiculttoaluatethemodellearnedboE.TheobytomeasurethesuccessoflearningistosumthelogprobabilitiesthatthePoEassignstoectorsthatwnfromusedduringtraining. procedure)+log)+logrespectivIfthedi erencebeteenlogandlogisknownitiseasytopickthemostlikelyclassofthetestimage,andsincethisdi erenceisonlyasinglenberitisquiteeasytoestimateitdiscriminativelyusingasetofvalidationimageswhoselabelsareFigure5showsfeatureslearnedbyaPoEthattainsalaerof100hiddenandistrainedon800imagesofthedigit2.Figure6showssomepreviouslyunseentestimagesof2'sandtheirone-stepreconstructionsfromthebinaryactivitiesofthePoEtrainedon2'sandfromanidenticalPoEtrainedon3's. Figure5:Thewtslearnedby100hiddenunitstrainedon16x16imagesofthedigit2.scalegoesfrom+3(white)to3(blacNotethatthe eldsaremostlyquitelocal.Alocalfeatureliketheoneincolumn1row7lookslikeanedgedetector,butitisbestunderstoodaslocaldeformationofatemplate.Supposealltheotheractivefeaturescreateanofa2thatdi ersfromthedatainhavingalargeloopwhosetopfallsontheblackpartofthee eld.Byturningonthisfeature,thetopoftheloopcanberemoedandreplacedbalinesegmentthatisalittleloerintheimage.underamodeltrainedon825imagesofthedigit4andamodeltrainedon825imagesofthedigit6.,theocialtestsetfortheUSPSdigitsviolatesthestandardassumptionimagesweredrawnfromtheunusedportionoftheocialenforthepreviouslyscoresunderomodelswperfectoacthisexcellentseparation,itwasnecessarytousemodelswithtohiddenlaersandtoathescoresfromtoseparatelytrainedmodelsofeachdigitclass.oreachdigitclass,onemodelmodelhad100inthe rsthiddenlaerand50inthesecond.Theunitsinthe rsthiddenlaerw probabilitieswhentheimageisreconstructedfromthebinaryactivitiesof100featuredetectorsthathaebeentrainedon2's.Thebottomrowshowsthereconstructionprobabilitiesusing100featuredetectorstrainedon3's.hiddenlaerasthedata. 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Log scores of both models on training dataScore under model4Score under model6 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Log scores under both models on test dataScore under model4Score under model6 Figure7:a)Theunnormalisedlogprobabilityscoresofthetrainingimagesofthedigits4and6underthelearnedPoE'sfor4and6.b)Thelogprobabilityscoresforpreviouslyunseentestimagesof4'sand6's.Notethegoodseparationofthetoclasses.Figure8showstheunnormalizedlogprobabilityscoresforimagesof7'sand9'swhicharethemostdicultclassestodiscriminate.Discriminationisnotperfectonthetestimages,butitisencouragingthatalloftheerrorsareclosetothedecisionboundary,sotherearenocon denIfthereare10di erenoE'sforthe10digitclassesitisslightlylessobvioushowtousethe10unnormalizedscoresofatestimagefordiscrimination.Onepossibilityistouseavsettotrainalogisticregressionnetorkthattakestheunnormalizedlogprobabilitiesgivenb a) 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Log scores of both models on training dataScore under model7Score under model9 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Log scores under both models on test dataScore under model7Score under model9 Figure8:a)Theunnormalisedlogprobabilityscoresofthetrainingimagesofthedigits7and9underthelearnedPoE'sfor7and9.b)Thelogprobabilityscoresforpreviouslyunseentestimagesof7'sand9's.Althoughtheclassesarenotlinearlyseparable,alltheerrorsareclosetothebestseparatingline,sotherearenoverycon denterrors.thePoE'sandconertsthemintoaprobabilitydistributionacrossthe10labels.Figure9shothewtsinalogisticregressionnetorkthatistrainedafter tting10PoEmodelstothe10separatedigitordertoseewhethertheerswereprovidingusefuleinformation,eacoEprovidedtoscores.The rstscorewastheunnormalizedmodel rsthiddenlaThesecondscorewastheunnormalizedlogprobabilityoftheprobabilitiesofationofthe rstlaerofhiddenunitsunderaPoEmodelthatconsistedoftheunitsinthesecondhiddenlaThewtsin gure9showthatthesecondlaerofhiddenunitsprousefuladditionalinformation.Presumablythisisbecauseitcapturesthewyinwhichfeaturestedinthe rsthiddenlaerarecorrelated.Theerrorrateis1.1%whichcomparesveryfaorablywiththe5.1%errorrateofasimplenearestneighborclassi eronthesametrainingandtestsetsandisaboutthesameasthevbestclassi erbasedonelasticmodelsofthedigits(Revw,WilliamsandHinton,1996).If7%rejectsarealloed(bhoosinganappropriatethresholdfortheprobabilitylevelofthemostprobableclass),therearenoerrorsonthe2750testimages.eraldi erentnetorkarchitecturesweretriedforthedigit-speci cPoE'sandtheresultsreportedbestmodelreportedperformancereportgoodMNISTdatabaseatandtheywerecarefultodoallthemodelselectionusingsubsetsofthetrainingdatasothattheocialtestdatawasusedonlytomeasurethe nalerrorrate.goodThefactthatthelearningprocedureinEq.6givesgoodresultsinthesimulationsdescribedinsections4,and9suggeststhatitissafetoignorethe naltermintheofEq.comesfromthechangeinthedistribution Figure9:Thewtslearnedbydoingmultinomiallogisticregressiononthetrainingdatawiththelabelsasoutputsandtheunnormalisedlogprobabilityscoresfromthetrained,digit-speci c,oE'sasinputs.hcolumncorrespondstoadigitclass,startingwithdigit1Thetoprowisthebiasesfortheclasses.Thenexttenrowsarethewtsassignedtothescoresthatrepresenthelogprobabilityfthepixelsunderthemodellearnedbythe rsthiddenlaerofeacThelasttenrowsarethewtsassignedtothescoresthatrepresentthelogprobabilitiesoftheprobabilitiesonthe rsthiddenlaerunderthethemodelearnedbythesecondhiddenNotethatalthoughthewtsinthelasttenrowsaresmaller,theyarestillquitelargewhicscoresfromtheersprovideuseful,ogetanideaoftherelativemagnitudeofthetermthatisbeingignored,extensivesimlationswereperformedusingrestrictedBoltzmannmachineswithsmallnbersofvisibleandhiddenunits.ByperformingcomputationsthatareexponentialinthenberofhiddenunitsexponenberpossibleisalsopossibletomeasurehappensximationinupdatecomparedwiththenumericalprecisionofthemachinebutsmallcomparedwiththecurvoftheconedivTheRBM'susedforthesesimulationshadrandomtrainingdataandrandomwdidnothaebiasesonthevisibleorhiddenunits.Themainresultcanbesummarizedasfollooranindividualwt,theRHSofEq.10,summedoeralltrainingcases,occasionallydi ersinsignfromtheLHS.Butfornetorkscontainingmorethan2unitsineachlaeritisalmostcertainthataparallelupdateofallthewtsbasedontheRHSofEq.10willimproetheedivotherwords,whenerageoerthetrainingdata,thevectorofparameterupdatesgivenbytheRHSisalmostcertaintohaeapositivecosinewiththetruetde nedbytheLHS.Figure10aisahistogramoftheimprotsintheconergencewhenEq.10wasusedtoperformoneparallelwtupdateineachofahthousandnetThenetorkscontained8visibleand4hiddenunitsandtheirwtswhosenfromaGaussiandistributionwithmeanzeroandstandarddeviation20.tsorlargernetorkstheapproximationinEq.10isevenbetter.Figure10bshowsthatthelearningproceduredoesnotalwysimproetheloglikelihoodofthetrainingdata,thoughithasastrongtendencytodoso.Notethatonly1000netorksw a) 0.5 1 1.5x 105 0 1000 2000 3000 4000 5000 6000 7000 Histogram of improvements in contrastive divergence 2 0 2 4 6 8 10x 106 0 10 20 30 40 50 60 Histogram of improvements in data log likelihood Figure10:a)Ahistogramoftheimprotsintheconedivergenceasaresultofusing10toperformoneupdateofthewtsineachof10TheexpectedvaluesontheRHSofEq.10werecomputedexactlyThenetorkshad8visibleand4hiddenunits.initialwtswererandomlychosenfromaGaussianwithmean0andstandarddeviation20.Thetrainingdatawaschosenatrandom.b)Theimprotsintheloglikelihoodofthedatafor1000netorkschoseninexactlythesamewyasin gure10a.Notethattheloglikelihooddecreasedintocases.Thechangesintheloglikelihoodarethesameasthechangesinbutwithasignrevusedforthishistogram.Figure11comparesthecontributionstothegradientoftheconedivergencemadebbeingupdatesenbyEq.10tomaketheconedivergenceworse,thedotsin gure11wouldhaetoaboetheorkstheximationinisquitesafe.eexpecttoliebetsowhentheparametersarehangedtomoclosertothechangesshouldalsomoandayfromthepreviouspositionofSotheignoredchangesinshouldcauseanincreaseinandthusanimprotintheconedivInanearlierversionofthispaper,thelearningruleinEq.asinterpretedasapprooptimizationoftheconeloglikelihood:,theconeloglikelihoodcanaceitsmaximumvalueof0bysimplymakingallpossiblevectorsinthedataspaceequallyprobable.Theconedivergencedi ersfromtheconeloglikelihoodbyincludingtheentropiesofthedistributions10)andsothehighenyofrulesoutthesolutioninwhichallpossibledatavareequiprobable.ypesexpertBinarystochasticpixelsarenotunreasonableformodelingpreprocessedimagesofhandwrittendigitsinwhichinkandbackgroundarerepresentedas1and0.Inrealimages,hoer,therebetthereal-valuedintensitiesofitsneighbors.Thiscannotbecapturedbymodelsthatusebinarystochasticpixelsbecauseabinarypixelcanneverhaemorethan1bitofmutualinformationwithanItispossibletouse\multinomial"pixelsthathadiscretevThisisa 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 Modeled vs unmodeled effects modeledunmodeledupdate gurehae10visibleand10hiddenandtheirtsaredrawnfromazero-meanGaussianwithastandarddeviationof10.Thehorizontalaxisshoolddenotethedistributionsbeforeandafterthewupdate.Thisdi ersfromtheimprotintheconedivergencebecauseitignorestheignoredterm,,isplottedontheverticalaxis.tsaboethediagonallinewouldcorrespondtocasesinwhichthewtupdatescausedadecreaseintheconunmodeledbeingcon ictwiththemodelede ects.clumsysolutionforimagesbecauseitfailstocapturetheconyandone-dimensionalityofpixelin,thoughitmaybeusefulforothertypesofdata.AbetterapproachistoimaginereplicatingeachvisibleunitsothatapixelcorrespondstoawholesetofbinaryvisibleunitsthatallhaeidenticalwtstothehiddenThenberofactiveunitsinthesetcanthenximateareal-valuedinDuringreconstruction,thenberofactiveunitswillbebinomiallydistributedandbecauseallthereplicashaethesamewts,thesingleprobabilittrolsthisbinomialdistributiononlyneedsbeusedtoallowreplicatedhiddenunitstoapproximatereal-valuesusingbinomiallydistributedtegerstates.Asetofreplicatedunitscanbeviewedasacomputationalltheapapprotounitswhosewtsactuallydi eroritcanbeviewedasastationaryapproximationtothebehaviourofasingleunitoertimeinwhichcasethenberofactivereplicasisa ringrate.Analternativetoreplicatinghiddentouse\unifac"expertsthateachconsistofamixtureofauniformdistributionandafactoranalyserwithjustonefactor.hexperthasabinarylatenariablethatspeci eswhethertousetheuniformorthefactoranalyserandaaluedlatenariablethatspeci esthevalueofthefactor(ifitisbeingused).ThefactorspecifyimagespaceExpertsofthistypehaebeenexploredinthecontextofdirectedacyclicgraphston,SallansandGhahramani,1998)buttheyshouldworkbetterinaproductofexperts.Analternativetousingalargenberofrelativelysimpleexpertsistomakeeachexpert Thelasttosetsofparametersareexactlyequivttotheparametersofa\unigauss"expertintroducedinsection4,soa\unigauss"expertcanbeconsideredtobeamixtureofauniformwithafactoranalyserthathasnofactors. ascomplicatedaspossible,retainingtheexactderiveofelihoodexpert.modelingexample,eachexpertcouldbeamixtureofmanyaxis-alignedGaussians.Someexpertsmighfocusononeregionofanimagebyusingveryhighvariancesforpixelsoutsidethatregion.solongastheregionsmodeledbydi erentexpertsoerlap,itshouldbepossibletoaoidblocboundaryartifacts.ProductsModelsHiddenMarkvModels(HMM's)areofgreatpracticalvalueinmodelingsequencesofdiscretebolsorsequencesofreal-valuedvectorsbecausethereisanecientalgorithmforupdatingtheparametersoftheHMMtoimproetheloglikelihoodofasetofobservedsequences.are,hoer,quitelimitedintheirgenerativepoerbecausetheonlywythattheportionofastringgenerateduptotimecanconstraintheportionofthestringgeneratedaftertimeisviathediscretehiddenstateofthegeneratorattimeSoifthe rstpartofastringhas,onabitsofmutualinformationwiththerestofthestringtheHMMmusthahiddenstatestoeythismutualinformationbyitschoiceofhiddenstate.ThisexponentialineciencycanbeoercomebyusingaproductofHMM'sasagenerator.Duringgeneration,eachHMMgetstopickahiddenstateateachtimesothemutualinformationbeteenthepastandthefuturecanbelinearinthenberofHMM's.ItisthereforeexponentiallymoreecienttohaemansmallHMM'sthanonebigone.er,toapplythestandardforwardalgorithmtoaproductofHMM'sitisnecessarytotakethecross-productoftheirstatespaceswhichthroytheexponentialwin.orproductsofHMM'stobeofpracticalsigni canceitisnecessaryto ndanecienytotrainthem.AndrewBrown(BrownandHininpr)hasshownthatforatoyexampleiningaproductoffourHMM's,thelearningalgorithminEq.orkswTheforwalgorithmisusedgetthetofloglikelihoodedorreconstructedse-theparametersofanindividualexpert.Theone-stepreconstructionofasequenceisgeneratedasfolloenanobservedsequence,usetheforwardalgorithmineachexpertseparatelytocalculatetheposteriorprobabilitydistributionoerpathsthroughthehiddenstates.oreachexpert,stochasticallyselectahiddenpathfromtheposteriorgiventheobservteachtimestep,selectanoutputsymboloroutputvectorfromtheproductoftheoutputdistributionsspeci edbytheselectedhiddenstateofeachHMM.IfmorerealisticproductsofHMM'scanbetrainedsuccessfullybyminimizingtheconergence,theyshouldbefarbetterthansingleHMM'sformanydi erentkindsofsequenexpertnon-localregularitAsingleHMMwhicustalsomodelalltheotherregularitiesinstringsofEnglishwordscouldnotcapturethisregularityecientlybecauseitcouldnota ordtodevitentirememorycapacitytorememberingwhethertheword\shut"hadalreadyoccurredintheTherehaebeenpreviousattemptstolearnrepresentationsbyadjustingparameterstocanceloutthee ectsofbriefiterationinarecurrentnetork(HintonandMcClelland,1988;O'Reilly1996,Seung,1998),butthesewerenotformulatedusingastochasticgenerativemodelandanappropriateobjectivefunction. ny ;&#xword;shut ny ;&#xword;theheckupFigure12:AHiddenMarkvModel.The rstandthirdnodeshaeoutputdistributionsthatareuniformacrossallwthe rstnodehasahightransitionprobabilitytoitself,stringsofEnglishwordsaregiventhesamelowprobabilitythisexpert.Stringsthatcontheword\shut"folloeddirectlyorindirectlybytheword\up"haehigherprobabilityunderthisexpert.unexpectedproposedbyWinstonWinston'sprogramcomparedarchesmadeofblockswith\nearmisses"suppliedbyateacheranditusedthedi erencesinitsrepresentationsofthecorrectandincorrectarchestodecidewhichaspectsofitsrepresentationwererelevByusingastocmodeldispensebetrealdataandthenearmissesgeneratedbythemodelthatdrivethelearningofthesigni canpoolsexpertmodelserageinthelogprobabilitydomainisfarfromnew(GenestandZidek,1986;Heskes1998),butresearchhasfocussedonhowto ndthebestwtsforcombiningexpertsthathaealreadybeenlearnedorprogrammedseparately(Berger,DellaPietraandDellaPietra,1996)ratherthantrainingtheexpertscooperativThegeometricmeanofasetofprobabilitydistributionshaspropertsmallerthantheaerageoftheKullback-Lieblerdivergencesoftheindividualdistributions, individualmodelsareiden=1.islessthanoneandthedi erencebetthetosidesof12isThismakesitclearthatthebene tofcombiningexpertscomesfromthefactthattheymakelogsmallbydisagreeingonunobserveddata.ItistemptingtoaugmenoE'sbygivingeachexpert,,anadditionaladaptiveparameter,scalesitslogprobabilities.esinferencehmoredicultpersonalcommConsider,forexample,anexpertwith=100.isequivttohaving100copiesofanexpertbutwiththeirlatentstatesalltiedtogetherandeasierjust=1andwtheoElearningalgorithmtodeterminetheappropriatesharpnessoftheexpert. modelsbecauseexpertsproductexpertsindependenmodelsperceptualgraphicalmodelssu erfromthe\explainingay"phenomenon.Whensuchgraphicalmodelswiterativetecximateinference(SaulJordan,1998)ortousecrudeapproximationsthatignoreexplainingayduringinferenceandrelyonthelearningalgorithmto ndrepresentationsforwhichtheshoddyinferencetechniqueisnottoodamagington,Daan,FreyandNeal,1995).model.acyclicgraphicalmodelbutrequiresaniterativeproceduresuchasGibbssamplinginaPoE.If,er,Eq.6isusedforlearning,thedicultyofgeneratingsamplesfromthemodelisnotamajorproblem.independenceexpertsgiventhedata,PoE'shaeamoresubtleadvtageoergenerativemodelsthatwhoosingvectorfromIfsuchamodelhasasinglehiddenlaerandthelatenariableshaeindependenpriordistributions,therewillbeastrongtendencyfortheposteriorvaluesofthelatentobeapproximatelymarginallyindependentafterthemodelhasbeen ttedtodataorthisreason,therehasbeenlittlesuccesswithattemptstolearnsuchgenerativemodelsonehiddeneratatimeinagreedy,bottom-upwWithPoE's,hoer,eventhoughtheexpertshaindependentpriorsthelatenariablesindi erentexpertswillbemarginallycanhaehighminformationevenforfantasydatageneratedbythePoEitself.Soafterthe rsthiddenlaerhasbeenlearnedgreedilytheremaystillbelotsofstatisticalstructureinthelatenariablesforthesecondhiddenlaertocapture.Themostattractivepropertyofasetoforthogonalbasisfunctionsisthatitispossibletocomputethecoecientoneachbasisfunctionseparatelywithoutworryingaboutthecoecienonotherbasisfunctions.oEretainsthisattractivepropertywhilstallowingnon-orthogonalexpertsandanon-lineargenerativemodel.oE'sprovideanecientinstantiationoftheoldpsychologicalideaofanalysis-bThisideaneverwedproperlybecausethegenerativemodelswerenotselectedtomaketheanalysiseasyInaPoE,itisdiculttogeneratedatafromthegenerativemodel,but,giventhemodel,itiseasytocomputehowanygivendatavectormighthaebeengeneratedand,asweseen,itisrelativelyeasytolearntheparametersofthegenerativemodel.ThisresearcasfundedbytheGatsbThankstoZoubinGhahramani,DavidMacKabersofhelpfuldiscussionsandtoPeterDaanforvingtheuscriptanddisprovingsomeexpensivspeculations. Thisiseasytounderstandfromacodingperspectiveinwhichthedataiscommunicatedby rstspecifyingthestatesofthelatenunderanindependentpriorandthenspecifyingthedatagiventhelatentstates.Ifthelatentstatesarenotmarginallyindependentthiscodingschemeisinecient,sopressuretoardscodingeciencycreatespressuretoardsindependence. Berger,A.,DellaPietra,S.andDellaPietra,V.(1996)AMaximumEnyApproachtoNaturalLanguageProcessing.ComputationalLinguisticsreund,Y.andHaussler,D.(1992)UnsupervisedlearningofdistributionsonbinaryvusingtolaernetesinNeuralInformationessingSystems4J.E.MoodyS.J.HansonandR.P.Lippmann(Eds.),MorganKaufmann:SanMateo,CA.annotatedbibliographalScienc,114-148.decompositionselihood-based,1425-1433.ton,G.,Daan,P.,F,B.&Neal,R.(1995)Thewe-sleepalgorithmforself-organizingneuralnet,1158-1161.ton,G.E.Sallans,B.andGhahramani,Z.(1998)HierarchicalCommunitiesofExperts.InM.I.Jordan(Ed.)arninginGrerAcademicPress.D.Z.Anderson,editor,alInformationPressingSystems,358{366,AmericanInstituteofNewYton,G.E.&Sejnowski,T.J.(1986)LearningandrelearninginBoltzmannmacRumelhart,D.E.andMcClelland,J.L.,editors,allelintheMicreofCoolume1:,MITPress,R.C.(1996)BiologicallyPlausibleError-DrivenLearningUsingLocalActivTheGeneralizedRecirculationAlgorithm./itNeuralComputation,/bf8,895-938.w,M.,Williams,C.K.I.andHinton,G.E.(1996)UsingGenerativeModelsforHand-Intelligenc,592-606.Saul,L.K.,Jaakkola,T.&Jordan,M.I.(1996)Mean eldtheoryforsigmoidbeliefnetJournalofArti cialIntelligencSeung,H.S.(1998)Learningconuousattractorsinarecurrentnet.J.Kearns,andS.A.Solla(Eds.)Press,CambridgeMass..(1986)Informationprocessingindynamicalsystems:oundationsofharmonallelationsintheMicreofCoolume1:,MITPressLearningstructuraldescriptionsThePsychologyofComputerVision,McGraw-HillBookCompan,NewY ProductsExpertsyComputationalNeuroscienceUnityCollegeLondon17QueenSquare,LondonWC1N3AR,U.K.Itispossibletocombinemultipleprobabilisticmodelsofthesamedatab