Hin ton Gatsb y Computational Neuroscience Unit Univ ersit y College London 17 Queen Square London W C1N 3AR UK httpwwwgatsby u cl ac uk Abstract It is p ossible to com bine m ultiple probabilistic mo dels of the same data b ym ultiplying their prob ID: 86195
Download Pdf The PPT/PDF document "raining Products of Experts Minimizing C..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
erydierenyofcombiningdistributionsistomultiplythemtogetherandrenormalize.productproductproductssmoothmodelsareabitmorecomplicatedandeachcontainoneormorelatenhidden)verypomodelsofthiskindwillbecalled\experts".ProductsExpertsproducesharperexpertmodels.expertmodelproductconstrainallofthedimensions.ormodelinghandwrittendigits,onelow-resolutionmodelcangenerateimagesthathaetheapproximateoerallshapeofthedigitandothermorelocalmodelscanensurethatsmallimagepatchescontainsegmentsofstrokewiththecorrectnestructure.ormodelingsentences,eachexpertcanenforceanuggetoflinguisticknoorexample,oneexpertcouldensurethatthetensesagree,onecouldensurethatthereisnberagreemenbeteenthesubjectandverbandonecouldensurethatstringsinwhichcolouradjectivesfollosizeadjectivesaremoreprobablethanthetherevoEtodataappearsdicultbecauseitappearstobenecessarytocomputees,withrepecttotheparameters,ofthepartitionfunctionthatisusedintherenormal-Asweshallsee,hoer,thesederivescanbenessedbyoptimizingalessobobjectivefunctionthantheloglikelihoodofthedata.productsexpertselihoodeconsiderindividualexpertmodelsforwhichitistractabletocomputethederiveoftheectorwithrespectparametersofexpert.individualexpertmodelsasfollo adataectorinparametersofmodel)istheprobabilityofundermodel,andindexesallpossiblevectorsinthedataorconuousdataspacesthesumisreplacedbytheappropriateinoranexpertehighytodataanditasteaslittleyaspossibleontherestofdataspace.er,cantthedatawellevenifeachexpertwastesalotofitsprobabilityoninappropriateregionsofthedataspaceprovideddierentexpertswasteprobabilityindierentregions.TheobviouswytotaPoEtoasetofobservdatav,istofollowthederivoftheloglikelihoodofeachobservedv,underthePoE.Thisisgivenb =@logpm(djm) Xcp(cj1)@logpm(cjm) ThesecondtermontheRHSofEq.2isjusttheexpectedderiveofthelogprobabilitofanexpertonfantasydata,,thatisgeneratedfromthePoE.So,assumingthateachofthe ThesymbolhasnosimplerelationshiptothesymbolusedontheLHSofEq.Indeed,solongas)ispositiveitdoesnotneedtobeaprobabilityatall,thoughitwillgenerallybeaprobabilityinthispaper.ortimeseriesmodels,isawholesequence. individualexpertshasatractablederive,theobviousdicultyinestimatingthederivofthelogprobabilityofthedataunderthePoEisgeneratingcorrectlydistributedfantasydata.Thiscanbedoneinvariouswordiscretedataitispossibletouserejectionsampling:expertgeneratesadatavectorindependentlyandthisprocessisrepeateduntilalltheexpertshappengoodoEspecieserallprobabilitydistributionandhowdierentitfromacausalmodel,istAMarkhainMonteCarlomethodthatusesGibbssamplingistypicallymmoreecienInGibbssampling,eacariabledrawsasamplefromitsposteriordistributionexpertscanalwysbeupdatedinparallelbecausetheyareconditionallyindependenThisisaeryimportantconsequenceoftheproductformIftheindividualexpertsalsohaethepropertythatthecomponentsofthedatavectorareconditionallyindependentgiventhehiddenstateoftheexpert,thehiddenandvisiblevariablesformabipartitegraphanditispossibletoupdatecomponenectorinexperts.betupdatesogetanunbiasedestimateofthegradientforthePoEitisnecessaryfortheMarkhaintoconergetotheequilibriumdistribution.,evenifitiscomputationallyfeasibletoapproachtheequilibriumdistributionbeforetakingsamples,thereisasecond,seriousdicultSamplesfromtheequilibriumdistri-butiongenerallyhaeryhighvariancesincetheycomefromalloerthemodel'sdistribution.Thishighvarianceswampsthederivorsestill,thevarianceinthesamplesdependsontheparametersofthemodel.Thisvariationinthevariancecausestheparameterstoberepelledconsiderahorizontalsheetoftinwhichisresonatinginsucythatsomepartshaestrongerticaloscillationsandotherpartsaremotionless.Sandscatteredonthetinwillinthemotionlessareaseventhoughthetime-aeragedgradientiszeroevMaximizingtheloglikelihoodofthedata(aeragedoerthedatadistribution)isequivttominimizingtheKullback-Lieblerdivergencebeteenthedatadistribution,,andtheequilib-riumdistributionoerthevisiblev,thatisproducedbyprolongedGibbssamplingfromthegenerativemodeldenotesaKullback-Leiblerdivergence,theanglebracetsdenoteexpectationsoerthedistributionspeciedasasubscriptand)istheenyofthedatadistribution.doesnotdependontheparametersofthemodel,so)canbeignoredduringtheoptimization.distribution,canberewrittenas: Q0=@m(djm) Q0 @m(cjm) Thereisasimpleandeectivealternativetomaximumlikelihoodlearningwhicheliminatesalmostallofthecomputationrequiredtogetsamplesfromtheequilibriumdistributionandalsoeliminatesmhofthevariancethatmasksthegradientsignal.eapproachin-esoptimizingadierentobjectivefunction.Insteadofjustminimizingeminimize isaifwimagingstartingdistributionattime0. betreconstructionsofthedatavectorsthataregeneratedbyonefullstepofGibbssampling.hainthatisimplementedbyGibbssamplingtoleaetheinitial,distributionthevisiblevariablesunaltered.Insteadofrunningthechaintoequilibriumandcomparingtheinitialandnalderiveswecansimplyrunthechainforonefullstepandthenupdatetheparameterstoreducethetendencyofthechaintowanderayfromtheinitialdistributiononteedthat,sotheconedivergencecanbenegativhainsinwhicalltransitionshaenon-zeroprobabilitsotheconedivergencecanonlybezeroifthemodelisperfect.Themathematicalmotivationfortheconedivergenceisthattheintractableexpecta-tionoontheRHSofEq.4cancelsout: Q01 Q1jjQ1=@m(djm) Q0 *@logpm(^djm) +Q1+ jjQ1 expertpossibleesoflog)andlogItisalsostraighardtosamplefromprocedureproducesanunbiasedsamplefromkadatav,fromthedistributionofthedataCompute,foreachexpertseparately,theposteriorprobabilitydistributionoeritslatenhidden)variablesgiventhedatavkavalueforeachlatenariablefromitsposteriordistribution.enthealuesoferallthevisiblevariablesbultiplyingtogethertheconditionaldistributionsspeciedyeachexpert.constitutethereconstructeddatavThethirdtermontheRHSofEq.5isproblematictocompute,butextensivesim(seesection10)showthatitcansafelybeignoredbecauseitissmallanditseldomopposesthetofoterms.parametersofexpertsproportiontotheapproximatederiveoftheconediv Q0 *@m(^djm) Thisworksverywellinpracticeevenwhenasinglereconstructionofeachdatavectorisusedinplaceofthefullprobabilitydistributionoerreconstructions.Thedierenceinthederivbecauseprocedurestocmodellingmoderatelyclosematchbeteenadatavectoranditsreconstructionreducessamplingvarianceinmhthe samewyastheuseofmatchedpairsforexperimentalandcontrolconditionsinaclincaltrial.TheloariancemakesitfeasibletoperformonlinelearningaftereachdatavectorispresenthoughthesimulationsdescribedinthispaperusebatchlearninginwhichtheparameterupdatesThereisanalternativejusticationforthelearningalgorithminEq.Inhigh-dimensionaldatasets,thedatanearlyalwyslieson,orcloseto,amhloerdimensional,smoothlycurvThePoEneedstondparametersthatmakeasharpridgeoflogprobabilityalongthewdimensionalmanifold.Bystartingwithapointonthemanifoldandensuringthatthispoinhashigherlogprobabilitythanypicalreconstructionsfromtheariablesofalltheexperts,thePoEensuresthattheprobabilitydistributionhastherightlocalcurvature(proreconstructionsareclosetopossiblethatoEwillaccidentallyassignhighprobabilitytootherdistantandunvisitedpartsofthedataspace,butthisisunlikelyifthelogprobabiltysurfaceissmoothandifbothitsheightanditslocalcurvatureareconstrainedatthedatapoinItisalsopossibletondandeliminatesucpoinyperformingprolongedGibbssamplingwithoutanydata,butthisisjustawyofimprovingthelearningandnot,asinBoltzmannmachinelearning,anessentialpartofit.oE'sshouldworkverywellondatadistributionsthatcanbefactorizedintoaproductoflogure1.are15\unigauss"expertshofwhichisamixtureofauniformandasingle,axis-alignedGaussian.Inthettedmodel,htightdataclusterisrepresentedbytheintersectionoftoGaussianswhichareelongatedupdatestheparameters.oreachupdateoftheparameters,thefollowingcomputationisperformedoneryobserveddataventhedata,,calculatetheposteriorprobabilityofselectingtheGaussianratherthantheuniformineachexpertandcomputethersttermontheRHSofEq.oreachexpert,stochasticallyselecttheGaussianortheuniformaccordingtotheposteri-ComputethenormalizedproductoftheselectedGaussians,whichisitselfaGaussian,andsamplefromitisusedtogeta\reconstructed"vectorinthedataspace.ComputethenegativeterminEq.6usingthereconstructedvectoraspopulationcodemodelexpertdimensionandprecisionisobtainedbytheintersectionofalargenberofexperts.Figure3wswhathappenswhenexpertsofthetypeusedinthepreviousexamplearetteddimensionalsyntheticimagesthateachcontainoneedge.Theedgesvariedintheirorienposition,andtheintensitiesoneachsideoftheedge.Theinyproleacrosstheedgewasahexpertalsolearnedavarianceforeachpixelandalthoughthesevariancesvindividualexpertsdidnotspecializeinasmallsubsetofthedimensions.enanimage,abouthalfoftheexpertshaeahighprobabilityofpickingtheirGaussianratherthantheiruniform.TheproductsofthechosenGaussiansareexcellentreconstructionsoftheimage.Theexpertsatthetopofgure3looklikeedgedetectorsinariousorientations,positionsandpolarities.expertslocateTheyeacorkfortodierentsetsofedgesthathaeoppositepolaritiesanddierenpositions. Figure1:hdotisadatapoinThedatahasbeenttedwithaproductof15experts.ellipsesshowtheonestandarddeviationcontoursoftheGaussiansineachexpert.Theexpertsareinitializedwithrandomlylocated,circularGaussiansthathaeaboutthesamevarianceasthedata.Theveunneededexpertsremainvague,butthemixingproportionsoftheirGaussiansremainhigh. Figure2:300datapointsgeneratedbyprolongedGibbssamplingfromthe15expertsttedingure1.TheGibbssamplingstartedfromarandompointintherangeofthedataandused25paralleliterationswithannealing.Noticethatthettedmodelgeneratesdataatthegridpointhatismissingintherealdata.expertsexpertexpertsygivingthemdierentordierentlywtedtrainingcasesorbytrainingthemondierensubsetsofthedatadimensions,orbyusingdierentmodelclassesforthedierentexperts.hexperthasbeeninitializedseparatelytheindividualydistributionsneedtoberaisedtoafractionalpoertocreatetheinitialPSeparateinitializationoftheexpertsseemslikeasensibleidea,butsimulationsindicatethatthePoEisfarmorelikelytobecometrappedinpoorlocaloptimaiftheexpertsarealloedtospecializeseparatelyBettersolutionsareobtainedbysimplyinitializingtheexpertsrandomlywithveryvaguedistributionsandusingthelearningruleinEq. Figure3:meansofallthe100-dimensionalGaussiansinaproductof40experts,hofhisamixtureofaGaussianandauniform.ThePoEwasttedto1010imagesthateactainedasingleinyedge.TheexpertshaebeenorderedbyhandsothatqualitativsimilarexpertsareadjacenTheBoltzmannmachinelearningalgorithm(HintonandSejnowski,1986)istheoreticallyeleganandeasytoimplementinhardware,butitisveryslowinnetorkswithinterconnectedhiddenunitsbecauseofthevarianceproblemsdescribedinsection2.Smolensky(1986)introducedarestrictedtypeofBoltzmannmachinewithonevisiblelaer,onehiddenlaer,andnoinproportionalproductprobabilitiesthatthevisiblevectorwouldbegeneratedbyeachofthehiddenunitsactingalone.expertperexpertisoitspeciesafactorialprobabilitydistributioninwhicheachvisibleunitisequallyelytobeonoro.Whenthehiddenunitison,itspeciesadierentfactorialdistributionbusingthewtonitsconnectiontoeachvisibleunittospecifythelogoddsthatthevisibleunitison.Multiplyingtogetherthedistributionsoerthevisiblestatesspeciedbydierentexpertsisacedbysimplyaddingthelogodds.ExactinferenceistractableinanRBMbecausethestatesofthehiddenunitsareconditionallyindependentgiventhedata.learningalgorithmforanRBM.ConsiderthederiveofthelogprobabilityofthedatawithrespectbetavisibleunitahiddenThersttermontheRHSofEq.2is: BoltzmannmachinesandProductsofExpertsareverydierentclassesofprobabilisticgenerativemodelandtheintersectionofthetoclassesisRBM's @j(djwj) expectedclampedposteriordistributiongiv,andistheexpectedvalueofwhenalternatingGibbssamplingofthehiddenandvisibleunitsisiteratedtogetsamplesfromtheequilibriumdistributioninanetorkwhoseonlyhiddenunitisThesecondtermontheRHSofEq.2is: alloftsinexpectedalternatingGibbssamplingofallthehiddenandallthevisibleunitsisiteratedtogetsamplesfromtheequilibriumdistributionoftheRBM.SubtractingEq.8fromEq.7andtakingexpectationsoerthedistributionofthedatagiv Q0= 1 Thetimerequiredtoapproachequilibriumandthehighsamplingvarianceinlearningdicult.ItismhmoreeectivetousetheapproximategradientoftheconoranRBMthisapproximategradientisparticularlyeasytocompute: istheexpectedvalueofwhenone-stepreconstructionsareclampedonthevisibleunitsandissampledfromitsposteriordistributiongiventhereconstruction.high-dimensionaldata,featuresthatmodelthedatawotestthisconjecture,anRBMwith500hiddenunitsand256visibleunitswastrainedon80001616real-valuedimagesofhandwrittendigitsfromall10classes.Theimages,fromthetrainingsetontheUSPSCedarROM,werenormalizedbutariableinbettheycouldbetreatedasprobabilitiesandEq.10wasmodiedtouseprobabilitiesinplaceofstochasticbinaryvaluesforboththedataandtheone-stepreconstructions: Stocprobabilitiesofthereconstructedpixels,butinsteadofpickingbinarystatesforthepixelsfromthoseprobabilities,theprobabilitiesthemselveswereusedasthereconstructeddatavIttooktodaysinmatlabona500MHzworkstationtoperform658epochsoflearning.hepoch,thewtswereupdated80timesusingtheapproximategradientofthecon ergencecomputedonmini-batchesofsize100thatcontained10exemplarsofeachdigitclass.Thelearningratewassetempiricallytobeaboutonequarteroftheratethatcauseddivoscillationsintheparameters.ofurtherimproethelearningspeedamomentummethodwAftertherst10epochs,theparameterupdatesspeciedbyEq.11weresupplemenyadding09timesthepreviousupdate.ThePoElearnedlocalisedfeatureswhosebinarystatesyieldedalmostperfectreconstructions.himageaboutonethirdereturnedlearnedfeatureshadon-centero-surroundreceptiveeldsorviceversa,somelookedlikepiecesofstroke,andsomelookedlikeGaborltersorwThewtsof100ofthehiddenunits,selectedatrandom,areshowningure4. Figure4:eeldsofarandomlyselectedsubsetofthe500hiddenunitsinathatwastrainedon8000imagesofdigitswithequalnbersfromeachclass.hblockshothe256learnedwtsconnectingahiddenunittothepixels.Thescalegoesfrom+2(white)2(blacmodelsAnattractiveaspectofPoE'sisthatitiseasytocomputethenumeratorinEq.1soitiseasytocomputethelogprobabilityofadatavectoruptoanadditiveconstant,log,whichisthelogofthedenominatorinEq.,itisveryhardtocomputethisadditiveconstanThisdoesnotmatterifweonlywttocomparetheprobabilitiesoftodierentdatavunderthePoE,butitmakesitdiculttoaluatethemodellearnedboE.TheobytomeasurethesuccessoflearningistosumthelogprobabilitiesthatthePoEassignstoectorsthatwnfromusedduringtraining. procedure)+log)+logrespectivIfthedierencebeteenlogandlogisknownitiseasytopickthemostlikelyclassofthetestimage,andsincethisdierenceisonlyasinglenberitisquiteeasytoestimateitdiscriminativelyusingasetofvalidationimageswhoselabelsareFigure5showsfeatureslearnedbyaPoEthattainsalaerof100hiddenandistrainedon800imagesofthedigit2.Figure6showssomepreviouslyunseentestimagesof2'sandtheirone-stepreconstructionsfromthebinaryactivitiesofthePoEtrainedon2'sandfromanidenticalPoEtrainedon3's. Figure5:Thewtslearnedby100hiddenunitstrainedon16x16imagesofthedigit2.scalegoesfrom+3(white)to3(blacNotethattheeldsaremostlyquitelocal.Alocalfeatureliketheoneincolumn1row7lookslikeanedgedetector,butitisbestunderstoodaslocaldeformationofatemplate.Supposealltheotheractivefeaturescreateanofa2thatdiersfromthedatainhavingalargeloopwhosetopfallsontheblackpartoftheeeld.Byturningonthisfeature,thetopoftheloopcanberemoedandreplacedbalinesegmentthatisalittleloerintheimage.underamodeltrainedon825imagesofthedigit4andamodeltrainedon825imagesofthedigit6.,theocialtestsetfortheUSPSdigitsviolatesthestandardassumptionimagesweredrawnfromtheunusedportionoftheocialenforthepreviouslyscoresunderomodelswperfectoacthisexcellentseparation,itwasnecessarytousemodelswithtohiddenlaersandtoathescoresfromtoseparatelytrainedmodelsofeachdigitclass.oreachdigitclass,onemodelmodelhad100inthersthiddenlaerand50inthesecond.Theunitsinthersthiddenlaerw probabilitieswhentheimageisreconstructedfromthebinaryactivitiesof100featuredetectorsthathaebeentrainedon2's.Thebottomrowshowsthereconstructionprobabilitiesusing100featuredetectorstrainedon3's.hiddenlaerasthedata. 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Log scores of both models on training dataScore under model4Score under model6 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Log scores under both models on test dataScore under model4Score under model6 Figure7:a)Theunnormalisedlogprobabilityscoresofthetrainingimagesofthedigits4and6underthelearnedPoE'sfor4and6.b)Thelogprobabilityscoresforpreviouslyunseentestimagesof4'sand6's.Notethegoodseparationofthetoclasses.Figure8showstheunnormalizedlogprobabilityscoresforimagesof7'sand9'swhicharethemostdicultclassestodiscriminate.Discriminationisnotperfectonthetestimages,butitisencouragingthatalloftheerrorsareclosetothedecisionboundary,sotherearenocondenIfthereare10dierenoE'sforthe10digitclassesitisslightlylessobvioushowtousethe10unnormalizedscoresofatestimagefordiscrimination.Onepossibilityistouseavsettotrainalogisticregressionnetorkthattakestheunnormalizedlogprobabilitiesgivenb a) 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Log scores of both models on training dataScore under model7Score under model9 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Log scores under both models on test dataScore under model7Score under model9 Figure8:a)Theunnormalisedlogprobabilityscoresofthetrainingimagesofthedigits7and9underthelearnedPoE'sfor7and9.b)Thelogprobabilityscoresforpreviouslyunseentestimagesof7'sand9's.Althoughtheclassesarenotlinearlyseparable,alltheerrorsareclosetothebestseparatingline,sotherearenoverycondenterrors.thePoE'sandconertsthemintoaprobabilitydistributionacrossthe10labels.Figure9shothewtsinalogisticregressionnetorkthatistrainedaftertting10PoEmodelstothe10separatedigitordertoseewhethertheerswereprovidingusefuleinformation,eacoEprovidedtoscores.TherstscorewastheunnormalizedmodelrsthiddenlaThesecondscorewastheunnormalizedlogprobabilityoftheprobabilitiesofationoftherstlaerofhiddenunitsunderaPoEmodelthatconsistedoftheunitsinthesecondhiddenlaThewtsingure9showthatthesecondlaerofhiddenunitsprousefuladditionalinformation.Presumablythisisbecauseitcapturesthewyinwhichfeaturestedinthersthiddenlaerarecorrelated.Theerrorrateis1.1%whichcomparesveryfaorablywiththe5.1%errorrateofasimplenearestneighborclassieronthesametrainingandtestsetsandisaboutthesameasthevbestclassierbasedonelasticmodelsofthedigits(Revw,WilliamsandHinton,1996).If7%rejectsarealloed(bhoosinganappropriatethresholdfortheprobabilitylevelofthemostprobableclass),therearenoerrorsonthe2750testimages.eraldierentnetorkarchitecturesweretriedforthedigit-specicPoE'sandtheresultsreportedbestmodelreportedperformancereportgoodMNISTdatabaseatandtheywerecarefultodoallthemodelselectionusingsubsetsofthetrainingdatasothattheocialtestdatawasusedonlytomeasurethenalerrorrate.goodThefactthatthelearningprocedureinEq.6givesgoodresultsinthesimulationsdescribedinsections4,and9suggeststhatitissafetoignorethenaltermintheofEq.comesfromthechangeinthedistribution Figure9:Thewtslearnedbydoingmultinomiallogisticregressiononthetrainingdatawiththelabelsasoutputsandtheunnormalisedlogprobabilityscoresfromthetrained,digit-specic,oE'sasinputs.hcolumncorrespondstoadigitclass,startingwithdigit1Thetoprowisthebiasesfortheclasses.ThenexttenrowsarethewtsassignedtothescoresthatrepresenthelogprobabilityfthepixelsunderthemodellearnedbythersthiddenlaerofeacThelasttenrowsarethewtsassignedtothescoresthatrepresentthelogprobabilitiesoftheprobabilitiesonthersthiddenlaerunderthethemodelearnedbythesecondhiddenNotethatalthoughthewtsinthelasttenrowsaresmaller,theyarestillquitelargewhicscoresfromtheersprovideuseful,ogetanideaoftherelativemagnitudeofthetermthatisbeingignored,extensivesimlationswereperformedusingrestrictedBoltzmannmachineswithsmallnbersofvisibleandhiddenunits.ByperformingcomputationsthatareexponentialinthenberofhiddenunitsexponenberpossibleisalsopossibletomeasurehappensximationinupdatecomparedwiththenumericalprecisionofthemachinebutsmallcomparedwiththecurvoftheconedivTheRBM'susedforthesesimulationshadrandomtrainingdataandrandomwdidnothaebiasesonthevisibleorhiddenunits.Themainresultcanbesummarizedasfollooranindividualwt,theRHSofEq.10,summedoeralltrainingcases,occasionallydiersinsignfromtheLHS.Butfornetorkscontainingmorethan2unitsineachlaeritisalmostcertainthataparallelupdateofallthewtsbasedontheRHSofEq.10willimproetheedivotherwords,whenerageoerthetrainingdata,thevectorofparameterupdatesgivenbytheRHSisalmostcertaintohaeapositivecosinewiththetruetdenedbytheLHS.Figure10aisahistogramoftheimprotsintheconergencewhenEq.10wasusedtoperformoneparallelwtupdateineachofahthousandnetThenetorkscontained8visibleand4hiddenunitsandtheirwtswhosenfromaGaussiandistributionwithmeanzeroandstandarddeviation20.tsorlargernetorkstheapproximationinEq.10isevenbetter.Figure10bshowsthatthelearningproceduredoesnotalwysimproetheloglikelihoodofthetrainingdata,thoughithasastrongtendencytodoso.Notethatonly1000netorksw a) 0.5 1 1.5x 105 0 1000 2000 3000 4000 5000 6000 7000 Histogram of improvements in contrastive divergence 2 0 2 4 6 8 10x 106 0 10 20 30 40 50 60 Histogram of improvements in data log likelihood Figure10:a)Ahistogramoftheimprotsintheconedivergenceasaresultofusing10toperformoneupdateofthewtsineachof10TheexpectedvaluesontheRHSofEq.10werecomputedexactlyThenetorkshad8visibleand4hiddenunits.initialwtswererandomlychosenfromaGaussianwithmean0andstandarddeviation20.Thetrainingdatawaschosenatrandom.b)Theimprotsintheloglikelihoodofthedatafor1000netorkschoseninexactlythesamewyasingure10a.Notethattheloglikelihooddecreasedintocases.Thechangesintheloglikelihoodarethesameasthechangesinbutwithasignrevusedforthishistogram.Figure11comparesthecontributionstothegradientoftheconedivergencemadebbeingupdatesenbyEq.10tomaketheconedivergenceworse,thedotsingure11wouldhaetoaboetheorkstheximationinisquitesafe.eexpecttoliebetsowhentheparametersarehangedtomoclosertothechangesshouldalsomoandayfromthepreviouspositionofSotheignoredchangesinshouldcauseanincreaseinandthusanimprotintheconedivInanearlierversionofthispaper,thelearningruleinEq.asinterpretedasapprooptimizationoftheconeloglikelihood:,theconeloglikelihoodcanaceitsmaximumvalueof0bysimplymakingallpossiblevectorsinthedataspaceequallyprobable.Theconedivergencediersfromtheconeloglikelihoodbyincludingtheentropiesofthedistributions10)andsothehighenyofrulesoutthesolutioninwhichallpossibledatavareequiprobable.ypesexpertBinarystochasticpixelsarenotunreasonableformodelingpreprocessedimagesofhandwrittendigitsinwhichinkandbackgroundarerepresentedas1and0.Inrealimages,hoer,therebetthereal-valuedintensitiesofitsneighbors.Thiscannotbecapturedbymodelsthatusebinarystochasticpixelsbecauseabinarypixelcanneverhaemorethan1bitofmutualinformationwithanItispossibletouse\multinomial"pixelsthathadiscretevThisisa 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 Modeled vs unmodeled effects modeledunmodeledupdategurehae10visibleand10hiddenandtheirtsaredrawnfromazero-meanGaussianwithastandarddeviationof10.Thehorizontalaxisshoolddenotethedistributionsbeforeandafterthewupdate.Thisdiersfromtheimprotintheconedivergencebecauseitignorestheignoredterm,,isplottedontheverticalaxis.tsaboethediagonallinewouldcorrespondtocasesinwhichthewtupdatescausedadecreaseintheconunmodeledbeingcon ictwiththemodeledeects.clumsysolutionforimagesbecauseitfailstocapturetheconyandone-dimensionalityofpixelin,thoughitmaybeusefulforothertypesofdata.AbetterapproachistoimaginereplicatingeachvisibleunitsothatapixelcorrespondstoawholesetofbinaryvisibleunitsthatallhaeidenticalwtstothehiddenThenberofactiveunitsinthesetcanthenximateareal-valuedinDuringreconstruction,thenberofactiveunitswillbebinomiallydistributedandbecauseallthereplicashaethesamewts,thesingleprobabilittrolsthisbinomialdistributiononlyneedsbeusedtoallowreplicatedhiddenunitstoapproximatereal-valuesusingbinomiallydistributedtegerstates.Asetofreplicatedunitscanbeviewedasacomputationalltheapapprotounitswhosewtsactuallydieroritcanbeviewedasastationaryapproximationtothebehaviourofasingleunitoertimeinwhichcasethenberofactivereplicasisaringrate.Analternativetoreplicatinghiddentouse\unifac"expertsthateachconsistofamixtureofauniformdistributionandafactoranalyserwithjustonefactor.hexperthasabinarylatenariablethatspecieswhethertousetheuniformorthefactoranalyserandaaluedlatenariablethatspeciesthevalueofthefactor(ifitisbeingused).ThefactorspecifyimagespaceExpertsofthistypehaebeenexploredinthecontextofdirectedacyclicgraphston,SallansandGhahramani,1998)buttheyshouldworkbetterinaproductofexperts.Analternativetousingalargenberofrelativelysimpleexpertsistomakeeachexpert Thelasttosetsofparametersareexactlyequivttotheparametersofa\unigauss"expertintroducedinsection4,soa\unigauss"expertcanbeconsideredtobeamixtureofauniformwithafactoranalyserthathasnofactors. ascomplicatedaspossible,retainingtheexactderiveofelihoodexpert.modelingexample,eachexpertcouldbeamixtureofmanyaxis-alignedGaussians.Someexpertsmighfocusononeregionofanimagebyusingveryhighvariancesforpixelsoutsidethatregion.solongastheregionsmodeledbydierentexpertsoerlap,itshouldbepossibletoaoidblocboundaryartifacts.ProductsModelsHiddenMarkvModels(HMM's)areofgreatpracticalvalueinmodelingsequencesofdiscretebolsorsequencesofreal-valuedvectorsbecausethereisanecientalgorithmforupdatingtheparametersoftheHMMtoimproetheloglikelihoodofasetofobservedsequences.are,hoer,quitelimitedintheirgenerativepoerbecausetheonlywythattheportionofastringgenerateduptotimecanconstraintheportionofthestringgeneratedaftertimeisviathediscretehiddenstateofthegeneratorattimeSoiftherstpartofastringhas,onabitsofmutualinformationwiththerestofthestringtheHMMmusthahiddenstatestoeythismutualinformationbyitschoiceofhiddenstate.ThisexponentialineciencycanbeoercomebyusingaproductofHMM'sasagenerator.Duringgeneration,eachHMMgetstopickahiddenstateateachtimesothemutualinformationbeteenthepastandthefuturecanbelinearinthenberofHMM's.ItisthereforeexponentiallymoreecienttohaemansmallHMM'sthanonebigone.er,toapplythestandardforwardalgorithmtoaproductofHMM'sitisnecessarytotakethecross-productoftheirstatespaceswhichthroytheexponentialwin.orproductsofHMM'stobeofpracticalsignicanceitisnecessarytondanecienytotrainthem.AndrewBrown(BrownandHininpr)hasshownthatforatoyexampleiningaproductoffourHMM's,thelearningalgorithminEq.orkswTheforwalgorithmisusedgetthetofloglikelihoodedorreconstructedse-theparametersofanindividualexpert.Theone-stepreconstructionofasequenceisgeneratedasfolloenanobservedsequence,usetheforwardalgorithmineachexpertseparatelytocalculatetheposteriorprobabilitydistributionoerpathsthroughthehiddenstates.oreachexpert,stochasticallyselectahiddenpathfromtheposteriorgiventheobservteachtimestep,selectanoutputsymboloroutputvectorfromtheproductoftheoutputdistributionsspeciedbytheselectedhiddenstateofeachHMM.IfmorerealisticproductsofHMM'scanbetrainedsuccessfullybyminimizingtheconergence,theyshouldbefarbetterthansingleHMM'sformanydierentkindsofsequenexpertnon-localregularitAsingleHMMwhicustalsomodelalltheotherregularitiesinstringsofEnglishwordscouldnotcapturethisregularityecientlybecauseitcouldnotaordtodevitentirememorycapacitytorememberingwhethertheword\shut"hadalreadyoccurredintheTherehaebeenpreviousattemptstolearnrepresentationsbyadjustingparameterstocancelouttheeectsofbriefiterationinarecurrentnetork(HintonandMcClelland,1988;O'Reilly1996,Seung,1998),butthesewerenotformulatedusingastochasticgenerativemodelandanappropriateobjectivefunction. ny ;word;shut ny ;word;theheckupFigure12:AHiddenMarkvModel.Therstandthirdnodeshaeoutputdistributionsthatareuniformacrossallwtherstnodehasahightransitionprobabilitytoitself,stringsofEnglishwordsaregiventhesamelowprobabilitythisexpert.Stringsthatcontheword\shut"folloeddirectlyorindirectlybytheword\up"haehigherprobabilityunderthisexpert.unexpectedproposedbyWinstonWinston'sprogramcomparedarchesmadeofblockswith\nearmisses"suppliedbyateacheranditusedthedierencesinitsrepresentationsofthecorrectandincorrectarchestodecidewhichaspectsofitsrepresentationwererelevByusingastocmodeldispensebetrealdataandthenearmissesgeneratedbythemodelthatdrivethelearningofthesignicanpoolsexpertmodelserageinthelogprobabilitydomainisfarfromnew(GenestandZidek,1986;Heskes1998),butresearchhasfocussedonhowtondthebestwtsforcombiningexpertsthathaealreadybeenlearnedorprogrammedseparately(Berger,DellaPietraandDellaPietra,1996)ratherthantrainingtheexpertscooperativThegeometricmeanofasetofprobabilitydistributionshaspropertsmallerthantheaerageoftheKullback-Lieblerdivergencesoftheindividualdistributions, individualmodelsareiden=1.islessthanoneandthedierencebetthetosidesof12isThismakesitclearthatthebenetofcombiningexpertscomesfromthefactthattheymakelogsmallbydisagreeingonunobserveddata.ItistemptingtoaugmenoE'sbygivingeachexpert,,anadditionaladaptiveparameter,scalesitslogprobabilities.esinferencehmoredicultpersonalcommConsider,forexample,anexpertwith=100.isequivttohaving100copiesofanexpertbutwiththeirlatentstatesalltiedtogetherandeasierjust=1andwtheoElearningalgorithmtodeterminetheappropriatesharpnessoftheexpert. modelsbecauseexpertsproductexpertsindependenmodelsperceptualgraphicalmodelssuerfromthe\explainingay"phenomenon.Whensuchgraphicalmodelswiterativetecximateinference(SaulJordan,1998)ortousecrudeapproximationsthatignoreexplainingayduringinferenceandrelyonthelearningalgorithmtondrepresentationsforwhichtheshoddyinferencetechniqueisnottoodamagington,Daan,FreyandNeal,1995).model.acyclicgraphicalmodelbutrequiresaniterativeproceduresuchasGibbssamplinginaPoE.If,er,Eq.6isusedforlearning,thedicultyofgeneratingsamplesfromthemodelisnotamajorproblem.independenceexpertsgiventhedata,PoE'shaeamoresubtleadvtageoergenerativemodelsthatwhoosingvectorfromIfsuchamodelhasasinglehiddenlaerandthelatenariableshaeindependenpriordistributions,therewillbeastrongtendencyfortheposteriorvaluesofthelatentobeapproximatelymarginallyindependentafterthemodelhasbeenttedtodataorthisreason,therehasbeenlittlesuccesswithattemptstolearnsuchgenerativemodelsonehiddeneratatimeinagreedy,bottom-upwWithPoE's,hoer,eventhoughtheexpertshaindependentpriorsthelatenariablesindierentexpertswillbemarginallycanhaehighminformationevenforfantasydatageneratedbythePoEitself.Soafterthersthiddenlaerhasbeenlearnedgreedilytheremaystillbelotsofstatisticalstructureinthelatenariablesforthesecondhiddenlaertocapture.Themostattractivepropertyofasetoforthogonalbasisfunctionsisthatitispossibletocomputethecoecientoneachbasisfunctionseparatelywithoutworryingaboutthecoecienonotherbasisfunctions.oEretainsthisattractivepropertywhilstallowingnon-orthogonalexpertsandanon-lineargenerativemodel.oE'sprovideanecientinstantiationoftheoldpsychologicalideaofanalysis-bThisideaneverwedproperlybecausethegenerativemodelswerenotselectedtomaketheanalysiseasyInaPoE,itisdiculttogeneratedatafromthegenerativemodel,but,giventhemodel,itiseasytocomputehowanygivendatavectormighthaebeengeneratedand,asweseen,itisrelativelyeasytolearntheparametersofthegenerativemodel.ThisresearcasfundedbytheGatsbThankstoZoubinGhahramani,DavidMacKabersofhelpfuldiscussionsandtoPeterDaanforvingtheuscriptanddisprovingsomeexpensivspeculations. Thisiseasytounderstandfromacodingperspectiveinwhichthedataiscommunicatedbyrstspecifyingthestatesofthelatenunderanindependentpriorandthenspecifyingthedatagiventhelatentstates.Ifthelatentstatesarenotmarginallyindependentthiscodingschemeisinecient,sopressuretoardscodingeciencycreatespressuretoardsindependence. Berger,A.,DellaPietra,S.andDellaPietra,V.(1996)AMaximumEnyApproachtoNaturalLanguageProcessing.ComputationalLinguisticsreund,Y.andHaussler,D.(1992)UnsupervisedlearningofdistributionsonbinaryvusingtolaernetesinNeuralInformationessingSystems4J.E.MoodyS.J.HansonandR.P.Lippmann(Eds.),MorganKaufmann:SanMateo,CA.annotatedbibliographalScienc,114-148.decompositionselihood-based,1425-1433.ton,G.,Daan,P.,F,B.&Neal,R.(1995)Thewe-sleepalgorithmforself-organizingneuralnet,1158-1161.ton,G.E.Sallans,B.andGhahramani,Z.(1998)HierarchicalCommunitiesofExperts.InM.I.Jordan(Ed.)arninginGrerAcademicPress.D.Z.Anderson,editor,alInformationPressingSystems,358{366,AmericanInstituteofNewYton,G.E.&Sejnowski,T.J.(1986)LearningandrelearninginBoltzmannmacRumelhart,D.E.andMcClelland,J.L.,editors,allelintheMicreofCoolume1:,MITPress,R.C.(1996)BiologicallyPlausibleError-DrivenLearningUsingLocalActivTheGeneralizedRecirculationAlgorithm./itNeuralComputation,/bf8,895-938.w,M.,Williams,C.K.I.andHinton,G.E.(1996)UsingGenerativeModelsforHand-Intelligenc,592-606.Saul,L.K.,Jaakkola,T.&Jordan,M.I.(1996)MeaneldtheoryforsigmoidbeliefnetJournalofArticialIntelligencSeung,H.S.(1998)Learningconuousattractorsinarecurrentnet.J.Kearns,andS.A.Solla(Eds.)Press,CambridgeMass..(1986)Informationprocessingindynamicalsystems:oundationsofharmonallelationsintheMicreofCoolume1:,MITPressLearningstructuraldescriptionsThePsychologyofComputerVision,McGraw-HillBookCompan,NewY ProductsExpertsyComputationalNeuroscienceUnityCollegeLondon17QueenSquare,LondonWC1N3AR,U.K.Itispossibletocombinemultipleprobabilisticmodelsofthesamedatab