/
Journal of Machine Learning Research    Submi tted  Revised  Published  Random Search Journal of Machine Learning Research    Submi tted  Revised  Published  Random Search

Journal of Machine Learning Research Submi tted Revised Published Random Search - PDF document

pamella-moone
pamella-moone . @pamella-moone
Follow
442 views
Uploaded On 2014-12-19

Journal of Machine Learning Research Submi tted Revised Published Random Search - PPT Presentation

This paper shows empirically and theoretically that r andomly chosen trials are more ef64257cient for hyperparameter optimization than trials on a grid Emp irical evidence comes from a compar ison with a large previous study that used grid search an ID: 26289

This paper shows empirically

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Journal of Machine Learning Research ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch13(2012)281-305Submitted3/11;Revised9/11;Published2/12RandomSearchforHyper-ParameterOptimizationJamesBergstraJAMES.BERGSTRA@UMONTREAL.CAYoshuaBengioYOSHUA.BENGIO@UMONTREAL.CAD´epartementd'Informatiqueetderechercheop´erationnelleUniversit´edeMontr´ealMontr´eal,QC,H3C3J7,CanadaEditor:LeonBottouAbstractGridsearchandmanualsearcharethemostwidelyusedstrategiesforhyper-parameteroptimiza-tion.Thispapershowsempiricallyandtheoreticallythatrandomlychosentrialsaremoreefcientforhyper-parameteroptimizationthantrialsonagrid.Empiricalevidencecomesfromacompar-isonwithalargepreviousstudythatusedgridsearchandmanualsearchtocongureneuralnet-worksanddeepbeliefnetworks.Comparedwithneuralnetworksconguredbyapuregridsearch,wendthatrandomsearchoverthesamedomainisabletondmodelsthatareasgoodorbetterwithinasmallfractionofthecomputationtime.Grantingrandomsearchthesamecomputationalbudget,randomsearchndsbettermodelsbyeffectivelysearchingalarger,lesspromisingcon-gurationspace.Comparedwithdeepbeliefnetworksconguredbyathoughtfulcombinationofmanualsearchandgridsearch,purelyrandomsearchoverthesame32-dimensionalcongurationspacefoundstatisticallyequalperformanceonfourofsevendatasets,andsuperiorperformanceononeofseven.AGaussianprocessanalysisofthefunctionfromhyper-parameterstovalidationsetperformancerevealsthatformostdatasetsonlyafewofthehyper-parametersreallymatter,butthatdifferenthyper-parametersareimportantondifferentdatasets.Thisphenomenonmakesgridsearchapoorchoiceforconguringalgorithmsfornewdatasets.Ouranalysiscastssomelightonwhyrecent“HighThroughput”methodsachievesurprisingsuccess—theyappeartosearchthroughalargenumberofhyper-parametersbecausemosthyper-parametersdonotmattermuch.Weanticipatethatgrowinginterestinlargehierarchicalmodelswillplaceanincreasingburdenontechniquesforhyper-parameteroptimization;thisworkshowsthatrandomsearchisanaturalbase-lineagainstwhichtojudgeprogressinthedevelopmentofadaptive(sequential)hyper-parameteroptimizationalgorithms.Keywords:globaloptimization,modelselection,neuralnetworks,deeplearning,responsesurfacemodeling1.IntroductionTheultimateobjectiveofatypicallearningalgorithmAistondafunctionthatminimizessomeexpectedlossL(x)overi.i.d.samplesxfromanatural(grandtruth)distributionGx.AlearningalgorithmAisafunctionalthatmapsadataset(train)(anitesetofsamplesfromGx)toafunction.Veryoftenalearningalgorithmproducesthroughtheoptimizationofatrainingcriterionwithrespecttoasetofparametersq.However,thelearningalgorithmitselfoftenhasbellsandwhistlescalledhyper-parametersl,andtheactuallearningalgorithmistheoneobtainedafterchoosingl,whichcanbedenotedAl,and=Al((train))foratrainingset(train).Forexample,withac\r2012JamesBergstraandYoshuaBengio. BERGSTRAANDBENGIOGaussiankernelSVM,onehastoselectaregularizationpenaltyCforthetrainingcriterion(whichcontrolsthemargin)andthebandwidthsoftheGaussiankernel,thatis,l=(C;s)WhatwereallyneedinpracticeisawaytochooselsoastominimizegeneralizationerrorExGx[L(xAl((train)))].NotethatthecomputationperformedbyAitselfofteninvolvesaninneroptimizationproblem,whichisusuallyiterativeandapproximate.Theproblemofidentifyingagoodvalueforhyper-parametersliscalledtheproblemofhyper-parameteroptimization.Thispapertakesalookatalgorithmsforthisdifcultouter-loopoptimizationproblem,whichisofgreatpracticalimportanceinempiricalmachinelearningwork:l()=argminl2LExGx[LxAl((train))]:(1)Ingeneral,wedonothaveefcientalgorithmsforperformingtheoptimizationimpliedbyEqua-tion1.Furthermore,wecannotevenevaluatetheexpectationovertheunknownnaturaldistributionGx,thevaluewewishtooptimize.Nevertheless,wemustcarryoutthisoptimizationasbestwecan.WithregardstotheexpectationoverGx,wewillemploythewidelyusedtechniqueofcross-validationtoestimateit.Cross-validationisthetechniqueofreplacingtheexpectationwithameanoveravalidationset(valid)whoseelementsaredrawni.i.dxGx.Cross-validationisunbiasedaslongas(valid)isindependentofanydatausedbyAl(seeBishop,1995,pp.32-33).WeseeinEquations2-4thehyper-parameteroptimizationproblemasitisaddressedinpractice:l()argminl2Lmeanx2X(valid)LxAl((train)):(2)argminl2LY(l)(3)argminl2fl(1):::l(S)gY(l)ˆl(4)Equation3expressesthehyper-parameteroptimizationproblemintermsofahyper-parameterresponsefunctionY.Hyper-parameteroptimizationistheminimizationofY(l)overl2L.Thisfunctionissometimescalledtheresponsesurfaceintheexperimentdesignliterature.Differentdatasets,tasks,andlearningalgorithmfamiliesgiverisetodifferentsetsLandfunctionsY.KnowingingeneralverylittleabouttheresponsesurfaceYorthesearchspaceL,thedominantstrategyforndingagoodlistochoosesomenumber(S)oftrialpointsfl(1):::l(S)g,toevaluateY(l)foreachone,andreturnthel(i)thatworkedthebestasˆl.ThisstrategyismadeexplicitbyEquation4.Thecriticalstepinhyper-parameteroptimizationistochoosethesetoftrialsfl(1):::l(S)gThemostwidelyusedstrategyisacombinationofgridsearchandmanualsearch(e.g.,LeCunetal.,1998b;Larochelleetal.,2007;Hinton,2010),aswellasmachinelearningsoftwarepackagessuchaslibsvm(ChangandLin,2001)andscikits.learn.1IfLisasetindexedbyKcongurationvariables(e.g.,forneuralnetworksitwouldbethelearningrate,thenumberofhiddenunits,thestrengthofweightregularization,etc.),thengridsearchrequiresthatwechooseasetofvaluesforeachvariable(L(1):::L(K)).Ingridsearchthesetoftrialsisformedbyassemblingeverypossiblecombinationofvalues,sothenumberoftrialsinagridsearchisS=ÕKk=1jL(k)jelements.ThisproductoverKsetsmakesgridsearchsufferfromthecurseofdimensionalitybecausethenumberofjointvaluesgrowsexponentiallywiththenumberofhyper-parameters(Bellman,1961).Manual 1.scikits.learn:MachineLearninginPythoncanbefoundathttp://scikit-learn.sourceforge.net.282 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATIONsearchisusedtoidentifyregionsinLthatarepromisingandtodeveloptheintuitionnecessarytochoosethesetsL(k).AmajordrawbackofmanualsearchisthedifcultyinreproducingresultsThisisimportantbothfortheprogressofscienticresearchinmachinelearningaswellasforeaseofapplicationoflearningalgorithmsbynon-expertusers.Ontheotherhand,gridsearchalonedoesverypoorlyinpractice(asdiscussedhere).Weproposerandomsearchasasubstituteandbaselinethatisbothreasonablyefcient(roughlyequivalenttoorbetterthancombininingmanualsearchandgridsearch,inourexperiments)andkeepingtheadvantagesofimplementationsimplicityandreproducibilityofpuregridsearch.Randomsearchisactuallymorepracticalthangridsearchbecauseitcanbeappliedevenwhenusingaclusterofcomputersthatcanfail,andallowstheexperimentertochangethe“resolution”onthey:addingnewtrialstothesetorignoringfailedtrialsarebothfeasiblebecausethetrialsarei.i.d.,whichisnotthecaseforagridsearch.Ofcourse,randomsearchcanprobablybeimprovedbyautomatingwhatmanualsearchdoes,i.e.,asequentialoptimization,butthisislefttofuturework.Thereareseveralreasonswhymanualsearchandgridsearchprevailasthestateoftheartdespitedecadesofresearchintoglobaloptimization(e.g.,NelderandMead,1965;Kirkpatricketal.,1983;Powell,1994;Weise,2009)andthepublishingofseveralhyper-parameteroptimizationalgorithms(e.g.,Nareyek,2003;Czogieletal.,2005;Hutter,2009):ManualoptimizationgivesresearcherssomedegreeofinsightintoYThereisnotechnicaloverheadorbarriertomanualoptimization;Gridsearchissimpletoimplementandparallelizationistrivial;Gridsearch(withaccesstoacomputecluster)typicallyndsabetterˆlthanpurelymanualsequentialoptimization(inthesameamountoftime);Gridsearchisreliableinlowdimensionalspaces(e.g.,1-d,2-d).Wewillcomebacktotheuseofglobaloptimizationalgorithmsforhyper-parameterselectioninourdiscussionoffuturework(Section6).Inthispaper,wefocusonrandomsearch,thatis,inde-pendentdrawsfromauniformdensityfromthesamecongurationspaceaswouldbespannedbyaregulargrid,asanalternativestrategyforproducingatrialsetfl(1):::l(S)g.Weshowthatrandomsearchhasallthepracticaladvantagesofgridsearch(conceptualsimplicity,easeofimplementation,trivialparallelism)andtradesasmallreductioninefciencyinlow-dimensionalspacesforalargeimprovementinefciencyinhigh-dimensionalsearchspaces.Inthisworkweshowthatrandomsearchismoreefcientthangridsearchinhigh-dimensionalspacesbecausefunctionsYofinteresthavealoweffectivedimensionality;essentially,Yofinterestaremoresensitivetochangesinsomedimensionsthanothers(Caischetal.,1997).Inparticular,ifafunctionoftwovariablescouldbeapproximatedbyanotherfunctionofonevariable((x1;x2)g(x1)),wecouldsaythathasaloweffectivedimension.Figure1illustrateshowpointgridsanduniformlyrandompointsetsdifferinhowtheycopewithloweffectivedimensionality,asintheaboveexamplewith.Agridofpointsgivesevencoverageintheoriginal2-dspace,butprojectionsontoeitherthex1orx2subspaceproducesaninefcientcoverageofthesubspace.Incontrast,randompointsareslightlylessevenlydistributedintheoriginalspace,butfarmoreevenlydistributedinthesubspaces.Iftheresearchercouldknowaheadoftimewhichsubspaceswouldbeimportant,thenheorshecoulddesignanappropriategrid.However,weshowthefailingsofthisstrategyinSection2.Fora283 BERGSTRAANDBENGIO  \n  \r \n  \r\n \n\r\n \r\n \n\r\n \r\n \n\r\n \r\n \n\r\n Figure1:Gridandrandomsearchofninetrialsforoptimizingafunction(x;y)=g(x)+h(y)g(x)withloweffectivedimensionality.Aboveeachsquareg(x)isshowningreen,andleftofeachsquareh(y)isshowninyellow.Withgridsearch,ninetrialsonlytestg(x)inthreedistinctplaces.Withrandomsearch,allninetrialsexploredistinctvaluesofg.Thisfailureofgridsearchistheruleratherthantheexceptioninhighdimensionalhyper-parameteroptimization.givenlearningalgorithm,lookingatseveralrelativelysimilardatasets(fromdifferentdistributions)revealsthatondifferentdatasets,differentsubspacesareimportant,andtodifferentdegrees.Agridwithsufcientgranularitytooptimizinghyper-parametersforalldatasetsmustconsequentlybeinefcientforeachindividualdatasetbecauseofthecurseofdimensionality:thenumberofwastedgridsearchtrialsisexponentialinthenumberofsearchdimensionsthatturnouttobeirrelevantforaparticulardataset.Incontrast,randomsearchthrivesonloweffectivedimensionality.Randomsearchhasthesameefciencyintherelevantsubspaceasifithadbeenusedtosearchonlytherelevantdimensions.Thispaperisorganizedasfollows.Section2looksattheefciencyofrandomsearchinpracticevs.gridsearchasamethodforoptimizingneuralnetworkhyper-parameters.WetakethegridsearchexperimentsofLarochelleetal.(2007)asapointofcomparison,andrepeatsimilarexperimentsusingrandomsearch.Section3usesGaussianprocessregression(GPR)toanalyzetheresultsoftheneuralnetworktrials.TheGPRletsuscharacterizewhatYlookslikeforvariousdatasets,andestablishanempiricallinkbetweentheloweffectivedimensionalityofYandtheefciencyofrandomsearch.Section4comparesrandomsearchandgridsearchwithmoresophisticatedpointsetsdevelopedforQuasiMonte-Carlonumericalintegration,andarguesthatintheregimeofinterestforhyper-parameterselectiongridsearchisinappropriateandmoresophisticatedmethodsbringlittleadvantageoverrandomsearch.Section5comparesrandomsearchwiththeexpert-guidedmanualsequentialoptimizationemployedinLarochelleetal.(2007)tooptimizeDeepBeliefNetworks.Section6commentsontheroleofglobaloptimizationalgorithmsinfuturework.WeconcludeinSection7thatrandomsearchisgenerallysuperiortogridsearchforoptimizinghyper-parameters.284 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATION2.Randomvs.GridforOptimizingNeuralNetworksInthissectionwetakeasecondlookatseveraloftheexperimentsofLarochelleetal.(2007)us-ingrandomsearch,tocomparewiththegridsearchesdoneinthatwork.Webeginwithalookathyper-parameteroptimizationinneuralnetworks,andthenmoveontohyper-parameteropti-mizationinDeepBeliefNetworks(DBNs).Tocharacterizetheefciencyofrandomsearch,wepresenttwotechniquesinpreliminarysections:Section2.1explainshowweestimatethegeneral-izationperformanceofthebestmodelfromasetofcandidates,takingintoaccountouruncertaintyinwhichmodelisactuallybest;Section2.2explainstherandomexperimentefciencycurvethatweusetocharacterizetheperformanceofrandomsearchexperiments.Withthesepreliminariesoutoftheway,Section2.3describesthedatasetsfromLarochelleetal.(2007)thatweuseinourwork.Section2.4presentsourresultsoptimizingneuralnetworks,andSection5presentsourresultsoptimizingDBNs.2.1EstimatingGeneralizationBecauseofnitedatasets,testerrorisnotmonotoneinvalidationerror,anddependingonthesetofparticularhyper-parametervalueslevaluated,thetesterrorofthebest-validationerrorcongu-rationmayvary.Whenreportingperformanceoflearningalgorithms,itcanbeusefultotakeintoaccounttheuncertaintyduetothechoiceofhyper-parametersvalues.Thissectiondescribesourprocedureforestimatingtestsetaccuracy,whichtakesintoaccountanyuncertaintyinthechoiceofwhichtrialisactuallythebest-performingone.Toexplainthisprocedure,wemustdistinguishbetweenestimatesofperformanceY(valid)=YandY(test)basedonthevalidationandtestsetsrespectively:Y(valid)(l)=meanx2X(valid)LxAl((train));Y(test)(l)=meanx2X(test)LxAl((train)):Likewise,wemustdenetheestimatedvarianceVaboutthesemeansonthevalidationandtestsets,forexample,forthezero-oneloss(Bernoullivariance):V(valid)(l)=Y(valid)(l)1Y(valid)(l) j(valid)j1;andV(test)(l)=Y(test)(l)1Y(test)(l) j(test)j1:Withotherlossfunctionstheestimatorofvariancewillgenerallybedifferent.Thestandardpracticeforevaluatingamodelfoundbycross-validationistoreportY(test)(l(s))forthel(s)thatminimizesY(valid)(l(s)).However,whendifferenttrialshavenearlyoptimalval-idationmeans,thenitisnotclearwhichtestscoretoreport,andaslightlydifferentchoiceoflcouldhaveyieldedadifferenttesterror.Toresolvethedifcultyofchoosingawinner,wereportaweightedaverageofallthetestsetscores,inwhicheachoneisweightedbytheprobabilitythatitsparticularl(s)isinfactthebest.Inthisview,theuncertaintyarisingfrom(valid)beinganitesam-pleofGxmakesthetest-setscoreofthebestmodelamongl(1);:::;l(S)arandomvariable,z.ThisscorezismodeledbyaGaussianmixturemodelwhoseScomponentshavemeansµs=Y(test)(l(s))285 BERGSTRAANDBENGIOvariancess2s=V(test)(l(s)),andweightswsdenedbyws=PZ(s)Z(s0);8s0=s;whereZ(i)NY(valid)(l(i));V(valid)(l(i)):Tosummarize,theperformancezofthebestmodelinanexperimentofStrialshasmeanµzandstandarderrors2zµz=Sås=1wsµs;and(5)s2z=Sås=1wsµ2s+s2sµ2z:(6)Itissimpleandpracticaltoestimateweightswsbysimulation.TheprocedurefordoingsoistorepeatedlydrawhypotheticalvalidationscoresZ(s)fromNormaldistributionswhosemeansaretheY(valid)(l(s))andwhosevariancesarethesquaredstandarderrorsV(valid)(l(s)),andtocounthowofteneachtrialgeneratesawinningscore.Sincethetestscoresofthebestvalidationscoresaretypicallyrelativelyclose,wsneednotbeestimatedverypreciselyandafewtensofhypotheticaldrawssufce.Inexpectation,thistechniqueforestimatinggeneralizationgivesahigherestimatethanthetraditionaltechniqueofreportingthetestseterrorofthebestmodelinvalidation.ThedifferenceisrelatedtothevarianceY(valid)andthedensityofvalidationsetscoresY(l(i))nearthebestvalue.TotheextentthatY(valid)castsdoubtonwhichmodelwasbest,thistechniqueaveragestheperformanceofthebestmodeltogetherwiththeperformanceofmodelswhichwerenotthebest.Thenextsection(RandomExperimentEfcienyCurve)illustratesthisphenomenonanddiscussesitinmoredetail.2.2RandomExperimentEfciencyCurveFigure2illustratestheresultsofarandomexperiment:anexperimentof256trialstrainingneuralnetworkstoclassifytherectanglesdataset.Sincethetrialsofarandomexperimentareindepen-dentlyidenticallydistributed(i.i.d.),arandomsearchexperimentinvolvingSi.i.d.trialscanalsobeinterpretedasNindependentexperimentsofstrials,aslongassNS.Thisinterpretational-lowsustoestimatestatisticssuchastheminimum,maximum,median,andquantilesofanyrandomexperimentofsizes,wheresisadivisorofSTherearetwogeneraltrendsinrandomexperimentefciencycurves,suchastheoneinFigure2:asharpupwardslopeofthelowerextremesasexperimentsgrow,andagentledownwardslopeoftheupperextremes.ThesharpupwardslopeoccursbecausewhenwetakethemaximumoverlargersubsetsoftheStrials,trialswithpoorperformancearerarelythebestwithintheirsubset.Itisnaturalthatlargerexperimentsndtrialswithbetterscores.Theshapeofthiscurveindicatesthefrequencyofgoodmodelsunderrandomsearch,andquantiestherelativevolumes(insearchspace)ofthevariouslevelsofperformance.Thegentledownwardslopeoccursbecauseaswetakethemaximumoverlargersubsetsoftrials(inEquation6),wearelesssureaboutwhichtrialisactuallythebest.Largeexperimentsaveragetogethergoodvalidationtrialswithunusuallyhightestscoreswithothergoodvalidationtrialswithunusuallylowtestscorestoarriveatamoreaccurateestimateofgeneralization.Forexample,286 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATION 1 2 4 8 16 32 64 128experimentsize(#trials) 0:45 0:50 0:55 0:60 0:65 0:70 0:75 0:80accuracy rectanglesimages Figure2:Arandomexperimentefciencycurve.Thetrialsofarandomexperimentarei.i.d,soanexperimentofmanytrials(here,256trialsoptimizinganeuralnetworktoclassifytherectanglesbasicdataset,Section2.3)canbeinterpretedasseveralindependentsmallerexperiments.Forexample,athorizontalaxisposition8,weconsiderour256trialstobe32experimentsof8trialseach.Theverticalaxisshowsthetestaccuracyofthebesttrial(s)fromexperimentsofagivensize,asdeterminedbyEquation5.Whentherearesufcientlymanyexperimentsofagivensize(i.e.,10),thedistributionofperformanceisillustratedbyaboxplotwhoseboxedsectionspansthelowerandupperquartilesandincludesalineatthemedian.Thewhiskersaboveandbeloweachboxedsectionshowthepositionofthemostextremedatapointwithin1.5timestheinter-quartilerangeofthenearestquartile.Datapointsbeyondthewhiskersareplottedwith'+'symbols.Whentherearenotenoughexperimentstosupportaboxplot,asoccurshereforexperimentsof32trialsormore,thebestgeneralizationscoreofeachexperimentisshownbyascatterplot.Thetwothinblacklinesacrossthetopoftheguremarktheupperandlowerboundariesofa95%condenceintervalonthegeneralizationofthebesttrialoverall(Equation6).considerwhatFigure2wouldlooklikeiftheexperimenthadincludedluckytrialwhosevalidationscorewerearound77%asusual,butwhosetestscorewere80%.Inthebarplotfortrialsofsize1,wewouldseethetopperformerscoring80%.Inlargerexperiments,wewouldaveragethat80%performancetogetherwithothertestsetperformancesbecause77%isnotclearlythebestvalidationscore;thisaveragingwouldmaketheupperenvelopeoftheefciencycurveslopedownwardfrom80%toapointveryclosetothecurrenttestsetestimateof76%.Figure2characterizestherangeofperformancethatistobeexpectedfromexperimentsofvari-oussizes,whichisvaluableinformationtoanyonetryingtoreproducetheseresults.Forexample,ifwetrytorepeattheexperimentandourrstfourrandomtrialsfailtondascorebetterthan70%,thentheproblemislikelynotinhyper-parameterselection.287 BERGSTRAANDBENGIO Figure3:Fromtoptobottom,samplesfromthemnistrotatedmnistbackgroundrandommnistbackgroundimagesmnistrotatedbackgroundimagesdatasets.Inalldatasetsthetaskistoidentifythedigit(0-9)andignorethevariousdistractingfactorsofvariation.2.3DataSetsFollowingtheworkofLarochelleetal.(2007)andVincentetal.(2008),weuseavarietyofclassi-cationdatasetsthatincludemanyfactorsofvariation.2Themnistbasicdatasetisasubsetofthewell-knownMNISThandwrittendigitdataset(LeCunetal.,1998a).Thisdatasethas28x28pixelgrey-scaleimagesofdigits,eachbelongingtooneoftenclasses.Wechoseadifferenttrain/test/validationsplittinginordertohavefasterexperimentsandseelearningperformancedifferencesmoreclearly.Weshufedtheoriginalsplitsrandomly,andused10000trainingexamples,2000validationexamples,and50000testingexamples.Theseimagesarepresentedaswhite(1.0-valued)foregrounddigitsagainstablack(0.0-valued)background.Themnistbackgroundimagesdatasetisavariationonmnistbasicinwhichthewhitefore-grounddigithasbeencompositedontopofa28x28naturalimagepatch.TechnicallythiswasdonebytakingthemaximumoftheoriginalMNISTimageandthepatch.Naturalimagepatcheswithverylowpixelvariancewererejected.Aswithmnistbasicthereare10classes,10000trainingexamples,2000validationexamples,and50000testexamples.Themnistbackgroundrandomdatasetisasimilarvariationonmnistbasicinwhichthewhiteforegrounddigithasbeencompositedontopofrandomuniform(0,1)pixelvalues.Aswithmnistbasicthereare10classes,10000trainingexamples,2000validationexamples,and50000testexamples.Themnistrotateddatasetisavariationonmnistbasicinwhichtheimageshavebeenrotatedbyanamountchosenrandomlybetween0and2pradians.Thisdatasetincluded10000trainingexamples,2000validationexamples,50000testexamples. 2.Datasetscanbefoundathttp://www.iro.umontreal.ca/˜lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007.288 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATION Figure4:Top:Samplesfromtherectanglesdataset.Middle:Samplesfromtherectanglesimagesdataset.Bottom:Samplesfromtheconvexdataset.Inrectanglesdatasets,theimageisformedbyoverlayingasmallrectangleonabackground.Thetaskistolabelthesmallrectangleasbeingeithertallorwide.Inconvex,thetaskistoidentifywhetherthesetofwhitepixelsisconvex(images1and4)ornotconvex(images2and3).Themnistrotatedbackgroundimagesdatasetisavariationonmnistrotatedinwhichtheimageshavebeenrotatedbyanamountchosenrandomlybetween0and2pradians,andthensub-sequentlycompositedontonaturalimagepatchbackgrounds.Thisdatasetincluded10000trainingexamples,2000validationexamples,50000testexamples.Therectanglesdataset(Figure4,top)isasimplesyntheticdatasetofoutlinesofrectanglesTheimagesare28x28,theoutlinesarewhite(1-valued)andthebackgroundsareblack(0-valued).Theheightandwidthoftherectanglesweresampleduniformly,butwhentheirdifferencewassmallerthan3pixelsthesampleswererejected.Thetopleftcorneroftherectangleswasalsosampleduniformly,withtheconstraintthatthewholerectangletsintheimage.Eachimageislabelledasoneoftwoclasses:tallorwide.ThistaskwaseasierthantheMNISTdigitclassication,soweonlyused1000trainingexamples,and200validationexamples,butwestillused50000testingexamples.Therectanglesimagesdataset(Figure4,middle)isavariationonrectanglesinwhichtheforegroundrectangleswerelledwithonenaturalimagepatch,andcompositedontopofadifferentbackgroundnaturalimagepatch.Theprocessforsamplingrectangleshapeswassimilartotheoneusedforrectangles,excepta)theareacoveredbytherectangleswasconstrainedtobebetween25%and75%ofthetotalimage,b)thelengthandwidthoftherectangleswereforcedtobeofatleast10pixels,andc)theirdifferencewasforcedtobeofatleast5pixels.Thistaskwasharderthanrectangles,soweused10000trainingexamples,2000validationexamples,and50000testingexamples.Theconvexdataset(Figure4,bottom)isabinaryimageclassicationtask.Each28x28imageconsistsentirelyof1-valuedand0-valuedpixels.Ifthe1-valuedpixelsformaconvexregioninimagespace,thentheimageislabelledasbeingconvex,otherwiseitislabelledasnon-convex.Theconvexsetsconsistofasingleconvexregionwithpixelsofvalue1.0.Candidateconveximageswereconstructedbytakingtheintersectionofanumberofhalf-planeswhoselocationandorienta-289 BERGSTRAANDBENGIOtionwerechosenuniformlyatrandom.Thenumberofintersectinghalf-planeswasalsosampledrandomlyaccordingtoageometricdistributionwithparameter0.195.Acandidateconveximagewasrejectediftherewerelessthan19pixelsintheconvexregion.Candidatenon-conveximageswereconstructedbytakingtheunionofarandomnumberofconvexsetsgeneratedasabove,butwiththenumberofhalf-planessampledfromageometricdistributionwithparameter0.07andwithaminimumnumberof10pixels.Thenumberofconvexsetswassampleduniformlyfrom2to4.Thecandidatenon-conveximageswerethentestedbycheckingaconvexityconditionforeverypairofpixelsinthenon-convexset.Thosesetsthatfailedtheconvexitytestwereaddedtothedataset.Theparametersforgeneratingtheconvexandnon-convexsetswerebalancedtoensurethattheconditionaloverallpixelmeanisthesameforbothclasses.2.4CaseStudy:NeuralNetworksInLarochelleetal.(2007),thehyper-parametersoftheneuralnetworkwereoptimizedbysearchoveragridoftrials.Wedescribethehyper-parametercongurationspaceofourneuralnetworklearningalgorithmintermsofthedistributionthatwewillusetorandomlysamplefromthatcon-gurationspace.Thersthyper-parameterinourcongurationisthetypeofdatapreprocessing:withequalprobability,oneof(a)none,(b)normalize(centereachfeaturedimensionanddividebyitsstandarddeviation),or(c)PCA(afterremovingdimension-wisemeans,examplesareprojectedontoprinciplecomponentsofthedatawhosenormshavebeendividedbytheireigenvalues).PartofPCApreprocessingischoosinghowmanycomponentstokeep.Wechooseafractionofvariancetokeepwithauniformdistributionbetween0:5and1:0.Therehavebeenseveralsuggestionsforhowtherandomweightsofaneuralnetworkshouldbeinitialized(wewilllookatunsupervisedlearningpretrainingalgorithmslaterinSection5).Weexperimentedwithtwodistributionsandtwoscalingheuristics.Thepossibledistributionswere(a)uniformon(1;1),and(b)unitnormal.Thetwoscalingheuristicswere(a)ahyper-parametermultiplierbetween0:1and10:0dividedbythesquarerootofthenumberofinputs(LeCunetal.,1998b),and(b)thesquarerootof6dividedbythesquarerootofthenumberofinputsplushiddenunits(BengioandGlorot,2010).TheweightsthemselveswerechosenusingoneofthreerandomseedstotheMersenneTwisterpseudo-randomnumbergenerator.Inthecaseoftherstheuristic,wechoseamultiplieruniformlyfromtherange(0:2;2:0).Thenumberofhiddenunitswasdrawngeometrically3from18to1024.Weselectedeitherasigmoidalortanhnonlinearitywithequalprobability.Theoutputweightsfromhiddenunitstopredictionunitswereinitializedtozero.Thecostfunctionwasthemeanerroroverminibatchesofeither20or100(withequalprobability)examplesatatime:inexpectationthesegivethesamegradientdirections,butwithmoreorlessvariance.Theoptimizationalgorithmwasstochasticgra-dientdescentwith[initial]learningratee0drawngeometricallyfrom0:001to10:0.Weofferedthepossibilityofanannealedlearningrateviaatimepoint0drawngeometricallyfrom300to30000.Theeffectivelearningrateetafterminibatchiterationswaset=0e0 max(;0):(7)Wepermittedaminimumof100andamaximumof1000iterationsoverthetrainingdata,stoppingifever,atiteration,thebestvalidationperformancewasobservedbeforeiteration=2.With50% 3.WewillusethephrasedrawngeometricallyfromAtoBfor0ABtomeandrawinguniformlyinthelogdomainbetweenlog(A)andlog(B),exponentiatingtogetanumberbetweenAandB,andthenroundingtothenearestinteger.Thephrasedrawnexponentiallymeansthesamethingbutwithoutrounding.290 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATIONprobability,an`2regularizationpenaltywasapplied,whosestrengthwasdrawnexponentiallyfrom3:1107to3:1105.ThissamplingprocesscoversroughlythesamedomainwiththesamedensityasthegridusedinLarochelleetal.(2007),exceptfortheoptionalpreprocessingsteps.ThegridoptimizationofLarochelleetal.(2007)didnotconsidernormalizingorkeepingonlyleadingPCAdimensionsoftheinputs;wecomparetorandomsamplingwithandwithouttheserestrictions.4WeformedexperimentsforeachdatasetbydrawingS=256trialsfromthisdistribution.TheresultsoftheseexperimentsareillustratedinFigures5and6.Randomsamplingoftrialsissurpris-inglyeffectiveinthesesettings.Figure5showsthatevenamongthefractionofjobs(71=256)thatusednopreprocessing,therandomsearchwith8trialsisbetterthanthegridsearchemployedinLarochelleetal.(2007).Typically,theextentofagridsearchisdeterminedbyacomputationalbudget.Figure6showswhatispossibleifweuserandomsearchinalargerspacethatrequiresmoretrialstoexplore.ThelargersearchspaceincludesthepossibilityofnormalizingtheinputorapplyingPCApreprocessing.Inthelargerspace,32trialswerenecessarytoconsistentlyoutperformgridsearchratherthan8,indicatingthattherearemanyharmfulwaystopreprocessthedata.However,whenweallowedlargerexperimentsof64trialsormore,randomsearchfoundsuperiorresultstothosefoundmorequicklywithinthemorerestrictedsearch.Thistradeoffbetweenexplorationandexploitationiscentraltothedesignofaneffectiverandomsearch.TheefciencycurvesinFigures5and6revealthatdifferentdatasetsgiverisetofunctionsYwithdifferentshapes.Themnistbasicresultsconvergeveryrapidlytowardwhatappearstobeaglobalmaximum.Thefactthatexperimentsofjust4or8trialsoftenhavethesamemaximumasmuchlargerexperimentsindicatesthattheregionofLthatgivesrisetothebestperformanceisapproximatelyaquarteroraneighthrespectivelyoftheentirecongurationspace.Assumingthattherandomsearchhasnotmissedatinyregionofsignicantlybetterperformance,wecansaythatrandomsearchhassolvedthisproblemin4or8guesses.Itishardtoimagineanyoptimizationalgorithmdoingmuchbetteronanon-trivial7-dimensionalfunction.Incontrastthemnistrotatedbackgroundimagesandconvexcurvesshowthatevenwith16or32randomtrials,thereisconsid-erablevariationinthegeneralizationofthereportedlybestmodel.ThisindicatesthattheYfunctioninthesecasesismorepeaked,withsmallregionsofgoodperformance.3.TheLowEffectiveDimensionofSection2showedthatrandomsamplingismoreefcientthangridsamplingforoptimizingfunc-tionsYcorrespondingtoseveralneuralnetworkfamiliesandclassicationtasks.InthissectionweshowthatindeedYhasaloweffectivedimension,whichexplainswhyrandomlysampledtrialsfoundbettervalues.Onesimplewaytocharacterizetheshapeofahigh-dimensionalfunctionistolookathowmuchitvariesineachdimension.Gaussianprocessregressiongivesusthestatis-ticalmachinerytolookatYandmeasureitseffectivedimensionality(Neal,1998;RasmussenandWilliams,2006).WeestimatedthesensitivityofYtoeachhyper-parameterbyttingaGaussianprocess(GP)withsquaredexponentialkernelstopredictY(l)froml.Thesquaredexponentialkernel(orGaussiankernel)measuressimilaritybetweentworeal-valuedhyper-parametervaluesaandbbyexp(ab l2).Thepositive-valuedgovernsthesensitivityoftheGPtochangeinthishyper- 4.Sourcecodeforthesimulationsisavailableathttps://github.com/jaberg/hyperopt.291 BERGSTRAANDBENGIO 1 2 4 8 16 32experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistbasic 1 2 4 8 16 32experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistbackgroundimages 1 2 4 8 16 32experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistbackgroundrandom 1 2 4 8 16 32experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistrotated 1 2 4 8 16 32experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistrotatedbackgroundimages 1 2 4 8 16 32experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy convex 1 2 4 8 16 32experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy rectangles 1 2 4 8 16 32experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy rectanglesimages Figure5:Neuralnetworkperformancewithoutpreprocessing.Randomexperimentefciencycurvesofasingle-layerneuralnetworkforeightofthedatasetsusedinLarochelleetal.(2007),lookingonlyattrialswithnopreprocessing(7hyper-parameterstooptimize).Theverticalaxisistest-setaccuracyofthebestmodelbycross-validation,thehorizontalaxisistheexperimentsize(thenumberofmodelscomparedincross-validation).Thedashedbluelinerepresentsgridsearchaccuracyforneuralnetworkmodelsbasedonaselectionbygridsaveraging100trials(Larochelleetal.,2007).Randomsearchesof8trialsmatchoroutperformgridsearchesof(onaverage)100trials.parameter.Thekernelsdenedforeachhyper-parameterwerecombinedbymultiplication(jointGaussiankernel).WetaGPtosamplesofYbyndingthelengthscale()foreachhyper-parameterthatmaximizedthemarginallikelihood.Toensurerelevancecouldbecomparedbetweenhyper-parameters,weshiftedandscaledeachonetotheunitinterval.Forhyper-parametersthatweredrawngeometricallyorexponentially(e.g.,learningrate,numberofhiddenunits),kernelcalculationswerebasedonthelogarithmoftheeffectivevalue.292 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATION 1 2 4 8 16 32 64experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistbasic 1 2 4 8 16 32 64experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistbackgroundimages 1 2 4 8 16 32 64experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistbackgroundrandom 1 2 4 8 16 32 64experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistrotated 1 2 4 8 16 32 64experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy mnistrotatedbackgroundimages 1 2 4 8 16 32 64experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy convex 1 2 4 8 16 32 64experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy rectangles 1 2 4 8 16 32 64experimentsize(#trials) 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0accuracy rectanglesimages Figure6:Neuralnetworkperformancewhenstandardpreprocessingalgorithmsareconsidered(9hyper-parameters).Dashedbluelinerepresentsgridsearchaccuracyusing(onaverage)100trials(Larochelleetal.,2007),inwhichnopreprocessingwasdone.Oftentheextentofasearchisdeterminedbyacomputationalbudget,andwithrandomsearch64trialsareenoughtondbettermodelsinalargerlesspromisingspace.ExploringjustfourPCAvariancelevelsbygridsearchwouldhaverequired5timesasmany(average500)trialsperdataset.Figure7showstherelevanceofeachcomponentofLinmodellingY(l).Findingthelengthscalesthatmaximizemarginallikelihoodisnotaconvexproblemandmanylocalminimaexist.Togetasenseofwhatlengthscalesweresupportedbythedata,weteachsetofsamplesfromY50times,resamplingdifferentsubsetsof80%oftheobservationseverytime,andreinitializingthelengthscaleestimatesrandomlybetween0.1and2.Figure7revealstwoimportantpropertiesofYforneuralnetworksthatsuggestwhygridsearchperformssopoorlyrelativetorandomexperiments:1.asmallfractionofhyper-parametersmatterforanyonedataset,but293 BERGSTRAANDBENGIO       \n \r        \n \r  \n \r\n       \n \r  \n \r        \n \r           \n \r           \n \r   \n        \n \r \n         \n \r \n\n    \r        \r        \r       \r       \r       Figure7:AutomaticRelevanceDetermination(ARD)appliedtohyper-parametersofneuralnet-workexperiments(withrawpreprocessing).Foreachdataset,asmallnumberofhyper-parametersdominateperformance,buttherelativeimportanceofeachhyper-parametervariesfromeachdatasettothenext.Section2.4describesthesevenhyper-parametersineachpanel.Boxplotsareobtainedbyrandomizingthesubsetofdatausedtotthelengthscales,andrandomizingthelengthscaleinitialization.(Bestviewedincolor.)294 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATION2.differenthyper-parametersmatterondifferentdatasets.Eveninthissimple7-dproblem,Yhasamuchlowereffectivedimensionofbetween1and4,dependingonthedataset.Itwouldbeimpossibletocoverjustthesefewdimensionswithareli-ablegridhowever,becausedifferentdatasetscallforgridsondifferentdimensions.Thelearningrateisalwaysimportant,butsometimesthelearningrateannealingratewasimportant(rectanglesimages),sometimesthe`2-penaltywasimportant(convexmnistrotated),sometimesthenumberofhiddenunitswasimportant(rectangles),andsoon.WhilerandomsearchoptimizedtheseYfunctionswith8to16trials,agridwith,say,fourvaluesineachoftheseaxeswouldalreadyrequire256trials,andyetprovidenoguaranteethatYforanewdatasetwouldbewelloptimized.Figure7alsoallowsustoestablishacorrelationbetweeneffectivedimensionalityandeaseofoptimization.Thedatasetsforwhichtheeffectivedimensionalitywaslowest(1or2)weremnistbasicmnistbackgroundimagesmnistbackgroundrandom,andrectanglesimages.Lookingbackatthecorrespondingefciencycurves(Figure5)wendthatthesearealsothedatasetswhosecurvesplateaumostsharply,indicatingthatthesefunctionsaretheeasiesttooptimize.Theyareoftenoptimizedreasonablywellbyjust2randomtrials.LookingtoFigure7atthedatasetswithlargesteffectivedimensionality(3or4),weidentifyconvexmnistrotatedrectangles.LookingattheirefciencycurvesinFigure5revealsthattheyconsistentlyrequiredatleast8randomtrials.ThiscorrelationoffersanotherpieceofevidencethattheeffectivedimensionalityofYisplayingastrongroleindeterminingthedifcultyofhyper-parameteroptimization.4.GridSearchandSetswithLowEffectiveDimensionalityItisaninterestingmathematicalchallengetochooseasetoftrialsforsamplingfunctionsofun-known,butloweffectivedimensionality.Wewouldlikeittobetruethatnomatterwhichdimen-sionsturnouttobeimportant,ourtrialssampletheimportantdimensionsevenly.SetsofpointswiththispropertyarewellstudiedintheliteratureofQuasi-Randommethodsfornumericalintegration,wheretheyareknownaslow-discrepancysetsbecausetheytrytomatch(minimizediscrepancywith)theuniformdistribution.Althoughthereareseveralformaldenitionsoflowdiscrepancy,theyallcapturetheintuitionthatthepointsshouldberoughlyequidistantfromoneanother,inorderthattherebeno“clumps”or“holes”inthepointset.Severalproceduresforconstructinglow-discrepancypointsetsinmultipledimensionsalsotrytoensureasmuchaspossiblethatsubspaceprojectionsremainlow-discrepancysetsinthesubspace.Forexample,theSobol(AntonovandSaleev,1979),Halton(Halton,1960),andNiederreiter(Brat-leyetal.,1992)sequences,aswellaslatinhypercubesampling(McKayetal.,1979)areallmoreorlessdeterministicschemesforgettingpointsetsthataremorerepresentativeofrandomuniformdrawsthanactualrandomuniformdraws.InQuasiMonte-Carlointegration,suchpointsetsareshowntoasymptoticallyminimizethevarianceofniteintegralsfasterthantruerandomuniformsamples,butinthissection,wewilllookatthesepointsetsinthesettingofrelativelysmallsamplesizes,toseeiftheycanbeusedformoreefcientsearchthanrandomdraws.RatherthanrepeattheverycomputationallyexpensiveexperimentsconductedinSection2,weusedanarticialsimulationtocomparetheefciencyofgrids,randomdraws,andthefourlow-discrepancypointsetsmentionedinthepreviousparagraph.Thearticialsearchproblemwastondauniformlyrandomlyplacedmulti-dimensionaltargetinterval,whichoccupies1%ofthevolumeoftheunithyper-cube.Welookedatfourvariantsofthesearchproblem,inwhichthetargetwas295 BERGSTRAANDBENGIO1.acubeina3-dimensionalspace,2.ahyper-rectangleina3-dimensionalspace,3.ahyper-cubeina5-dimensionalspace,4.ahyper-rectangleina5-dimensionalspace.Theshapeofthetargetrectangleinvariants(2)and(4)wasdeterminedbysamplingsidelengthsuniformlyfromtheunitinterval,andthenscalingtherectangletohaveavolumeof1%.Thisprocessgavetherectanglesashapethatwasoftenwideortall-muchlongeralongsomeaxesthanothers.Thepositionofthetargetwasdrawnuniformlyamongthepositionstotallyinsidetheunithyper-cube.Inthecaseoftallorwidetargets(2)and(4),theindicatorfunction[ofthetarget]hadalowereffectivedimensionthanthedimensionalityoftheoverallspacebecausethedimensionsinwhichthetargetiselongatedcanbealmostignored.Thesimulationexperimentbeganwiththegenerationof100randomsearchproblems.Thenforeachexperimentdesignmethod(random,Sobol,latinhypercube,grid)wecreatedexperimentsof1,2,3,andsoonupto512trials.5TheSobol,Niederreiter,andHaltonsequencesyieldedsimilarresults,soweusedtheSobolsequencetorepresenttheperformanceoftheselow-discepancysetconstructionmethods.Therearemanypossiblegridexperimentsofanysizeinmultipledimensions(atleastfornon-primeexperimentsizes).Wedidnottesteverypossiblegrid,insteadwetestedeverygridwithamonotonicresolution.Forexample,forexperimentsofsize16in5dimensionswetriedthevegridswithresolutions(1,1,1,1,16),(1,1,1,2,8),(1,1,2,2,4),(1,1,1,4,4),(1,2,2,2,2);forexperimentsofsomeprimesizePin3dimensionswetriedonegridwithresolution(1,1,P).Sincethetargetintervalsweregeneratedinsuchawaythatrectanglesidenticaluptoapermutationofsidelengthshaveequalprobability,gridswithmonotonicresolutionarerepresentativeofallgrids.Thescoreofanexperimentdesignmethodforeachexperimentsizewasthefractionofthe100targetsthatitfound.Tocharacterizetheperformanceofrandomsearch,weusedtheanalyticformoftheexpectation.Theexpectedprobabilityofndingthetargetis1.0minustheprobabilityofmissingthetargetwitheverysingleoneofTtrialsintheexperiment.Ifthevolumeofthetargetrelativetotheunithypercubeis(v=V=0:01)andthereareTtrials,thenthisprobabilityofndingthetargetis1(1v V)T=10:99T:Figure8illustratestheefciencyofeachkindofpointsetatndingthemultidimensionalin-tervals.Thereweresomegridsthatwerebestatndingcubesandhyper-cubesin3-dand5-d,butmostgridsweretheworstperformers.Nogridwascompetitivewiththeothermethodsatndingtherectangular-shapedintervals,whichhadloweffectivedimension(cases2and4;Figure8,rightpanels).Latinhypercubes,commonlyusedtoinitializeexperimentsinBayesianoptimization,werenomoreefcientthantheexpectedperformanceofrandomsearch.Interestingly,theSobolse-quencewasconsistentlybestbyafewpercentagepoints.Thelow-discrepancypropertythatmakestheSobolusefulinintegrationhelpshere,whereithastheeffectofminimizingthesizeofholeswherethetargetmightpassundetected.TheadvantageoftheSobolsequenceismostpronouncedinexperimentsof100-300trials,wheretherearesufcientlymanytrialsforthestructureintheSobol 5.SamplesfromtheSobolsequencewereprovidedbytheGNUScienticLibrary(M.Galassietal.,2009).296 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATION Figure8:Theefciencyinsimulationoflow-discrepancysequencesrelativetogridandpseudo-randomexperiments.Thesimulationtestedhowreliablyvariousexperimentdesignmeth-odslocateamultidimensionalintervaloccupying1%ofaunithyper-cube.Thereisonegreydotineachsub-plotforeverygridofeveryexperimentsizethathasatleasttwoticksineachdimension.Theblackdotsindicatenear-perfectgridswhosenestandcoarsestdimensionalresolutionsdifferbyeither0or1.Hyper-parametersearchismosttypi-callylikethebottom-rightscenario.Gridsearchexperimentsareinefcientforndingaxis-alignedelongatedregionsinhighdimensions(i.e.,bottom-right).Pseudo-randomsamplesareasefcientaslatinhypercubesamples,andslightlylessefcientthantheSobolsequence.departsignicantlyfromi.i.dpoints,butnotsufcientlymanytrialsforrandomsearchtosucceedwithhighprobability.Athoughtexperimentgivessomeintuitionforwhygridsearchfailsinthecaseofrectangles.Longthinrectanglestendtointersectwithseveralpointsiftheyintersectwithany,reducingtheeffectivesamplesizeofthesearch.Iftherectangleshadbeenrotatedawayfromtheaxesusedtobuildthegrid,thendependingontheangletheefciencyofgridcouldapproachtheefciencyofrandomorlow-discrepancytrials.Moregenerally,ifthetargetmanifoldwerenotsystematicallyalignedwithsubsetsoftrialpoints,thengridsearchwouldbeasefcientastherandomandquasi-randomsearches.297 BERGSTRAANDBENGIO5.RandomSearchvs.SequentialManualOptimizationToseehowrandomsearchcompareswithacarefulcombinationofgridsearchandhand-tuninginthecontextofamodelwithmanyhyper-parameters,weperformedexperimentswiththeDeepBeliefNetwork(DBN)model(Hintonetal.,2006).ADBNisamulti-layergraphicalmodelwithdirectedandundirectedcomponents.Itisparameterizedsimilarlytoamultilayerneuralnetworkforclassication,andithasbeenarguedthatpretrainingamultilayerneuralnetworkbyunsupervisedlearningasaDBNactsbothtoregularizetheneuralnetworktowardbettergeneralization,andtoeasetheoptimizationassociatedwithnetuningtheneuralnetworkforaclassicationtask(Erhanetal.,2010).ADBNclassierhasmanymorehyper-parametersthananeuralnetwork.Firstly,thereisthenumberofunitsandtheparametersofrandominitializationforeachlayer.Secondly,therearehyper-parametersgoverningtheunsupervisedpretrainingalgorithmforeachlayer.Finally,therearehyper-parametersgoverningtheglobalnetuningofthewholemodelforclassication.ForthedetailsofhowDBNmodelsaretrained(stackingrestrictedBoltzmannmachinestrainedbycon-trastivedivergence),thereaderisreferredtoLarochelleetal.(2007),Hintonetal.(2006)orBengio(2009).Weevaluatedrandomsearchbytraining1-layer,2-layerand3-layerDBNs,samplingfromthefollowingdistribution:Wechose1,2,or3layerswithequalprobability.Foreachlayer,wechose:–anumberofhiddenunits(log-uniformlybetween128and4000),–aweightinitializationheuristicthatfollowedfromadistribution(uniformornormal),amultiplier(uniformlybetween0.2and2),adecisiontodividebythefan-out(trueorfalse),–anumberofiterationsofcontrastivedivergencetoperformforpretraining(log-uniformlyfrom1to10000),–whethertotreatthereal-valuedexamplesusedforunsupervisedpretrainingasBernoullimeans(fromwhichtodrawbinary-valuedtrainingsamples)orasasamplesthemselves(eventhoughtheyarenotbinary),–aninitiallearningrateforcontrastivedivergence(log-uniformlybetween0.0001and1.0),–atimepointatwhichtostartannealingthecontrastivedivergencelearningrateasinEquation7(log-uniformlyfrom10to10000).Therewasalsothechoiceofhowtopreprocessthedata.EitherweusedtherawpixelsorweremovedsomeofthevarianceusingaZCAtransform(inwhichexamplesareprojectedontoprinciplecomponents,andthenmultipliedbythetransposeoftheprinciplecomponentstoplacethembackintheinputsspace).IfusingZCApreprocessing,wekeptanamountofvariancedrawnuniformlyfrom0.5to1.0.Wechosetoseedourrandomnumbergeneratorwithoneof2,3,or4.Wechosealearningratefornetuningofthenalclassierlog-uniformlyfrom0.001to10.298 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATIONWechoseanannealstarttimefornetuninglog-uniformlyfrom100to10000.Wechose`2regularizationoftheweightmatricesateachlayerduringnetuningtobeeither0(withprobability0.5),orlog-uniformlyfrom107to104Thishyper-parameterspaceincludes8globalhyper-parametersand8hyper-parametersforeachlayer,foratotalof32hyper-parametersfor3-layermodels.Agridsearchisnotpracticalforthe32-dimensionalsearchproblemofDBNmodelselection,becauseevenjust2possiblevaluesforeachof32hyper-parameterswouldyieldmoretrialsthanwecouldconduct(232�109trialsandeachcantakehours).Formanyofthehyper-parameters,especiallyrealvaluedones,wewouldreallyliketotrymorethantwovalues.TheapproachtakeninLarochelleetal.(2007)wasacombinationofmanualsearch,multi-resolutiongridsearchandcoordinatedescent.Thealgorithm(includingmanualsteps)issomewhatelaborate,butsensible,andwebelievethatitisrepresentativeofhowmodelsearchistypicallydoneinseveralresearchgroups,ifnotthecommunityatlarge.Larochelleetal.(2007)describeitasfollows:“Thehyper-parametersearchprocedureweusedalternatesbetweenxinganeuralnet-workarchitectureandsearchingforgoodoptimizationhyper-parameterssimilarlytocoordinatedescent.Moretimewouldusuallybespentonndinggoodoptimizationparameters,givensomeempiricalevidencethatwefoundindicatingthatthechoiceoftheoptimizationhyper-parameters(mostlythelearningrates)hasmuchmoreinuenceontheobtainedperformancethanthesizeofthenetwork.Weusedthesameproceduretondthehyper-parametersforDBN-1,whicharethesameasthoseofDBN-3exceptthesecondhiddenlayerandthirdhiddenlayersizes.Wealsoallowedourselvestotestformuchlargerrst-hiddenlayersizes,inordertomakethecomparisonbetweenDBN-1andDBN-3fairer.“Weusuallystartedbytestingarelativelysmallarchitecture(between500and700unitsintherstandsecondhiddenlayer,andbetween1000and2000hiddenunitsinthelastlayer).Giventheresultsobtainedonthevalidationset(comparedtothoseofNNetforinstance)afterselectingappropriateoptimizationparameters,wewouldthenconsidergrowingthenumberofunitsinalllayerssimultaneously.Thebiggestnetworksweeventuallytestedhadupto3000,4000and6000hiddenunitsintherst,secondandthirdhiddenlayersrespectively.“Asfortheoptimizationhyper-parameters,wewouldproceedbyrsttryingafewcom-binationsofvaluesforthestochasticgradientdescentlearningrateofthesupervisedandunsupervisedphases(usuallybetween0.1and0.0001).Wethenrenethechoiceoftestedvaluesforthesehyper-parameters.Thersttrialswouldsimplygiveusatrendonthevalidationseterrorfortheseparameters(isachangeinthehyper-parametermakingthingsworseofbetter)andwewouldthenconsiderthatinformationinselectingap-propriateadditionaltrials.Onecouldchoosetouselearningrateadaptationtechniques(e.g.,slowlydecreasingthelearningrateorusingmomentum)butwedidnotndthesetechniquestobecrucial.TherewaslargevariationinthenumberoftrialsusedinLarochelleetal.(2007)tooptimizetheDBN-3.Onedataset(mnistbackgroundimages)benetedfrom102trials,whileanother(mnistbackgroundrandom)only13becauseagoodresultwasfoundmorequickly.Theaveragenumber299 BERGSTRAANDBENGIO        \n       \n          \n       \n  \r        \n       \n  \r        \n       \n   \n        \n       \n   \n  \r \n        \n       \n          \n       \n          \n       \n  \n Figure9:DeepBeliefNetwork(DBN)performanceaccordingtorandomsearch.Hererandomsearchisusedtoexploreupto32hyper-parameters.Resultsobtainedbygrid-assistedmanualsearchusinganaverageof41trialsaremarkedinnely-dashedgreen(1-layerDBN)andcoarsely-dashedred(3-layerDBN).Randomexperimentsof128randomtrialsfoundaninferiorbestmodelforthreedatasets,acompetitivemodelinfour,andsuperiormodelinone(convex).(Bestviewedincolor.)oftrialsacrossdatasetsfortheDBN-3modelwas41.Inconsideringthenumberoftrialsperdataset,itisimportanttobearinmindthattheexperimentsondifferentdatasetswerenotperformedindependently.Rather,laterexperimentsbenetedfromtheexperiencetheauthorshaddrawnfromearlierones.Althoughgridsearchwaspartoftheoptimizationloop,themanualinterventionturnstheoveralloptimizationprocessintosomethingwithmoreresemblancetoanadaptivesequentialalgorithm.RandomsearchversionsoftheDBNexperimentsfromLarochelleetal.(2007)areshowninFigure9.Inthismorechallengingoptimizationproblemrandomsearchisstilleffective,butnot300 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATIONsuperiorasitwasasinthecaseofneuralnetworkoptimization.Comparingtothe3-layerDBNresultsinLarochelleetal.(2007),randomsearchfoundabettermodelthanthemanualsearchinonedataset(convex),anequallygoodmodelinfour(mnistbasicmnistrotatedrectangles,andrectanglesimages),andaninferiormodelinthree(mnistbackgroundimagesmnistbackgroundrandommnistrotatedbackgroundimages).Comparingtothe1-layerDBNresults,randomsearchofthe1-layer,2-layerand3-layercongurationspacefoundatleastagoodamodelinallcases.Incomparingthesescores,thereadershouldbearinmindthatthescoresintheoriginalexperimentswerenotcomputedusingthesamescore-averagingtechniquethatwedescribedinSection2.1,andouraveragingtechniqueisslightlybiasedtowardunderestimation.IntheDBNefciencycurvesweseethatevenexperimentswithlargernumbersoftrials(64andlarger)featuresignicantvariability.Thisindicatesthattheregionsofthesearchspacewiththebestperformancearesmall,andrandomlychoseni.i.d.trialsdonotreliablyndthem.6.FutureWorkOurresultonthemultidimensionalintervaltask,togetherwiththeGPRcharacterizationoftheshapeofY,togetherwiththecomputationalconstraintthathyper-parametersearchesonlydrawonafewhundredtrials,allsuggestthatpseudo-randomorquasi-randomtrialsareoptimalfornon-adaptivehyper-parametersearch.Thereisstillworktobedoneforeachmodelfamily,toestablishhowitshouldbeparametrizedfori.i.d.randomsearchtobeasreliableaspossible,butthemostpromisingandinterestingdirectionforfutureworkiscertainlyinadaptivealgorithms.Thereisalargebodyofliteratureonglobaloptimization,agreatdealofwhichbearsontheap-plicationofhyper-parameteroptimization.Generalnumericmethodssuchassimplexoptimization(NelderandMead,1965),constrainedoptimizationbylinearapproximation(Powell,1994;Weise,2009),nitedifferencestochasticapproximationandsimultaneouspredictionstochasticapproxi-mation(Kleinmanetal.,1999)couldbeuseful,aswellasmethodsforsearchindiscretespacessuchassimulatedannealing(Kirkpatricketal.,1983)andevolutionaryalgorithms(Rechenberg,1973;Hansenetal.,2003).DrewanddeMello(2006)havealreadyproposedanoptimizational-gorithmthatidentieseffectivedimensions,formoreefcientsearch.Theypresentanalgorithmthatdistinguishesbetweenimportantandunimportantdimensions:alow-discrepancypointsetisusedtochoosepointsintheimportantdimensions,andunimportantdimensionsare“padded”withthinnercoverageandcheapersamples.Theiralgorithm'ssuccesshingesontherapidandsuccessfulidenticationofimportantdimensions.Sequentialmodel-basedoptimizationmethodsandpartic-ularlyBayesianoptimizationmethodsareperhapsmorepromisingbecausetheyofferprincipledapproachestoweightingtheimportanceofeachdimension(Hutter,2009;Hutteretal.,2011;Srini-vasanandRamakrishnan,2011).Withsomanysophisticatedalgorithmstodrawon,itmayseemstrangethatgridsearchisstillwidelyused,and,withstraightfaces,wenowsuggestusingrandomsearchinstead.Webelievethereasonforthisstateofaffairsisatechnicalone.Manualoptimizationfollowedbygridsearchiseasytoimplement:gridsearchrequiresverylittlecodeinfrastructurebeyondaccesstoaclusterofcomputers.Randomsearchisjustassimpletocarryout,usesthesametools,andtsinthesameworkow.Adaptivesearchalgorithmsontheotherhandrequiremorecodecomplexity.Theyrequireclient-serverarchitecturesinwhichamasterprocesskeepstrackofthetrialsthathavecom-pleted,thetrialsthatareinprogress,thetrialsthatwerestartedbutfailedtocomplete.Somekindofshareddatabaseandinter-processcommunicationmechanismsarerequired.Trialsinanadaptive301 BERGSTRAANDBENGIOexperimentcannotbequeuedupallatonce;themasterprocessmustbeinvolvedsomehowintheschedulingandtimingofjobsonthecluster.ThesetechnicalhurdlesarenoteasytojumpwiththestandardtoolsofthetradesuchasMATLABorPython;signicantsoftwareengineeringisrequired.Untilthatengineeringisdoneandadoptedbyacommunityofresearchers,progressonthestudyofsophisticatedhyper-parameteroptimizationalgorithmswillbeslow.7.ConclusionGridsearchexperimentsarecommonintheliteratureofempiricalmachinelearning,wheretheyareusedtooptimizethehyper-parametersoflearningalgorithms.Itisalsocommontoperformmulti-stage,multi-resolutiongridexperimentsthataremoreorlessautomated,becauseagridexperimentwithane-enoughresolutionforoptimizationwouldbeprohibitivelyexpensive.Wehaveshownthatrandomexperimentsaremoreefcientthangridexperimentsforhyper-parameteroptimizationinthecaseofseverallearningalgorithmsonseveraldatasets.Ouranalysisofthehyper-parameterresponsesurface(Y)suggeststhatrandomexperimentsaremoreefcientbecausenotallhyper-parametersareequallyimportanttotune.Gridsearchexperimentsallocatetoomanytrialstotheexplorationofdimensionsthatdonotmatterandsufferfrompoorcoverageindimensionsthatareimportant.ComparedwiththegridsearchexperimentsofLarochelleetal.(2007),randomsearchfoundbettermodelsinmostcasesandrequiredlesscomputationaltime.Randomexperimentsarealsoeasiertocarryoutthangridexperimentsforpracticalreasonsrelatedtothestatisticalindependenceofeverytrial.TheexperimentcanbestoppedanytimeandthetrialsformacompleteexperimenIfextracomputersbecomeavailable,newtrialscanbeaddedtoanexperimentwithouthavingtoadjustthegridandcommittoamuchlargerexperiment.Everytrialcanbecarriedoutasynchronously.Ifthecomputercarryingoutatrialfailsforanyreason,itstrialcanbeeitherabandonedorrestartedwithoutjeopardizingtheexperiment.Randomsearchisnotincompatiblewithacontrolledexperiment.Toinvestigatetheeffectofonehyper-parameterofinterestX,werecommendrandomsearch(insteadofgridsearch)foroptimizingoverotherhyper-parameters.Chooseonesetofrandomvaluesfortheseremaininghyper-parametersandusethatsamesetforeachvalueofX.Randomexperimentswithlargenumbersoftrialsalsobringattentiontothequestionofhowtomeasuretesterrorofanexperimentwhenmanytrialshavesomeclaimtobeingbest.Whenusingarelativelysmallvalidationset,theuncertaintyinvolvedinselectingthebestmodelbycross-validationcanbelargerthantheuncertaintyinmeasuringthetestsetperformanceofanyonemodel.Itisimportanttotakebothofthesesourcesofuncertaintyintoaccountwhenreportingtheuncer-taintyaroundthebestmodelfoundbyamodelsearchalgorithm.Thistechniqueisusefultoallexperiments(includingbothrandomandgrid)inwhichmultiplemodelsachieveapproximatelythebestvalidationsetperformance.Low-discrepancysequencesdevelopedforQMCintegrationarealsogoodalternativestogrid-basedexperiments.Inlowdimensions(e.g.,1-5)oursimulatedresultssuggestthattheycanholdsomeadvantageoverpseudo-randomexperimentsintermsofsearchefciency.However,thetrials302 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATIONofalow-discrepancyexperimentarenoti.i.d.whichmakesitinappropriatetoanalyzeperformancewiththerandomefciencycurve.Itisalsomoredifcultinpracticetoconductaquasi-randomexperimentbecauselikeagridexperiment,theomissionofasinglepointcanbemoresevere.Finally,whentherearemanyhyper-parameterdimensionsrelativetothecomputationalbudgetfortheexperiment,alow-discrepancytrialsetisnotexpectedtobehaveverydifferentlyfromapseudo-randomone.Finally,thehyper-parameteroptimizationstrategiesconsideredherearenon-adaptive:theydonotvarythecourseoftheexperimentbyconsideringanyresultsthatarealreadyavailable.Randomsearchwasnotgenerallyasgoodasthesequentialcombinationofmanualandgridsearchfromanexpert(Larochelleetal.,2007)inthecaseofthe32-dimensionalsearchproblemofDBNop-timization,becausetheefciencyofsequentialoptimizationovercametheinefciencyofthegridsearchemployedateachstepoftheprocedure.Futureworkshouldconsidersequential,adaptivesearch/optimizationalgorithmsinsettingswheremanyhyper-parametersofanexpensivefunctionmustbeoptimizedjointlyandtheeffectivedimensionalityishigh.Wehopethatfutureworkinthatdirectionwillconsiderrandomsearchoftheformstudiedhereasabaselineforperformance,ratherthangridsearch.AcknowledgmentsThisworkwassupportedbytheNationalScienceandEngineeringResearchCouncilofCanadaandComputeCanada,andimplementedwithTheano(Bergstraetal.,2010).ReferencesI.A.AntonovandV.M.Saleev.AneconomicmethodofcomputingLPt-sequences.USSRCompu-tationalMathematicsandMathematicalPhysics,19(1):252–256,1979.R.Bellman.AdaptiveControlProcesses:AGuidedTour.PrincetonUniversityPress,NewJersey,1961.Y.Bengio.LearningdeeparchitecturesforAI.FoundationsandTrendsinMachineLearning,2(1):1–127,2009.doi:10.1561/2200000006.Y.BengioandX.Glorot.Understandingthedifcultyoftrainingdeepfeedforwardneuralnetworks.InY.W.TehandM.Titterington,editors,Proc.ofTheThirteenthInternationalConferenceonArticialIntelligenceandStatistics(AISTATS'10),pages249–256,2010.J.Bergstra,O.Breuleux,F.Bastien,P.Lamblin,R.Pascanu,G.Desjardins,J.Turian,andY.Bengio.Theano:aCPUandGPUmathexpressioncompiler.InProceedingsofthePythonforScienticComputingConference(SciPy),June2010.Oral.C.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,London,UK,1995.P.Bratley,B.L.Fox,andH.Niederreiter.Implementationandtestsoflow-discrepancysequences.TransactionsonModelingandComputerSimulation,(TOMACS),2(3):195–213,1992.R.E.Caisch,W.Morokoff,andA.Owen.Valuationofmortgagebackedsecuritiesusingbrownianbridgestoreduceeffectivedimension,1997.303 BERGSTRAANDBENGIOC.ChangandC.Lin.LIBSVM:ALibraryforSupportVectorMachines,2001.I.Czogiel,K.Luebke,andC.Weihs.Responsesurfacemethodologyforoptimizinghyperparame-ters.Technicalreport,Universit¨atDortmundFachbereichStatistik,September2005.S.S.DrewandT.HomemdeMello.Quasi-MonteCarlostrategiesforstochasticoptimization.InProc.ofthe38thConferenceonWinterSimulation,pages774–782,2006.D.Erhan,Y.Bengio,A.Courville,P.Manzagol,P.Vincent,andS.Bengio.Whydoesunsupervisedpre-traininghelpdeeplearning?JournalofMachineLearningResearch,11:625–660,2010.J.H.Halton.Ontheefciencyofcertainquasi-randomsequencesofpointsinevaluatingmulti-dimensionalintegrals.NumerischeMathematik,2:84–90,1960.N.Hansen,S.D.M¨uller,andP.Koumoutsakos.Reducingthetimecomplexityofthederandomizedevolutionstrategywithcovariancematrixadaptation(CMA-ES).EvolutionaryComputation,11(1):1–18,2003.G.E.Hinton.ApracticalguidetotrainingrestrictedBoltzmannmachines.TechnicalReport2010-003,UniversityofToronto,2010.version1.G.E.Hinton,S.Osindero,andY.Teh.Afastlearningalgorithmfordeepbeliefnets.NeuralComputation,18:1527–1554,2006.F.Hutter.AutomatedCongurationofAlgorithmsforSolvingHardComputationalProblems.PhDthesis,UniversityofBritishColumbia,2009.F.Hutter,H.Hoos,andK.Leyton-Brown.Sequentialmodel-basedoptimizationforgeneralalgo-rithmconguration.InLION-5,2011.ExtendedversionasUBCTechreportTR-2010-10.S.Kirkpatrick,C.D.Gelatt,andM.P.Vecchi.Optimizationbysimulatedannealing.Science,220(4598):671–680,1983.N.L.Kleinman,J.C.Spall,andD.Q.Naiman.Simulation-basedoptimizationwithstochasticap-proximationusingcommonrandomnumbers.ManagementScience,45(11):1570–1578,Novem-ber1999.doi:doi:10.1287/mnsc.45.11.1570.H.Larochelle,D.Erhan,A.Courville,J.Bergstra,andY.Bengio.Anempiricalevaluationofdeeparchitecturesonproblemswithmanyfactorsofvariation.InZ.Ghahramani,editor,ProceedingsoftheTwenty-fourthInternationalConferenceonMachineLearning(ICML'07),pages473–480.ACM,2007.Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,86(11):2278–2324,November1998a.Y.LeCun,L.Bottou,G.Orr,andK.Muller.Efcientbackprop.InG.OrrandK.Muller,editors,NeuralNetworks:TricksoftheTrade.Springer,1998b.M.Galassietal.GNUScienticLibraryReferenceManual,3rdedition,2009.304 RANDOMSEARCHFORHYPER-PARAMETEROPTIMIZATIONM.D.McKay,R.J.Beckman,andW.J.Conover.Acomparisonofthreemethodsforselectingvaluesofinputvariablesintheanalysisofoutputfromacomputercode.Technometrics,21(2):239–245,May1979.doi:doi:10.2307/1268522.A.Nareyek.Choosingsearchheuristicsbynon-stationaryreinforcementlearning.AppliedOpti-mization,86:523–544,2003.R.M.Neal.AssessingrelevancedeterminationmethodsusingDELVE.InC.M.Bishop,editor,NeuralNetworksandMachineLearning,pages97–129.Springer-Verlag,1998.J.A.NelderandR.Mead.Asimplexmethodforfunctionminimization.TheComputerJournal,7:308–313,1965.M.J.D.Powell.Adirectsearchoptimizationmethodthatmodelstheobjectiveandconstraintfunctionsbylinearinterpolation.AdvancesinOptimizationandNumericalAnalysis,pages51–67,1994.C.E.RasmussenandC.K.I.Williams.GaussianProcessesforMachineLearning.MITPress,2006.IngoRechenberg.Evolutionsstrategie-OptimierungtechnischerSystemenachPrinzipienderbiol-ogischenEvolution.Fommann-Holzboog,Stuttgart,1973.A.SrinivasanandG.Ramakrishnan.ParameterscreeningandoptimisationforILPusingdesignedexperiments.JournalofMachineLearningResearch,12:627–662,February2011.P.Vincent,H.Larochelle,Y.Bengio,andP.Manzagol.Extractingandcomposingrobustfeatureswithdenoisingautoencoders.InW.W.Cohen,A.McCallum,andS.T.Roweis,editors,Pro-ceedingsoftheTwenty-fthInternationalConferenceonMachineLearning(ICML'08),pages1096–1103.ACM,2008.T.Weise.GlobalOptimizationAlgorithms-TheoryandApplication.Self-Published,secondedi-tion,2009.Onlineavailableathttp://www.it-weise.de/.305