/
JournalofMachineLearningResearch8(2007)1197-1215Submitted8/05;Revised2 JournalofMachineLearningResearch8(2007)1197-1215Submitted8/05;Revised2

JournalofMachineLearningResearch8(2007)1197-1215Submitted8/05;Revised2 - PDF document

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
395 views
Uploaded On 2015-11-10

JournalofMachineLearningResearch8(2007)1197-1215Submitted8/05;Revised2 - PPT Presentation

OSADCHYLECUNANDMILLERonstandardhardwareandisrobusttovariationsinyaw90roll45pitch60aswellaspartialocclusionsThemethodismotivatedbytheideathatmultiviewfacedetectionandposeestimationar ID: 188706

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch8(2007)1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch8(2007)1197-1215Submitted8/05;Revised2/07;Published5/07SynergisticFaceDetectionandPoseEstimationwithEnergy-BasedModelsMargaritaOsadchyRITA@CS.HAIFA.AC.ILDepartmentofComputerScienceUniversityofHaifaMountCarmel,Haifa31905,IsraelYannLeCunYANN@CS.NYU.EDUTheCourantInstituteNewYorkUniversityNewYork,NY10003,USAMatthewL.MillerMLM@NEC-LABS.COMNECLabsAmericaPrincetonNJ08540,USAEditor:PietroPeronaAbstractWedescribeanovelmethodforsimultaneouslydetectingfacesandestimatingtheirposeinrealtime.Themethodemploysaconvolutionalnetworktomapimagesoffacestopointsonalow-dimensionalmanifoldparametrizedbypose,andimagesofnon-facestopointsfarawayfromthatmanifold.Givenanimage,detectingafaceandestimatingitsposeisviewedasminimizinganen-ergyfunctionwithrespecttotheface/non-facebinaryvariableandthecontinuousposeparameters.Thesystemistrainedtominimizealossfunctionthatdrivescorrectcombinationsoflabelsandposetobeassociatedwithlowerenergyvaluesthanincorrectones.Thesystemisdesignedtohandleverylargerangeofposeswithoutretraining.Theperformanceofthesystemwastestedonthreestandarddatasets—forfrontalviews,rotatedfaces,andproles—iscomparabletoprevioussystemsthataredesignedtohandleasingleoneofthesedatasets.Weshowthatasystemtrainedsimuiltaneouslyfordetectionandposeestimationismoreaccu-rateonbothtasksthansimilarsystemstrainedforeachtaskseparately.1Keywords:facedetection,poseestimation,convolutionalnetworks,energybasedmodels,objectrecognition1.IntroductionThedetectionofhumanfacesinnaturalimagesandvideosisakeycomponentinawidevarietyofapplicationsofhuman-computerinteraction,searchandindexing,security,andsurveillance.Manyreal-worldapplicationswouldprotfromview-independentdetectorsthatcandetectfacesunderawiderangeofposes:lookingleftorright(yawaxis),upordown(pitchaxis),ortiltingleftorright(rollaxis).Inthispaperwedescribeanovelmethodthatcannotonlydetectfacesindependentlyoftheirposes,butalsosimultaneouslyestimatethoseposes.Thesystemishighlyreliable,runsinrealtime1.Amorepreliminaryversionofthisworkappearsas:Osadchyetal.(2005).c\r2007MargaritaOsadchy,YanLeCunandMatthewL.Miller. OSADCHY,LECUNANDMILLERonstandardhardware,andisrobusttovariationsinyaw(90),roll(45),pitch(60),aswellaspartialocclusions.Themethodismotivatedbytheideathatmulti-viewfacedetectionandposeestimationaresocloselyrelatedthattheyshouldnotbeperformedseparately.Thetasksarerelatedinthesensethattheycouldusesimilarfeaturesandinternalrepresentations,andmustberobustagainstthesamesortsofvariation:skincolor,glasses,facialhair,lighting,scale,expressions,etc.Wesuspectthat,whentrainedtogether,eachtaskcanserveasaninductivebiasfortheother,yieldingbettergeneralizationorrequiringfewertrainingexamples(Caruana,1997).Toexploitthesynergybetweenthesetwotasks,wetrainalearningmachinetomapinputimagestopointsinalow-dimensionalspace.Inthelow-dimensionaloutputspaceweembedafacemani-foldwhichisparameterizedbyfacialposeparameters(e.g.,pitch,yaw,androll).Aconvolutionalnetworkistrainedtomapfaceimagestopointsonthefacemanifoldthatcorrespondtotheposeofthefacesandnon-faceimagestopointsfarawayfromthatmanifold.Aftertraining,adetectionisperformedbymeasuringwhetherthedistanceoftheoutputpointfromthemanifoldislowerthanathreshold.Ifthepointisclosetothemanifold,indicatingthatafaceispresentintheimage,itsposeparameterscanbeinferredfromthepositionoftheprojectionofthepointontothemanifold.Tomapinputimagestopointsinthelow-dimensionalspace,weemployaconvolutionalnetworkarchitecture(LeCunetal.,1998).Convolutionalnetworksarespecicallydesignedtolearninvariantrepresentationofimages.Theycaneasilylearnthetypeofshift-invariantlocalfeaturesthatarerelevanttofacedetectionandposeestimation.Moreimportantly,theycanbereplicatedoverlargeimages(appliedtoeverysub-windowsinalargeimage)atasmallfractionofthecostofapplyingmoretraditionalclassierstoeverysub-windowsinanimage.Thisisaconsiderableadvantageforbuildingreal-timesystems.AsalearningmachineweusetherecentlyproposedEnergy-BasedModels(EBM)thatprovideadescriptionandtheinferenceprocessandthelearningprocessinasingle,well-principledframework(LeCunandHuang,2005;LeCunetal.,2006).Givenaninput(animage),anEnergy-BasedModelassociatesanenergytoeachcongurationofthevariablestobemodeled(theface/non-facelabelandtheposeparametersinourcase).MakinganinferencewithanEBMconsistsinsearchingforacongurationofthevariablestobepredictedthatminimizestheenergy,orcomparingtheenergiesofasmallnumberofcongurationsofthosevariables.EBMshaveanumberofadvantagesoverprobabilisticmodels:(1)Thereisnoneedtocomputepartitionfunctions(normalizationconstants)thatmaybeintractable;(2)becausethereisnorequirementfornormalization,therepertoireofpossiblemodelarchitecturesthatcanbeusedisconsiderablyricher.InourapplicationwedeneanEnergy-BasedModelasascalar-valuedenergyfunctionofthreevariables:image,label,andpose,andwetreatposeasadeterministiclatentvariable.Thusbothlabelofanimageandposeareinferredthroughtheenergy-minimizationprocess.TraininganEBMconsistsinndingvaluesofthetrainableparameters(whichparameterizetheenergyfunction)thatassociatelowenergiesto“desired”congurationsofvariables,andhighenergiesto“undesired”congurations.Withprobabilisticmodels,makingtheprobabilityofsomevalueslargeautomaticallymakestheprobabilitiesofothervaluessmallbecauseofthenormaliza-tion.WithEBM'smakingtheenergyofdesiredcongurationslowmaynotnecessarilymaketheenergiesofothercongurationshigh.Therefore,onemustbeverycarefulwhendesigninglossfunctionsforEBMs.Inourapplicationtofacedetectionwederiveanewtypeofcontrastivelossfunctionthatistailoredtosuchdetectiontasks.1198 SYNERGISTICFACEDETECTIONANDPOSEESTIMATIONWITHENERGY-BASEDMODELSThepaperisorganizedasfollows.First,someoftherelevantpriorworksonmulti-viewfacedetectionarebrieydiscussed.Section2discussesthesynergybetweenposeestimationandfacedetection,anddescribesthebasicmethodsforintegratingthem.Section3discussesthelearningmachine,andSection4givestheresultsofexperimentsconductedwithoursystem.Section5drawssomeconclusions.1.1PreviousWorkLearning-basedapproachestofacedetectionabound,includingreal-timemethods(ViolaandJones,2001),andapproachesbasedonconvolutionalnetworks(Vaillantetal.,1994;GarciaandDelakis,2002).Mostmulti-viewsystemstakeaview-basedapproach,whichinvolvesbuildingseparatedetectorsfordifferentviewsandeitherapplyingtheminparallel(Pentlandetal.,1994;SungandPoggio,1998;SchneidermanandKanade,2000;Lietal.,2002)orusingaposeestimatortoselectthemostappropriatedetector(JonesandViola,2003;Huangetal.,2004).Anotherapproachistoestimateandcorrectin-planerotationsbeforeapplyingasinglepose-specicdetector(Rowleyetal.,1998b).Someattemptshavebeendoneinintegratingposesearchanddetection,butinmuchsmallerspaceofposeparameters(FleuretandGeman,2001).ClosertoourapproachisthatofLietal.(2000),inwhichanumberofSupportVectorRegressorsaretrainedtoapproximatesmoothfunctions,eachofwhichhasamaximumforafaceataparticularpose.Anothermachineistrainedtoconverttheresultingvaluestoestimatesofposes,andathirdmachineistrainedtoconvertthevaluesintoaface/non-facescore.Theresultingsystemisratherslow.SeeYangetal.(2002)forsurveyoffacedetectionmethods.2.IntegratingFaceDetectionandPoseEstimationToexploitthepositedsynergybetweenfacedetectionandposeestimation,wemustdesignasystemthatintegratesthesolutionstothetwoproblems.Merelycascadingtwosystemswheretheanswertooneproblemisusedtoassistinsolvingtheotherwillnotoptimallytakeadvantageofthesynergy.Therefore,bothanswersmustbederivedfromoneunderlyinganalysisoftheinput,andbothtasksmustbetrainedtogether.OurapproachistobuildatrainablesystemthatcanmaprawimagesXtopointsinalow-dimensionalspace(Figure1).Inthatspace,wepre-deneafacemanifoldF(Z)thatweparameter-izebytheposeZ.Wetrainthesystemtomapfaceimageswithknownposestothecorrespondingpointsonthemanifold.Wealsotrainittomapimagesofnon-facestopointsfarawayfromthemanifold.Duringrecognition,thesystemmapstheinputimageXtoapointinthelowdimensionalspaceG(X).TheproximityofG(X)tothemanifoldthentellsuswhetherornotanimageXisaface.ByndingtheposeparametersZthatcorrespondtothepointonthemanifoldthatisclosesttothepointG(X)(projection),weobtainanestimateofthepose(Figure2).2.1ParameterizingtheFaceManifoldWewillnowdescribethedetailsoftheparameterizationsofthefacemanifold.Threecriteriadi-rectedthedesignofthefacemanifold:(1)preservingthetopologyandgeometryoftheproblem;(2)providingenoughspaceformappingthebackgroundimagesfarfromthemanifold(sincetheproximitytothemanifoldindicateswhethertheinputimagecontainsaface);and(3)minimizing1199 OSADCHY,LECUNANDMILLERFace Manifoldparameterized by poseTrainMapping:GLow dimensional spaceF(Z)Figure1:ManifoldMapping—TrainingFace Manifoldparameterized by poseApplyMapping:GLow dimensional spaceF(Z)ImageXG(X)G(X)-F(Z)Figure2:ManifoldMapping—RecognitionandPoseEstimation.1200 SYNERGISTICFACEDETECTIONANDPOSEESTIMATIONWITHENERGY-BASEDMODELSthecomputationalcostofndingtheparametersoftheclosestpointonthemanifoldtoanypointinthespace.Let'sstartwiththesimplestcaseofoneposeparameterZ=q,representing,say,yaw.Ifwewanttopreservethenaturaltopologyandgeometryoftheproblem(therstcriterion),thefacemanifoldunderyawvariationsintheinterval[90;90]shouldbeahalfcircle(withconstantcurvature).Thenaturalwayofrepresentingacircleiswithsineandcosinefunctions.Inthiscaseweembedtheangleparameterintwodimensionalspace.Nowimagesoffaceswillbemappedtopointsonthehalfcirclemanifoldcorrespondingtoq,andnon-faceimageswillbemappedtopointsintherestofthetwodimensionalspace.Havinglotsoffreespacetorepresentnon-faceimagesmaybenecessary,duetotheconsiderableamountofvariabilityinnon-faceimages.Soincreasingthedimensionofembeddingmighthelpusinbetterseparationoffaceandnon-faceimages(thesecondcriterion).Inthecaseofsingleposeparameterwesuggest3Dembedding.Tomaketheprojectionandparameterestimationsimple(thethirdcriterion),weembedthishalf-circleinathree-dimensionalspaceusingthreeequally-spacedshiftedcosinefunctions(Figure3):Fi(q)=cos(qai);i=1;2;3;q=[p2;p2];a=fp3;0;p3g:ApointonthefacemanifoldparameterizedbytheyawangleqisF(q)=[F1(q);F2(q);F3(q)].WhenwerunthenetworkonanimageX,itoutputsavectorG(X).TheyawangleqcorrespondingtothepointonthemanifoldthatisclosesttoG(X)canbeexpressedanalyticallyas:q=arctanå3=1Gi(X)cos(ai)å3=1Gi(X)sin(ai):ThepointonthemanifoldclosesttoG(X)isjustF(q).Thefunctionchoiceisnotlimitedtocosine.Howevercosinesarepreferablesincetheyallowcomputingtheposeanalyticallyfromtheoutputofthenetwork.Withoutthisproperty,ndingtheposecouldbeanexpensiveoptimizationprocess,orevenrequiretheuseofasecondlearningmachine.Thesameideacanbegeneralizedtoanynumberofposeparameters.Letusconsiderthesetofallfaceswithyawin[90;90]androllin[45;45].Inanabstractway,thissetisisomorphictoaportionofasphere.Consequently,wecanrepresentapointonthefacemanifoldasafunctionofthetwoposeparametersby9basisfunctionsthatarethecross-productsofthreeshiftedcosinesforoneoftheangles,andthreeshiftedcosinesfortheotherangle:Fij(q;f)=cos(qai)cos(fbj);i;j=1;2;3:Forconvenience,werescaletherollanglestotherange[90;90]whichallowsustosetbi=ai.Withthisparameterization,themanifoldhasconstantcurvature,whichensuresthattheeffectofer-rorswillbethesameregardlessofpose.Givena9-dimensionaloutputvectorfromtheconvolutionalnetworkGij(X),wecomputethecorrespondingyawandrollanglesq;fasfollows:cc=åijGij(X)cos(ai)cos(bj);cs=åijGij(X)cos(ai)sin(bj);sc=åijGij(X)sin(ai)cos(bj);ss=åijGij(X)sin(ai)sin(bj);q=0:5(atan2(cs+sc;ccss)+atan2(sccs;cc+ss));f=0:5(atan2(cs+sc;ccss)atan2(sccs;cc+ss)).Theprocesscaneasilybeextendedtoincludepitchinadditiontoyawandroll,aswellasotherparametersifnecessary.1201 OSADCHY,LECUNANDMILLERF1F2F3Face Manifoldpose3232Figure3:Left:facemanifoldembedding;right:manifoldparametrizationbysingleposeparame-ter.Thevalueofeachcosinefunctionforoneposeangleconstitutethethreecomponentsofapointonthefacemanifoldcorrespondingtothatpose.3.LearningMachineTomapinputimagestopointsinthelow-dimensionalspace,weemployaconvolutionalnetworkarchitecturetrainedusingEnergyMinimizationFramework.Nextwepresentthedetailsofthelearningmachine.3.1EnergyMinimizationFrameworkWeproposethefollowingcongurationoftheEnergyBasedModel(LeCunandHuang,2005;LeCunetal.,2006).Considerascalar-valuedfunctionEW(Y;Z;X),whereXisarawimage,Zisafacialpose(e.g.,yawandrollasdenedabove),Yisabinarylabel:Y=1forface,Y=0fornon-face.Wisaparametervectorsubjecttolearning.EW(Y;Z;X)canbeinterpretedasanenergyfunctionthatmeasuresthedegreeofcompatibilitybetweenthevaluesofX;Z;Y.TheinferenceprocessconsistsinclampingXtotheobservedvalue(theimage),andsearchingforcongurationsofZandYthatminimizetheenergyEW(Y;Z;X):(Y;Z)=argminY2fYg;Z2fZgEW(Y;Z;X)wherefYg=f0;1gandfZg=[90;90][45;45]foryawandrollvariables.Ideally,iftheinputXistheimageofafacewithposeZ,thenaproperlytrainedsystemshouldgivealowerenergytothefacelabelY=1thantothenon-facelabelY=0foranypose:EW(1;Z;X)EW(0;Z0;X),8Z0.Foraccurateposeestimation,thesystemshouldgivealoweren-ergytothecorrectposethantoanyotherpose:EW(1;Z0;X)�EW(1;Z;X),8Z06=Z.Trainingamachinetosatisfythosetwoconditionsforanyimagewillguaranteethattheenergy-minimizinginferenceprocesswillproducethecorrectanswer.TransformingenergiestoprobabilitiescaneasilybedoneviaGibbsdistribution:P(Y;ZjX)=exp(bEW(Y;Z;X))=Zy2fYg;z2fZgexp(bEW(y;z;X))wherebisanarbitrarypositiveconstant,andfYgandfZgarethesetsofpossiblevaluesofyandz.Withthisformulation,wecaneasilyinterprettheenergyminimizationwithrespecttoYandZas1202 SYNERGISTICFACEDETECTIONANDPOSEESTIMATIONWITHENERGY-BASEDMODELSConvolutionalnetworkAnalytical mappingonto face manifoldGw(X)-F(Z)Gw(X)F(Z)W (param)X (image)E (energy)Z (pose)Y (label)TSwitchFigure4:ArchitectureoftheMinimumEnergyMachine.amaximumconditionallikelihoodestimationofYandZ.Thisprobabilisticinterpretationassumesthattheintegralinthedenominator(thepartitionfunction)converges.Itiseasytodesignsuchanenergyfunctionforourcase.However,aproperprobabilisticformulationwouldrequireustousethenegativelog-likelihoodofthetrainingsamplesasourlossfunctionfortraining.ThiswillrequireustocomputethederivativeofthedenominatorwithrespecttothetrainableparametersW.Thisisunnecessarycomplicationwhichcanbealleviatedbyadoptingtheenergy-basedformulation.RemovingthenecessityfornormalizationgivesuscompletefreedominthechoiceoftheinternalarchitectureandparameterizationofEW(Y;Z;X),aswellasconsiderableexibilityinthechoiceofthelossfunctionfortraining.OurenergyfunctionforafaceEW(1;Z;X)isdenedasthedistancebetweenthepointproducedbythenetworkGW(X)andthepointwithposeZonthemanifoldF(Z):EW(1;Z;X)=kGW(X)F(Z)k:Theenergyfunctionforanon-faceEW(0;Z;X)isequaltoaconstantTthatwecaninterpretasathreshold(itisindependentofZandX).Thecompleteenergyfunctionis:EW(Y;Z;X)=YkGW(X)F(Z)k+(1Y)T:ThearchitectureofthemachineisdepictedinFigure4.Operatingthismachine(ndingtheoutputlabelandposewiththesmallestenergy)comesdowntorstnding:Z=argminZ2fZgjjGW(X)F(Z)jj,andthencomparingthisminimumdistance,kGW(X)F(Z)k,tothethresholdT.Ifit'ssmallerthanT,thenXisclassiedasaface,otherwiseXisclassiedasanon-face.Thisdecisionisimplementedinthearchitectureasaswitch,thatdependsuponthebinaryvariableY.ForsimplicitywexTtobeaconstant.AlthoughitisalsopossibletomakeTafunctionofposeZ.1203 OSADCHY,LECUNANDMILLER3.2ConvolutionalNetworkWeemployaConvolutionalNetworkasthebasicarchitecturefortheGW(X)functionthatmapsimagepointsintheface-space.Thearchitectureofconvolutionalnetsissomewhatinspiredbythestructureofbiologicalvisualsystems.Convolutionalnetshavebeenusedsuccessfullyinanumberofvisionapplicationssuchashandwritingrecognition(LeCunetal.,1989,1998),andgenericobjectrecognition(LeCunetal.,2004).SeveralauthorshaveadvocatedtheuseofConvolutionalNetworksforobjectdetection(Vaillantetal.,1994;NowlanandPlatt,1995;LeCunetal.,1998;GarciaandDelakis,2002).Convolutionalnetworksare“end-to-end”trainablesystemthatcanoperateonrawpixelimagesandlearnlow-levelfeaturesandhigh-levelrepresentationinanintegratedfashion.Eachlayerinaconvolutionalnetiscomposedunitsorganizedinplanescalledfeaturemaps.Eachunitinafeaturemaptakesinputsfromasmallneighborhoodwithinthefeaturemapsofthepreviouslayer.Neighboringunitsinafeaturemapareconnectedtoneighboring(possiblyoverlapping)windows.Eachunitcomputesaweightedsumofitsinputsandpassestheresultthroughasigmoidsaturationfunction.Allunitswithinafeaturemapsharethesameweights.Therefore,eachfeaturemapcanbeseenasconvolvingthefeaturemapsofthepreviouslayerswithsmall-sizekernels,andpassingthesumofthoseconvolutionsthroughsigmoidfunctions.Unitsinafeaturemapdetectlocalfeaturesatalllocationsonthepreviouslayer.Convolutionalnetsareadvantageousbecausetheycanoperateonrawimagesandcaneasilylearnthetypeofshift-invariantlocalfeaturesthatarerelevanttoimagerecognition.Furthermore,theyareveryefcientcomputationallyfordetectionandrecognitiontasksinvolvingaslidingwin-dowoverlargeimages(Vaillantetal.,1994;LeCunetal.,1998).ThenetworkarchitectureusedfortrainingisshowninFigure5.ItissimilartoLeNet5(LeCunetal.,1998),butcontainsmorefeaturemaps.Thenetworkinputisa3232pixelgray-scaleimage.TherstlayerC1isaconvolutionallayerwith8featuremapsofsize2828.Eachunitineachfeaturemapisconnectedtoa55neighborhoodintheinput.ContiguousunitsinC1takeinputfromneighborhoodontheinputthatoverlapby4pixels.Thenextlayer,S2,isaso-calledsubsamplinglayerwith8featuremapsofsize1414.Eachunitineachmapisconnectedtoa22neighborhoodinthecorrespondingfeaturemapinC1.ContiguousunitsinS2takeinputfromcontiguous,non-overlapping2x2neighborhoodsinthecorrespondingmapinC1.C3isconvolutionalwith20featuremapsofsize1010.Eachunitineachfeaturemapisconnectedtoseveral55neighborhoodsatidenticallocationsinasubsetofS2'sfeaturemaps.DifferentC3mapstakeinputfromdifferentsubsetsofS2tobreakthesymmetryandtoforcethemapstoextractdifferentfeatures.S4isasubsamplinglayerwith22subsamplingratioscontaining20featuremapsofsize55.LayerC5isaconvolutionallayerwith120featuremapsofsize11with55kernels.EachC5maptakesinputfromall20ofS4'sfeaturemaps.Theoutputlayerhas9outputs(sincethefacemanifoldisnine-dimensional)andisfullyconnectedtoC5(suchafullconnectioncanbeseenasaconvolutionwith11kernels).3.3TrainingwithaContrastiveLossFunctionThevectorWcontainsall63;493weightsandkernelcoefcientsintheconvolutionalnetwork.Theyareallsubjecttotrainingbyminimizingasinglelossfunction.Akeyelement,andanovelcontribution,ofthispaperisthedesignofthelossfunction.1204 SYNERGISTICFACEDETECTIONANDPOSEESTIMATIONWITHENERGY-BASEDMODELSFigure5:Architectureofconvolutionalnetworkusedfortraining.Thisrepresentsonesliceofthenetworkwithwitha3232inputwindow.Thesliceincludesalltheelementsthatarenecessarytocomputeasingleoutputvector.Thetrainednetworkisreplicatedoverthefullinputimage,producingoneoutputvectorforeach3232windowsteppedevery4pixelshorizontallyandvertically.Theprocessisrepeatedatmultiplescales.ThetrainingsetSiscomposedoftwosubsets:thesetS1oftrainingsamples(1;Xi;Zi)contain-ingafaceannotatedwiththepose;andthesetS0oftrainingsample(0;Xi)containinganon-faceimage(background).ThelossfunctionL(W;S)isdenedastheaverageoverS1ofaper-samplelossfunctionL1(W;Zi;Xi),plustheaverageoverS0ofaper-samplelossfunctionL0(W;Xi):L(W;S)=1jS1jåi2S1L1(W;Zi;Xi)+1jS0jåi2S0L0(W;Xi):(1)FacesampleswhoseposeisunknowncaneasilybeaccommodatedbyviewingZasa“deterministiclatentvariable”overwhichtheenergymustbeminimized.However,experimentsreportedinthispaperonlyusetrainingsamplesmanuallylabeledwiththepose.Foraparticularpositivetrainingsample(Xi;Zi;1),theper-samplelossL1shouldbedesignedinsuchawaythatitsminimizationwithrespecttoWwillmaketheenergyofthecorrectanswerlowerthantheenergiesofallpossibleincorrectanswers.Minimizingsuchalossfunctionwillmakethemachineproducetherightanswerwhenrunningtheenergy-minimizinginferenceprocedure.Wecanwritethisconditionas:Condition1EW(Yi=1;Zi;Xi)EW(Y;Z;Xi)forY6=YiorZ6=Zi:SatisfyingthisconditioncanbedonebysatisfyingthetwofollowingconditionsCondition2EW(1;Zi;Xi)TandEW(1;Zi;Xi)minZ6=ZiEW(1;Z;Xi):1205 OSADCHY,LECUNANDMILLERFollowingLeCunandHuang(2005),weassumethatthelossisafunctionalthatdependsonXonlythroughthesetofenergiesassociatedwithXandallthepossiblevaluesofZandY.Thisassumptionallowsustodecouplethedesignofthelossfunctionfromtheinternalstructure(architecture)oftheenergyfunction.WealsoassumethatthereexistaWforwhichcondition2issatised.Thisisareasonableassumption,wemerelyensurethatthelearningmachinecanproducethecorrectoutputforanysinglesample.Wenowshowthat,withourarchitecture,ifwechooseL1tobeastrictlymonotonicallyincreasingfunctionofEW(1;Zi;Xi)(overthedomainofE),thenminimizingL1withrespecttoWwillcausethemachinetosatisfycondition2.Therstinequalityin2willobviouslybesatisedbyminimizingsuchaloss.ThesecondinequalitywillbesatisedifEW(1;Z;Xi)hasasingle(non-degenerate)globalminimumasafunctionofZ.TheminimizationofL1withrespecttoWwillplacethisminimumatZi,andthereforewillensurethatallothervaluesofZwillhavehigherenergy.OurenergyfunctionEW(1;Z;X)=jjGW(X)F(Z)jjindeedhasasingleglobalminimumasafunctionofZ,becauseF(Z)isinjectiveandthenormisconvex.ThesingleglobalminimumisattainedforGW(X)=F(Z).Forourexperiments,wesimplychose:L1(W;1;Z;X)=EW(1;Z;X)2:Foraparticularnegative(non-face)trainingsample(Xi;0),theper-samplelossL0shouldbedesignedinsuchawaythatitsminimizationwithrespecttoWwillmaketheenergyforY=1andanyvalueofZhigherthantheenergyforY=0(whichisequaltoT).Minimizingsuchalossfunctionwillmakethemachineproducetherightanswerwhenrunningtheenergy-minimizinginferenceprocedure.Wecanwritetheconditionforcorrectoutputas:Condition3EW(1;Z;Xi)�T8Zwhichcanbere-writtenas:Condition4EW(1;Z;Xi)�TZ=argminzEW(1;z;Xi):Again,weassumethatthereexistsaWforwhichcondition4issatised.Itiseasytoseethat,withourarchitecture,ifwechooseL0tobeastrictlymonotonicallydecreasingfunctionofEW(1;Z;Xi)(overthedomainofE),thenminimizingL0withrespecttoWwillcausethemachinetosatisfycondition4.Forourexperiments,wesimplychose:L0(W;0;Xi)=Kexp[E(1;Z;Xi)]whereKisapositiveconstant.Anicepropertyofthisfunctionisthatitisboundedbelowby0,andthatitsgradientvanishesweapproachtheminimum.TheentiresystemwastrainedbyminimizingaveragevalueofthelossfunctioninEq.(1)withrespecttotheparameterW.WeusedastochasticversionoftheLevenberg-MarquardtalgorithmwithdiagonalapproximationoftheHessian(LeCunetal.,1998).1206 SYNERGISTICFACEDETECTIONANDPOSEESTIMATIONWITHENERGY-BASEDMODELSFigure6:Screenshotfromannotationtool.3.4RunningtheMachineThedetectionsystemoperatesonrawgrayscaleimages.Theconvolutionalnetworkisappliedtoall3232sub-windowsoftheimage,steppedevery4pixelshorizontallyandvertically.Becausethelayersareconvolutional,applyingtworeplicasofthenetworkinFigure5totwooverlappinginputwindowsleadstoaconsiderableamountofredundantcomputation.Eliminatingtheredundantcomputationyieldsadramaticspeedup:eachlayeroftheconvolutionalnetworkisextendedsoastocovertheentireinputimage.Theoutputisalsoreplicatedthesameway.Duetothetwo22subsamplinglayers,weobtainoneoutputvectorevery44pixels.Todetectfacesinasize-invariantfashion,thenetworkisappliedtomultipledown-scaledver-sionsoftheimageoverarangeofscalessteppedbyafactorofp2.Ateachscaleandlocation,thenetwork's9-dimensionaloutputvectoriscomparedtotheclosestpointonthefacemanifold(whosepositionindicatestheposeofthecandidateface).Thesystemcollectsalistofalllocationsandscalesofoutputvectorsclosertothefacemanifoldthanthedetectionthreshold.Afterexaminingallscales,thesystemidentiesgroupsofoverlappingdetectionsinthelistanddiscardsallbutthestrongest(closesttothemanifold)fromeachgroupwithinanexclusionareaofapresetsize.Noattemptismadetocombinedetectionsorapplyanyvotingscheme.4.ExperimentsandResultsUsingthearchitecturedescribedinSection3,webuiltadetectortolocatefacesandestimatetwoposeparameters:yawfromlefttorightprole,andin-planerotationfrom45to45degrees.Themachinewastrainedtoberobustagainstpitchvariation.Inthissection,werstdescribethetrainingprotocolforthisnetwork,andthengivetheresultsoftwosetsofexperiments.Therstsetofexperimentstestswhethertrainingforthetwotaskstogetherimprovesperformanceonboth.Thesecondsetallowscomparisonsbetweenoursystemandotherpublishedmulti-viewdetectors.1207 OSADCHY,LECUNANDMILLER4.1TrainingTheimagesweusedfortrainingwerecollectedatNECLabs.Allfaceimagesweremanuallyannotatedwithappropriateposes.Theannotationprocesswasgreatlysimpliedbyusingasimpletoolforspecifyingalocationandapproximateposeofaface.TheuserinterfaceforthistoolisshowninFigure6.Annotationprocessisdonebyrstclickingonthemidpointbetweentheeyesandonthecenterofthemouth.Thetoolthendrawsaperspectivegridinfrontofthefaceandtheuseradjustsittobeparalleltothefaceplane.Thisprocessyieldsestimatesforallsixposeparameters:location(x;y),threeangles(yaw,pitch,androll)andscale.Theimageswereannotatedinsuchawaythatthemidpointbetweentheeyesandonthecenterofthemoutharepositionedinthecenteroftheimage.Thisallowsthesetwopointstostayxedwhentheposechangesfromlefttorightprole.Thedownsideisthatprolesoccupyonlyhalfoftheimage.Inthismannerweannotatedabout30,000facesinimagesfromvarioussources.Eachfacewasthencroppedandscaledsothattheeyemidpointandthemouthmidpointappearedincanonicalpositions,10pixelsapart,in3232-pixelimagewithsomemoderatevariationoflocationandscale.Theresultingimageswheremirroredhorizontally,toyieldroughly60,000faces.Weremovedsomeportionoffacesfromthissettoyieldaroughlyuniformdistributionofposesfromleftproletorightprole.Unfortunately,theamountofvariationinpitch(up/down)wasnotsufcienttodothesame.Thiswasthereasonfortrainingoursystemtoberobustagainstpitchvariationinsteadofestimatingthepitchangle.Therollvariationwasaddedbyrandomlyrotatingtheimagesintherangeof[45;45]degrees.Theresultingtrainingsetconsistedof52;850,32x32grey-levelimagesoffaceswithuniformdistributionofposes.Theinitialsetofnegativetrainingsamplesconsistedof52;850imagepatcheschosenrandomlyfromnon-faceareasinavarietyofimages.Forthesecondsetoftests,halfoftheseimageswerereplacedwithimagepatchesobtainedbyrunningtheinitialversionofthedetectoronthetrainingimagesandcollectingfalsedetections.Eachtrainingimagewasused5timesduringtraining,withrandomvariationsinscale(fromxp2tox(1+p2)),in-planerotation(45),brightness(20),andcontrast(from0.8to1.3).Totrainthenetwork,wemade9passesthroughthisdata,thoughitmostlyconvergedafterabouttherst6passes.ThetrainingsystemwasimplementedintheLushlanguage(BottouandLeCun,2002).Thetotaltrainingtimewasabout26hoursona2GhzPentium4.Attheendoftraining,thenetworkhadconvergedtoanequalerrorrateof5%onthetrainingdataand6%onaseparatetestsetof90,000images.AstandaloneversionofthedetectionsystemwasimplementedintheClanguage.Itcandetect,locate,andestimatetheposeoffacesthatarebetween40and250pixelshighina640480imageatroughly5framespersecondona2.4GHzPentium4.4.2SynergyTestsThegoalofthesynergytestwastoverifythatbothfacedetectionandposeestimationbenetfromlearningandrunninginparallel.Totestthisclaimwebuiltthreenetworkswithalmostidenticalarchitectures,buttrainedtoperformdifferenttasks.Therstonewastrainedforsimultaneousfacedetectionandposeestimation(combined),thesecondwastrainedfordetectiononlyandthethirdforposeestimationonly.The“detectiononly”networkhadonlyoneoutputforindicatingwhetherornotitsinputwasaface.The“poseonly”networkwasidenticaltothecombinednetwork,buttrainedonfacesonly(nonegativeexamples).Figure7showstheresultsofrunningthesenetworks1208 SYNERGISTICFACEDETECTIONANDPOSEESTIMATIONWITHENERGY-BASEDMODELS0246810121416182050556065707580859095100False positive ratePercentage of faces detected Detection only Pose + detection05101520253050556065707580859095100Yaw-error tolerance (degrees)Percentage of yaws correctly estimated Pose only Pose + detectionFigure7:Synergytest.Left:ROCcurvesforthepose-plus-detectionanddetection-onlynetworks.(Thexaxisisthefalsepositiverateperimage).Right:frequencywithwhichthepose-plus-detectionandpose-onlynetworkscorrectlyestimatedtheyawswithinvariouserrortolerances.00.511.522.533.544.5550556065707580859095100False positives per imagePercentage of faces detected ProfileFrontalRotated in plane05101520253050556065707580859095100Pose-error tolerance (degrees)Percentage of poses correctly estimated Yaw In-plane rotationFigure8:Resultsonstandarddatasets.Left:ROCcurvesforourdetectoronthethreedatasets.Thexaxisistheaveragenumberoffalsepositivesperimageoverallthreesets,soeachpointcorrespondstoasingledetectionthreshold.Right:frequencywithwhichyawandrollareestimatedwithinvariouserrortolerances.onour10,000testimages.Inboththesegraphs,weseethatthepose-plus-detectionnetworkhadbetterperformance,conrmingthattrainingforeachtaskbenetstheother.4.3StandardDataSetsThereisnostandarddatasetthatspanstherangeofposesoursystemisdesignedtohandle.Thereare,however,datasetsthathavebeenusedtotestmorerestrictedfacedetectors,eachsetfocusingonaparticularvariationinpose.Bytestingasingledetectorwithallofthesesets,wecancompareourperformanceagainstpublishedsystems.Asfarasweknow,wearethersttopublishresultsforasingledetectoronallthesedatasets.Thedetailsofthesesetsaredescribedbelow:MIT+CMU(SungandPoggio,1998;Rowleyetal.,1998a)–130imagesfortestingfrontalface1209 OSADCHY,LECUNANDMILLERdetectors.Wecount517facesinthisset,butthestandardtestsonlyuseasubsetof507faces,because10facesareinthewrongposeorotherwisenotsuitableforthetest.(Note:about2%ofthefacesinthestandardsubsetarebadly-drawncartoons,whichwedonotintendoursystemtodetect.Nevertheless,weincludethemintheresultswereport.)TILTED(Rowleyetal.,1998b)–50imagesoffrontalfaceswithin-planerotations.223facesoutof225areinthestandardsubset.(Note:about20%ofthefacesinthestandardsubsetareoutsideofthe45rotationrangeforwhichoursystemisdesigned.Again,westillincludetheseinourresults.)PROFILE(SchneidermanandKanade,2000)–208imagesoffacesinprole.Thereseemstobesomedisagreementaboutthenumberoffacesinthestandardsetofannotations:SchneidermanandKanade(2000)reportsusing347facesofthe462thatwefound,JonesandViola(2003)reportsusing355,andwefound353annotations.However,thesediscrepanciesshouldnotsignicantlyeffectthereportedresults.Wecountedafaceasbeingdetectedif1)atleastonedetectionlaywithinacirclecenteredonthemidpointbetweentheeyes,witharadiusequalto1.25timesthedistancefromthatpointtothemidpointofthemouth,and2)thatdetectioncameatascalewithinafactoroftwoofthecorrectscalefortheface'ssize.Wecountedadetectionasafalsepositiveifitdidnotliewithinthisrangeforanyofthefacesintheimage,includingthosefacesnotinthestandardsubset.TheleftgraphinFigure8showsROCcurvesforourdetectoronthethreedatasets.Figures9,10showdetectionresultsonvariousposes.Table1showsourdetectionratescomparedagainstothermulti-viewsystemsforwhichresultsweregivenonthesedatasets.Wewanttostressherethatallthesesystemsaretestedinaposespecicmanner:forexample,adetectortestedonTILTEDsetistrainedonlyonfrontaltiltedfaces.Suchadetectorwillnotbeabletodetectnonfrontaltiltedfaces.Combiningallposevariationsinonesystemobviouslywillincreasethenumberoffalsepositives,sincefalsepositivesofview-baseddetectorsarenotnecessarilycorrelated.Oursystemisdesignedtohandleallposevariations.ThismakesthecomparisoninTable1somewhatunfairtooursystem,butwedon'tseeanyotherwayofcomparisonagainstothersystems.ThetableshowsthatourresultsontheTILTEDandPROFILEsetsaresimilartothoseofthetwoJones&Violadetectors,andevenapproachthoseoftheRowleyetalandSchneiderman&Kanadenon-real-timedetectors.Thosedetectors,however,arenotdesignedtohandleallvariationsinpose,anddonotyieldposeestimates.MorerecentsystemreportedinHuangetal.(2004)isalsoreal-timeandcanhandleallposevariation,butdoesn'tyieldposeestimates.Unfortunately,theyalsoreporttheresultsofposespecicdetectors.TheseresultsarenotshowninTable1,becausetheyreportdifferentpointsonROCcurveinthePROFILEexperiment(86:2%for0.42f.pperimage)andtheydidn'ttestontheTILTEDset.Eventhoughtheytrainedacombineddetectorforallposevariations,theydidnottestitthewaywedid.TheirtestconsistsinrunningthefulldetectoronthePROFILEsetrotatedby[-30,30]degreesin-plane.Unfortunately,theydonotprovideenoughdetailstorecreatetheirtestset.TherightsideofFigure8showsourperformanceatposeestimation.Tomakethisgraph,wexedthedetectionthresholdatavaluethatresultedinabout0.5falsepositivesperimageoverallthreedatasets.Wethencomparedtheposeestimatesforalldetectedfaces(includingthosenotinthestandardsubsets)againstourmanualposeannotations.Notethatthistestismoredifcultthantypicaltestsofposeestimationsystems,wherefacesarerstlocalizedbyhand.Whenwehand-localizethesefaces,89%ofyawsand100%ofin-planerotationsarecorrectlyestimatedtowithin15.1210 SYNERGISTICFACEDETECTIONANDPOSEESTIMATIONWITHENERGY-BASEDMODELSFigure9:Someexamplefacedetections.Eachwhiteboxshowsthelocationofadetectedface.Theangleofeachboxindicatestheestimatedin-planerotation.Theblackcrosshairswithineachboxindicatetheestimatedyaw.1211 OSADCHY,LECUNANDMILLERFigure10:Moreexamplesoffacedetections.1212 SYNERGISTICFACEDETECTIONANDPOSEESTIMATIONWITHENERGY-BASEDMODELSDataset!TILTEDPROFILEMIT+CMUFalsepositivesperimage!4.4226.90.443.36.501.28Ourdetector90%97%67%83%83%88%JonesandViola(2003)(tilted)90%95%xxJonesandViola(2003)(prole)x70%83%xRowleyetal.(1998a)89%96%xxSchneidermanandKanade(2000)x86%93%xTable1:Comparisonsofourresultswithothermulti-viewdetectors.Eachcolumnshowsthedetec-tionratesforagivenaveragenumberoffalsepositivesperimage(theseratescorrespondtothoseforwhichotherauthorshavereportedresults).Resultsforreal-timedetectorsareshowninbold.Notethatoursistheonlysingledetectorthatcanbetestedonalldatasetssimultaneously.5.ConclusionThesystemwehavepresentedhereintegratesdetectionandposeestimationbytrainingaconvolu-tionalnetworktomapfacestopointsonamanifold,parameterizedbypose,andnon-facestopointsfarfromthemanifold.Thenetworkistrainedbyoptimizingalossfunctionofthreevariables—image,pose,andface/non-facelabel.Whenthethreevariablesmatch,theenergyfunctionistrainedtohaveasmallvalue,whentheydonotmatch,itistrainedtohavealargevalue.Thissystemhasseveraldesirableproperties:Theuseofaconvolutionalnetworkmakesitfast.Attypicalwebcamresolutions,itcanprocess5framespersecondona2.4GhzPentium4.Itisrobusttoawiderangeofposes,includingvariationsinyawupto90,in-planerotationupto45,andpitchupto60.Thishasbeenveriedwithtestsonthreestandarddatasets,eachdesignedtotestrobustnessagainstasingledimensionofposevariation.Atthesametimethatitdetectsfaces,itproducesestimatesoftheirpose.Onthestandarddatasets,theestimatesofyawandin-planerotationarewithin15ofmanualestimatesover80%and95%ofthetime,respectively.Wehaveshownexperimentallythatoursystem'saccuracyatbothposeestimationandfacedetectionisincreasedbytrainingforthetwotaskstogether.ReferencesL.BottouandY.LeCun.TheLushManual.http://lush.sf.net,2002.R.Caruana.Multitasklearning.MachineLearning,28:41–75,1997.F.FleuretandD.Geman.Coarse-to-nefacedetection.IJCV,pages85–107,2001.C.GarciaandM.Delakis.Aneuralarchitectureforfastandrobustfacedetection.IEEE-IAPRInt.ConferenceonPatternRecognition,pages40–43,2002.1213 OSADCHY,LECUNANDMILLERC.Huang,B.Wu,H.Ai,andS.Lao.Omni-directionalfacedetectionbasedonrealadaboost.InInternationalConferenceonImageProcessing,Singapore,2004.M.JonesandP.Viola.Fastmulti-viewfacedetection.TechnicalReportTR2003-96,MitsubishiElectricResearchLaboratories,2003.Y.LeCun,B.Boser,J.S.Denker,D.Henderson,R.E.Howard,W.Hubbard,andL.D.Jackel.Backpropagationappliedtohandwrittenzipcoderecognition.NeuralComputation,1(4):541–551,Winter1989.Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradient-basedlearningappliedtodocumentrecognition.ProceedingsoftheIEEE,86(11):2278–2324,November1998.Y.LeCun,R.Hadsell,S.Chopra,F.-J.Huang,andM.-A.Ranzato.Atutorialonenergy-basedlearning.InPredictingStructuredOutputs.Bakiretal.(eds),MITPress,2006.Y.LeCunandF.J.Huang.Lossfunctionsfordiscriminativetrainingofenergy-basedmodels.InProc.ofthe10-thInternationalWorkshoponArticialIntelligenceandStatistics(AIStats'05),2005.Y.LeCun,F.-J.Huang,andL.Bottou.Learningmethodsforgenericobjectrecognitionwithinvari-ancetoposeandlighting.InProceedingsofCVPR'04.IEEEPress,2004.S.Z.Li,L.Zhu,Z.Zhang,A.Blake,H.Zhang,andH.Shum.Statisticallearningofmulti-viewfacedetection.InProceedingsofthe7thEuropeanConferenceonComputerVision-PartIV,2002.Y.Li,S.Gong,andH.Liddell.Supportvectorregressionandclassicationbasedmulti-viewfacedetectionandrecognition.InFaceandGesture,2000.S.NowlanandJ.Platt.Aconvolutionalneuralnetworkhandtracker.InAdvancesinNeuralInformationProcessingSystems(NIPS1995),pages901–908,SanMateo,CA,1995.MorganKaufmann.M.Osadchy,M.Miller,andY.LeCun.Synergisticfacedetectionandposeestimationwithenergy-basedmodel.InAdvancesinNeuralInformationProcessingSystems(NIPS2004).MITPress,2005.A.Pentland,B.Moghaddam,andT.Starner.View-basedandmodulareigenspacesforfacerecog-nition.InCVPR,1994.H.A.Rowley,S.Baluja,andT.Kanade.Neuralnetwork-basedfacedetection.PAMI,20:22–38,1998a.H.A.Rowley,S.Baluja,andT.Kanade.Rotationinvariantneuralnetwork-basedfacedetection.InComputerVisionandPatternRecognition,1998b.H.SchneidermanandT.Kanade.Astatisticalmethodfor3dobjectdetectionappliedtofacesandcars.InComputerVisionandPatternRecognition,2000.K.SungandT.Poggio.Example-basedlearningofview-basedhumanfacedetection.PAMI,20:39–51,1998.1214 SYNERGISTICFACEDETECTIONANDPOSEESTIMATIONWITHENERGY-BASEDMODELSR.Vaillant,C.Monrocq,andY.LeCun.Originalapproachforthelocalisationofobjectsinimages.IEEProconVision,Image,andSignalProcessing,141(4):245–250,August1994.P.ViolaandM.Jones.Rapidobjectdetectionusingaboostedcascadeofsimplefeatures.InProceedingsIEEEConf.onComputerVisionandPatternRecognition,pages511–518,2001.M.-H.Yang,D.Kriegman,andN.Ahuja.Detectingfacesinimages:Asurvey.PAMI,24(1):34–58,2002.1215