torontoedu David J Fleet fleetcstorontoedu Department of Computer Science University of Toronto Canada Abstract We propose a method for learning similarity preserving hash functions that map high dimensional data onto binary codes The formulation is ID: 24703
Download Pdf The PPT/PDF document "Minimal Loss Hashing for Compact Binary ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
MinimalLossHashingforCompactBinaryCodes where (x;h)vec(hxT).Here,wT (x;h)actsasascoringfunctionthatdeterminestherelevanceofinput-codepairs,basedonaweightedsumoffeaturesinthejointfeaturevector (x;h).Otherformsof (:;:)arepossible,leadingtootherhashfunctions.Tomotivateourupperboundonempiricalloss,webe-ginwithashortreviewoftheboundcommonlyusedforstructuralSVMs(Taskaretal.,2003;Tsochan-taridisetal.,2004).3.1.StructuralSVMInstructuralSVMs(SSVM),giveninput-outputtrain-ingpairsf(xi;yi)gNi=1,oneaimstolearnamappingfrominputstooutputsintermsofaparameterizedscoringfunctionf(x;y;w):by=argmaxyf(x;y;w):(7)Givenalossfunctionontheoutputdomain,L(;),theSSVMwithmargin-rescalingintroducesamarginvio-lation(slack)variableforeachtrainingpair,andmin-imizessumofslackvariables.Forapair(x,y),slackisdenedasmaxy[L(y;y)+f(x;y;w)]f(x;y;w).Importantly,theslackvariablesprovideanupperboundonlossforthepredictorby;i.e.,L(by;y)maxy[L(y;y)+f(x;y;w)]f(x;by;w)(8)maxy[L(y;y)+f(x;y;w)]f(x;y;w):(9)Toseetheinequalityin(8),notethat,ifthersttermontheRHSof(8)ismaximizedbyy=by,thentheftermscancel,and(8)becomesanequality.Otherwise,theoptimalvalueofthemaxtermmustbelargerthanwheny=by,whichcausestheinequality.Thesecondinequality(9)followsstraightforwardlyfromthede-nitionofbyin(7);i.e.,f(x;by;w)f(x;y;w)forally.Theboundin(9)ispiecewiselinear,convexinw,andeasiertooptimizethantheempiricalloss.3.2.Convex-concaveboundforhashingThedierencebetweenlearninghashfunctionsandtheSSVMisthatthebinarycodesforourtrainingdataarenotknownapriori.Butnotethatthetighterboundin(8)usesyonlyinthelossterm,whichisusefulforhashfunctionlearning,becausesuitablelossfunctionsforhashing,suchas(4),donotrequireground-truthlabels.Thebound(8)ispiecewiselinear,convex-concave(asumofconvexandconcaveterms),andisthebasisforSSVMswithlatentvariables(Yu&Joachims,2009).Belowweformulateasimilarboundforlearningbinaryhashfunctions.OurupperboundonthelossfunctionL,givenapairofinputs,xiandxj,asupervisorylabelsij,andtheparametersofthehashfunctionw,hastheformL(b(xi;w);b(xj;w);sij)maxgi;gj2HL(gi;gj;sij)+gTiWxi+gTjWxjmaxhi2HhTiWximaxhj2HhTjWxj:(10)Theprooffor(10)issimilartothatfor(8)above.Itfollowsfrom(5)thatthesecondandthirdtermsontheRHSof(10)aremaximizedbyhi=b(xi;w)andhj=b(xj;w).Ifthersttermweremaximizedbygi=b(xi;w)andgj=b(xj;w),thentheinequalityin(10)becomesanequality.Forallothervaluesofgiandgjthatmaximizetherstterm,theRHScanonlyincrease,hencetheinequality.Theboundholdsfor`,`bre,andanysimilarlossfunction,withbinarylabelssijorreal-valuedlabelsdij.Weformulatetheoptimizationfortheweightswofthehashingfunction,intermsofminimizationofthefol-lowingconvex-concaveupperboundonempiricalloss:X(i;j)2Smaxgi;gj2HL(gi;gj;sij)+gTiWxi+gTjWxjmaxhi2HhTiWximaxhj2HhTjWxj:(11)4.OptimizationMinimizing(11)tondwentailsthemaximizationofthreetermsforeachpair(i;j)2S.Thesecondandthirdtermsaretriviallymaximizeddirectlybythehashfunction(5).Maximizingthersttermis,however,nottrivial.Itissimilartotheloss-adjustedinferenceintheSSVMs.Thenextsectiondescribesanecientalgorithmforndingtheexactsolutionofloss-adjustedinferenceforhashfunctionlearning.4.1.Binaryhashingloss-adjustedinferenceWesolveloss-adjustedinferenceforgenerallossfunc-tionsoftheformL(h;g;s)=`(khgkH;s).Thisappliestoboth`breand`.Theloss-adjustedinfer-enceistondthepairofbinarycodesgivenbyeb(xi;xj;w);eb(xj;xi;w)=argmaxgi;gj2H`(kgigjkH;sij)+gTiWxi+gTjWxj:(12)Beforesolving(12)ingeneral,rstconsiderthespe-ciccaseforwhichwerestricttheHammingdistancebetweengiandgjtobem,i.e.,kgigjkH=m.Forq-bitcodes,misanintegerbetween0andq.When MinimalLossHashingforCompactBinaryCodes kgigjkH=m,thelossin(12)dependsonmbutnotthespecicbitsequencesgiandgj.Thus,insteadof(12),wecannowsolve`(m;sij)+maxgi;gj2HgTiWxi+gTjWxj(13)s.t.kgigjkH=m:Thekeytondingthetwocodesthatsolve(13)istodecidewhichofthembitsinthetwocodesshouldbedierent.Letv[k]denotethekthelementofavectorv.Wecancomputethejointcontributionofthekthbitsofgiandgjto[gTiWxi+gTjWxj]byk(gi[k];gj[k])=gi[k](Wxi)[k]+gj[k](Wxj)[k];andthesecontributionscanbecomputedforthefourpossiblestatesofthekthbitsindependently.Tothisend,k=maxk(1;0);k(0;1)maxk(0;0);k(1;1)representshowmuchisgainedbysettingthebitsgi[k]andgj[k]tobedierentratherthanthesame.Becausegiandgjdieronlyinmbits,thesolutionto(13)isobtainedbysettingthembitswithmlargestk'stobedierent.Allotherbitsinthetwocodesshouldbethesame.Whengi[k]andgj[k]mustbedierent,theyarefoundbycomparingk(1;0)andk(0;1).Other-wise,theyaredeterminedbythelargerofk(0;0)andk(1;1).Nowsolve(13)forallm;notingthatweonlycomputekforeachbit,1kq,once.Tosolve(12)itsucestondthemthatprovidesthelargestvaluefortheobjectivefunctionin(13).Werstsortthek'sonce,andfordierentvaluesofm,wecomparethesumoftherstmlargestk'splus`(m;sij),andchoosethemthatachievesthehighestscore.Afterwards,wedeterminethevaluesofthebitsaccordingtotheircontributionsasdescribedabove.GiventhevaluesofWxiandWxj,thisloss-adjustedinferencealgorithmtakestimeO(qlogq).Otherthansortingthek's,allotherstepsarelinearinqwhichmakestheinferenceecientandscalabletolargecodelengths.ThecomputationofWxicanbedoneonceperpoint,althoughitisusedwithmanypairs.4.2.Perceptron-likelearningInSec.3.2,weformulatedaconvex-concavebound(11)onempiricalloss.InSec.4.1wedescribedhowthevalueoftheboundcouldbecomputedatagivenW.Nowconsideroptimizingtheobjectivei.e.,low-eringthebound.Astandardtechniqueformini-mizingsuchobjectivesiscalledtheconcave-convexprocedure(Yuille&Rangarajan,2003).Applyingthismethodtoourproblem,weshoulditerativelyimputethemissingdata(thebinarycodesb(xi;w))andoptimizefortheconvexterm(theloss-adjustedtermsin(11)).However,ourpreliminaryexperimentsshowedthatthisprocedureisslowandnotsoeectiveforlearninghashfunctions.Alternatively,followingthestructuredperceptron(Collins,2002)andrecentworkofMcAllesteretal.(2010)weconsideredastochasticgradient-basedap-proach,basedonaniterative,perceptron-like,learn-ingrule.Atiterationt,letthecurrentweightvectorbewt,andletthenewtrainingpairbe(xt;x0t)withsupervisorysignalst.Weupdatetheparametersac-cordingtothefollowinglearningrule:wt+1=wt+h (xt;b(xt;wt))+ (x0t;b(x0t;wt)) (xt;eb(xt;x0t;wt)) (x0t;eb(x0t;xt;wt))i(14)whereisthelearningrate, (x;h)=vec(hxT),andeb(xt;x0t;wt)andeb(x0t;xt;wt)areprovidedbytheloss-adjustedinferenceabove.Thislearningrulehasbeeneectiveinourexperiments.Oneinterpretationofthisupdateruleisthatitfol-lowsthenoisygradientdescentdirectionofourconvex-concaveobjective.Toseethismoreclearly,werewritetheobjective(11)asX(ij)2ShLij+wT (xi;eb(xi;xj;w))+wT (xj;eb(xj;xi;w))wT (xi;b(xi;w))wT (xj;b(xj;w))i:(15)Theloss-adjustedinference(12)yieldseb(xi;xj;w)andeb(xj;xi;w).EvaluatingthelossfunctionforthesetwobinarycodesgivesLij(whichnolongerdependsonw).Takingthenegativegradientoftheobjective(15)withrespecttow,wegettheexactlearningruleof(14).However,notethatthisobjectiveispiecewiselinear,duetothemaxoperations,andthusnotdierentiableatisolatedpoints.Whilethetheoreticalpropertiesofthisupdateruleshouldbeexploredfurther(e.g.,see(McAllesteretal.,2010)),weempiricallyveriedthattheupdaterulelowerstheupperbound,andconvergestoalocalminima.Forexample,Fig.1plotstheem-piricallossandthebound,computedover105trainingpairs,asafunctionoftheiterationnumber.5.ImplementationdetailsWeinitializeWusingLSH;i.e.,theentriesofWaresampled(IID)fromanormaldensityN(0;1),andeach MinimalLossHashingforCompactBinaryCodes Figure1.Theupperboundin(11)andtheempiricallossasfunctionsoftheiterationnumber.rowisthennormalizedtohaveunitlength.Thelearn-ingrulein(14)isusedwithseveralminormodica-tions:1)Inloss-adjustedinference(12),thelossismultipliedbyaconstanttobalancethelossandthescoringfunction.Thisscalingdoesnotaectourin-equalities.2)WeconstraintherowsofWtohaveunitlength,andtheyarerenormalizedaftereachgradientupdate.3)Weusemini-batchestocomputethegra-dient,andamomentumtermbasedonthegradientofthepreviousstepisadded(witharatioof0:9).Foreachexperiment,weselect10%ofthetrainingsetasavalidationset.Wechoose,andlosshyper-parameters,andbyvalidationonafewcandidatechoices.Weallowtoincreaselinearlywiththecodelength.Eachepochincludesarandomsampleof105pointpairs,independentofthemini-batchsizeorthenumberoftrainingpoints.Forvalidationwedo100epochs,andfortrainingweuse2000epochs.Forsmalldatasetssmallernumberofepochswasused.6.ExperimentsWecompareourapproach,minimallosshashing{MLH,withseveralstate-of-the-artmethods.Resultsforbinaryreconstructiveembedding{BRE(Kulis&Darrell,2009),spectralhashing{SH(Weissetal.,2008),shift-invariantkernelhashing{SIKH(Ragin-sky&Lazebnik,2009),andmultilayerneuralnetswithsemanticne-tuning{NNCA(Torralbaetal.,2008),wereobtainedwithimplementationsgenerouslypro-videbytheirrespectiveauthors.Forlocality-sensitivehashing{LSH(Charikar,2002)weusedourownim-plementation.WeshowresultsofSIKHforexperi-mentswithlargerdatasetsandlongercodelengths,becauseitwasnotcompetitiveotherwise.Eachdatasetcomprisesatrainingset,atestset,andasetofground-truthneighbors.Forevaluation,wecom-puteprecisionandrecallforpointsretrievedwithinaHammingdistanceRofcodesassociatedwiththetestqueries.PrecisionasafunctionofRisH=T,whereTisthetotalnumberofpointsretrievedinHammingballwithradiusR,Histhenumberoftrueneighborsamongthem.RecallasafunctionofRisH=GwhereGisthetotalnumberofground-truthneighbors.6.1.SixdatasetsWerstmirrortheexperimentsofKulisandDarrell(2009)withvedatasets2:Photo-tourism,acorpusofimagepatchesrepresentedas128DSIFTfeatures(Snavelyetal.,2006);LabelMeandPeekaboom,collec-tionsofimagesrepresentedas512DGistdescriptors(Torralbaetal.,2008);MNIST,784Dgreyscaleimagesofhandwrittendigits3;andNursery,8Dfeatures4.Wealsouseasyntheticdatasetcomprisinguniformlysam-pledpointsfroma10Dhypercube(Weissetal.,2008).LikeKulisandDarrelweused1000randompointsfortraining,and3000points(wherepossible)fortest-ing;allmethodsusedidenticaltrainingandtestsets.Theneighborsofeachdata-pointaredenedwithadataset-specicthreshold.OneachtrainingsetwendtheEuclideandistanceatwhicheachpointhas,onaverage,50neighbors.Thisdenesground-truthneighborsandnon-neighborsfortraining,andforcom-putingprecisionandrecallstatisticsduringtesting.Forpreprocessing,eachdatasetismean-centered.Forallbutthe10DUniformdata,wethennormalizeeachdatumtohaveunitlength.Becausesomemeth-ods(BRE,SH,SIKH)improvewithdimensionalityre-ductionpriortotrainingandtesting,weapplyPCAtoeachdataset(except10DUniformand8DNurs-ery)andretaina40Dsubspace.MLHoftenperformsslightlybetteronthefulldatasets,butwereportre-sultsforthe40Dsubspace,tobeconsistentwiththeothermethods.Forallmethodswithlocalminimaorstochasticopti-mization(i.e.,allbutSH)weoptimize10independentmodels,ateachofseveralcodelengths.Fig.2plotsprecision(averagedover10models,withst.dev.bars),forpointsretrievedwithinaHammingradiusR=3usingdierencecodelengths.Theseresultsaresimilartothosein(Kulis&Darrell,2009),whereBREyieldshigherprecisionthanSHandLSHfordierentbinarycodelengths.TheplotsalsoshowthatMLHconsis-tentlyyieldshigherprecisionthanBRE.Thisbehaviorpersistsforawiderangeofretrievalradii(seeFig.3).Formanyretrievaltaskswithlargedatasets,precisionismoreimportantthanrecall.Nevertheless,forother 2KulisandDarreltreatedCaltech-101dierentlyfromtheother5datasets,withaspecickernel,soexperimentswerenotconductedonthatdataset.3http://yann.lecun.com/exdb/mnist/4http://archive.ics.uci.edu/ml/datasets/Nursery MinimalLossHashingforCompactBinaryCodes 10DUniformLabelMe MNISTNursery PeekaboomPhoto-tourism Figure2.PrecisionofpointsretrievedusingHammingra-dius3bits,asafunctionofcodelength.(viewincolor)taskssuchasrecognition,highrecallmaybedesiredifonewantstondthemajorityofsimilarpointstoeachquery.Toassessbothrecallandprecision,Fig.4plotsprecision-recallcurves(averagedover10models,withst.dev.bars)fortwoofthedatasets(MNISTandLabelMe),andforbinarycodesoflength30and45.TheseplotsareobtainedbyvaryingtheretrievalradiusR,from0toq.Inalmostallcases,theper-formanceofMLHisclearlysuperior.MLHhashighrecallatalllevelsofprecision.Whilespacedoesallowustoplotthecorrespondingcurvesfortheotherfourdatasets,thebehaviorissimilartothatinFig.4.6.2.Euclidean22KLabelMeWealsotestedalargerLabelMedatasetcompiledbyTorralbaetal.,(2008),whichwecall22KLabelMe.Ithas20,019trainingimagesand2000testimages,eachwitha512DGistdescriptor.With22KLabelMewecanexaminehowdierentmethodsscaletobothlargerdatasetsandlongerbinarycodes.Datapre-processingwasidenticaltothatabove(i.e.,meancentering,nor-malization,40DPCA).NeighborsweredenedbythethresholdintheEuclideanGistspacesuchthateach Figure3.LabelMe{PrecisionforANNretrievalwithinHammingradii1(left)and5(right).(viewincolor)LabelMe{30bitsLabelMe{45bits MNIST{30bitsMNIST{45bits Figure4.Precision-Recallcurvesfordierentmethods,fordierentcodelengths.MovingdownthecurvesinvolvesincreasingHammingdistancesforretrieval.(viewincolor)trainingpointhas,onaverage,100neighbors.Fig.5showsprecision-recallcurvesasafunctionofcodelength,from16to256bits.Asabove,itisclearthatMLHoutperformsallothermethodsforshortandlongcodelengths.SHdoesnotscalewelltolargecodelengths.WecouldnotruntheBREimplementationonthefulldatasetduetoitsmemoryneedsandruntime.Insteadwetraineditwith1000to5000pointsandobservedthattheresultsdonotchangedramat-ically.Theresultsshownherearewith3000trainingpoints,afterwhichthedatabasewaspopulatedwithall20019trainingpoints.At256bitsLSHapproachestheperformanceofBRE,andactuallyoutperformsSHandSIKH.Thedashedcurves(MLH.5)inFig.5areMLHprecision-recallresultsbutathalfthecodelength(e.g.,thedashedcurveonthe64-bitplotisfor32-bitMLH).NotethatMLHoftenoutperformsothermethodsevenwithhalfthecodelength.Finally,sincetheMLHframeworkadmitsgeneralloss MinimalLossHashingforCompactBinaryCodes 16bits32bits 64bits128bits 256bits Figure5.Precision-recallcurvesfordierentcodelengths,usingtheEuclidean22KLabelMedataset.(viewincolor)functionsoftheformL(khgkH;s),itisalsointer-estingtoconsidertheresultsofourlearningframe-workwiththeBREloss(2).TheBRE2curvesinFig.5showthisapproachtobeonparwithBRE.Whileouroptimizationtechniqueismoreecientthatthecoordinate-descentalgorithmofKulisandDar-rel(2009),thedierenceinperformancebetweenMLHandBREisduemainlytothelossfunction,`in(4).6.3.Semantic22KLabelMe22KLabelMealsocomeswithapairwiseanityma-trixthatisbasedonsegmentationsandobjectla-belsprovidedbyhumans.Hencetheanitymatrixprovidessimilarityscoresbasedonsemanticcontent.WhileGistremainstheinputforourmodel,weusedthisanitymatrixtodeneanewsetofneighborsforeachtrainingpoint.Hashfunctionslearnedus-ingthesesemanticlabelsshouldbemoreusefulforcontent-basedretrievalthanhashfunctionstrainedus-ingEuclideandistanceinGistspace.Multilayerneu-ralnetstrainedbyTorralbaetal.(2008)(NNCA)areconsideredthesuperiormethodforsemantic22KLa-belMe.Theirmodelisne-tunedusingsemanticla- Figure6.(top)Percentageof50ground-truthneighborsasafunctionofnumberofimagesretrieved(0M1000)forMLHwith64,256bits,andforNNCAwith256bits.(bottom)Percentageof50neighborsretrievedasafunctionofcodelengthforM=50andM=500.(viewincolor)belsandnonlinearneighborhoodcomponentanalysisof(Salakhutdinov&Hinton,2007).WetrainedMLH,usingvaryingcodelengths,on512DGistdescriptorswithsemanticlabels.Fig.6showstheperformanceofMLHandNNCA,alongwithanearestneighborbaselinethatusedcosinesimilarity(slightlybetterthanEuclideandistance)inGistspace{NN.NotethatNNistheboundontheperformanceofLSHandBREastheymimicEuclideandistance.MLHandNNCAexhibitsimilarperformancefor32-bitcodes,butforlongercodesMLHissuperior.NNCAisnotsignicantlybetterthanGist-basedNN,butMLHwith128and256bitsisbetterthanNN,especiallyforlargerM(numberofimagesretrieved).Finally,Fig.7showssomeinterestingqualitativeresultsontheSemantic22KLabelMemodel.7.ConclusionInthispaper,basedonthelatentstructuralSVMframework,weformulatedanapproachtolearningsimilarity-preservingbinarycodesunderageneralclassoflossfunctions.Weintroducedanewlossfunc-tionsuitablefortrainingusingEuclideandistanceorusingsetsofsimilar/dissimilarpoints.Ourlearningalgorithmisonline,ecient,andscaleswelltolargecodelengths.EmpiricalresultsondierentdatasetssuggestthatMLHoutperformsexistingmethods. MinimalLossHashingforCompactBinaryCodes Figure7.QualitativeresultsonSemantic22KLabelMe.Therstimageofeachrowisaqueryimage.Theremaining13imagesoneachrowwereretrievedusing256-bitMLHbinarycodes,inincreasingorderoftheirHammingdistance.ReferencesCharikar,M.Similarityestimationtechniquesfromround-ingalgorithms.STOC.ACM,2002.Collins,M.Discriminativetrainingmethodsforhiddenmarkovmodels:Theoryandexperimentswithpercep-tronalgorithms.EMNLP,2002.Indyk,P.andMotwani,R.Approximatenearestneighbors:towardsremovingthecurseofdimensionality.ACMSTOC,pp.604{613,1998.Jegou,H.,Douze,M.,andSchmid,C.Hammingembed-dingandweakgeometricconsistencyforlargescaleim-agesearch.ECCV,pp.304{317,2008.Kulis,B.andDarrell,T.Learningtohashwithbinaryreconstructiveembeddings.NIPS,2009.Lin,R.,Ross,D.,andYagnik,J.SPEChashing:Similaritypreservingalgorithmforentropy-basedcoding.CVPR,2010.Lowe,D.Distinctiveimagefeaturesfromscale-invariantkeypoints.IJCV,60(2):91{110,2004.McAllester,D.,Hazan,T.,andKeshet,J.Directlossmin-imizationforstructuredprediction.ICML,2010.Raginsky,M.andLazebnik,S.Locality-sensitivebinarycodesfromshift-invariantkernels.NIPS,2009.Salakhutdinov,R.andHinton,G.Learninganonlinearembeddingbypreservingclassneighbourhoodstructure.AI/STATS,2007.Salakhutdinov,R.andHinton,G.Semantichashing.Int.J.Approx.Reasoning,50(7):969-978,2009.Shakhnarovich,G.,Viola,P.,andDarrell,T.Fastposeestimationwithparameter-sensitivehashing.ICCV,pp.750{759,2003.Snavely,N.,Seitz,S.,andSzeliski,R.Phototourism:Ex-ploringphotocollectionsin3D.SIGGRAPH,pp.835{846,2006.Taskar,B.,Guestrin,C.,andKoller,D.Max-marginMarkovnetworks.NIPS,2003.Torralba,A.,Fergus,R.,andWeiss,Y.Smallcodesandlargeimagedatabasesforrecognition.CVPR,2008.Tsochantaridis,I.,Hofmann,T.,Joachims,T.,andAltun,Y.Supportvectormachinelearningforinterdependentandstructuredoutputspaces.ICML,2004.Wang,J.,Kumar,S.,andChang,S.Sequentialprojectionlearningforhashingwithcompactcodes.ICML,2010.Weiss,Y.,Torralba,A.,andFergus,R.Spectralhashing.NIPS,pp.1753{1760,2008.Yu,C.andJoachims,T.LearningstructuralSVMswithlatentvariables.ICML,2009.Yuille,A.andRangarajan,A.Theconcave-convexproce-dure.NeuralComput.,15(4):915{936,2003.