/
Figure2:ComputingtheFragmentandimage-sentencesimilarities.Left:CNNrepr Figure2:ComputingtheFragmentandimage-sentencesimilarities.Left:CNNrepr

Figure2:ComputingtheFragmentandimage-sentencesimilarities.Left:CNNrepr - PDF document

debby-jeon
debby-jeon . @debby-jeon
Follow
384 views
Uploaded On 2016-06-07

Figure2:ComputingtheFragmentandimage-sentencesimilarities.Left:CNNrepr - PPT Presentation

Figure3Thetwoobjectivesforabatchof2examplesLeftRowsrepresentfragmentsvicolumnssjEverysquareshowsanidealscenarioofyijsignvTisjintheMILobjectiveRedboxesareyij1Yellowindicatesmembersofposi ID: 351781

Figure3:Thetwoobjectivesforabatchof2examples.Left:Rowsrep-resentfragmentsvi columnssj.Ev-erysquareshowsanidealscenarioofyij=sign(vTisj)intheMILob-jective.Redboxesareyij=1.Yellowindicatesmembersofposi

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Figure2:ComputingtheFragmentandimage-sen..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Figure2:ComputingtheFragmentandimage-sentencesimilarities.Left:CNNrepresentations(green)ofdetectedobjectsaremappedtothefragmentembeddingspace(blue,Section3.2).Right:Dependencytreerelationsinthesentenceareembedded(Section3.1).Ourmodelinterpretsinnerproducts(shownasboxes)betweenfragmentsasasimilarityscore.Thealignment(shadedboxes)islatentandinferredbyourmodel(Section3.3.1).Theimage-sentencesimilarityiscomputedasaxedfunctionofthepairwisefragmentscores.WerstdescribetheneuralnetworksthatcomputetheImageandSentenceFragmentembeddings.Thenwediscusstheobjectivefunction,whichiscomposedofthetwoaforementionedobjectives.3.1DependencyTreeRelationsasSentenceFragmentsWewouldliketoextractandrepresentthesetofvisuallyidentiableentitiesdescribedinasentence.Forinstance,usingtheexampleinFigure2,wewouldliketoidentifytheentities(dog,child)andcharacterisetheirattributes(black,young)andtheirpairwiseinteractions(chasing).Inspiredbypreviouswork[5,22]weobservethatadependencytreeofasentenceprovidesarichsetoftypedrelationshipsthatcanservethispurposemoreeffectivelythanindividualwordsorbigrams.Wediscardthetreestructureinfavorofasimplermodelandinterpreteachrelation(edge)asanindividualsentencefragment(Figure2,rightshows5exampledependencyrelations).Thus,werepresenteverywordusing1-of-kencodingvectorwusingadictionaryof400,000wordsandmapeverydependencytriplet(R;w1;w2)intotheembeddingspaceasfollows:s=fWRWew1Wew2+bR:(1)Here,Weisad400;000matrixthatencodesa1-of-kvectorintoad-dimensionalwordvectorrepresentation(weused=200).WexWetoweightsobtainedthroughanunsupervisedobjectivedescribedinHuangetal.[34].NotethateveryrelationRcanhaveitsownsetofweightsWRandbiasesbR.Wextheelement-wisenonlinearityf(:)tobetheRectiedLinearUnit(ReLU),whichcomputesx!max(0;x).Thesizeoftheembeddedspaceiscross-validated,andwefoundthatvaluesofapproximately1000generallyworkwell.3.2ObjectDetectionsasImageFragmentsSimilartosentences,wewishtoextractanddescribethesetofentitiesthatimagesarecomposedof.Inspiredbypriorwork[7],asamodelingassumptionweobservethatthesubjectofmostsentencedescriptionsareattributesofobjectsandtheircontextinascene.Thisnaturallymotivatestheuseofobjectsandtheglobalcontextasthefragmentsofanimage.Inparticular,wefollowGirshicketal.[27]anddetectobjectsineveryimagewithaRegionConvolutionalNeuralNetwork(RCNN).TheCNNispre-trainedonImageNet[37]andnetunedonthe200classesoftheImageNetDetectionChallenge[38].Weusethetop19detectedlocationsandtheentireimageastheimagefragmentsandcomputetheembeddingvectorsbasedonthepixelsIbinsideeachboundingboxasfollows:v=Wm[CNNc(Ib)]+bm;(2)whereCNN(Ib)takestheimageinsideagivenboundingboxandreturnsthe4096-dimensionalactivationsofthefullyconnectedlayerimmediatelybeforetheclassier.TheCNNarchitectureisidenticaltotheonedescribedinGirhsicketal.[27].Itcontainsapproximately60millionparameterscandcloselyresemblesthearchitectureofKrizhevskyetal[25].3.3ObjectiveFunctionWearenowreadytoformulatetheobjectivefunction.RecallthatwearegivenatrainingsetofNimagesandcorrespondingsentences.Intheprevioussectionswedescribedparameterizedfunctionsthatmapeverysentenceandimagetoasetoffragmentvectorsfsgandfvg,respectively.Allparametersofourmodelarecontainedinthesetwofunctions.AsshowninFigure2,ourmodel3 Figure3:Thetwoobjectivesforabatchof2examples.Left:Rowsrep-resentfragmentsvi,columnssj.Ev-erysquareshowsanidealscenarioofyij=sign(vTisj)intheMILob-jective.Redboxesareyij=�1.Yellowindicatesmembersofposi-tivebagsthathappentocurrentlybeyij=�1.Right:ThescoresareaccumulatedwithEquation6intoimage-sentencescorematrixSkl. theninterpretstheinnerproductvTisjbetweenanimagefragmentviandasentencefragmentsjasasimilarityscore,andcomputestheimage-sentencesimilarityasaxedfunctionofthescoresoftheirrespectivefragments.Wearemotivatedbytwocriteriaindesigningtheobjectivefunction.First,theimage-sentencesimilaritiesshouldbeconsistentwiththegroundtruthcorrespondences.Thatis,correspondingimage-sentencepairsshouldhaveahigherscorethanallotherimage-sentencepairs.ThiswillbeenforcedbytheGlobalRankingObjective.Second,weintroduceaFragmentAlignmentObjectivethatexplicitlylearnstheappearanceofsentencefragments(suchas“blackdog”)inthevisualdomain.Ourfullobjectiveisthesumofthesetwoobjectivesandaregularizationterm:C()=CF()+ CG()+ jjjj22;(3)whereisashorthandforparametersofourneuralnetwork(=fWe;WR;bR;Wm;bm;cg)and and arehyperparametersthatwecross-validate.Wenowdescribebothobjectivesinmoredetail.3.3.1FragmentAlignmentObjectiveTheFragmentAlignmentObjectiveencodestheintuitionthatifasentencecontainsafragment(e.g.“blueball”,Figure3),atleastoneoftheboxesinthecorrespondingimageshouldhaveahighscorewiththisfragment,whilealltheotherboxesinalltheotherimagesthathavenomentionof“blueball”shouldhavealowscore.Thisassumptioncanbeviolatedinmultipleways:atripletmaynotrefertoanythingvisuallyidentiableintheimage.TheboxthatthetripletreferstomaynotbedetectedbytheRCNN.Lastly,otherimagesmaycontainthedescribedvisualconceptbutitsmentionmayomittedintheassociatedsentencedescriptions.Nonetheless,theassumptionisstillsatisedinmanycasesandcanbeusedtoformulateacostfunction.Consideran(incomplete)FragmentAlignmentObjectivethatassumesadensealignmentbetweeneverycorrespondingimageandsentencefragments:C0()=XiXjmax(0;1�yijvTisj):(4)Here,thesumisoverallpairsofimageandsentencefragmentsinthetrainingset.ThequantityvTisjcanbeinterpretedasthealignmentscoreofvisualfragmentviandsentencefragmentsj.Inthisincompleteobjective,wedeneyijas+1iffragmentsviandsjoccurtogetherinacorrespondingimage-sentencepair,and�1otherwise.Intuitively,C0()encouragesscoresinredregionsofFigure3tobelessthan-1andscoresalongtheblockdiagonal(greenandyellow)tobemorethan+1.MultipleInstanceLearningextension.TheproblemwiththeobjectiveC0()isthatitassumesdensealignmentbetweenallpairsoffragmentsineverycorrespondingimage-sentencepair.How-ever,thisishardlyeverthecase.Forexample,inFigure3,the“boyplaying”tripletreferstoonlyoneofthethreedetections.WenowdescribeaMultipleInstanceLearning(MIL)[39]extensionoftheobjectiveC0thatattemptstoinferthelatentalignmentbetweenfragmentsincorrespondingimage-sentencepairs.Concretely,foreverytripletweputimagefragmentsintheassociatedim-ageintoapositivebag,whileimagefragmentsineveryotherimagebecomenegativeexamples.Ourpreciseformulationisinspiredbythemi-SVM[40],whichisasimpleandnaturalextensionofaSupportVectorMachinetoaMultipleInstanceLearningsetting.Insteadoftreatingtheyijasconstants,weminimizeoverthembywrappingEquation4asfollows:4 CF()=minyijC0()s.t.Xi2pjyij+1 218jyij=�18i;js.t.mv(i)6=ms(j)andyij2f�1;1g(5)Here,wedenepjtobethesetofimagefragmentsinthepositivebagforsentencefragmentj.mv(i)andms(j)returntheindexoftheimageandsentence(anelementoff1;:::;Ng)thatthefragmentsviandsjbelongto.Notethattheinequalitysimplystatesthatatleastoneoftheyijshouldbepositiveforeverysentencefragmentj(i.e.atleastonegreenboxineverycolumninFigure3).Thisobjectivecannotbesolvedefciently[40]butacommonlyusedheuristicistosetyij=sign(vTisj).Iftheconstraintisnotsatisedforanypositivebag(i.e.allscoreswerebelowzero),thehighest-scoringiteminthepositivebagissettohaveapositivelabel.3.3.2GlobalRankingObjectiveRecallthattheGlobalRankingObjectiveensuresthatthecomputedimage-sentencesimilaritiesareconsistentwiththegroundtruthannotation.First,wedenetheimage-sentencealignmentscoretobetheaveragethresholdedscoreoftheirpairwisefragmentscores:Skl=1 jgkj(jglj+n)Xi2gkXj2glmax(0;vTisj):(6)Here,gkisthesetofimagefragmentsinimagekandglisthesetofsentencefragmentsinsentencel.Bothk;lrangefrom1;:::;N.Wetruncatescoresatzerobecauseinthemi-SVMobjective,scoresgreaterthan0areconsideredcorrectalignmentsandscoreslessthan0areconsideredtobeincorrectalignments(i.e.falsemembersofapositivebag).Inpractice,wefoundthatitwashelpfultoaddasmoothingtermn,sinceshortsentencescanotherwisehaveanadvantage(wefoundthatn=5workswellandthatthissettingisnotverysensitive).TheGlobalRankingObjectivethenbecomes:CG()=XkhXlmax(0;Skl�Skk+)| {z }rankimages+Xlmax(0;Slk�Skk+)| {z }ranksentencesi:(7)Here,isahyperparameterthatwecross-validate.Theobjectivestipulatesthatthescorefortrueimage-sentencepairsSkkshouldbehigherthanSklorSlkforanyl6=kbyatleastamarginof.3.4OptimizationWeuseStochasticGradientDescent(SGD)withmini-batchesof100,momentumof0.9andmake20epochsthroughthetrainingdata.Thelearningrateiscross-validatedandannealedbyafractionof0:1forthelasttwoepochs.SincebothMultipleInstanceLearningandCNNnetuningbenetfromagoodinitialization,weruntherst10epochswiththefragmentalignmentobjectiveC0andCNNweightscxed.After10epochs,weswitchtothefullMILobjectiveCFandbeginnetuningtheCNN.ThewordembeddingmatrixWeiskeptxedduetooverttingconcerns.Ourimplementationrunsatapproximately1secondperbatchonastandardCPUworkstation.4ExperimentsDatasets.Weevaluateourimage-sentenceretrievalperformanceonPascal1K[2],Flickr8K[3]andFlickr30K[4]datasets.Thedatasetscontain1,000,8,000and30,000imagesrespectivelyandeachimageisannotatedusingAmazonMechanicalTurkwith5independentsentences.SentenceDataPreprocessing.Wedidnotexplicitlylter,spellcheckornormalizeanyofthesentencesforsimplicity.WeusetheStanfordCoreNLPparsertocomputethedependencytreesforeverysentence.Sincetherearemanypossiblerelationtypes(asmanyashundreds),duetooverttingconcernsandpracticalconsiderationsweremoveallrelationtypesthatoccurlessthan1%ofthetimeineachdataset.Inpractice,thisreducesthenumberofrelationsfrom136to16inPascal1K,170to17inFlickr8K,andfrom212to21inFlickr30K.Additionally,allwordsthatarenotfoundinourdictionaryof400,000words[34]arediscarded.ImageDataPreprocessing.WeusetheCaffe[41]implementationoftheImageNetDetectionRCNNmodel[27]todetectobjectsinallimages.OnourmachinewithaTeslaK40GPU,theRCNNprocessesoneimageinapproximately25seconds.Wediscardthepredictionsfor200ImageNetdetectionclassesandonlykeepthe4096-Dactivationsofthefullyconnectlayerimmediatelybeforetheclassieratallofthetop19detectedlocationsandfromtheentireimage.5 Pascal1K ImageAnnotationImageSearchModel R@1R@5R@10Meanr R@1R@5R@10Meanr RandomRanking 4.09.012.071.0 1.65.210.650.0 Socheretal.[22] 23.045.063.016.9 16.446.665.612.5kCCA.[22] 21.047.061.018.0 16.441.458.015.9DeViSE[21] 17.057.068.011.9 21.654.672.49.5SDT-RNN[22] 25.056.070.013.4 25.465.284.47.0 Ourmodel 39.068.079.010.5 23.665.279.87.6 Table1:Pascal1Krankingexperiments.R@KisRecall@K(highisgood).Meanristhemeanrank(lowisgood).Notethatthetestsetonlyconsistsof100images. Flickr8K ImageAnnotationImageSearchModel R@1R@5R@10Medr R@1R@5R@10Medr RandomRanking 0.10.61.1631 0.10.51.0500 Socheretal.[22] 4.518.028.632 6.118.529.029DeViSE[21] 4.816.527.328 5.920.129.629SDT-RNN[22] 6.022.734.023 6.621.631.725 FragmentAlignmentObjective 7.221.931.825 5.920.030.326GlobalRankingObjective 5.821.834.820 7.523.435.021(y)Fragment+Global 12.529.443.814 8.626.738.717 y!Images:FullframeOnly 5.919.227.334 5.217.626.532y!Sentences:BOW 9.125.940.717 6.922.434.023y!Sentences:Bigrams 8.728.541.016 8.525.237.020 Ourmodel(y+MIL) 12.632.944.014 9.729.642.515 *Hodoshetal.[3] 8.321.630.334 7.620.730.138*Ourmodel(y+MIL) 9.324.937.421 8.827.941.317 Table2:Flickr8Kexperiments.R@KisRecall@K(highisgood).Medristhemedianrank(lowisgood).Thestarredevaluationcriterion(*)in[3]isslightlydifferentsinceitdiscards4,000outof5,000testsentences.EvaluationProtocolforBidirectionalRetrieval.ForPascal1KwefollowSocheretal.[22]anduse800imagesfortraining,100forvalidationand100fortesting.ForFlickrdatasetsweuse1,000imagesforvalidation,1,000fortestingandtherestfortraining(consistentwith[3]).Wecomputethedenseimage-sentencesimilaritySklbetweeneveryimage-sentencepairinthetestsetandrankimagesandsentencesinorderofdecreasingscore.ForbothImageAnnotationandImageSearch,wereportthemedianrankoftheclosestgroundtruthresultinthelist,aswellasRecall@KwhichcomputesthefractionoftimesthecorrectresultwasfoundamongthetopKitems.WhencomparingtoHodoshetal.[3]wecloselyfollowtheirevaluationprotocolandonlykeepasubsetofNsentencesoutoftotal5N(weusetherstsentenceoutofeverygroupof5).4.1ComparisonMethodsSDT-RNN.Socheretal.[22]embedafullframeCNNrepresentationwiththesentencerepresenta-tionfromaSemanticDependencyTreeRecursiveNeuralNetwork(SDT-RNN).Theirlossmatchesourglobalrankingobjective.WerequestedthesourcecodeofSocheretal.[22]andsubstitutedtheFlickr8KandFlick30Kdatasets.TobetterunderstandthebenetsofusingourdetectionCNNandforamorefaircomparisonwealsotraintheirmethodwithourCNNfeatures.Sincewehavemulti-pleobjectsperimage,weaveragerepresentationsfromallobjectswithdetectioncondenceabovea(cross-validated)threshold.WerefertotheexactmethodofSocheretal.[22]withasinglefullframeCNNas“Socheretal”,andtotheirmethodwithourcombinedCNNfeaturesas“SDT-RNN”.DeViSE.TheDeViSE[21]sourcecodeisnotpubliclyavailablebuttheirapproachisaspecialcaseofourmethodwiththefollowingmodications:weusetheaverage(L2-normalized)wordvectorsasasentencefragment,theaverageCNNactivationofallobjectsaboveadetectionthreshold(asdiscussedincaseofSDT-RNN)asanimagefragmentandonlyusetheglobalrankingobjective.4.2QuantitativeEvaluationOurmodeloutperformspreviousmethods.OurfullmethodconsistentlyoutperformspreviousmethodsonFlickr8K(Table2)andFlickr30K(Table3)datasets.OnPascal1K(Table1)theSDT-RNNappearstobecompetitiveonImageSearch.FragmentandGlobalObjectivesarecomplementary.AsseeninTables2and3,bothobjectivesperformwellindependently,butbenetfromthecombination.NotethattheGlobalObjectiveperformsconsistentlybetter,possiblybecauseitdirectlyminimizestheevaluationcriterion(ranking6