/
JonathanHuangVivekRathodChenSunMenglongZhuAnoopKorattikaraAlireza JonathanHuangVivekRathodChenSunMenglongZhuAnoopKorattikaraAlireza

JonathanHuangVivekRathodChenSunMenglongZhuAnoopKorattikaraAlireza - PDF document

bery
bery . @bery
Follow
343 views
Uploaded On 2022-08-20

JonathanHuangVivekRathodChenSunMenglongZhuAnoopKorattikaraAlireza - PPT Presentation

humanevaluationresultsAccordingtotheautomaticmetricstheCOCOtrainedmodelsaresuperiortotheConceptualtrainedmodelsCIDErscoresinthemid03fortheCOCOtrainedconditionversusmid02fortheConceptual ID: 939026

148 147 coco conceptual 147 148 conceptual coco 2015 2014 2016 2017 text 150 conceptualcaptions trainedmodels basedmodels layer rnn1x1

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JonathanHuangVivekRathodChenSunMenglongZ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JonathanHuang,VivekRathod,ChenSun,MenglongZhu,AnoopKorattikara,AlirezaFathi,IanFischer,ZbigniewWojna,YangSong,SergioGuadarrama,andKevinMurphy.2016.Speed/accuracytrade-offsformodernconvolutionalobjectdetectors.CoRRabs/1611.10012.AndrejKarpathyandLiFei-Fei.2015.Deepvisual-semanticalignmentsforgeneratingimagedescrip-tions.InProc.ofIEEEConferenceonComputerVisionandPatternRecognition(CVPR).RyanKiros,RuslanSalakhutdinov,andRichardSZemel.2015.Unifyingvisual-semanticembeddingswithmultimodalneurallanguagemodels.Transac-tionsoftheAssociationforComputationalLinguis-tics.A.Krizhevsky,I.Sutskever,andG.Hinton.2012.Im-agenetclassicationwithdeepconvolutionalneuralnetworks.InNIPS.Chin-YewLinandFranzJosefOch.2004.Auto-maticevaluationofmachinetranslationqualityus-inglongestcommonsubsequenceandskip-bigramstatistics.InProceedingsofACL.Tsung-YiLin,MichaelMaire,SergeJ.Belongie,LubomirD.Bourdev,RossB.Girshick,JamesHays,PietroPerona,DevaRamanan,PiotrDoll´ar,andC.LawrenceZitnick.2014.MicrosoftCOCO:com-monobjectsincontext.CoRRabs/1405.0312.SiqiLiu,ZhenhaiZhu,NingYe,SergioGuadarrama,andKevinMurphy.2017.Optimizationofimagedescriptionmetricsusingpolicygradientmethods.InInternationalConferenceonComputerVision(ICCV).JunhuaMao,JiajingXu,YushiJing,andAlanYuille.2016.Trainingandevaluatingmultimodalwordem-beddingswithlarge-scalewebannotatedimages.InNIPS.Marc'AurelioRanzato,SumitChopra,MichaelAuli,andWojciechZaremba.2015.Sequenceleveltrainingwithrecurrentneuralnetworks.CoRRabs/1511.06732.IlyaSutskever,OriolVinyals,andQuocVLe.2014.Sequencetosequencelearningwithneuralnetworks.InAdvancesinneuralinformationprocessingsys-tems.pages3104–3112.ChristianSzegedy,SergeyIoffe,andVincentVan-houcke.2016.Inception-v4,inception-resnetandtheimpactofresidualconnectionsonlearning.CoRRabs/1602.07261.AshishVaswani,NoamShazeer,NikiParmar,JakobUszkoreit,LlionJones,AidanNGomez,LukaszKaiser,andIlliaPolosukhin.2017.Attentionisallyouneed.InAdvancesinNeuralInformationPro-cessingSystems.RamakrishnaVedantam,C.LawrenceZitnick,andDeviParikh.2015.Cider:Consensus-basedimagedescriptionevaluation.InTheIEEEConferenceonComputerVisionandPatternRecognition(CVPR).OriolVinyals,AlexanderToshev,SamyBengio,andDumitruErhan.2015a.Showandtell:Aneuralim-agecaptiongenerator.InProceedingsoftheIEEEconferenceoncomputervisionandpatternrecogni-tion.pages3156–3164.OriolVinyals,AlexanderToshev,SamyBengio,andDumitruErhan.2015b.Showandtell:Aneuralimagecaptiongenerator.InProc.ofIEEEConfer-enceonComputerVisionandPatternRecognition(CVPR).KelvinXu,JimmyBa,RyanKiros,AaronCourville,RuslanSalakhutdinov,RichardZemel,andYoshuaBengio.2015.Show,attendandtell:Neuralimagecaptiongenerationwithvisualattention.InProc.ofthe32ndInternationalConferenceonMachineLearning(ICML).Z.Yang,Y.Yuan,Y.Wu,R.Salakhutdinov,andW.W.Cohen.2016.Reviewnetworksforcaptiongenera-tion.InNIPS.PeterYoung,AliceLai,MicahHodosh,andJuliaHock-enmaier.2014.Fromimagedescriptionstovisualdenotations:Newsimilaritymetricsforsemanticin-ferenceovereventdescriptions.TACL2:67–78. humanevaluationresults.Accordingtotheauto-maticmetrics,theCOCO-trainedmodelsaresu-periortotheConceptual-trainedmodels(CIDErscoresinthemid-0.3fortheCOCO-trainedcon-dition,versusmid-0.2fortheConceptual-trainedcondition),andtheRNN-basedmodelsaresupe-riortoTransformer-basedmodels.Notably,thesearethesamemetricswhichscorehumanslowerthanthemethodsthatwontheCOCO2015chal-lenge(Vinyalsetal.,2015a;Fangetal.,2015),despitethefactthathumansarestillmuchbetteratt

histask.Thefailureofthesemetricstoalignwiththehumanevaluationresultscastsagaingravedoubtsontheirabilitytodriveprogressinthiseld.Asignicantweaknessofthesemetricsisthathal-lucinationeffectsareunder-penalized(asmallpre-cisionpenaltyfortokenswithnocorrespondentinthereference),comparedtohumanjudgmentsthattendtodivedramaticallyinthepresenceofhallucinations.6ConclusionsWepresentanewimagecaptioningdataset,Con-ceptualCaptions,whichhasseveralkeycharacter-istics:ithasaround3.3Mexamples,anorderofmagnitudelargerthantheCOCOimage-captioningdataset;itconsistsofawidevarietyofimages,includingnaturalimages,productimages,profes-sionalphotos,cartoons,drawings,etc.;and,itscaptionsarebasedondescriptionstakenfromorig-inalAlt-textattributes,automaticallytransformedtoachieveabalancebetweencleanliness,informa-tiveness,andlearnability.Weevaluateboththequalityoftheresultingimage/captionpairs,aswellastheperformanceofseveralimage-captioningmodelswhentrainedontheConceptualCaptionsdata.Theresultsindicatethatsuchmodelsachievebetterperformance,andavoidsomeofthepitfallsseenwithCOCO-trainedmodels,suchasobjecthallucination.WehopethattheavailabilityoftheConceptualCaptionsdatasetwillfosterconsiderableprogressontheautomaticimage-captioningtask.ReferencesPeterAnderson,BasuraFernando,MarkJohnson,andStephenGould.2016.SPICE:semanticproposi-tionalimagecaptionevaluation.InECCV.D.Bahdanau,K.Cho,andY.Bengio.2015.Neuralmachinetranslationbyjointlylearningtoalignandtranslate.InProceedingsofICLR.SatanjeevBanerjeeandAlonLavie.2005.METEOR:AnautomaticmetricforMTevaluationwithim-provedcorrelationwithhumanjudgments.InPro-ceedingsoftheACLWorkshoponintrinsicandex-trinsicevaluationmeasuresformachinetranslationand/orsummarization.YoshuaBengio.2009.Learningdeeparchitecturesforai.Found.TrendsMach.Learn.2(1):1–127.RaffaellaBernardi,RuketCakici,DesmondElliott,AykutErdem,ErkutErdem,NazliIkizler-Cinbis,FrankKeller,AdrianMuscat,andBarbaraPlank.2016.Automaticdescriptiongenerationfromim-ages:Asurveyofmodels,datasets,andevaluationmeasures.JAIR55.CraigChambers,AshishRaniwala,FrancesPerry,StephenAdams,RobertHenry,RobertBradshaw,andNathan.2010.Flumejava:Easy,ef-cientdata-parallelpipelines.InACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation(PLDI).2PennPlaza,Suite701NewYork,NY10121-0701,pages363–375.http://dl.acm.org/citation.cfm?id=1806638.JunyoungChung,CaglarGulcehre,KyungHyunCho,andYoshuaBengio.2014.Empiricalevaluationofgatedrecurrentneuralnetworksonsequencemodel-ing.arXivpreprintarXiv:1412.3555.J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei.2009.ImageNet:Alarge-scalehierarchicalim-agedatabase.InCVPR.NanDingandRaduSoricut.2017.Cold-startreinforce-mentlearningwithsoftmaxpolicygradients.InNIPS.JeffDonahue,LisaAnneHendricks,SergioGuadar-rama,MarcusRohrbach,SubhashiniVenugopalan,KateSaenko,andTrevorDarrell.2014.Long-termrecurrentconvolutionalnetworksforvisualrecog-nitionanddescription.InProc.ofIEEEConfer-enceonComputerVisionandPatternRecognition(CVPR).JohnDuchi,EladHazan,andYoramSinger.2011.Adaptivesubgradientmethodsforonlinelearningandstochasticoptimization.JournalofMachineLearningResearch12(Jul):2121–2159.HaoFang,SaurabhGupta,ForrestIandola,RupeshSri-vastava,LiDeng,PiotrDoll´ar,JianfengGao,Xi-aodongHe,MargaretMitchell,JohnPlatt,etal.2015.Fromcaptionstovisualconceptsandback.InProc.ofIEEEConferenceonComputerVisionandPatternRecognition(CVPR).SeppHochreiterandJ¨urgenSchmidhuber.1997.Longshort-termmemory.Neuralcomputation9(8

):1735–1780.MicahHodosh,PeterYoung,andJuliaHockenmaier.2013.Framingimagedescriptionasarankingtask:Data,modelsandevaluationmetrics.JAIR. Model Training 1+2+3+ RNN8x8 COCO 0.3900.2760.173T2T8x8 COCO 0.4780.3620.275 RNN8x8 Conceptual 0.5710.4180.277T2T8x8 Conceptual 0.6590.5060.355 Table4:HumanevalresultsonFlickr1KTest.Athirddifferenceistheresiliencetoalargespectrumofimagetypes.COCOonlycontainsnat-uralimages,andthereforeacartoonimagelikethefourthoneresultsinmassivehallucinationeffectsforCOCO-trainedmodels(“stuffedanimal”,“sh”,“sideofcar”).Incontrast,Conceptual-trainedmod-elshandlesuchimageswithease.5.4QuantitativeResultsInthissection,wepresentquantitativeresultsonthequalityoftheoutputsproducedbyseveralim-agecaptioningmodels.Wepresentbothautomaticevaluationresultsandhumanevaluationresults.5.4.1HumanEvaluationResultsForhumanevaluations,weuseapoolofprofes-sionalraters(tensofraters),withadouble-blindevaluationcondition.RatersareaskedtoassignaGOODorBADlabeltoagivenhimage;captioniinput,usingjustcommon-sensejudgment.Thisapproximatesthereactionofatypicaluser,whonormallywouldnotacceptpredenednotionsofGOODvs.BAD.Weask3separateraterstorateeachinputpairandreportthepercentageofpairsthatreceivekormore(k+)GOODannotations.InTable4,wereporttheresultsontheFlickr1Ktestset.Thisevaluationisout-of-domainforbothtrainingconditions,soallmodelsareonrel-ativelyequalfooting.TheresultsindicatethattheConceptual-basedmodelsaresuperior.In50.6%(fortheT2T8x8model)ofcases,amajorityofan-notators(2+)assignedaGOODlabel.TheresultsalsoindicatethattheTransformer-basedmodelsaresuperiortotheRNN-basedmodelsbyagoodmar-gin,byover8-points(for2+)underbothCOCOandConceptualtrainingconditions.Model Training CIDErROUGE-LMETEOR RNN1x1 COCO 1.0210.6940.348RNN8x8 COCO 1.0440.6980.354T2T1x1 COCO 1.0320.7000.358T2T8x8 COCO 1.0320.7000.356 RNN1x1 Conceptual 0.4030.4450.191RNN8x8 Conceptual 0.4100.4370.189T2T1x1 Conceptual 0.3480.4030.171T2T8x8 Conceptual 0.3450.4000.170 Table5:AutometricsontheCOCOC40Test.Model Training CIDErROUGE-LSPICE RNN1x1 COCO 0.1830.1490.062RNN8x8 COCO 0.1910.1520.065T2T1x1 COCO 0.1840.1480.062T2T8x8 COCO 0.1900.1510.064 RNN1x1 Conceptual 1.3510.3260.235RNN8x8 Conceptual 1.4010.3300.240T2T1x1 Conceptual 1.5880.3310.254T2T8x8 Conceptual 1.6760.3360.257 Table6:Autometricsonthe22.5KConceptualCaptionsTestset.Model Training CIDErROUGE-LSPICE RNN1x1 COCO 0.3400.4140.101RNN8x8 COCO 0.3560.4130.103T2T1x1 COCO 0.3410.4040.101T2T8x8 COCO 0.3590.4160.103 RNN1x1 Conceptual 0.2690.3100.076RNN8x8 Conceptual 0.2750.3090.076T2T1x1 Conceptual 0.2260.2800.068T2T8x8 Conceptual 0.2270.2770.066 Table7:AutometricsontheFlickr1KTest.5.4.2AutomaticEvaluationResultsInthissection,wereportautomaticevaluationre-sults,usingestablishedimagecaptioningmetrics.FortheCOCOC40testset(Fig.5),wereportthenumericalvaluesreturnedbytheCOCOon-lineevaluationserverz,usingtheCIDEr(Vedantametal.,2015),ROUGE-L(LinandOch,2004),andMETEOR(BanerjeeandLavie,2005)metrics.ForConceptualCaptions(Fig.6)andFlickr(Fig.7)testsets,wereportnumericalvaluesfortheCIDEr,ROUGE-L,andSPICE(Andersonetal.,2016)x.Forallmetrics,highernumbermeanscloserdis-tancebetweenthecandidatesandthegroundtruthcaptions.Theautomaticmetricsaregoodatdetectingin-vsout-of-domainsituations.ForCOCO-modelstestedonCOCO,theresultsinFig.5showCIDErscoresinthe1.02-1.04range,forbothRNN-andTransformer-basedmodels;thescoresdropinthe0.35-0.41range(CIDEr)fortheConceptual-basedmodelstestedaga

instCOCOgroundtruth.ForConceptual-modelstestedontheConceptualCap-tionstestset,theresultsinFig.6showscoresashighas1.468CIDErfortheT2T8x8model,whichcorroboratesthehuman-evalresultsfortheTransformer-basedmodelsbeingsuperiortotheRNN-basedmodels;thescoresfortheCOCO-basedmodelstestedagainstConceptualCaptionsgroundtruthareallbelow0.2CIDEr.Theautomaticmetricsfailtocorroboratethe zhttp://mscoco.org/dataset/#captions-eval.xhttps://github.com/tylin/coco-caption. COCO-trainedRNN8x8agroupofmenstandinginfrontofabuildingacoupleofpeoplewalk-ingdownawalkwayachildsittingatatablewithacakeonitacloseupofastuffedanimalonatableT2T8x8agroupofmeninuni-formandtiesaretalkinganarrowhallwaywithaclockandtwodoorsawomancuttingabirth-daycakeatapartyapictureofashonthesideofacarConceptual-trainedRNN8x8graduateslineupforthecommencementcer-emonyaviewofthenaveachild'sdrawingatabirthdaypartyacartoonbusiness-manthinkingaboutsomethingT2T8x8graduateslineuptore-ceivetheirdiplomasthecloisterofthecathe-drallearningabouttheartsandcraftsacartoonbusinessmanaskingforhelpFigure5:Sidebysidecomparisonofmodeloutputsundertwotrainingconditions.Conceptual-basedmodels(lowerhalf)tendtohallucinateless,aremoreexpressive,andhandlewellalargervarietyofimages.ThetwoimagesinthemiddlearefromFlickr;theothertwoarefromConceptualCaptions.5.2ExperimentalSetupImagePreprocessingEachinputimageisrstpreprocessedbyrandomdistortionandcropping(usingarandomratiofrom50%100%).Thispreventsmodelsfromoverttingindividualpixelsofthetrainingimages.Encoder-DecoderForRNN-basedmodels,weusea1-layer,512-dimLSTMastheRNNcell.FortheTransformer-basedmodels,weusethedefaultsetupfrom(Vaswanietal.,2017),withN=6encoderanddecoderlayers,ahidden-layersizeof512,and8attentionheads.TextHandlingTrainingcaptionsaretruncatedtomaximum15tokens.Weuseatokentypemin-countof4,whichresultsinaround9,000tokentypesfortheCOCOdataset,andaround25,000tokentypesfortheConceptualCaptionsdataset.AllothertokensarereplacedwithspecialtokenhUNKi.Thewordembeddingmatrixhassize512andistiedtotheoutputprojectionmatrix.OptimizationAllmodelsaretrainedusingMLElossandoptimizedusingAdagrad(Duchietal.,2011)withlearningrate0.01.Mini-batchsizeis25.Allmodelparametersaretrainedforatotalnumberof5Msteps,withbatchupdatesasynchronouslydistributedacross40workers.ThenalmodelisselectedbasedonthebestCIDErscoreonthedevelopmentsetforthegiventrainingcondition.InferenceDuringinference,thedecoderpredic-tionofthepreviouspositionisfedtotheinputofthenextposition.Weuseabeamsearchofbeamsize4tocomputethemostlikelyoutputsequence.5.3QualitativeResultsBeforewepresentthenumericalresultsforourexperiments,wediscussbrieythepatternsthatwehaveobserved.OnedifferencebetweenCOCO-trainedmodelsandConceptual-trainedmodelsistheirabilitytousetheappropriatenaturallanguagetermsfortheentitiesinanimage.Fortheleft-mostimageinFig.5,COCO-trainedmodelsuse“groupofmen”torefertothepeopleintheimage;Conceptual-basedmodelsusethemoreappropriateandinfor-mativeterm“graduates”.Thesecondimage,fromtheFlickrtestset,makesthisevenmoreclear.TheConceptual-trainedT2T8x8modelisperfectlyren-deringtheimagecontentas“thecloisterofthecathedral”.Noneoftheothermodelscomeclosetoproducingsuchanaccuratedescription.AseconddifferenceisthatCOCO-trainedmod-elsoftenseemtohallucinateobjects.Forinstance,theyhallucinate“frontofbuilding”fortherstim-age,“clockandtwodoors”forthesecond,and“birthdaycake”forthethirdimage.Incontrast,Co

nceptual-trainedmodelsdonotseemtohavethisproblem.Wehypothesizethatthehallucina-tionissueforCOCO-basedmodelscomesfromthehighcorrelationspresentintheCOCOdata(e.g.,ifthereisakidatatable,thereisalsocake).Thishighdegreeofcorrelationinthedatadoesnotallowthecaptioningmodeltocorrectlydisentan-gleandlearnrepresentationsattherightlevelofgranularity. IntheoriginalShow-And-Tellmodel,asingleim-ageembeddingoftheentireimageisfedtotherstcellofanRNN,whichisalsousedfortextgener-ation.Inourmodel,asingleimageembeddingisfedtoanRNNencwithonlyonecell,andthenadif-ferentRNNdecisusedfortextgeneration.Wetriedbothsingleimage(1x1)embeddingsand8x8parti-tionsoftheimage,whereeachpartitionhasitsownembedding.Inthe8x8case,imageembeddingsarefedinasequencetotheRNNenc.Inbothcases,weapplyplainRNNswithoutcrossattention,sameastheShow-And-Tellmodel.RNNswithcrossatten-tionwereusedintheShow-Attend-Tellmodel(Xuetal.,2015),butwenditsperformancetobeinferiortotheShow-And-Tellmodel.4.2TransformerModelIntheTransformer-basedmodels,boththeencoderandthedecodercontainastackofNlayers.Wedenotethen-thlayerintheencoderbyXn=fxn;1;:::;xn;Lg,andX0=X,H=XN.Eachoftheselayerscontainstwosub-layers:amulti-headself-attentionlayerATTN,andaposition-wisefeedforwardnetworkFFN:x0n;j=ATTN(xn;j;Xn;Weq;Wek;Wev),softmax(hxn;jWeq;XnWeki)XnWevx(n+1);j=FFN(x0n;j;Wef)whereWeq,Wek,andWevaretheencoderweightmatricesforquery,key,andvaluetransformationintheself-attentionsub-layer;andWefdenotestheencoderweightmatrixofthefeedforwardsub-layer.SimilartotheRNN-basedmodel,weconsiderus-ingasingleimageembedding(1x1)andavectorof8x8imageembeddings.Inthedecoder,wedenotethen-thlayerbyZn=fzn;1;:::;zn;TgandZ0=Y.Therearetwomaindifferencesbetweenthedecoderanden-coderlayers.First,theself-attentionsub-layerinthedecoderismaskedtotheright,inordertopre-ventattendingto“future”positions(i.e.zn;jdoesnotattendtozn;(j+1);:::;zn;T).Second,inbe-tweentheself-attentionlayerandthefeedforwardlayer,thedecoderaddsathirdcross-attentionlayerthatconnectszn;jtothetop-layerencoderrepre-sentationH=XN.z0n;j=ATTN(zn;j;Zn;1:j;Wdq;Wdk;Wdv)z00n;j=ATTN(z0n;j;H;Wcq;Wck;Wcv)z(n+1);j=FFN(z00n;j;Wdf)whereWdq,Wdk,andWdvaretheweightmatricesforquery,key,andvaluetransformationinthede-coderself-attentionsub-layer;Wcq,Wck,Wcvarethecorrespondingdecoderweightmatricesinthecross-attentionsub-layer;andWdfisthedecoderweightmatrixofthefeedforwardsub-layer.TheTransformer-basedmodelsutilizepositioninformationattheembeddinglayer.Inthe8x8case,the64embeddingvectorsareserializedtoa1Dsequencewithpositionsfrom[0;:::;63].Thepo-sitioninformationismodeledbyapplyingsineandcosinefunctionsateachpositionandwithdiffer-entfrequenciesforeachembeddingdimension,asin(Vaswanietal.,2017),andsubsequentlyaddedtotheembeddingrepresentations.5ExperimentalResultsInthissection,weevaluatetheimpactofusingtheConceptualCaptionsdataset(referredtoas'Conceptual'inwhatfollows)fortrainingimagecaptioningmodels.Tothisend,wetrainthemodelsdescribedinSection4undertwoexper-imentalconditions:usingthetraining&devel-opmentsetsprovidedbytheCOCOdataset(Linetal.,2014),versustraining&developmentsetsusingtheConceptualdataset.Wequantitativelyevaluatetheresultingmodelsusingthreediffer-enttestsets:theblindCOCO-C40testset(in-domainforCOCO-trainedmodels,out-of-domainforConceptual-trainedmodels);theConceptualtestset(out-of-domainforCOCO-trainedmod-els,in-domainforConceptual-trainedmodels);andtheFlickr(Youngetal.,2014)1Ktestset(out-of-domainforbothCOCO-trainedmodelsandConceptua

l-trainedmodels).5.1DatasetDetailsCOCOImageCaptionsTheCOCOimagecap-tioningdatasetisnormallydividedinto82Kimagesfortraining,and40Kimagesforvalidation.Eachoftheseimagescomeswithatleast5groundtruthcaptions.Followingstandardpractice,wecombinethetrainingsetwithmostofthevalidationdatasetfortrainingourmodel,andonlyholdoutasubsetof4Kimagesforvalidation.ConceptualCaptionsTheConceptualCaptionsdatasetcontainsaround3.3Mimagesfortraining,28Kforvalidationand22.5Kforthetestset.Formoredetailedstatistics,seeTable3. “actor”,“dog”,“neighborhood”,etc.)andkeeponlythecandidatesforwhichalldetectedtypeshaveacountofover100(around55%ofthecan-didates).Theseremaininghimage;captionipairscontainaround16,000entitytypes,guaranteedtobewellrepresentedintermsofnumberofexamples.Table1containsseveralexamplesofbefore/after-transformationpairs.ConceptualCaptionsQualityToevaluatetheprecisionofourpipeline,weconsiderarandomsampleof4KexamplesextractedfromthetestsplitoftheConceptualCaptionsdataset.Weperformahumanevaluationonthissample,usingthesamemethodologydescribedinSection5.4. GOOD(outof3) 1+2+3 ConceptualCaptions 96.9%90.3%78.5%Table2:HumanevaluationresultsonasamplefromConceptualCaptions.TheresultsarepresentedinTable2andshowthat,outof3annotations,over90%ofthecaptionsreceiveamajority(2+)ofGOODjudgments.ThisindicatesthattheConceptualCaptionspipeline,thoughinvolvingextensivealgorithmicprocessing,produceshigh-qualityimagecaptions. ExamplesUniqueTokens/Caption TokensMeanStdDevMedian Train 3,318,33351,20110.34.59.0Valid. 28,35513,06310.34.69.0Test 22,53011,73110.14.59.0Table3:StatisticsoverTrain/Validation/TestsplitsforConceptualCaptions.WepresentinTable3statisticsovertheTrain/Validation/TestsplitsfortheConceptualCap-tionsdataset.Thetrainingsetconsistsofslightlyover3.3Mexamples,whilethereareslightlyover28Kexamplesinthevalidationsetand22.5Kex-amplesinthetestset.Thesizeofthetrainingsetvocabulary(uniquetokens)is51,201.Notethatthetestsethasbeencleanedusinghumanjudgements(2+GOOD),whileboththetrainingandvalida-tionsplitscontainallthedata,asproducedbyourautomaticpipeline.Themean/stddev/medianstatis-ticsfortokens-per-captionoverthedatasplitsareconsistentwitheachother,ataround10.3/4.5/9.0,respectively.4ImageCaptioningModelsInordertoassesstheimpactoftheConceptualCap-tionsdataset,weconsiderseveralimagecaptioningmodelspreviouslyproposedintheliterature.ThesemodelscanbeunderstoodusingtheillustrationinFig.4,astheymainlydifferinthewayinwhichtheyinstantiatesomeofthesecomponents. Figure4:Themainmodelcomponents.Therearethreemaincomponentstothisarchi-tecture:AdeepCNNthattakesa(preprocessed)im-ageandoutputsavectorofimageembeddingsX=(x1;x2;:::;xL).AnEncodermodulethattakestheimageembeddingsandencodesthemintoatensorH=fenc(X).ADecodermodelthatgeneratesoutputszt=fdec(Y1:t;H)ateachstept,conditionedonHaswellasthedecoderinputsY1:t.Weexploretwomaininstantiationsofthisarchitec-ture.OneusesRNNswithLSTMcells(HochreiterandSchmidhuber,1997)toimplementthefencandfdecfunctions,correspondingtotheShow-And-Tell(Vinyalsetal.,2015b)model.TheotherusesTransformerself-attentionnetworks(Vaswanietal.,2017)toimplementfencandfdec.AllmodelsinthispaperuseInception-ResNet-v2astheCNNcomponent(Szegedyetal.,2016).4.1RNN-basedModelsOurinstantiationoftheRNN-basedmodelisclosetotheShow-And-Tell(Vinyalsetal.,2015b)model.hl,RNNenc(xl;hl�1);andH=hL;zt,RNNdec(yt;zt�1);wherez0=H: (QFRGHU *2! SHRSOH SOD\LQJ IULVEHH

'HFRGHU SHRSOH SOD\LQJ IULVEHH LQ ,PDJH(PEHGGLQJ ; + = OriginalAlt-text HarrisonFordandCalistaFlockhartattendthepremiereof`HollywoodHomicide'atthe29thAmericanFilmFestivalSeptember5,2003inDeauville,France. ConceptualCaptions actorsattendthepremiereatfestival. what-happened “HarrisonFordandCalistaFlockhart”mappedto“actors”;name,location,anddatedropped. OriginalAlt-text SideviewofaBritishAirwaysAirbusA319aircraftonapproachtolandwithlandinggeardown-StockImage ConceptualCaptions sideviewofanaircraftonapproachtolandwithlandinggeardown what-happened phrase“BritishAirwaysAirbusA319aircraft”mappedto“aircraft”;boilerplateremoved. OriginalAlt-text TwosculpturesbyartistDuncanMcKellaradorntreesoutsidethederelictNorwichUnionofcesinBristol,UK-StockImage ConceptualCaptions sculpturesbypersonadorntreesoutsidethederelictofces what-happened objectcount(e.g.“Two”)dropped;propernoun-phrasehypernymizedto“person”;proper-nounmodiersdropped;locationdropped;boilerplateremoved.Table1:ExamplesofConceptualCaptionsasderivedfromtheiroriginalAlt-textversions.image.Wematchtheselabelsagainstthecandi-datetext,takingintoaccountmorphology-basedstemmingasprovidedbythetextannotation.Can-didatehimage;captionipairswithnooverlaparediscarded.Thislterdiscardsaround60%oftheincomingcandidates.TextTransformationwithHypernymizationInthecurrentversionofthedataset,weconsid-eredover5billionimagesfromabout1billionEnglishwebpages.Thelteringcriteriaabovearedesignedtobehigh-precision(whichcomeswithpotentiallylowrecall).Fromtheoriginalinputcan-didates,only0.2%himage;captionipairspassthelteringcriteriadescribedabove.WhiletheremainingcandidatecaptionstendtobeappropriateAlt-textimagedescriptions(seeAlt-textinFig.1),amajorityofthesecandidatecaptionscontainpropernames(people,venues,locations,etc.),whichwouldbeextremelydif-culttolearnaspartoftheimagecaptioningtask.Togiveanideaofwhatwouldhappeninsuchcases,wetrainanRNN-basedcaptioningmodel(seeSection4)onnon-hypernymizedAlt-textdataandpresentanoutputexampleinFig.3.Ifauto-maticdeterminationofpersonidentity,location,etc.isneeded,itshouldbeattemptedasasepa-ratetaskandwouldneedtoleverageimagemeta-informationabouttheimage(e.g.location).UsingtheGoogleCloudNaturalLanguageAPIs,weobtainnamed-entityandsyntactic-dependencyannotations.WethenusetheGoogleKnowl-edgeGraph(KG)SearchAPItomatchthenamed-entitiestoKGentriesandexploittheassociatedhy-pernymterms.Forinstance,both“HarrisonFord”and“CalistaFlockhart”identifyasnamed-entities, Alt-text(groundtruth):JimmyBarnesperformsattheSydneyEntertainmentCentreModeloutput:SingerJustinBieberperformsonstageduringtheBillboardMusicAwardsattheMGMFigure3:Exampleofmodeloutputtrainedonclean,non-hypernymizedAlt-textdata.sowematchthemtotheircorrespondingKGen-tries.TheseKGentrieshave“actor”astheirhyper-nym,sowereplacetheoriginalsurfacetokenswiththathypernym.Thefollowingstepsareappliedtoachievetexttransformations:nounmodiersofcertaintypes(propernouns,numbers,units)areremoved;dates,durations,andpreposition-basedloca-tions(e.g.,“inLosAngeles”)areremoved;named-entitiesareidentied,matchedagainsttheKGentries,andsubstitutewiththeirhy-pernym;resultingcoordinationnoun-phraseswiththesamehead(e.g.,“actorandactor”)arere-solvedintoasingle-head,pluralizedform(e.g.,“actors

8;);Around20%ofsamplesarediscardedduringthistransformationbecauseitcanleavesentencestooshortorinconsistent.Finally,weperformanotherroundoftextanaly-sisandentityresolutiontoidentifyconceptswithlow-count.Weclusterallresolvedentities(e.g., Figure2:ConceptualCaptionspipelinestepswithexamplesandnaloutput.andintendstodescribethenatureorthecontentoftheimage.BecausetheseAlt-textvaluesarenotinanywayrestrictedorenforcedtobegoodimagedescriptions,manyofthemhavetobediscarded,e.g.,searchengineoptimization(SEO)terms,orTwitterhash-tagterms.WeanalyzecandidateAlt-textusingtheGoogleCloudNaturalLanguageAPIs,specicallypart-of-speech(POS),sentiment/polarity,andpornogra-phy/profanityannotations.Ontopoftheseannota-tions,wehavethefollowingheuristics:awell-formedcaptionshouldhaveahighuniquewordratiocoveringvariousPOStags;candidateswithnodeterminer,nonoun,ornoprepositionarediscarded;candidateswithahighnounratioarealsodiscarded;candidateswithahighrateoftokenrepetitionarediscarded;capitalizationisagoodindicatorofwell-composedsentences;candidateswheretherstwordisnotcapitalized,orwithtoohighcapitalized-wordratioarediscarded;highlyunlikelytokensareagoodindicatorofnotdesirabletext;weuseavocabularyVWof1Btokentypes,appearingatleast5timesintheEnglishWikipedia,anddiscardcandidatesthatcontaintokensthatarenotfoundinthisvocabulary.candidatesthatscoretoohighortoolowonthepolarityannotations,ortriggerthepornog-raphy/profanitydetectors,arediscarded;predenedboiler-plateprex/sufxsequencesmatchingthetextarecropped,e.g.“clicktoenlargepicture”,“stockphoto”;wealsodroptextwhichbegins/endsincertainpatterns,e.g.“embeddedimagepermalink”,“prolephoto”.Theseltersonlyallowaround3%oftheincomingcandidatestopasstothelaterstages.Image&Text-basedFilteringInadditiontotheseparatelteringbasedonimageandtextcontent,welteroutcandidatesforwhichnoneofthetexttokenscanbemappedtothecontentoftheimage.Tothisend,weuseclassiersavailableviatheGoogleCloudVisionAPIstoassignclasslabelstoimages,usinganimageclassierwithalargenum-beroflabels(orderofmagnitudeof105).Notably,theselabelsarealso100%coveredbytheVwtokentypes.Imagesaregenerallyassignedbetween5to20labels,thoughtheexactnumberdependsonthe �$OWWH[WQRWSURFHVVHG XQGHVLUHGLPDJHIRUPDW DVSHFWUDWLRRUVL]H@ $/77(;7 ³)HUUDULGLFH´ ³7KHPHDQLQJRIOLIH´ ³'HPL/RYDWRZHDULQJD EODFN(VWHU$EQHU6SULQJ JRZQDQG6WXDUW :HLW]PDQVDQGDOVDWWKH $PHULFDQ0XVLF $ZDUGV´ ,0$*( �$OWWH[WGLVFDUGHG@ &$37,21 ³SRSURFNDUWLVW ZHDULQJDEODFN JRZQDQGVDQGDOV DWDZDUGV´ �$OWWH[WGLVFDUGHG 7H[WGRHVQRWFRQWDLQ SUHSDUWLFOH@ �$OWWH[WGLVFDUGHG 1RWH[WYV LPDJHREMHFW RYHUODS@ ,PDJH )LOWHULQJ 7H[W )LOWHULQJ ,PJ7H[W )LOWHULQJ 7H[W 7UDQVIRUP 3,3(/,1( ,0$*( ,0$*( ,0$*( dataset.ConceptualCaptionsconsistsofabout3.3Mhimage;descript

ionipairs.IncontrastwiththecuratedstyleoftheCOCOimages,Concep-tualCaptionsimagesandtheirrawdescriptionsareharvestedfromtheweb,andthereforerepre-sentawidervarietyofstyles.TherawdescriptionsareharvestedfromtheAlt-textHTMLattributeyassociatedwithwebimages.Wedevelopedanau-tomaticpipeline(Fig.2)thatextracts,lters,andtransformscandidateimage/captionpairs,withthegoalofachievingabalanceofcleanliness,informa-tiveness,uency,andlearnabilityoftheresultingcaptions.Asacontributiontothemodelingcategory,weevaluateseveralimage-captioningmodels.BasedonthendingsofHuangetal.(2016),weuseInception-ResNet-v2(Szegedyetal.,2016)forimage-featureextraction,whichconfersoptimiza-tionbenetsviaresidualconnectionsandcom-putationallyefcientInceptionunits.Forcap-tiongeneration,weusebothRNN-based(Hochre-iterandSchmidhuber,1997)andTransformer-based(Vaswanietal.,2017)models.OurresultsindicatethatTransformer-basedmodelsachievehigheroutputaccuracy;combinedwiththereportsofVaswanietal.(2017)regardingthereducednum-berofparametersandFLOPsrequiredfortraining&serving(comparedwithRNNs),modelssuchasT2T8x8(Section4)pushforwardtheperformanceonimage-captioninganddeservefurtherattention.2RelatedWorkAutomaticimagecaptioninghasalonghistory(Ho-doshetal.,2013;Donahueetal.,2014;Karpa-thyandFei-Fei,2015;Kirosetal.,2015).IthasacceleratedwiththesuccessofDeepNeu-ralNetworks(Bengio,2009)andtheavailabilityofannotateddataasofferedbydatasetssuchasFlickr30K(Youngetal.,2014)andMS-COCO(Linetal.,2014).TheCOCOdatasetisnotlarge(orderof106im-ages),giventhetrainingneedsofDNNs.Inspiteofthat,ithasbeenverypopular,inpartbecauseitoffersannotationsforimageswithnon-iconicviews,ornon-canonicalperspectivesofobjects,andthereforereectsthecompositionofeverydayscenes(thesameistrueaboutFlickr30K(Youngetal.,2014)).COCOannotations–categorylabel-ing,instancespotting,andinstancesegmentation–aredoneforallobjectsinanimage,includingthose yhttps://en.wikipedia.org/wiki/Alt attributeinthebackground,inaclutteredenvironment,orpartiallyoccluded.Itsimagesarealsoannotatedwithcaptions,i.e.sentencesproducedbyhumanan-notatorstoreectthevisualcontentoftheimagesintermsofobjectsandtheiractionsorrelations.AlargenumberofDNNmodelsforimagecap-tiongenerationhavebeentrainedandevaluatedusingCOCOcaptions(Vinyalsetal.,2015a;Fangetal.,2015;Xuetal.,2015;Ranzatoetal.,2015;Yangetal.,2016;Liuetal.,2017;DingandSoricut,2017).Thesemodelsareinspiredbysequence-to-sequencemodels(Sutskeveretal.,2014;Bahdanauetal.,2015)butuseCNN-basedencodingsin-steadofRNNs(HochreiterandSchmidhuber,1997;Chungetal.,2014).Recently,theTransformerar-chitecture(Vaswanietal.,2017)hasbeenshowntobeaviablealternativetoRNNs(andCNNs)forsequencemodeling.Inthiswork,weevaluatetheimpactoftheConceptualCaptionsdatasetontheimagecaptioningtaskusingmodelsthatcombineCNN,RNN,andTransformerlayers.AlsorelatedtothisworkisthePinterestimageandsentence-descriptiondataset(Maoetal.,2016).Itisalargedataset(orderof108examples),butitstextdescriptionsdonotstrictlyreectthevisualcontentoftheassociatedimage,andthereforecan-notbeuseddirectlyfortrainingimage-captioningmodels.3ConceptualCaptionsDatasetCreationTheConceptualCaptionsdatasetisprogrammat-icallycreatedusingaFlume(Chambersetal.,2010)pipeline.ThispipelineprocessesbillionsofInternetwebpagesinparallel.Fromtheseweb-pages,itextracts,lters,andprocessescandidatehimage;captionipairs.Thelteringandprocess-ingstepsaredescribedindetailinthefollowingsections.Image-basedFilteri

ngTherstlteringstage,image-basedltering,discardsimagesbasedonencodingformat,size,aspectratio,andoffensivecontent.ItonlykeepsJPEGimageswherebothdimensionsaregreaterthan400pixels,andtheratiooflargertosmallerdimensionisnomorethan2.Itexcludesimagesthattriggerpornographyorprofanitydetectors.Theseltersdiscardmorethan65%ofthecandidates.Text-basedFilteringThesecondlteringstage,text-basedltering,harvestsAlt-textfromHTMLwebpages.Alt-textgenerallyaccompaniesimages, ConceptualCaptions:ACleaned,Hypernymed,ImageAlt-textDatasetForAutomaticImageCaptioningPiyushSharma,NanDing,SebastianGoodman,RaduSoricutGoogleAIVenice,CA90291fpiyushsharma,dingnan,seabass,rsoricutg@google.comAbstractWepresentanewdatasetofimagecaptionannotations,ConceptualCaptions,whichcontainsanorderofmagnitudemoreim-agesthantheMS-COCOdataset(Linetal.,2014)andrepresentsawidervarietyofbothimagesandimagecaptionstyles.Weachievethisbyextractingandlteringim-agecaptionannotationsfrombillionsofwebpages.Wealsopresentquantitativeevaluationsofanumberofimagecap-tioningmodelsandshowthatamodelarchitecturebasedonInception-ResNet-v2(Szegedyetal.,2016)forimage-featureextractionandTransformer(Vaswanietal.,2017)forsequencemodelingachievesthebestperformancewhentrainedontheCon-ceptualCaptionsdataset.1IntroductionAutomaticimagedescriptionisthetaskofpro-ducinganatural-languageutterance(usuallyasen-tence)whichcorrectlyreectsthevisualcontentofanimage.Thistaskhasseenanexplosioninproposedsolutionsbasedondeeplearningarchitec-tures(Bengio,2009),startingwiththewinnersofthe2015COCOchallenge(Vinyalsetal.,2015a;Fangetal.,2015),andcontinuingwithavarietyofimprovements(seee.g.Bernardietal.(2016)forareview).Practicalapplicationsofautomaticimagedescriptionsystemsincludeleveragingdescriptionsforimageindexingorretrieval,andhelpingthosewithvisualimpairmentsbytransformingvisualsig-nalsintoinformationthatcanbecommunicatedviatext-to-speechtechnology.Thescienticchallengeisseenasaligning,exploiting,andpushingfurtherthelatestimprovementsattheintersectionofCom-puterVisionandNaturalLanguageProcessing. Alt-text:APakistaniworkerhelpstoclearthedebrisfromtheTajMa-halHotelNovember7,2005inBal-akot,Pakistan.ConceptualCaptions:aworkerhelpstoclearthedebris. Alt-text:MusicianJustinTimber-lakeperformsatthe2017Pilgrim-ageMusic&CulturalFestivalonSeptember23,2017inFranklin,Tennessee.ConceptualCaptions:popartistperformsatthefestivalinacity.Figure1:Examplesofimagesandimagedescrip-tionsfromtheConceptualCaptionsdataset;westartfromexistingalt-textdescriptions,andauto-maticallyprocessthemintoConceptualCaptionswithabalanceofcleanliness,informativeness,u-ency,andlearnability.Therearetwomaincategoriesofadvancesre-sponsibleforincreasedinterestinthistask.Therstistheavailabilityoflargeamountsofanno-tateddata.RelevantdatasetsincludetheImageNetdataset(Dengetal.,2009),withover14millionimagesand1millionbounding-boxannotations,andtheMS-COCOdataset(Linetal.,2014),with120,000imagesand5-wayimage-captionanno-tations.Thesecondistheavailabilityofpower-fulmodelingmechanismssuchasmodernCon-volutionalNeuralNetworks(e.g.Krizhevskyetal.(2012)),whicharecapableofconvertingimagepix-elsintohigh-levelfeatureswithnomanualfeature-engineering.Inthispaper,wemakecontributionstoboththedataandmodelingcategories.First,wepresentanewdatasetofcaptionannotations,ConceptualCaptions(Fig.1),whichhasanor-derofmagnitudemoreimagesthantheCOCO https://github.com/google-research-datasets/conceptual-capti

Related Contents


Next Show more