/
AMulti-WorldApproachtoQuestionAnsweringaboutReal-WorldScenesbasedonUnc AMulti-WorldApproachtoQuestionAnsweringaboutReal-WorldScenesbasedonUnc

AMulti-WorldApproachtoQuestionAnsweringaboutReal-WorldScenesbasedonUnc - PDF document

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
368 views
Uploaded On 2016-05-10

AMulti-WorldApproachtoQuestionAnsweringaboutReal-WorldScenesbasedonUnc - PPT Presentation

MateuszMalinowskiMarioFritzMaxPlanckInstituteforInformaticsSaarbr ID: 313606

MateuszMalinowskiMarioFritzMaxPlanckInstituteforInformaticsSaarbr

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "AMulti-WorldApproachtoQuestionAnsweringa..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AMulti-WorldApproachtoQuestionAnsweringaboutReal-WorldScenesbasedonUncertainInput MateuszMalinowskiMarioFritzMaxPlanckInstituteforInformaticsSaarbr¨ucken,Germanyfmmalinow,mfritzg@mpi-inf.mpg.deAbstractWeproposeamethodforautomaticallyansweringquestionsaboutimagesbybringingtogetherrecentadvancesfromnaturallanguageprocessingandcomputervision.Wecombinediscretereasoningwithuncertainpredictionsbyamulti-worldapproachthatrepresentsuncertaintyabouttheperceivedworldinabayesianframework.Ourapproachcanhandlehumanquestionsofhighcomplexityaboutrealisticscenesandreplieswithrangeofanswerlikecounts,objectclasses,in-stancesandlistsofthem.Thesystemisdirectlytrainedfromquestion-answerpairs.Weestablisharstbenchmarkforthistaskthatcanbeseenasamodernattemptatavisualturingtest.1IntroductionAsvisiontechniqueslikesegmentationandobjectrecognitionbegintomature,therehasbeenanincreasinginterestinbroadeningthescopeofresearchtofullsceneunderstanding.Butwhatismeantby“understanding”ofasceneandhowdowemeasurethedegreeof“understanding”?Mostoften“understanding”referstoacorrectlabelingofpixels,regionsorboundingboxesintermsofsemanticannotations.Allpredictionsmadebysuchmethodsinevitablycomewithuncertaintiesattachedduetolimitationsinfeaturesordataoreveninherentambiguityofthevisualinput.Equallystrongprogresshasbeenmadeonthelanguageside,wheremethodshavebeenproposedthatcanlearntoanswerquestionssolelyfromquestion-answerpairs[1].Thesemethodsoperateonasetoffactsgiventothesystem,whichisreferedtoasaworld.Basedonthatknowledgetheanswerisinferredbymarginalizingovermultipleinterpretationsofthequestion.However,thecorrectnessofthefactsisacoreassumption.Weliketounitethosetworesearchdirectionsbyaddressingaquestionansweringtaskbasedonreal-worldimages.Tocombinetheprobabilisticoutputofstate-of-the-artscenesegmentationalgorithms,weproposeaBayesianformulationthatmarginalizesovermultiplepossibleworldsthatcorrespondtodifferentinterpretationsofthescene.Todate,wearelackingasubstantialdatasetthatservesasabenchmarkforquestionansweringonreal-worldimages.Suchatesthashighdemandson“understanding”thevisualinputandtestsawholechainofperception,languageunderstandinganddeduction.Thisverymuchrelatestothe“AI-dream”ofbuildingaturingtestforvision.WhilewearestillnotreadytotestourvisionsystemoncompletelyunconstrainedsettingsthatwereenvisionedinearlydaysofAI,wearguethataquestion-answeringtaskoncomplexindoorscenesisatimelystepinthisdirection.Contributions:Inthispaperwecombineautomatic,semanticsegmentationsofreal-worldsceneswithsymbolicreasoningaboutquestionsinaBayesianframeworkbyproposingamulti-worldapproachforautomaticquestionanswering.Weintroduceanoveldatasetofmorethan12,0001 question-answerpairsonRGBDimagesproducedbyhumans,asamodernapproachtoavisualturingtest.Webenchmarkourapproachonthisnewchallengeandshowtheadvantagesofourmulti-worldapproach.Furthermore,weprovideadditionalinsightsregardingthechallengesthatlieaheadofusbyfactoringoutsourcesoferrorfromdifferentcomponents.2RelatedworkSemanticparsers:Ourworkismainlyinspiredby[1]thatlearnsthesemanticrepresentationforthequestionansweringtasksolelybasedonquestionsandanswersinnaturallanguage.Althoughthearchitecturelearnsthemappingfromweaksupervision,itachievescomparableresultstothesemanticparsersthatrelyonmanualannotationsoflogicalforms([2],[3]).Incontrasttoourwork,[1]hasneverusedthesemanticparsertoconnectthenaturallanguagetotheperceivedworld.Languageandperception:Previouswork[4,5]hasproposedmodelsforthelanguagegroundingproblemwiththegoalofconnectingthemeaningofthenaturallanguagesentencestoaperceivedworld.Bothmethodsuseimagesastherepresentationofthephysicalworld,butconcentrateratheronconstraineddomainwithimagesconsistingofveryfewobjects.Forinstance[5]considersonlytwomugs,monitorandtableintheirdataset,whereas[4]examinesobjectssuchasblocks,plasticfood,andbuildingbricks.Incontrast,ourworkfocusesonadiversecollectionofreal-worldindoorRGBDimages[6]-withmanymoreobjectsinthesceneandmorecomplexspatialrelationshipbetweenthem.Moreover,ourpaperconsiderscomplexquestions-beyondthescopeof[4]and[5]-andreasoningacrossdifferentimagesusingonlytextualquestion-answerpairsfortraining.Thisimposesadditionalchallengesforthequestion-answeringenginessuchasscalabilityofthesemanticparser,goodscenerepresentation,dealingwithuncertaintyinthelanguageandperception,efcientinferenceandspatialreasoning.Althoughothers[7,8]proposeinterestingalternativesforlearningthelanguagebinding,itisunclearifsuchapproachescanbeusedtoprovideanswersonquestions.Integratedsystemsthatexecutecommands:Others[9,10,11,12,13]focusonthetaskoflearn-ingtherepresentationofnaturallanguageintherestrictedsettingofexecutingcommands.Insuchscenario,theintegratedsystemsexecutecommandsgivennaturallanguageinputwiththegoalofus-ingtheminnavigation.Inourwork,weaimforlessrestrictivescenariowiththequestion-answeringsysteminthemind.Forinstance,theusermayaskourarchitectureaboutcountingandcolors('Howmanygreentablesareintheimage?'),negations('Whichimagesdonothavetables?')andsuperla-tives('Whatisthelargestobjectintheimage?').Probabilisticdatabases:Similarlyto[14]thatreducesNamedEntityRecognitionproblemintotheinferenceproblemfromprobabilisticdatabase,wesamplemultiple-worldsbasedontheuncertaintyintroducedbythesemanticsegmentationalgorithmthatweapplytothevisualinput.3MethodOurmethodanswersonquestionsbasedonimagesbycombiningnaturallanguageinputwithoutputfromvisualsceneanalysisinaprobabilisticframeworkasillustratedinFigure1.Inthesingleworldapproach,wegenerateasingleperceivedworldWbasedonsegmentations-auniqueinterpretationofavisualscene.Incontrast,ourmulti-worldapproachintegratesovermanylatentworldsW,andhencetakingdifferentinterpretationsofthesceneandquestionintoaccount.Single-worldapproachforquestionansweringproblemWebuildonrecentprogressonend-to-endquestionansweringsystemsthataresolelytrainedonquestion-answerpairs(Q;A)[1].ToppartofFigure1outlineshowwebuildon[1]bymodelingthelogicalformsassociatedwithaquestionaslatentvariableTgivenasingleworldW.MoreformallythetaskofpredictingananswerAgivenaquestionQandaworldWisperformedbycomputingthefollowingposteriorwhichmarginalizesoverthelatentlogicalforms(semantictreesin[1])T:P(AjQ;W):=XTP(AjT;W)P(TjQ):(1)P(AjT;W)correspondstodenotationofalogicalformTontheworldW.Inthissetting,theanswerisuniquegiventhelogicalformandtheworld:P(AjT;W)=1[A2W(T)]withtheevaluationfunctionW,whichevaluatesalogicalformontheworldW.Following[1]weuseDCSTreesthatyieldthefollowingrecursiveevaluationfunctionW:W(T):=2 Figure1:Overviewofourapproachtoquestionansweringwithmultiplelatentworldsincontrasttosingleworldapproach.Tdjfv:v2W(p);t2W(Tj);Rj(v;t)gwhereT:=hp;(T1;R1);(T2;R2);:::;(Td;Rd)iisthesemantictreewithapredicatepassociatedwiththecurrentnode,itssubtreesT1;T2;:::;Td,andrelationsRjthatdenetherelationshipbetweenthecurrentnodeandasubtreeTj.Inthepredictions,weusealog-lineardistributionP(TjQ)/exp(T(Q;T))overthelogicalformswithafeaturevectormeasuringcompatibilitybetweenQandTandparameterslearntfromtrainingdata.Everycomponentjisthenumberoftimesthataspecicfeaturetemplateoccursin(Q;T).Weusethesametemplatesas[1]:stringtriggersapredicate,stringisunderarelation,stringisunderatracepredicate,twopredicatesarelinkedviarelationandapredicatehasachild.Themodellearnsbyalternatingbetweensearchingoverarestrictedspaceofvalidtreesandgradientdescentupdatesofthemodelparameters.WeusetheDataloginferenceenginetoproducetheanswersfromthelatentlogicalforms.Thelinguisticphenomenasuchassuperlativesandnegationsarehandledbythelogicalformsandtheinferenceengine.Foradetailedexposition,wereferthereaderto[1].Questionansweringonreal-worldimagesbasedonaperceivedworldSimilarto[5],weextendtheworkof[1]tooperatenowonwhatwecallperceivedworldW.Thisstillcorre-spondstothesingleworldapproachinouroverviewFigure1.Howeverourworldisnowpopu-latedwith“facts”derivedfromautomatic,semanticimagesegmentationsS.Forthispurpose,webuildtheworldbyrunningastate-of-the-artsemanticsegmentationalgorithm[15]overtheim-agesandcollecttherecognizedinformationaboutobjectssuchasobjectclass,3Dposition,and (a)Sampledworlds. (b)Object'scoordinates.Figure2:Fig.2ashowsafewsampledworldswhereonlysegmentsoftheclass'person'areshown.Intheclock-wiseorder:originalpicture,mostcondentworld,andthreepossibleworlds(gray-scalevaluesdenotetheclasscondence).Although,atrstglancethemostcondentworldseemstobeareasonableapproach,ourexperimentsshowopposite-wecanbenetfromimperfectbutmultipleworlds.Fig.2bshowsobject'scoordinates(originalandZ,Y,Ximagesintheclock-wiseorder),whichbetterrepresentthespatiallocationoftheobjectsthantheimagecoordinates.3 Predicate Denition closeAbove(A;B) above(A;B)and(Ymin(B)Ymax(A)+) closeLeftOf(A;B) leftOf(A;B)and(Xmin(B)Xmax(A)+) closeInFrontOf(A;B) inFrontOf(A;B)and(Zmin(B)Zmax(A)+) Xaux(A;B) Xmean(A)Xmax(B)andXmin(B)Xmean(A) Zaux(A;B) Zmean(A)Zmax(B)andZmin(B)Zmean(A) haux(A;B) closeAbove(A;B)orcloseBelow(A;B) vaux(A;B) closeLeftOf(A;B)orcloseRightOf(A;B) auxiliaryrelations daux(A;B) closeInFrontOf(A;B)orcloseBehind(A;B) leftOf(A;B) Xmean(A)Xmean(B)) above(A;B) Ymean(A)Ymean(B) inFrontOf(A;B) Zmean(A)Zmean(B)) spatial on(A;B) closeAbove(A;B)andZaux(A;B)andXaux(A;B) close(A;B) haux(A;B)orvaux(A;B)ordaux(A;B) Table1:PredicatesdeningspatialrelationsbetweenAandB.Auxiliaryrelationsdeneactualspatialre-lations.TheYaxispointsdownwards,functionsXmax;Xmin;:::takeappropriatevaluesfromthetuplepredicate,andisa'small'amount.SymmetricalrelationssuchasrightOf,below,behind,etc.canreadilybedenedintermsofotherrelations(i.e.below(A;B)=above(B;A)).color[16](Figure1-middlepart).Everyobjecthypothesisisthereforerepresentedasann-tuple:predicate(instance id;image id;color;spatial loc)wherepredicate2fbag;bed;books;:::g,instance idistheobject'sid,image idisidoftheimagecontainingtheobject,colorisesti-matedcoloroftheobject[16],andspatial locistheobject'spositionintheimage.Latterisrepresentedas(Xmin;Xmax;Xmean;Ymin;Ymax;Ymean;Zmin;Zmax;Zmean)anddenesmini-mal,maximal,andmeanlocationoftheobjectalongX;Y;Zaxes.Toobtainthecoordinateswetaxisparallelcuboidstothecropped3dobjectsbasedonthesemanticsegmentation.NotethattheX;Y;Zcoordinatesystemisalignedwithdirectionofgravity[15].AsshowninFigure2b,thisisamoremeaningfulrepresentationoftheobject'scoordinatesoversimpleimagecoordinates.Thecompleteschemawillbedocumentedtogetherwiththecoderelease.Werealizethattheskilleduseofspatialrelationsisacomplextaskandgroundingspatialrelationsisaresearchthreadonitsown(e.g.[17],[18]and[19]).Forourpurposes,wefocusonpredenedrelationsshowninTable1,whiletheassociationofthemaswellastheobjectclassesarestilldealtwithinthequestionansweringarchitecture.Multi-worldsapproachforcombininguncertainvisualperceptionandsymbolicreasoningUptonowwehaveconsideredtheoutputofthesemanticsegmentationas“hardfacts”,andhenceignoreduncertaintyintheclasslabeling.Everysuchlabelingofthesegmentscorrespondstodif-ferentinterpretationofthescene-differentperceivedworld.Drawingonideasfromprobabilisticdatabases[14],weproposeamulti-worldapproach(Figure1-lowerpart)thatmarginalizesovermultiplepossibleworldsW-multipleinterpretationsofavisualscene-derivedfromthesegmen-tationS.ThereforetheposteriorovertheanswerAgivenquestionQandsemanticsegmentationSoftheimagemarginalizesoverthelatentworldsWandlogicalformsT:P(AjQ;S)=XWXTP(AjW;T)P(WjS)P(TjQ)(2)ThesemanticsegmentationoftheimageisasetofsegmentssiwiththeassociatedprobabilitiespijovertheCobjectcategoriescj.MorepreciselyS=f(s1;L1);(s2;L2);:::;(sk;Lk)gwhereLi=f(cj;pij)gCj=1,P(si=cj)=pij,andkisthenumberofsegmentsofgivenimage.Let^Sf=(s1;cf(1));(s2;cf(2));:::;(sk;cf(k))) beanassignmentofthecategoriesintosegmentsoftheimageaccordingtothebindingfunctionf2F=f1;:::;Cgf1;:::;kg.Withsuchnotation,foraxedbindingfunctionf,aworldWisasetoftuplesconsistentwith^Sf,anddeneP(WjS)=Qip(i;f(i)).Hencewehaveasmanypossibleworldsasbindingfunctions,thatisCk.Eq.2becomesquicklyintractableforkandCseeninpractice,whereforeweuseasamplingstrategythatdrawsanitesample~W=(W1;W2;:::;WN)fromP(jS)underanassumptionthatforeachsegmentsieveryobject'scategorycjisdrawnindependentlyaccordingtopij.AfewsampledperceivedworldsareshowninFigure2a.Regardingthecomputationalefciency,computingPTP(AjWi;T)P(TjQ)canbedoneinde-pendentlyforeveryWi,andthereforeinparallelwithoutanyneedforsynchronization.SinceforsmallNthecomputationalcostsofsummingupcomputedprobabilitiesismarginal,theoverallcostisaboutthesameassingleinferencemoduloparallelism.Thepresentedmulti-worldapproachtoquestionansweringonreal-worldscenesisstillanend-to-endarchitecturethatistrainedsolelyonthequestion-answerpairs.4 Figure3:NYU-DepthV2dataset:image,Zaxis,groundtruthandpredictedsemanticsegmentations. Description Template Example counting Howmanyfobjectgareinfimage idg? Howmanycabinetsareinimage1? countingandcolors Howmanyfcolorgfobjectgareinfimage idg? Howmanygraycabinetsareinimage1? roomtype Whichtypeoftheroomisdepictedinfimage idg? Whichtypeoftheroomisdepictedinimage1? Individual superlatives Whatisthelargestfobjectginfimage idg? Whatisthelargestobjectinimage1? countingandcolors Howmanyfcolorgfobjectg? Howmanyblackbags? negationstype1 Whichimagesdonothavefobjectg? Whichimagesdonothavesofa? set negationstype2 Whichimagesarenotfroom typeg? Whichimagesarenotbedroom? negationstype3 Whichimageshavefobjectgbutdonothaveafobjectg? Whichimageshavedeskbutdonothavealamp? Table2:Syntheticquestion-answerpairs.Thequestionscanbeaboutindividualimagesorthesetsofimages.ImplementationandScalabilityForworldscontainingmanyfactsandspatialrelationsthein-ductionstepbecomescomputationallydemandingasitconsidersallpairsofthefacts(wehaveabout4millionpredicatesintheworstcase).Thereforeweuseabatch-basedapproximationinsuchsituations.Everyimageinducesasetoffactsthatwecallabatchoffacts.Foreverytestimage,wendknearestneighborsinthespaceoftrainingbatcheswithabooleanvariantofTF.IDFtomea-suresimilarity[20].Thisisequivalenttobuildingatrainingworldfromkimageswithmostsimilarcontenttotheperceivedworldofthetestimage.Weusek=3and25worldsinourexperiments.Datasetandthesourcecodecanbefoundinourwebsite1.4Experiments4.1DAtasetforQUestionAnsweringonReal-worldimages(DAQUAR)ImagesandSemanticSegmentationOurnewdatasetforquestionansweringisbuiltontopoftheNYU-DepthV2dataset[6].NYU-DepthV2contains1449RGBDimagestogetherwithanno-tatedsemanticsegmentations(Figure3)whereeverypixelislabeledintosomeobjectclasswithacondencescore.Originally894classesareconsidered.Accordingto[15],wepreprocessthedatatoobtaincanonicalviewsofthescenesanduseX,Y,Zcoordinatesfromthedepthsensortodenespatialplacementoftheobjectsin3D.Toinvestigatetheimpactofuncertaintyinthevisualanalysisofthescenes,wealsoemploycomputervisiontechniquesforautomaticsemanticsegmentation.Weuseastate-of-the-artsceneanalysismethod[15]whichmapseverypixelinto40classes:37infor-mativeobjectclassesaswellas'otherstructure','otherfurniture'and'otherprop'.Weignorethelatterthree.Weusethesamedatasplitas[15]:795trainingand654testimages.Touseourspatialrepresentationontheimagecontent,wet3dcuboidstothesegmentations.NewdatasetofquestionsandanswersInthespiritofavisualturingtest,wecollectquestionanswerpairsfromhumanannotatorsfortheNYUdataset.Inourwork,weconsidertwotypesoftheannotations:syntheticandhuman.Thesyntheticquestion-answerpairsareautomaticallygeneratedquestion-answerpairs,whicharebasedonthetemplatesshowninTable2.Thesetemplatesaretheninstantiatedwithfactsfromthedatabase.Tocollect12468humanquestion-answerpairsweask5in-houseparticipantstoprovidequestionsandanswers.Theywereinstructedtogivevalidanswersthatareeitherbasiccolors[16],numbersorobjects(894categories)orsetsofthose.Besidestheanswers,wedon'timposeanyconstraintsonthequestions.Wealsodon'tcorrectthequestionsaswebelievethatthesemanticparsersshouldberobustunderthehumanerrors.Finally,weuse6794trainingand5674testquestion-answerpairs–about9pairsperimageonaverage(8:63;8:75)2. 1https://www.d2.mpi-inf.mpg.de/visual-turing-challenge2Ournotation(x;y)denotesmeanxandtrimeany.WeuseTukey'strimean1 4(Q1+2Q2+Q3),whereQjdenotesthej-thquartile[21].Thismeasurecombinesthebenetsofbothmedian(robustnesstotheextremes)andempiricalmean(attentiontothehingevalues).5 Thedatabaseexhibitsomebiasesshowinghumanstendtofocusonafewprominentobjects.Forinstancewehavemorethan400occurrencesoftableandchairintheanswers.Inaveragetheobject'scategoryoccurs(14:25;4)timesintrainingsetand(22:48;5:75)timesintotal.Figure4showsexamplequestion-answerpairstogetherwiththecorrespondingimagethatillustratesomeofthechallengescapturedinthisdataset.PerformanceMeasureWhilethequalityofananswerthatthesystemproducescanbemeasuredintermsofaccuracyw.r.t.thegroundtruth(correct/wrong),wepropose,inspiredfromtheworkonFuzzySets[22],asoftmeasurebasedontheWUPscore[23],whichwecallWUPS(WUPSet)score.Asthenumberofclassesgrows,thesemanticboundariesbetweenthemarebecomingmorefuzzy.Forexample,bothconcepts'carton'and'box'havesimilarmeaning,or'cup'and'cupofcoffee'arealmostindifferent.Thereforeweseekametricthatmeasuresthequalityofananswerandpenalizesnaivesolutionswherethearchitectureoutputstoomanyortoofewanswers.StandardAccuracyisdenedas:1 NPNi=11fAi=Tig100whereAi,Tiarei-thanswerandground-truthrespectively.Sinceboththeanswersmayincludemorethanoneobject,itisbenecialtorepresentthemassetsoftheobjectsT=ft1;t2;:::g.Fromthispointofviewwehaveforeveryi2f1;2;:::;Ng:1fAi=Tig=1fAiTi\TiAig=minf1fAiTig;1fTiAigg(3)=minfYa2Ai1fa2Tig;Yt2Ti1ft2AiggminfYa2Ai(a2Ti);Yt2Ti(t2Ai)g(4)WeuseasoftequivalentoftheintersectionoperatorinEq.3,andasetmembershipmeasure,withproperties(x2X)=1ifx2X,(x2X)=maxy2X(x=y)and(x=y)2[0;1],inEq.4withequalitywhenever=1.ForweuseavariantofWu-Palmersimilarity[23,24].WUP(a;b)calculatessimilaritybasedonthedepthoftwowordsaandbinthetaxonomy[25,26],anddenetheWUPSscore:WUPS(A;T)=1 NNXi=1minfYa2Aimaxt2TiWUP(a;t);Yt2Timaxa2AiWUP(a;t)g100(5)Empirically,wehavefoundthatinourtaskaWUPscoreofaround0:9isrequiredforpreciseanswers.Thereforewehaveimplementeddown-weightingWUP(a;b)byoneorderofmagnitude(0:1WUP)wheneverWUP(a;b)tforathresholdt.Weplotacurveoverthresholdstrangingfrom0to1(Figure5).Since”WUPSat0”referstothemost'forgivable'measurewithoutanydown-weightingand”WUPSat1.0”correspondstoplainaccuracy.Figure5benchmarksarchitecturesbyrequiringanswerswithprecisionrangingfromlowtohigh.HereweshowsomeexamplesofthepureWUPscoretogiveintuitionsabouttherange:WUP(curtain,blinds)=0:94,WUP(carton,box)=0:94,WUP(stove,reextinguisher)=0:82.4.2QuantitativeresultsWeperformaseriesofexperimentstohighlightparticularchallengeslikeuncertainsegmenta-tions,unknowntruelogicalforms,somelinguisticphenomenaaswellasshowtheadvantagesofourproposedmulti-worldapproach.Inparticular,wedistinguishbetweenexperimentsonsyn-theticquestion-answerpairs(SynthQA)basedontemplatesandthosecollectedbyannotators(Hu-manQA),automaticscenesegmentation(AutoSeg)withacomputervisionalgorithm[15]andhu-mansegmentations(HumanSeg)basedontheground-truthannotationsintheNYUdatasetaswellassingleworld(single)andmulti-world(multi)approaches.4.2.1Syntheticquestion-answerpairs(SynthQA)Basedonhumansegmentations(HumanSeg,37classes)(1stand2ndrowsinTable3)usesau-tomaticallygeneratedquestions(weusetemplatesshowninTable2)andhumansegmentations.Wehavegenerated20trainingand40testquestion-answerpairspertemplatecategory,intotal140trainingand280testpairs(asanexceptionnegationstype1and2have10trainingand20testexam-pleseach).Thisexperimentshowshowthearchitecturegeneralizesacrosssimilartypeofquestionsprovidedthatwehavehumanannotationoftheimagesegments.Wehavefurtherremovednegationsoftype3intheexperimentsastheyhaveturnedouttobeparticularlycomputationallydemanding.Performanceincreasesherebyfrom56%to59:9%withabout80%trainingAccuracy.Sincesomeincorrectderivationsgivecorrectanswers,thesemanticparserlearnswrongassociations.Otherdif-cultiesstemfromthelimitedtrainingdataandunseenobjectcategoriesduringtraining.Basedonautomaticsegmentations(AutoSeg,37classes,single)(3rdrowinTable3)teststhear-chitecturebasedonuncertainfactsobtainedfromautomaticsemanticsegmentation[15]wherethe6 mostlikelyobjectlabelsareusedtocreateasingleworld.Here,weareexperiencingaseveredropinperformancefrom59:9%to11:25%byswitchingfromhumantoautomaticsegmentation.Notethatthereareonly37classesavailabletous.Thisresultsuggeststhatthevisionpartisaseriousbottleneckofthewholearchitecture.Basedonautomaticsegmentationsusingmulti-worldapproach(AutoSeg,37classes,multi)(4throwinTable3)showsthebenetsofusingourmultipleworldsapproachtopredictthean-swer.Herewerecoverpartofthelostperformancebyanexplicittreatmentoftheuncertaintyinthesegmentations.Performanceincreasesfrom11:25%to13:75%.4.3Humanquestion-answerpairs(HumanQA)Basedonhumansegmentations894classes(HumanSeg,894classes)(1strowinTable4)switch-ingtohumangeneratedquestion-answerpairs.Theincreaseincomplexityistwofold.First,thehumanannotationsexhibitmorevariationsthanthesyntheticapproachbasedontemplates.Second,thequestionsaretypicallylongerandincludemorespatiallyrelatedobjects.Figure4showsafewsamplesfromourdatasetthathighlightschallengesincludingcomplexandnestedspatialreferenceanduseofreferenceframes.Weyieldanaccuracyof7:86%inthisscenario.Asarguedabove,wealsoevaluatetheexperimentsonthehumandataunderthesofterWUPSscoresgivendifferentthresholds(Table4andFigure5).Inordertoputthesenumbersinperspective,wealsoshowperfor-mancenumbersfortwosimplemethods:predictingthemostpopularansweryields4:4%Accuracy,andouruntrainedarchitecturegives0:18%and1:3%AccuracyandWUPS(at0:9).Basedonhumansegmentations37classes(HumanSeg,37classes)(2ndrowinTable4)useshu-mansegmentationandquestion-answerpairs.Sinceonly37classesaresupportedbyourautomaticsegmentationalgorithm,werunonasubsetofthewholedataset.Wechoosethe25testimagesyieldingatotalof286questionanswerpairsforthefollowingexperiments.Thisyields12:47%and15:89%AccuracyandWUPSat0:9respectively.Basedonautomaticsegmentations(AutoSeg,37classes)(3rdrowinTable4)Switchingfromthehumansegmentationstotheautomaticyieldsagainadropfrom12:47%to9:69%inAccuracyandweobserveasimilartrendforthewholespectrumoftheWUPSscores.Basedonautomaticsegmentationsusingmulti-worldapproach(AutoSeg,37classes,multi)(4throwinTable4)Similartothesyntheticexperimentsourproposedmulti-worldapproachyieldsanimprovementacrossallthemeasurethatweinvestigate.Humanbaseline(5thand6throwsinTable4for894and37classes)showshumanpredictionsonourdataset.Weaskindependentannotatorstoprovideanswersonthequestionswehavecollected.Theyareinstructedtoanswerwithanumber,basiccolors[16],orobjects(from37or894cate-gories)orsetofthose.Thisperformancegivesapracticalupperboundforthequestion-answeringalgorithmswithanaccuracyof60:27%forthe37classcaseand50:20%forthe894classcase.WealsoasktocomparetheanswersoftheAutoSegsingleworldapproachwithHumanSegsingleworldandAutoSegmulti-worldsmethods.Weuseatwo-sidedbinomialtesttocheckifdifferenceinpreferencesisstatisticallysignicant.AsaresultAutoSegsingleworldistheleastpreferredmethodwiththep-valuebelow0:01inbothcases.HencethehumanpreferencesarealignedwithouraccuracymeasuresinTable4.4.4QualitativeresultsWechooseexamplesinFig.6toillustratedifferentfailurecases-includinglastexamplewhereallmethodsfail.Sinceourmulti-worldapproachgeneratesdifferentsetsoffactsabouttheperceivedworlds,weobserveatrendtowardsabetterrepresentationofhighlevelconceptslike'counting'(leftmostthegure)aswellaslanguageassociations.Asubstantialpartofincorrectanswersisattributedtomissingsegments,e.g.nopillowdetectioninthirdexampleinFig.6.5SummaryWeproposeasystemandadatasetforquestionansweringaboutreal-worldscenesthatisreminiscentofavisualturingtest.Despitethecomplexityinuncertainvisualperception,languageunderstandingandprograminduction,ourresultsindicatepromisingprogressinthisdirection.Webringideastogetherfromautomaticsceneanalysis,semanticparsingwithsymbolicreasoning,andcombinethemunderamulti-worldapproach.Aswehavematuretechniquesinmachinelearning,computervision,naturallanguageprocessinganddeductionatourdisposal,itseemstimelytobringthesedisciplinestogetheronthisopenchallenge.7 Figure4:Examplesofhumangeneratedquestion-answerpairsillustratingtheassociatedchallenges.Inthedescriptionsweusefollowingnotation:'A'-answer,'Q'-question,'QA'-question-answerpair.Lasttwoexamples(bottom-rightcolumn)arefromtheextendeddatasetnotusedinourexperiments. Figure5:WUPSscoresfordifferentthresholds. syntheticquestion-answerpairs(SynthQA) Segmentation World(s) #classes Accuracy HumanSeg SinglewithNeg.3 37 56:0% HumanSeg Single 37 59:5% AutoSeg Single 37 11:25% AutoSeg Multi 37 13:75% Table3:Accuracyresultsfortheexperimentswithsyn-theticquestion-answerpairs. Humanquestion-answerpairs(HumanQA) Segmentation World(s) #classes Accuracy WUPSat0:9 WUPSat0 HumanSeg Single 894 7:86% 11:86% 38:79% HumanSeg Single 37 12:47% 16:49% 50:28% AutoSeg Single 37 9:69% 14:73% 48:57% AutoSeg Multi 37 12:73% 18:10% 51:47% HumanBaseline 894 50:20% 50:82% 67:27% HumanBaseline 37 60:27% 61:04% 78:96% Table4:AccuracyandWUPSscoresfortheexperimentswithhumanquestion-answerpairs.WeshowWUPSscoresattwooppositesidesoftheWUPSspectrum. Figure6:Questionsandpredictedanswers.Notation:'Q'-question,'H'-architecturebasedonhumansegmentation,'M'-architecturewithmultipleworlds,'C'-mostcondentarchitecture,'()'-noanswer.Redcolordenotescorrectanswer.8 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! References like ÔcornerÕ are difÞcult to resolve given current computer vision models. Yet such scene features are References[1]Liang,P.,Jordan,M.I.,Klein,D.:Learningdependency-basedcompositionalsemantics.Com-putationalLinguistics(2013)[2]Kwiatkowski,T.,Zettlemoyer,L.,Goldwater,S.,Steedman,M.:Inducingprobabilisticccggrammarsfromlogicalformwithhigher-orderunication.In:EMNLP.(2010)[3]Zettlemoyer,L.S.,Collins,M.:Onlinelearningofrelaxedccggrammarsforparsingtologicalform.In:EMNLP-CoNLL-2007.(2007)[4]Matuszek,C.,Fitzgerald,N.,Zettlemoyer,L.,Bo,L.,Fox,D.:Ajointmodeloflanguageandperceptionforgroundedattributelearning.In:ICML.(2012)[5]Krishnamurthy,J.,Kollar,T.:Jointlylearningtoparseandperceive:Connectingnaturallanguagetothephysicalworld.TACL(2013)[6]Silberman,N.,Hoiem,D.,Kohli,P.,Fergus,R.:Indoorsegmentationandsupportinferencefromrgbdimages.In:ECCV.(2012)[7]Kong,C.,Lin,D.,Bansal,M.,Urtasun,R.,Fidler,S.:Whatareyoutalkingabout?text-to-imagecoreference.In:CVPR.(2014)[8]Karpathy,A.,Joulin,A.,Fei-Fei,L.:Deepfragmentembeddingsforbidirectionalimagesentencemapping.In:NIPS.(2014)[9]Matuszek,C.,Herbst,E.,Zettlemoyer,L.,Fox,D.:Learningtoparsenaturallanguagecom-mandstoarobotcontrolsystem.In:ExperimentalRobotics.(2013)[10]Levit,M.,Roy,D.:Interpretationofspatiallanguageinamapnavigationtask.Systems,Man,andCybernetics,PartB:Cybernetics,IEEETransactionson(2007)[11]Vogel,A.,Jurafsky,D.:Learningtofollownavigationaldirections.In:ACL.(2010)[12]Tellex,S.,Kollar,T.,Dickerson,S.,Walter,M.R.,Banerjee,A.G.,Teller,S.J.,Roy,N.:Un-derstandingnaturallanguagecommandsforroboticnavigationandmobilemanipulation.In:AAAI.(2011)[13]Kruijff,G.J.M.,Zender,H.,Jensfelt,P.,Christensen,H.I.:Situateddialogueandspatialorga-nization:What,where...andwhy.IJARS(2007)[14]Wick,M.,McCallum,A.,Miklau,G.:Scalableprobabilisticdatabaseswithfactorgraphsandmcmc.In:VLDB.(2010)[15]Gupta,S.,Arbelaez,P.,Malik,J.:Perceptualorganizationandrecognitionofindoorscenesfromrgb-dimages.In:CVPR.(2013)[16]VanDeWeijer,J.,Schmid,C.,Verbeek,J.:Learningcolornamesfromreal-worldimages.In:CVPR.(2007)[17]Regier,T.,Carlson,L.A.:Groundingspatiallanguageinperception:anempiricalandcompu-tationalinvestigation.JournalofExperimentalPsychology:General(2001)[18]Lan,T.,Yang,W.,Wang,Y.,Mori,G.:Imageretrievalwithstructuredobjectqueriesusinglatentrankingsvm.In:ECCV.(2012)[19]Guadarrama,S.,Riano,L.,Golland,D.,Gouhring,D.,Jia,Y.,Klein,D.,Abbeel,P.,Darrell,T.:Groundingspatialrelationsforhuman-robotinteraction.In:IROS.(2013)[20]Manning,C.D.,Raghavan,P.,Sch¨utze,H.:Introductiontoinformationretrieval.CambridgeuniversitypressCambridge(2008)[21]Tukey,J.W.:Exploratorydataanalysis.(1977)[22]Zadeh,L.A.:Fuzzysets.Informationandcontrol(1965)[23]Wu,Z.,Palmer,M.:Verbssemanticsandlexicalselection.In:ACL.(1994)[24]Guadarrama,S.,Krishnamoorthy,N.,Malkarnenkar,G.,Mooney,R.,Darrell,T.,Saenko,K.:Youtube2text:Recognizinganddescribingarbitraryactivitiesusingsemantichierarchiesandzero-shotrecognition.In:ICCV.(2013)[25]Miller,G.A.:Wordnet:alexicaldatabaseforenglish.CACM(1995)[26]Fellbaum,C.:WordNet.WileyOnlineLibrary(1999)9