123K - views

Recognizing Indoor Scenes Ariadna Quattoni CSAIL MIT UC Berkeley EECS ICSI ariadnacsail

mitedu Antonio Torralba CSAIL MIT 32 Vassar St Cambridge MA 02139 torralbacsailmitedu Abstract Indoor scene recognition is a challenging open prob lem in high level vision Most scene recognition models that work well for outdoor scenes perform poorly

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "Recognizing Indoor Scenes Ariadna Quatto..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Recognizing Indoor Scenes Ariadna Quattoni CSAIL MIT UC Berkeley EECS ICSI ariadnacsail






Presentation on theme: "Recognizing Indoor Scenes Ariadna Quattoni CSAIL MIT UC Berkeley EECS ICSI ariadnacsail"— Presentation transcript:

RecognizingIndoorScenesAriadnaQuattoniCSAIL,MITUCBerkeleyEECS&ICSIariadna@csail.mit.eduAntonioTorralbaCSAIL,MIT32VassarSt.,Cambridge,MA02139torralba@csail.mit.eduAbstractIndoorscenerecognitionisachallengingopenprob-leminhighlevelvision.Mostscenerecognitionmodelsthatworkwellforoutdoorscenesperformpoorlyintheindoordomain.Themaindifcultyisthatwhilesomein-doorscenes(e.g.corridors)canbewellcharacterizedbyglobalspatialproperties,others(e.g,bookstores)arebettercharacterizedbytheobjectstheycontain.Moregenerally,toaddresstheindoorscenesrecognitionproblemweneedamodelthatcanexploitlocalandglobaldiscriminativeinformation.Inthispaperweproposeaprototypebasedmodelthatcansuccessfullycombinebothsourcesofinfor-mation.Totestourapproachwecreatedadatasetof67indoorscenescategories(thelargestavailable)coveringawiderangeofdomains.Theresultsshowthatourapproachcansignicantlyoutperformastateoftheartclassierforthetask.1.IntroductionThereareanumberofapproachesdevotedtoscenerecognitionthathavebeenshowntobeparticularysuccess-fulinrecognizingoutdoorscenes.However,whentheseapproachesaretestedonindoorscenecategoriestheresultsdropdramaticallyformostcommonindoorscenes.Fig.1showsresultsofavarietyofstateoftheartscenerecog-nitionalgorithmsappliedtoadatasetoffteenscenecate-gories[9,3,7].Commontoalltheapproachescomparedinthisgraphistheirlowerperformanceonindoorcategories(RAW:26.5%,Gist:62.9%,Sift:61.9%)incomparisonwiththeperformanceachievedontheoutdoorcategories(RAW:32.6%,Gist:78.1%,Sift:79.1%).1 1Notethattheperformancesdifferfromtheonesreportedin[7].Thedifferenceisthatherewehavecroppedalltheimagestobesquareandwith256256pixels.Theoriginaldatasethasimagesofdifferentresolutionsandaspectratiosthatcorrelatewiththecategoriesprovidingnon-visualdiscriminantcues. suburbcoastforesthighwayinside citymountainopen countrystreettall building officebedroom industrial kitchenlivingroomstore Percent correct Visual words (SIFT) Raw RGB image 0102030405060708090100 Figure1.ComparisonofSpatialSiftandGistfeaturesforascenerecognitiontask.Bothsetoffeatureshaveastrongcorrelationintheperformanceacrossthe15scenecategories.Averageperfor-manceforthedifferentfeaturesare:Gist:73.0%,Pyramidmatch-ing:73:4%,bagofwords:64:1%,andcolorpixels(SSD):30:6%.InallcasesweuseanSVM.Thereissomepreviousworkdevotedtothetaskofin-doorscenerecognition(e.g.,[15,16]),buttothebestofourknowledgenoneofthemhavedealtwiththegeneralproblemofrecognizingawiderangeofindoorscenescat-egories.Webelievethattherearetwomainreasonsfortheslowprogressinthisarea.Therstreasonisthelackofalargetestbedofindoorscenesinwhichtotrainandtestdifferentapproaches.Withthisinmindwecreatedanewdatasetforindoorscenerecognitionconsistingof67scenes(thelargestavailable)coveringawiderangeofdomainsin-cluding:leisure,workingplace,home,storesandpublicspacesscenecategories.Thesecondreasonisthatinordertoimproveindoorscenerecognitionperformanceweneedtodevelopimagerepresentationsspecicallytailoredforthistask.Themaindifcultyisthatwhilemostoutdoorscenescanbewellchar-acterizedbyglobalimagepropertiesthisisnottrueofall1 airport inside stu bakery bar bookstore buf casildr church room cloister closet clothing store comp conce corridor deli dental office dining room fastfood florist game garage en house grocery store gym hai hospital room inside bus bway jewellery shop kinder garden kitchen laboratory wet laund library livingroom lobby mall ing movie theater museum nursery office operating room pantry pool inside prison cell restaurant shoe shop stairscase studio music subway toystore trainstation tv studio video store waiting room warwinecellarPubHomWorkinlacLeisurSto locker room restaurant kitchenbowling elevator Figure2.Summaryofthe67indoorscenecategoriesusedinourstudy.Tofacilitateseeingthevarietyofdifferentscenecategoriesconsideredherewehaveorganizedtheminto5bigscenegroups.Thedatabasecontains15620images.Allimageshaveaminimumresolutionof200pixelsinthesmallestaxis.indoorscenes.Someindoorscenes(e.g.corridors)canin-deedbecharacterizedbyglobalspatialpropertiesbutothers(e.gbookstores)arebettercharacterizedbytheobjectstheycontain.Formostindoorscenesthereisawiderangeofbothlocalandglobaldiscriminativeinformationthatneedstobeleveragedtosolvetherecognitiontask.Inthispaperweproposeascenerecognitionmodelspecicallytailoredtothetaskofindoorscenerecognition.Themainideaistouseimageprototypestodeneamap-pingbetweenimagesandscenelabelsthatcancapturethefactthatimagescontainingsimilarobjectsmusthavesim-ilarscenelabelsandthatsomeobjectsaremoreimportantthanothersindeningascene'sidentity.Ourworkisrelatedtoworkonlearningdistancefunc-tions[4,6,8]forvisualrecognition.Bothmethodslearntocombinelocalorelementarydistancefunctions.Thearetwomaindifferencesbetweentheirapproachanours.First,theirmethodlearnsaweightedcombinationofelementarydistancefunctionsforeachtrainingsamplebyminimizingarankingobjectivefunction.Differently,ourmethodlearnsaweightedcombinationofelementarydistancefunctionsforasetofprototypesbydirectlyminimizingaclassica-tionobjective.Second,whiletheyconcentratedonobjectrecognitionandimageretrievalourfocusisindoorscenerecognition.Thispapermakestwocontributions,rstweprovideauniquelargeanddiversedatabaseforindoorscenerecog-nition.Thisdatabaseconsistsof67indoorcategoriescov-eringawiderangeofdomains.Second,weintroduceamodelforindoorscenerecognitionthatlearnssceneproto-typessimilartostart-constellationmodelsandthatcansuc-cessfullycombinelocalandglobalimageinformation.2.IndoordatabaseInthissectionwedescribethedatasetofindoorscenecategories.Mostcurrentpapersonscenerecognitionfocusonareducedsetofindoorandoutdoorcategories.Incon-trast,ourdatasetcontainsalargenumberofindoorscenecategories.Theimagesinthedatasetwerecollectedfromdifferentsources:onlineimagesearchtools(GoogleandAltavista),onlinephotosharingsites(Flickr)andtheLa-belMedataset.Fig.2showsthe67scenecategoriesusedinthisstudy.Thedatabasecontains15620images.Allimageshaveaminimumresolutionof200pixelsinthesmallestaxis.Thisdatasetposesachallengingclassicationproblem.Asanillustrationofthein-classvariabilityinthedataset,g.3showsaverageimagesforsomeindoorclasses.Notethattheseaverageshaveveryfewdistinctiveattributesincomparisonwithaverageimagesforthefteenscenecate-goriesdatasetandCaltech101[10].Theseaveragessuggestthatindoorsceneclassicationmightbeahardtask.3.SceneprototypesandROIsWewillstartbydescribingourscenemodelandthesetoffeaturesusedintherestofthepapertocomputesimilaritiesbetweentwoscenes.3.1.PrototypesandROIAsdiscussedintheprevioussection,indoorscenecat-egoriesexhibitlargeamountsofin-classappearancevari-ability.Ourgoalwillbetondasetofprototypesthatbestdescribeseachclass.Thisnotionofsceneprototypeshasbeenusedinpreviousworks[11,17].Inthispaper,eachsceneprototypewillbedenedbyamodelsimilartoaconstellationmodel.Themaindifferencewithanobjectmodelisthattherootnodeisnotallowedtomove.Theparts(regionsofinterest,ROI)areallowedtomoveonasmallwindowandtheirdisplacementsareinde-pendentofeachother.EachprototypeTk(withk=1:::p)willbecomposedofmkROIsthatwewilldenotebytkj.Fig.4showsanexampleofaprototypeandasetofcandi-dateROIs. airport inside bakery bar bathroom bedroom bookstore buffet casino computer room concert hall corridor deli dining room greenhouse inside bus kitchen meeting room movie theater pool inside restaurant Figure3.Averageimagesforasampleoftheindoorscenecate-gories.Mostimageswithineachcategoryaveragetoauniformeldduetothelargevariabilitywithineachscenecategory(thisisincontrastwithCaltech101orthe15scenecategoriesdataset[9,3,7].Thebottom8averagescorrespondtothefewcategoriesthathavemoreregularityamongexamplars.InordertodeneasetofcandidateROIsforagivenpro-totype,weaskedahumanannotatortosegmenttheobjectscontainedinit.Annotatorssegmentedprototypeimagesforeachscenecategoryresultinginatotalof2000manuallysegmentedimages.WeusedthosesegmentationstoproposeasetofcandidateROIs(weselected10foreachprototypethatoccupyatleast1%oftheimagesize).WewillalsoshowresultswhereinsteadofusinghumanannotatorstogeneratethecandidateROIs,weusedaseg-mentationalgorithm.Inparticular,weproducecandidateROIsfromasegmentationobtainedusinggraph-cuts[13].3.2.ImagedescriptorsInordertodescribetheprototypesandtheROIswewillusetwosetsoffeaturesthatrepresentthestateoftheartonthetaskofscenerecognition.Wewillhaveonedescriptorthatwillrepresenttherootnodeoftheprototypeimage(Tk)globally.ForthiswewillusetheGistdescriptorusingthecodeavailableonline[9].Thisresultsinavectorof384dimensionsdescribingtheentireimage.ComparisonbetweentwoGistdescriptorsiscomputedusingEuclideandistance. b) ROI descriptorsc) Search region 80 pixels60 pixels Figure4.Exampleofasceneprototype.a)SceneprototypewithcandidateROI.b)Illustrationofthevisualwordsandtheregionsusedtocomputehistograms.c)SearchwindowtodetecttheROIinanewimage.TorepresenteachROIwewilluseaspatialpyramidofvisualwords.Thevisualwordsareobtainedasin[14]:wecreatevectorquantizedSiftdescriptorsbyapplyingK-meanstoarandomsubsetofimages(following[7]weused200clusters,i.e.visualwords).Fig.4.bshowsthevisualwords(thecolorofeachpixelrepresentsthevisualwordtowhichitwasassigned).EachROIisdecomposedintoa2x2gridandhistogramsofvisualwordsarecomputedforeachwindow[7,1,12].Distancesbetweentworegionsarecomputedusinghistogramintersectionasin[7].Histogramsofvisualwordscanbecomputedefcientlyusingintegralimages,thisresultsinanalgorithmwhosecomputationalcostisindependentofwindowsize.Thede-tectionaROIonanewimageisperformedbysearchingaroundasmallspatialwindowandalsoacrossafewscalechanges(Fig.4.c).Weassumethatiftwoimagesaresim-ilartheirrespectiveROIswillberoughlyaligned(i.e.insimilarspatiallocations).Therefore,weonlyneedtoper-formthesearcharoundasmallwindowrelativetotheorig-inallocation.Fig.5showsthreeROIsanditsdetectionsonnewimages.ForeachROI,thegureshowsbestandworstmatchesinthedataset.ThegureillustratesthevarietyofROIsthatwewillconsider:somecorrespondtowellde-nedobjects(e.g.,bed,lamp),regions(e.g.,oor,wallwithpaintings)orlessdistinctivelocalfeatures(e.g.,acolumn,aoortile).Thenextsectionwilldescribethelearningal-gorithmusedtoselectthemostinformativeprototypesandROIsforeachscenecategory.4.Model4.1.ModelFormulationInsceneclassicationourgoalistolearnamappingfromimagesxtoscenelabelsy.Forsimplicity,inthissec-tionweassumeabinaryclassicationsetting.Thatis,eachyi2f1;1gisabinarylabelindicatingwhetheranimagebelongstoagivenscenecategoryornot.Tomodelthemul-ticlasscaseweusethestandardapproachoftrainingone-versus-allclassiersforeachscene;attest,wepredictthe TarWor Figure5.Exampleofdetectionofsimilarimagepatches.Thetopthreeimagescorrespondtothequerypatterns.Foreachimage,thealgorithmtriestodetecttheselectedregiononthequeryimage.Thenextthreerowsshowthetopthreematchesforeachregion.Thelastrowshowsthethreeworstmatchingregions.scenelabelforwhichthecorrespondingclassierismostcondent.However,wewouldliketonotethatourmodelcanbeeasilyadaptedtoanexplicitmulticlasstrainingstrat-egy.AsaformofsupervisionwearegivenatrainingsetD=f(x1;y1);(x2;y2):::(xn;yn)gofnpairsoflabeledimagesandasetS=fT1;T2:::;Tpgofpsegmentedimageswhichwecallprototypes.EachprototypeTk=ft1;t2;:::;tmkghasbeensegmentedintomkROIsbyahumanannotator.EachROIcorrespondstosomeobjectinthescene,butwedonotknowtheirlabels.OurgoalistouseDandStolearnamappingh:X!R.Forbinaryclassication,wewouldtakethepredictionofanimagextobesign(h(x));inthemulticlasssetting,wewillusedirectlyh(x)tocompareitagainstotherclasspredictions.Asinmostsupervisedlearningsettingschoosinganap-propriatemappingh:X!Rbecomescritical.Inpartic-ular,forthesceneclassicationproblemwewouldliketolearnamappingthatcancapturethefactthatimagescon-tainingsimilarobjectsmusthavesimilarscenelabelsandthatsomeobjectsaremoreimportantthanothersinden-ingascene'sidentity.Forexample,wewouldliketolearnthatanimageofalibrarymustcontainbooksandshelvesbutmightormightnotcontaintables.InordertodeneausefulmappingthatcancapturetheessenceofascenewearegoingtouseS.Morespecically,foreachprototypeTkwedeneasetoffeaturesfunctions:fkj(x)=minsd(tkj;xs)(1)EachofthesefeaturesrepresentsthedistancebetweenaprototypeROItkjanditsmostsimilarsegmentinx(seesection3formoredetailsofhowthesefeaturesarecom-puted).Forsomescenecategoriesglobalimageinformationcanbeveryimportant,forthisreasonwewillalsoincludeaglobalfeaturegk(x)whichiscomputedastheL2normbe-tweentheGistrepresentationofimagexandtheGistrep-resentationofprototypek.Wecanthencombineallthesefeaturefunctionstodeneaglobalmapping:h(x)=pXk=1 kexpmkj=1kjfkj(x)kGgk(x)(2)Intheaboveformulation andarethetwoparame-tersetsofourmodel.Intuitively,each krepresentshowrelevantthesimilaritytoaprototypekisforpredictingthescenelabel.Similarly,eachkjcapturestheimportanceofaparticularROIinsideagivenprototype.Wecannowusethemappinghtodenethestandardregularizedclassica-tionobjective:L( ;)=nXi=1l(h(xi);yi)+Cbjj jj2+Cljjjj2(3)Thelefttermofequation3measurestheerrorthattheclassierincursontrainingexamplesDintermsofalossfunctionl.Inthispaperweusethehingeloss,givenbyl(h(x);y)=max(0;1yh(x))butotherlossessuchaslogisticlosscouldbeusedinstead.TherighthandtermsofEquation3areregularizationtermsandtheconstantsCbandCldictatetheamountofregularizationinthemodel.Finally,weintroducenon-negativityconstraintsonthe.SinceeachfkjisadistancebetweenimageROIs,thesecon-straintsensurethattheirlinearcombinationisalsoaglobaldistancebetweenaprototypeandanimage.Thiseasestheinterpretabilityoftheresults.Notethatthisglobaldistanceisusedtoinduceasimilaritymeasureintheclassierh. 4.2.LearningInthissectionwedescribehowtoestimatethemodelpa-rametersf ;g=argmin ;0L( ;)fromatrainingsetD.TheresultofthelearningstagewillbetheselectionoftherelevantprototypesforeachclassandtheROIthatshouldbeusedforeachprototype.Weuseanalternatingoptimizationstrategy,whichcon-sistsofaseriesofiterationsthatoptimizeonesetofpa-rametersgivenxedvaluesfortheothers.Initiallythepa-rametersaresettorandomvalues,andtheprocessiteratesbetweenxing andminimizingLwithrespecttoandxingandminimizingLwithrespectto .Weuseagradient-basedmethodforeachoptimizationstep.Sinceourobjectiveisnon-differentiablebecauseofthehingeloss,weuseasub-gradientoftheobjective,whichwecomputeasfollows:Givenparametervalues,letbethesetofindicesofexamplesinDthatattainnon-zeroloss.Also,tosimplifynotationassumethatparameterk0andfeaturefk0corre-spondtokGandgkrespectively.Thesubgradientwithrespectto isgivenby:@L @ k=Xi2yiexpmkj=1kjfkj(xi)+1 2Cb kandthesubgradientwithrespecttoisgivenby:@L @kj=Xi2yi kfkj(xi)expmkj=1kjfkj(xi)+1 2ClkjToenforcethenon-negativityconstraintsonthewecombinesub-gradientstepswithprojectionstothepositiveoctant.Inpracticeweobservedthatthisisasimpleandefcientmethodtosolvetheconstrainedoptimizationstep.5.ExperimentsInthissectionwepresentexperimentsforindoorscenerecognitionperformedonthedatasetdescribedinsection2.Weshowthatthemodelandrepresentationproposedinthispapergivesignicantimprovementoverastateoftheartmodelforthistask.Wealsoperformexperimentsusingdifferentversionsofourmodelandcomparemanualsegmentationstosegmentationsobtainedbyrunningaseg-mentationalgorithm.Inallcasestheperformancemetricisthestandardav-eragemulticlasspredictionaccuracy.Thisiscalculatedasthemeanoverthediagonalvaluesoftheconfusionmatrix.Anadvantageofthismetricwithrespecttoplainmulticlassaccuracyisthatitislesssensitivetounbalanceddistribu-tionsofclasses.Forallexperimentswetrainedaoneversusallclassierforeachofthe67scenesandcombinedtheir Gist SVM ROI Figure6.Multiclassaverageprecisionperformanceforthebase-lineandfourdifferentversionsofourmodel.scoresintoasinglepredictionbytakingthescenelabelwithmaximumcondencescore.Otherapproachesarepossibleforcombiningthepredictionsofthedifferentclassiers.Westartbydescribingthefourdifferentvariationsofourmodelthatweretestedontheseexperiments.Inarstset-tingweusedtheROIsobtainedfromthemanuallyanno-tatedimagesandrestrictedthemodeltouselocalinforma-tiononlybyremovingthegk(x)features(ROIAnnotation).Inaasecondsettingweallowedthemodeltousebothlo-calandglobalfeatures(ROI+GistAnnotation).InathirdsettingweutilizedtheROIsobtainedbyrunningasegmen-tationalgorithmandrestrictedthemodeltouselocalin-formationonly(ROISegmentation).Finally,inthefourthsettingweusedtheROIsobtainedfromtheautomaticseg-mentationbutallowedthemodeltoexploitbothlocalandglobalfeatures(ROI+GistSegmentation).Allthesemodelsweretrainedwith331prototypes.Wealsocomparedourapproachwithastateoftheartmodelforthistask.ForthiswetrainedanSVMwithaGistrepresentationandanRBFkernel(GistSVM).InprincipleotherfeaturescouldhavebeenusedforthisbaselinebutasitwasshowninFig.1Gistisoneofthemostcompetitiverepresentationsforthistask.Totrainallthemodelsweused80imagesofeachclassfortrainingand20imagesfortesting.Totrainaoneversusallclassierforcategorydwesamplenpositiveexamplesand3nnegativeexamples.6.ResultsFigure6showstheaveragemulticlassaccuracyforthevemodels:GistSVM,ROISegmentation,ROIAnnota-tion,ROI+GistSegmentationandROI+GistAnnotation.Aswecanseefromthisgurecombininglocalandglobalinformationleadstobetterperformance.Thissug-geststhatbothlocalandglobalinformationareusefulfortheindoorscenerecognitiontask.Noticealsothatusingautomaticsegmentationsinsteadofmanualsegmentations church inside 63.2% elevator 61.9% auditorium 55.6% buffet 55.0% classroom 50.0% greenhouse 50.0% bowling 45.0% cloister 45.0% concert hall 45.0% computerroom 44.4% dentaloffice 42.9% library 40.0% inside bus 39.1% closet 38.9% corridor 38.1% grocerystore 38.1% locker room 38.1% florist 36.8% studiomusic 36.8% hospitalroom 35.0% nursery 35.0% trainstation 35.0% bathroom 33.3% laundromat 31.8% stairscase 30.0% garage 27.8% gym 27.8% tv studio 27.8% videostore 27.3% gameroom 25.0% pantry 25.0% poolinside 25.0% inside subway 23.8% kitchen 23.8% winecellar 23.8% fastfood restaurant 23.5% bar 22.2% clothingstore 22.2% casino 21.1% deli 21.1% bookstore 20.0% waitingroom 19.0% dining room 16.7% bakery 15.8% livingroom 15.0% movietheater 15.0% bedroom 14.3% toystore 13.6% operating room 10.5% airport inside 10.0% artstudio 10.0% lobby 10.0% prison cell 10.0% hairsalon 9.5% subway 9.5% warehouse 9.5% meeting room 9.1% children room 5.6% shoeshop 5.3% kindergarden 5.0% restaurant 5.0% museum 4.3% restaurant kitchen 4.3% jewelleryshop 0.0% laboratorywet 0.0% mall 0.0% office 0.0% Figure7.The67indoorcategoriessortedbymulticlassaverageprecision(trainingwith80imagesperclassandtestisdoneon20imagesperclass).causesonlyasmalldropinperformance.Figure7showsthesortedaccuraciesforeachclassfortheROI+Gist-Segmentationmodel.Interestingly,veofthecategories(greenhouse,computer-room,inside-bus,corri-dorandpool-inside)forwhichweobservedsomeglobalregularity(see3)arerankedamongthetophalfbestper-formingcategories.Butamongthistophalfwealsondfourcategories(buffet,bathroom,concerthall,kitchen)forwhichweobservednoglobalregularity.Figure8showsrankedimagesforarandomsubsetofscenecategoriesfortheROI+GistSegmentationmodel.Figure9showsthetopandbottomprototypesselectedforasubsetofthecategories.Wecanseefromtheseresultsthatthemodelleveragesbothglobalandlocalinformationatdifferentscales.Onequestionthatwemightaskis:howistheperfor-manceoftheproposedmodelaffectedbythenumberofprototypesused?Toanswerthisquestionwetestedtheper-formanceofaversionofourmodelthatusedglobalinfor-mationonlyfordifferentnumberofprototypes(1to200).Weobservedalogarithmicgrowthoftheaverageprecisionasafunctionofthenumberofprototypes.Thismeansthatbyallowingthemodeltoexploitmoreprototypeswemightbeabletofurtherimprovetheperformance.Insummary,wehaveshowntheimportanceofcom-biningbothlocalandglobalimageinformationforindoorscenerecognition.Themodelthatweproposedleveragesbothanditcanoutperformastateoftheartclassierfortask.Inaddition,ourresultsletusconcludethatusingau-tomaticsegmentationsissimilartousingmanualsegmen-tationsandthusourmodelcanbetrainedwithaminimumamountofsupervision.7.ConclusionWehaveshownthatthealgorithmsthatconstitutetheactualstateoftheartalgorithmsonthe15scenecategoriza-tiontask[9,3,7]performverypoorlyattheindoorrecog-nitiontask.Indoorscenecategorizationrepresentsaverychallengingtaskduetothelargevariabilityacrossdiffer-entexemplarswithineachclass.Thisisnotthecasewithmanyoutdoorscenecategories(e.g.,beach,street,plaza,parkinglot,eld,etc.)whichareeasiertodiscriminateandseveralimagedescriptorshavebeenshowntoperformverywellatthattask.Outdoorscenerecognition,despitebeingachallengingtaskhasreachedadegreeofmaturitythathasallowedtheemergenceofseveralapplicationsincomputervision(e.g.[16])andcomputergraphics(e.g.[5]).How-ever,mostofthoseworkshaveavoideddealingwithindoorscenesasperformancesgenerallydropdramatically.Thegoalofthispaperistoattractattentiontothecom-putervisioncommunityworkingonscenerecognitiontothisimportantclassofscenesforwhichcurrentalgorithmsseemtoperformpoorly.Inthispaperwehaveproposedarepresentationabletooutperformrepresentationsthatarethecurrentstateoftheartonscenecategorization.How-ever,theperformancespresentedinthispaperareclosetotheperformanceoftherstattemptsonCaltech101[2].AcknowledgmentFundingforthisresearchwasprovidedbyNationalSci-enceFoundationCareeraward(IIS0747120).References[1]A.Bosch,A.Zisserman,andX.Munoz.Imageclassicationusingroisandmultiplekernellearning.Intl.J.ComputerVision,2008.[2]L.Fei-Fei,R.Fergus,andP.Perona.Learninggenerativevisualmodelsfromfewtrainingexamples:anincrementalbayesianapproachtestedon101objectcategories.InIEEE.CVPR2004,WorkshoponGenerative-ModelBasedVision,2004.[3]L.Fei-FeiandP.Perona.Abayesianhierarchicalmodelforlearningnaturalscenecategories.Incvpr,pages524–531,2005.[4]A.Frome,Y.Singer,F.Sha,andJ.Malik.Learningglobally-consistentlocaldistancefunctionsforshape-basedimageretrievalandclassication.ComputerVision,2007.ICCV2007.IEEE11thInternationalConferenceon,pages1–8,Oct.2007.[5]J.HaysandA.A.Efros.Scenecompletionusingmillionsofphotographs.ACMTransactionsonGraphics,26,2007.[6]J.Krapac.Learningdistancefunctionsforautomaticannota-tionofimages.InAdaptiveMultimedialRetrieval:Retrieval,User,andSemantics,2008.[7]S.Lazebnik,C.Schmid,andJ.Ponce.Beyondbagsoffeatures:Spatialpyramidmatchingforrecognizingnaturalscenecategories.Incvpr,pages2169–2178,2006.[8]T.MalisiewiczandA.A.Efros.Recognitionbyassociationvialearningper-exemplardistances.InCVPR,June2008.[9]A.OlivaandA.Torralba.Modelingtheshapeofthescene:aholisticrepresentationofthespatialenvelope.InternationalJournalinComputerVision,42:145–175,2001. classroom (2.09) classroom (1.99) classroom (1.98) fastfood (!0.18) garage (!0.69 bath kitche27) prisoncell (53) locker room (2.52) corridor (2.27) locker room (2.22) office (!0.04) prisoncell (52) dergarden (!0.86 bath bedroom 40) restaurant (1.57) livingroom (1.55) pantry (1.53) fastfood (!0.12) waitingroom (59) aurant 89) 16) winecellar (!1.44) mall (1.69) videostore (1.44) videostore (1.39) tv studio (!0.14) bath concert ha78) concert ha01) inside subway (!1.22) bathroom (2.45) bathroom (2.14) bedroom (2.01) laundromat (0.36) operating room( dental office (!0.65) bookre (!1.04 inside bus (!1.37) library (2.34) library (1.94) warehouse (1.93) warehouse (!0.07) jewellery laundromat (!0.87 toystore (!1. bowling (!1.32librar Figure8.ClassiedimagesforasubsetofscenecategoriesfortheROI+GistSegmentationmodel.Eachrowcorrespondstoascenecategory.Thenameontopofeachimagedenotesthegroundtruthcategory.Thenumberinparenthesisistheclassicationcondence.Therstthreecolumnscorrespondtothehighestcondencescores.Thenextvecolumnsshow5imagesfromthetestsetsampledsothattheyareatequaldistancefromeachotherintherankingprovidedbytheclassier.Thegoalistoshowwhichimages/classesarenearandfarawayfromthedecisionboundary.[10]J.Ponce,T.L.Berg,M.Everingham,D.A.Forsyth,M.Hebert,S.Lazebnik,M.Marszalek,C.Schmid,B.C.Russell,A.Torralba,C.K.I.Williams,J.Zhang,andA.Zis-serman.Datasetissuesinobjectrecognition.InInTowardCategory-LevelObjectRecognition,pages29–48.Springer,2006.[11]A.Quattoni,M.Collins,andT.Darrell.Transferlearningforimageclassicationwithsparseprototyperepresentations.InCVPR,2008.[12]B.C.Russell,A.A.Efros,J.Sivic,W.T.Freeman,andA.Zisserman.Usingmultiplesegmentationstodiscoverob-jectsandtheirextentinimagecollections.InCVPR,2006.[13]J.ShiandJ.Malik.Normalizedcutsandimagesegmenta-tion.IEEETransactionsonPatternAnalysisandMachineIntelligence,22(8):888–905,1997.[14]J.SivicandA.Zisserman.VideoGoogle:Efcientvi-sualsearchofvideos.InJ.Ponce,M.Hebert,C.Schmid,andA.Zisserman,editors,TowardCategory-LevelOb-jectRecognition,volume4170ofLNCS,pages127–144.Springer,2006.[15]M.SzummerandR.W.Picard.Indoor-outdoorimageclas-sication.InCAIVD'98:Proceedingsofthe1998Inter-nationalWorkshoponContent-BasedAccessofImageandVideoDatabases(CAIVD'98),page42,Washington,DC,USA,1998.IEEEComputerSociety.[16]A.Torralba,K.Murphy,W.Freeman,andM.Rubin.Context-basedvisionsystemforplaceandobjectrecogni-tion.InIntl.Conf.ComputerVision,2003.[17]A.Torralba,A.Oliva,M.S.Castelhano,andJ.M.Hen-derson.Contextualguidanceofeyemovementsandatten-tioninreal-worldscenes:theroleofglobalfeaturesinob-jectsearch.PsychologicalReview,113(4):766–786,October2006. Beta= 5.76Beta= 5.16Beta= 4.85Beta= 4.15Beta= 3.88Beta.70Beta.30Beta2Beta Beta= 5.03Beta= 4.99Beta= 4.64Beta= 4.26Beta= 3.06Beta.00Beta.99Beta4Beta Beta= 6.73Beta= 6.57Beta= 6.38Beta= 5.43Beta= 4.75Beta.67Beta.45Beta6Beta Beta= 5.62Beta= 4.43Beta= 4.42Beta= 4.10Beta.04Beta Beta= 2.72Beta= 2.54Beta= 2.34Beta= 2.29Beta.24Beta.13Beta BathroomBeta= 5.07Beta= 4.53Beta= 4.35Beta= 3.72Beta= 2.93Beta.79Beta Library Figure9.Prototypesforasubsetofscenecategoriesandsortedbytheirweight.The7rstcolumnscorrespondtothehighestrankprototypesandthelasttwocolumnsshowprototypeswiththemostnegativeweights.ThethicknessofeachboundingboxisproportionaltothevalueoftheweightforeachROI:kj.