/
SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE  Pedestrian SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE  Pedestrian

SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Pedestrian - PDF document

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
506 views
Uploaded On 2014-12-12

SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Pedestrian - PPT Presentation

In recent years the number of approaches to detecting pedestrians in monocular images has grown steadily However multiple datasets and widely varying evaluation protocols are used making direct comparisons dif64257cult To address these shortcomings ID: 22607

recent years the

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "SUBMISSION TO IEEE TRANSACTIONS ON PATTE..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

2SUBMISSIONTOIEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE1.1ContributionsDataset:Inearlierwork[3],weintroducedtheCaltechPedestrianDataset,whichincludes350,000pedestrianboundingboxeslabeledin250,000framesandremainsthelargestsuchdatasettodate.Occlusionsandtemporalcorrespondencesarealsoannotated.Usingtheextensivegroundtruth,weanalyzethestatisticsofpedestrianscale,occlusion,andlocationandhelpestablishcondi-tionsunderwhichdetectionsystemsmustoperate.EvaluationMethodology:Weaimtoquantifyandrankdetectorperformanceinarealisticandunbiasedmanner.Tothiseffect,weexploreanumberofchoicesintheevaluationprotocolandtheireffectonreportedperformance.Overall,themethodologyhaschangedsubstantiallysince[3],resultinginamoreaccurateandinformativebenchmark.Evaluation:Weevaluatesixteenrepresentativestate-of-the-artpedestriandetectors(previouslyweevaluatedseven[3]).Ourgoalwastochoosediversedetectorsthatweremostpromisingintermsoforiginallyre-portedperformance.Weavoidretrainingormodifyingthedetectorstoensureeachmethodwasoptimizedbyitsauthors.Inadditiontooverallperformance,weexploredetectionratesundervaryinglevelsofscaleandocclusionandonclearlyvisiblepedestrians.Moreover,wemeasurelocalizationaccuracyandanalyzeruntime.Toincreasethescopeofouranalysis,wealsobench-markthesixteendetectorsusingauniedevalua-tionframeworkonsixadditionalpedestriandetectiondatasetsincludingtheETH[4],TUD-Brussels[5],Daim-ler[6]andINRIA[7]datasetsandtwovariantsoftheCaltechdataset(seeFigure1).Byevaluatingacrossmultipledatasets,wecanrankdetectorperformanceandanalyzethestatisticalsignicanceoftheresultsand,moregenerally,drawconclusionsbothaboutthedetectorsandthedatasetsthemselves.Twogroupshaverecentlypublishedsurveyswhicharecomplementarytoourown.Geronimoetal.[2]performedacomprehensivesurveyofpedestriande-tectionforadvanceddriverassistancesystems,withaclearfocusonfullsystems.EnzweilerandGavrila[6]publishedtheDaimlerdetectiondatasetandanac-companyingevaluationofthreedetectors,performingadditionalexperimentsintegratingthedetectorsintofullsystems.Weinsteadfocusonamorethoroughanddetailedevaluationofstate-of-the-artdetectors.Thispaperisorganizedasfollows:weintroducetheCaltechPedestrianDatasetandanalyzeitsstatisticsinx2;acomparisonofexistingdatasetsisgiveninx2.4.Inx3wediscussevaluationmethodologyindetail.Asurveyofpedestriandetectorsisgiveninx4.1andinx4.2wediscussthesixteenrepresentativestate-of-the-artdetectorsusedinourevaluation.Inx5wereporttheresultsoftheperformanceevaluation,bothundervaryingconditionsusingtheCaltechdatasetandonsixadditionaldatasets.Weconcludewithadiscussionofthestateoftheartinpedestriandetectioninx6. (a) totalframes1000k labeledframes250k frameswpeds.132k #boundingboxes350k #occludedBB126k #uniquepeds.2300 aveped.duration5s avelabels/frame1.4 labelingtime400h (b)Fig.2.OverviewoftheCaltechPedestrianDataset.(a)Camerasetup.(b)Summaryofdatasetstatistics(1k=103).Thedatasetislarge,realisticandwell-annotated,allowingustostudystatisticsofthesize,positionandocclusionofpedestriansinurbanscenesandalsotoaccuratelyevaluatethestateortheartinpedestriandetection.2THECALTECHPEDESTRIANDATASETChallengingdatasetsarecatalystsforprogressincom-putervision.TheBarronetal.[8]andMiddlebury[9]opticalowdatasets,theBerkeleySegmentationDataset[10],theMiddleburyStereoDataset[11],andtheCaltech101[12],Caltech256[13]andPASCAL[14]objectrecog-nitiondatasetsallimprovedperformanceevaluation,addedchallenge,andhelpeddriveinnovationintheirrespectiveelds.Muchinthesameway,ourgoalinintroducingtheCaltechPedestrianDatasetistoprovideabetterbenchmarkandtohelpidentifyconditionsunderwhichcurrentdetectorsfailandthusfocusresearcheffortonthesedifcultcases.2.1DataCollectionandGroundTruthingWecollectedapproximately10hoursof30Hzvideo(106frames)takenfromavehicledrivingthroughregulartrafcinanurbanenvironment(camerasetupshowninFigure2(a)).TheCCDvideoresolutionis640480,and,notunexpectedly,theoverallimagequalityislowerthanthatofstillimagesofcomparableresolution.Thereareminorvariationsinthecamerapositionduetorepeatedmountingsofthecamera.ThedriverwasindependentfromtheauthorsofthisstudyandhadinstructionstodrivenormallythroughneighborhoodsinthegreaterLosAngelesmetropolitanareachosenfortheirrelativelyhighconcentrationofpedestriansincludingLAX,SantaMonica,Hollywood,Pasadena,andLittleTokyo.Inordertoremoveeffectsofthevehiclepitchingandthussimplifyannotation,thevideowasstabilizedusingtheinversecompositionalalgorithmforimagealignmentbyBakerandMatthews[15].Aftervideostabilization,250,000frames(in137ap-proximatelyminutelongsegmentsextractedfromthe10hoursofvideo)wereannotatedforatotalof350,000boundingboxesaround2300uniquepedestrians.Tomakesuchalargescalelabelingeffortfeasiblewecreatedauser-friendlylabelingtool,showninFigure3.Itsmostsalientaspectisaninteractiveprocedurewheretheannotatorlabelsasparsesetofframesandthesystemautomaticallypredictspedestrianpositionsininterme-diateframes.Specically,afteranannotatorlabelsa DOLL´ARetal.:PEDESTRIANDETECTION:ANEVALUATIONOFTHESTATEOFTHEART3 Fig.3.Theannotationtoolallowsannotatorstoefcientlynavigateandannotateavideoinaminimumamountoftime.Itsmostsalientaspectisaninteractiveprocedurewheretheannotatorlabelsonlyasparsesetofframesandthesystemau-tomaticallypredictspedestrianpositionsinintermediateframes.Theannotationtoolisavailableontheprojectwebsite.boundingbox(BB)aroundthesamepedestrianinatleasttwoframes,BBsinintermediateframesareinterpolatedusingcubicinterpolation(appliedindependentlytoeachcoordinateoftheBBs).Thereafter,everytimeananno-tatoraltersaBB,BBsinalltheunlabeledframesarere-interpolated.Theannotatorcontinuesuntilsatisedwiththeresult.Weexperimentedwithmoresophisticatedinterpolationschemes,includingrelyingontracking;however,cubicinterpolationprovedbest.Labelingthe2.3hoursofvideo,includingverication,took400hourstotal(spreadacrossmultipleannotators).Foreveryframeinwhichagivenpedestrianisvisible,annotatorsmarkaBBthatindicatesthefullextentoftheentirepedestrian(BB-full);foroccludedpedestriansthisinvolvesestimatingthelocationofhiddenparts.InadditionasecondBBisusedtodelineatethevisiblere-gion(BB-vis),seeFigure5(a).Duringanocclusionevent,theestimatedfullBBstaysrelativelyconstantwhilethevisibleBBmaychangerapidly.Forcomparison,inthePASCALlabelingscheme[14]onlythevisibleBBislabeledandoccludedobjectsaremarkedas`truncated'.EachsequenceofBBsbelongingtoasingleobjectwasassignedoneofthreelabels.Individualpedestrianswerelabeled`Person'(1900instances).LargegroupsforwhichitwouldhavebeentediousorimpossibletolabelindividualsweredelineatedusingasingleBBandlabeledas`People'(300).Inaddition,thelabel`Person?'wasassignedwhenclearidenticationofapedestrianwasambiguousoreasilymistaken(110).2.2DatasetStatisticsAsummaryofthedatasetisgiveninFigure2(b).About50%oftheframeshavenopedestrians,while30%havetwoormore,andpedestriansarevisiblefor5sonaverage.Below,weanalyzethedistributionofpedestrianscale,occlusionandlocation.Thisservestoestablishtherequirementsofarealworldsystemandtohelpidentifyconstraintsthatcanbeusedtoimproveautomaticpedestriandetectionsystems. (a)heightdistribution (b)aspectratio (c)scenegeometry (d)distancevs.heightFig.4.(a)Distributionofpedestrianpixelheights.Wedenethenearscaletoincludepedestriansover80pixels,themediumscaleas30-80pixels,andthefarscaleasunder30pixels.Mostobservedpedestrians(69%)areatthemediumscale.(b)DistributionofBBaspectratio;onaveragew:41h.(c)Usingthepinholecameramodel,apedestrian'spixelheighthisinverselyproportionaltodistancetothecamerad:h=fH=d.(d)Pixelheighthasafunctionofdistanced.Assuminganurbanspeedof55km/h,an80pixelpersonisjust1:5saway,whilea30pixelpersonis4saway.Thus,forautomotivesettings,detectionismostimportantatmediumscales(seex2.2.1fordetails).2.2.1ScaleStatisticsWegrouppedestriansbytheirimagesize(heightinpix-els)intothreescales:near(80ormorepixels),medium(between30-80pixels)andfar(30pixelsorless).Thisdi-visionintothreescalesismotivatedbythedistributionofsizesinthedataset,humanperformanceandautomotivesystemrequirements.InFigure4(a),wehistogramtheheightsofthe350,000BBsusinglogarithmicsizedbins.Theheightsareroughlylognormallydistributedwithamedianof48pixelsandalog-averageof50pixels(thelog-averageisequivalenttothegeometricmeanandismorerep-resentativeoftypicalvaluesforlognormallydistributeddatathanthearithmeticmean,whichis60pixelsinthiscase).Cutoffsforthenear/farscalesaremarked.Notethat69%ofthepedestrianslieinthemediumscale,andthatthecutoffsforthenear/farscalescorrespondtoabout1standarddeviation(inlogspace)fromthelog-averageheightof50pixels.Below30pixels,annotatorshavedifcultyidentifyingpedestriansreliably.Pedestrianwidthislikewiselognormallydistributed,andmoreoversoisthejointdistributionofwidthandheight(notshown).Asanylinearcombinationofthecomponentsofamultivariatenormaldistributionis 4SUBMISSIONTOIEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE (a) (b) (c) (d) (e) (f)Fig.5.Occlusionstatistics.(a)Foralloccludedpedestriansannotatorslabeledboththefullextentofthepedestrian(BB-full)andthevisibleregion(BB-vis).(b)Mostpedestrians(70%)areoccludedinatleastoneframe,underscoringtheimportanceofdetectingoccludedpeople.(c)Fractionofocclusioncanvarysignicantly(0%occlusionindicatesthataBBcouldnotrepresenttheextentofthevisibleregion).(d)Occlusionisfarfromuniformwithpedestrianstypicallyoccludedfrombelow.(e)Toobservefurtherstructureinthetypesofocclusionsthatactuallyoccur,wequantizeocclusionintoaxednumberoftypes.(f)Over97%ofoccludedpedestriansbelongtojustasmallsubsetofthehundredsofpossibleocclusiontypes.Detailsinx2.2.2.alsonormallydistributed,soshouldtheBBaspectratio(denedasw=h)sincelog(w=h)=log(w)�log(h).Ahistogramoftheaspectratios,usinglogarithmicbins,isshowninFigure4(b),andindeedthedistributionislognormal.Thelog-averageaspectratiois:41,meaningthattypicallyw:41h.However,whileBBheightdoesnotvaryconsiderablygivenaconstantdistancetothecamera,theBBwidthcanchangewiththepedestrian'spose(especiallyarmpositionsandrelativeangle).Thus,althoughwecouldhavedenedthenear,mediumandfarscalesusingthewidth,theconsistencyoftheheightmakesitbettersuited.Detectioninthemediumscaleisessentialforautomo-tiveapplications.Wechoseacamerasetupthatmirrorsexpectedautomotivesettings:640480resolution,27verticaleldofview,andfocallengthxedat7:5mm.Thefocallengthinpixelsisf1000(obtainedfrom480=2=f=tan(27=2)orusingthecamera'spixelsizeof7:5m).Usingapinholecameramodel(seeFigure4(c)),anobject'sobservedpixelheighthisinverselypropor-tionaltothedistancedtothecamera:hHf=d,whereHisthetrueobjectheight.AssumingH1:8mtallpedestrians,weobtaind1800=hm.Withthevehicletravelingatanurbanspeedof55km/h(15m/s),an80pixelpersonisjust1:5saway,whilea30pixelpersonis4saway(seeFigure4(d)).Thusdetectingnearscalepedestriansmayleaveinsufcienttimetoalertthedriver,whilefarscalepedestriansarelessrelevant.Weshallusethenear,medium,andfarscaledef-initionsthroughoutthiswork.Mostpedestriansareobservedatthemediumscaleandforsafetysystemsdetectionmustoccurinthisscaleaswell.Humanper-formanceisalsoquitegoodinthenearandmediumscalesbutdegradesnoticeablyatthefarscale.However,mostcurrentdetectorsaredesignedforthenearscaleandperformpoorlyevenatthemediumscale(seex5).Thusthereisanimportantmismatchincurrentresearcheffortsandtherequirementsofrealsystems.Usinghigherresolutioncameraswouldhelp;neverthe-less,giventhegoodhumanperformanceandlowercost,webelievethataccuratedetectioninthemediumscaleisanimportantandreasonablegoal.2.2.2OcclusionStatisticsOccludedpedestrianswereannotatedwithtwoBBsthatdenotethevisibleandfullpedestrianextent(seeFig-ure5(a)).WeplotfrequencyofocclusioninFigure5(b),i.e.,foreachpedestrianwemeasurethefractionofframesinwhichthepedestrianwasatleastsomewhatoccluded.Thedistributionhasthreedistinctregions:pedestriansthatareneveroccluded(29%),occludedinsomeframes(53%)andoccludedinallframes(19%).Over70%ofpedestriansareoccludedinatleastoneframe,underscoringtheimportanceofdetectingoc-cludedpeople.Nevertheless,littlepreviousworkhasbeendonetoquantifyocclusionordetectionperfor-manceinthepresenceofocclusion(usingrealdata).Foreachoccludedpedestrian,wecancomputethefractionofocclusionasoneminusthevisiblepedestrianareadividedbytotalpedestrianarea(calculatedfromthevisibleandfullBBs).Aggregating,weobtainthehistograminFigure5(c).Over80%occlusiontypicallyindicatesfullocclusion,while0%isusedtoindicatethataBBcouldnotrepresenttheextentofthevisibleregion(e.g.duetoadiagonaloccluder).Wefurthersubdividethecasesinbetweenintopartialocclusion(1-35%areaoccluded)andheavyocclusion(35-80%occluded). DOLL´ARetal.:PEDESTRIANDETECTION:ANEVALUATIONOFTHESTATEOFTHEART5Weinvestigatedwhichregionsofapedestrianweremostlikelytobeoccluded.Foreachframeinwhichapedestrianwaspartiallytoheavilyoccluded(1-80%frac-tionofocclusion),wecreatedabinary50100pixeloc-clusionmaskusingthevisibleandfullBBs.Byaveragingtheresulting54kocclusionmasks,wecomputedtheprobabilityofocclusionforeachpixel(conditionedonthepersonbeingpartiallyoccluded);theresultingheatmapisshowninFigure5(d).Observethestrongbiasforthelowerportionofthepedestriantobeoccluded,particularythefeet,andforthetopportion,especiallythehead,tobevisible.Anintuitiveexplanationisthatmostoccludingobjectsaresupportedfrombelowasopposedtohangingfromabove(anotherbutlesslikelypossibilityisthatitisdifcultforannotatorstodetectpedestriansifonlythefeetarevisible).Overall,occlusionisfarfromuniform,andexploitingthisndingcouldhelpimprovetheperformanceofpedestriandetectors.Notonlyisocclusionhighlynon-uniform,thereissignicantadditionalstructureinthetypesofocclusionsthatactuallyoccur.Below,weshowthatafterquantizingocclusionmasksintoalargenumberofpossibletypes,nearlyalloccludedpedestriansbelongtojustahandfuloftheresultingtypes.Toquantizetheocclusions,eachBB-fullisregisteredtoacommonreferenceBBthathasbeenpartitionedintoqxbyqyregularlyspacedcells;eachBB-viscanthenbeassignedatypeaccordingtothesmallestsetofcellsthatfullyencompassit.Figure5(e)shows3exampletypesforqx=3;qy=6(withtwoBB-vispertype).ThereareatotalofPqx;qyi=1;j=1ij=qxqy(qx+1)(qy+1)=4possibletypes.Foreach,wecomputethepercentageofthe54kocclusionsassignedtoitandproduceaheatmapusingthecorrespondingocclusionmasks.Thetop7of126typesforqx=3;qy=6areshowninFigure5(f).Together,these7typesaccountfornearly97%ofallocclusionsinthedataset.Ascanbeseen,pedestriansarealmostalwaysoccludedfromeitherbelowortheside;morecomplexocclusionsarerare.Werepeatedthesameanalysiswithanerpartitioningofqx=4;qy=8(notshown).Oftheresulting360possibletypesthetop14accountedfornearly95%ofocclusions.Theknowledgethatveryfewocclusionpatternsarecommonshouldproveusefulindetectordesign.2.2.3PositionStatisticsViewpointandgroundplanegeometry(Figure4(c))constrainpedestrianstoappearonlyincertainregionsoftheimage.Wecomputetheexpectedcenterpositionandplottheresultingheatmap,log-normalized,inFigure6(a).Ascanbeseenpedestriansaretypicallylocatedinanarrowbandrunninghorizontallyacrossthecenteroftheimage(y-coordinatevariessomewhatwithdistance/height).Notethatthesameconstraintsarenotvalidwhenphotographingascenefromarbitraryviewpoints,e.g.intheINRIAdataset.Inthecollecteddata,manyobjects,notjustpedes-trians,tendtobeconcentratedinthissameregion.InFigure6(b)weshowaheatmapobtainedbyusingBBs (a)pedestrianposition (b)HOGdetectionsFig.6.ExpectedcenterlocationofpedestrianBBsfor(a)groundtruthand(b)HOGdetections.Theheatmapsarelog-normalized,meaningpedestrianlocationisevenmoreconcen-tratedthanimmediatelyapparent.generatedbytheHOG[7]pedestriandetectorwithalowthreshold.Abouthalfofthedetections,includingbothtrueandfalsepositives,occurinthesamebandasthegroundtruth.Thusincorporatingthisconstraintcouldconsiderablyspeedupdetectionbutitwouldonlymoderatelyreducefalsepositives.2.3TrainingandTestingDataWesplitthedatasetintotraining/testingsetsandspecifyapreciseevaluationmethodology,allowingdifferentresearchgroupstocomparedetectorsdirectly.Weurgeauthorstoadheretooneoffourtraining/testingscenar-iosdescribedbelow.Thedatawascapturedover11sessions,eachlmedinoneofvecityneighborhoodsasdescribed.Wedividethedataroughlyinhalf,settingaside6sessionsfortraining(S0-S5)and5sessionsfortesting(S6-S10).FordetailedstatisticsabouttheamountofdataseebottomrowofTable1.Imagesfromallsessions(S0-S10)havebeenpubliclyavailable,ashavebeenannotationsforthetrainingsessions(S0-S5).Atthistimewearealsoreleasingannotationsforthetestingsessions(S6-S10).DetectorscanbetrainedusingeithertheCaltechtrain-ingdata(S0-S5)orany`external'data,andtestedoneithertheCaltechtrainingdata(S0-S5)ortestingdata(S6-S10).Thisresultsinfourevaluationscenarios:Scenarioext0:Trainonanyexternaldata,testonS0-S5.Scenarioext1:Trainonanyexternaldata,testonS6-S10.Scenariocal0:Perform6-foldcrossvalidationusingS0-S5.Ineachphaseuse5sessionsfortrainingandthe6thfortesting,thenmergeandreportresultsoverS0-S5.Scenariocal1:TrainusingS0-S5,testonS6-S10.Scenariosext0/ext1allowforevaluationofexisting,pre-trainedpedestriandetectors,whilecal0/cal1involvetrainingusingtheCaltechtrainingdata(S0-S5).Theresultsreportedhereusetheext0/ext1scenariosthusallowingforabroadsurveyofexistingpre-trainedpedestriandetectors.Authorsareencouragedtore-traintheirsystemsonourlargetrainingsetandevaluateunderscenarioscal0/cal1.Authorsshoulduseext0/cal0duringdetectordevelopment,andonlyafternalizingallparametersevaluateunderscenariosext1/cal1. 6SUBMISSIONTOIEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE Training Testing Height Properties imagingsetup #pedestrians #neg.images #pos.images #pedestrians #neg.images #pos.images 10%quantile median 90%quantile colorimages per-imageeval. noselect.bias videoseqs. temporalcorr. occlusionlabels publication MIT[16] photo 924 – – – – – 128 128 128 3 2000 USC-A[17] photo – – – 313 – 205 70 98 133 3 2005 USC-B[17] surv. – – – 271 – 54 63 90 126 3 2005 USC-C[18] photo – – – 232 – 100 74 108 145 3 2007 CVC[19] mobile 1000 6175y – – – – 46 83 164 3 3 2007 TUD-det[20] mobile 400 – 400 311 – 250 133 218 278 3 3 2008 Daimler-CB[21] mobile 2.4k 15ky – 1.6k 10ky – 36 36 36 3 2006 NICTA[22] mobile 18.7k 5.2k – 6.9k 50ky – 72 72 72 3 3 2008 INRIA[7] photo 1208 1218 614 566 453 288 139 279 456 3 2005 ETH[4] mobile 2388 – 499 12k – 1804 50 90 189 3 3 3 3 2007 TUD-Brussels[5] mobile 1776 218 1092 1498 – 508 40 66 112 3 3 3 2009 Daimler-DB[6] mobile 15.6k 6.7k – 56.5k – 21.8k 21 47 84 3 3 3 2009 Caltech[3] mobile 192k 61k 67k 155k 56k 65k 27 48 97 3 3 3 3 3 3 2009 TABLE1ComparisonofPedestrianDetectionDatasets(seex2.4fordetails)2.4ComparisonofPedestrianDatasetsExistingdatasetsmaybegroupedintotwotypes:(1)`person'datasetscontainingpeopleinunconstrainedposeinawiderangeofdomainsand(2)`pedestrian'datasetscontainingupright,possiblymovingpeople.Themostwidelyused`person'datasetsincludesubsetsoftheMITLabelMedata[23]andthePASCALVOCdatasets[14].Inthisworkwefocusonpedestriandetec-tion,whichismorerelevanttoautomotivesafety.Table1providesanoverviewofexistingpedestriandatasets.Thedatasetsareorganizedintothreegroups.Therstincludesolderormorelimiteddatasets.ThesecondincludesmorecomprehensivedatasetsincludingtheINRIA[7],ETH[4]andTUD-Brussels[5]pedestriandatasetsandtheDaimlerdetectionbenchmark(Daimler-DB)[6].ThenalrowcontainsinformationabouttheCaltechPedestrianDataset.Detailsfollowbelow.Imagingsetup:Pedestrianscanbelabeledinpho-tographs[7],[16],surveillancevideo[17],[24],andimagestakenfromamobilerecordingsetup,suchasarobotorvehicle[4],[5],[6].Datasetsgatheredfromphotographssufferfromselectionbias,asphotographsareoftenmanuallyselected,whilesurveillancevideoshaverestrictedbackgroundsandthusrarelyserveasabasisfordetectiondatasets.Datasetscollectedbycontinuouslylmingfromamobilerecordingsetup,suchastheCaltechPedestrianDataset,largelyeliminateselectionbias(unlesssomescenesarestagedbyactors,asin[6])whilehavingmoderatelydiversescenes.Datasetsize:Theamountandtypeofdataineachdatasetisgiveninthenextsixcolumns.Thecolumnsare:numberofpedestrianwindows(notcountingreections,shifts,etc.),numberofimageswithnopedestrians(ayindicatescroppednegativewindowsonly),andnumberofuncroppedimagescontainingatleastonepedestrian.TheCaltechPedestrianDatasetistwoordersofmagni-tudelargerthanmostexistingdatasets.Datasettype:Olderdatasets,includingtheMIT[16],CVC[19]andNICTA[22]pedestriandatasetsandtheDaimlerclassicationbenchmark(Daimler-CB)[21]tendtocontaincroppedpedestrianwindowsonly.Theseareknownas`classication'datasetsastheirprimaryuseistotrainandtestbinaryclassicationalgorithms.Incon-trast,datasetsthatcontainpedestriansintheiroriginalcontextareknownas`detection'datasetsandallowforthedesignandtestingoffull-imagedetectionsystems.TheCaltechdatasetalongwithallthedatasetsinthesecondset(INRIA,ETH,TUD-BrusselsandDaimler-DB)canserveas`detection'datasets.Pedestrianscale:Table1additionallyliststhe10thpercentile,medianand90thpercentilepedestrianpixelheightsforeachdataset.WhiletheINRIAdatasethasfairlyhighresolutionpedestrians,mostdatasetsgath-eredfrommobileplatformshavemedianheightsthatrangefrom50-100pixels.Thisemphasizestheimpor-tanceofdetectionoflowresolutionpedestrians,espe-ciallyforapplicationsonmobileplatforms.Datasetproperties:Thenalcolumnssummarizead-ditionaldatasetfeaturesincludingtheavailabilityofcolorimages,videodata,temporalcorrespondencebe-tweenBBsandocclusionlabels,andwhether`per-image'evaluationandunbiasedselectioncriteriawereused.Asmentioned,inourperformanceevaluationwead-ditionallyusetheINRIA[7],ETH[4],TUD-Brussels[5]andDaimler-DB[6]datasets.TheINRIAdatasethelpeddriverecentadvancesinpedestriandetectionandremainsoneofthemostwidelyuseddespiteitslimitations.MuchliketheCaltechdataset,theETH,TUD-BrusselsandDaimler-DBdatasetsareallcapturedinurbansettingsusingacameramountedtoavehicle(orstrollerinthecaseofETH).WhilebeingannotatedinlessdetailthantheCaltechdataset(seeTable1),eachcanserveas`detection'datasetandisthussuitableforuseinourevaluation. 14SUBMISSIONTOIEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE Fig.12.Performanceasafunctionofscale.Alldetec-torsimproverapidlywithincreasingscale,especiallyMULTI-FTR+MOTION,HOGLBPandLATSVM-V2whichutilizemotion,textureandparts,respectively.Atsmallscalesstate-of-the-artperformancehasconsiderableroomforimprovement.pedestriansdegrademost.Wecanseethistrendclearlybyplottinglog-averagemissrateasafunctionofscale.Figure12showsperformanceatvescalesbetween32and128pixels(seealsox3.2andx3.3).Performanceim-provesforallmethodswithincreasingscale,butmostforMULTIFTR+MOTION,HOGLBPandLATSVM-V2.Theseutilizemotion,textureandparts,respectively,forwhichhighresolutionsappeartobeparticularlyimportant.Occlusion:Theimpactofocclusionondetecting50pixelortallerpedestriansisshowninFigures11(d)and11(e).Asdiscussedinx2.2.2,weclassifypedestriansasunoccluded,partiallyoccluded(1-35%occluded)andheavilyoccluded(35-80%occluded).Performancedropssignicantlyevenunderpartialocclusion,leadingtoalog-averagemissrateof73%forCHNFTRSandMULTI-FTR+MOTION.Surprisingly,performanceofpartbaseddetectorsdegradesasseverelyasforholisticdetectors.Reasonable:Performanceformediumscaleorpar-tiallyoccludedpedestriansispoorwhileforfarscalesorunderheavyocclusionitisabysmal(seeFigure16).Thismotivatesustoevaluateperformanceonpedestriansover50pixelstallundernoorpartialocclusion(theseareclearlyvisiblewithoutmuchcontext).Werefertothisasthereasonableevaluationsetting.ResultsareshowninFigure11(f),MULTIFTR+MOTION,CHNFTRSandFPDWperformbestwithlog-averagemissratesof51-57%.Webelievethisevaluationismorerepresentativethanoverallperformanceonallpedestriansandweuseitforreportingresultsonalladditionaldatasetsinx5.2andforthestatisticalsignicanceanalysisinx5.3.Localization:Recallthattheevaluationisinsensitivetotheexactoverlapthresholdusedformatchingsolongasitisbelow:6(seex3.1andFigure7).Thisimpliesthatnearlyalldetectionsthatoverlapthegroundtruthoverlapitbyatleasthalf.However,asthethresholdisincreasedfurtherandhigherlocalizationaccuracyisrequired,performanceofalldetectorsdegradesrapidly.DetectorrankingismostlymaintainedexceptMULTIFTRandPLSdegrademore;thisimpliesthatallbutthesetwodetectorshaveroughlythesamelocalizationaccuracy.5.2EvaluationonMultipleDatasetsToincreasethescopeofouranalysis,webenchmarkedthedetectorsonsixadditionalpedestriandetectiondatasetsincludingINRIA[7],TUD-Brussels[5],ETH[4]Daimler-DB[6],Caltech-Training,andCaltech-Japan.Thesedatasetsarediscussedinx2.4andTable1;wealsoreviewtheirmostsalientaspectsbelow.Evaluatingacrossmultipledatasetsallowsustodrawconclusionbothaboutthedetectorsandthedatasets.Herewefocusonthedatasets,wereturntoassessingdetectorper-formanceusingmultipledatasetsinx5.3.PerformanceresultsforeverydatasetareshowninFigure13.Webeginwithabriefreviewofthesixdatasets.INRIAcontainsimagesofhighresolutionpedestrianscollectedmostlyfromholidayphotos(weuseonlythe288testimagesthatcontainpedestrians,notethatafewhaveincompletelabels).TheremainingdatasetswererecordedwithamovingcamerainurbanenvironmentsandallcontaincolorexceptDaimler-DB.ETHhashigherdensityandlargerscalepedestriansthantheremainingdatasets(weusetherenedannotationspublishedin[5]).Caltech-TrainingreferstothetrainingportionoftheCaltechPedestrianDataset.Caltech-JapanreferstoadatasetwegatheredinJapanthatisessentiallyidenticalinsizeandscopetotheCaltechdataset(unfortunatelyitcannotbereleasedpubliclyforlegalreasons).Ta-ble1providesanoverviewandfurtherdetailsoneachdataset'spropertiesandstatistics(seealsoFigure1).Webenchmarkperformanceusingthereasonableeval-uationsetting(50pixelortallerunderpartialornoocclusion),standardizingaspectratiosasdescribedinx3.4.ForDaimler-DBandINRIA,whichcontainonlygrayscaleandstaticimages,respectively,werunonlydetectorsthatdonotrequirecolorandmotioninfor-mation.Also,FTRMINEandFTRSYNTHresultsarenotalwaysavailable,otherwiseweevaluatedeverydetectoroneverydataset.Wemakealldatasets,alongwithannotationsanddetectoroutputsforeach,availableinasinglestandardizedformatonourprojectwebpage.Ofalldatasets,performanceisbestonINRIA,whichcontainshighresolutionpedestrians,withLATSVM-V2,CHNFTRSandFPDWachievinglog-averagemissratesof20-22%(seeFigure13(a)).PerformanceisalsofairlyhighonDaimler-DB(13(b))with29%log-averagemissrateattainedbyMULTIFTR+MOTION,possiblyduetothegoodimagequalityresultingfromuseofamonochromecamera.ETH(13(c)),TUD-Brussels(13(d)),Caltech-Training(13(e)),andCaltech-Testing(11(f))aremorechallenging,withlog-averagemissratesbetween51-55%,andCaltech-Japan(13(f))isevenmoredifcultduetolowerimagequality.Overall,detectorrankingisreasonablyconsistentacrossdatasets,suggestingthatevaluationisnotoverlydependentonthedatasetused.5.3StatisticalSignicanceWeaimtorankdetectorperformanceutilizingmultipledatasetsandassesswhetherthedifferencesbetween 16SUBMISSIONTOIEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE (a)histogramofdetectorranks (b)criticaldifferencediagramFig.14.Summaryofdetectorperformanceacrossmultipledatasetsandthestatisticalsignicanceoftheresults.(a)Visualizationofthenumberoftimeseachdetectorachievedeachrank.Detectorsareorderedbyimprovingmeanrank(displayedinbrackets);observethehighvarianceamongthetopperformingdetectors.(b)Criticaldifferencediagram[87]:thex-axisshowsmeanrank,bluebarslinkdetectorsforwhichthereisinsufcientevidencetodeclarethemstatisticallysignicantlydifferent(duetotherelativelylownumberofperformancesamplesandfairlyhighvariance).forwhichthereisinsufcientevidencetodeclarethemstatisticallysignicantlydifferent.Forexample,MULTI-FTR+MOTION,CHNFTRSandFPDWareallsignicantlybetterthanHOG(sincetheyarenotlinked).Observe,however,thatthedifferencesbetweenthetopsixdetec-torarenotstatisticallysignicant,indeed,eachdetectortendstobelinkedtonumerousothers.Thisresultdoesnotchangemuchifwerelaxthecondenceto =0:1.AsimilartrendwasobservedinEveringhametal.'s[14]analysisonthePASCALchallenge.Unfortunately,thestatisticalanalysisrequiresalargenumberofsamples,andwhilethe28datafoldsprovideaconsiderablymorethoroughanalysisofpedestriandetectorsthanprevi-ouslyattempted,giventheirinherentvariability,evenmoredatawouldbenecessary.Weemphasize,however,thatsimplybecausewehaveinsufcientevidencetodeclarethedetectorsstatisticallysignicantlydifferentdoesnotimplythattheirperformanceisequal.5.4RuntimeAnalysisInmanyapplicationsofpedestriandetection,includingautomotivesafety,surveillance,robotics,andhumanmachineinterfaces,fastdetectionratesareoftheessence.Althoughthroughoutwehavefocusedonaccuracy;weconcludebyjointlyconsideringbothaccuracyandspeed.WemeasureruntimeofeachdetectorusingimagesfromtheCaltechdataset(averagingruntimeovermul-tipleframes).Tocompensatefordetectorsrunningondifferenthardware,allruntimesarenormalizedtotherateofasinglemodernmachine.Weemphasizethatwemeasurethespeedofbinariesprovidedbytheauthorsandthatfasterimplementationsarelikelypossible.InFigure15weplotlog-averagemissrateversusruntimeforeachdetectoron640480images.Legendsareorderedbydetectionspeedmeasuredinframespersecond(fps).Detectionspeedforpedestriansover100pixelsrangesfrom:02fpsto6:5fpsachievedbyFPDW,aspedupversionofCHNFTRS.Detecting50pixelpedestrianstypicallyrequiresimageupsampling;theslowestdetectorsrequirearoundveminutesperframe.FPDWremainsthefastestdetectoroperatingat2:7fps.Overall,theredoesnotseemtobeastrongcorrelationbetweenruntimeandaccuracy.Whiletheslowestdetectorhappenstoalsobethemostaccurate(MUTLIFTR+MOTION),onpedestriansover50pixelsthetwofastestdetectors,CHNFTRSandFPDW,arealsothesecondandthirdmostaccurate,respectively.Whiletheframeratesmayseemlow,itisimportanttomentionthatalltesteddetectorscanbeemployedaspartofafullsystem(cf.[2]).Suchsystemsmayemploygroundplaneconstraintsandperformregion-of-interestselection(e.g.fromstereodisparityormotion),reduc-ingruntimedrastically.Moreover,numerousapproacheshavebeenproposedforspeedingupdetection,includingspeedingupthedetectoritself[29],[44],[46],throughuseofapproximations[63],[89]orbyusingspecialpurposehardwaresuchasGPUs[90](forareviewoffastdetectionsee[63]).Nevertheless,theaboveruntimeanalysisgivesasenseofthespeedofcurrentdetectors.6DISCUSSIONThisstudywascarriedouttoassessthestateoftheartinpedestriandetection.Automaticallydetectingpedestri-ansfrommovingvehiclescouldhaveconsiderableeco-nomicimpactandthepotentialtosubstantiallyreducepedestrianinjuriesandfatalities.Wemakethreemaincontributions:anewdataset,animprovedevaluationmethodologyandananalysisofthestateoftheart.First,weputtogetheranunprecedentedobjectde-tectiondataset.Thedatasetislarge,representativeandrelevant.Itwascollectedwithanimaginggeometryandinmultipleneighborhoodsthatmatchlikelyconditionsforurbanvehiclenavigation.Second,weproposeanevaluationmethodologythatallowsustocarryoutprob-ingandinformativecomparisonsbetweencompetingapproachestopedestriandetectioninarealisticandunbiasedmanner.Third,wecompareperformanceofsixteenpre-trainedstate-of-the-artdetectorsacrosssixdatasets.Performanceisassessedasafunctionofscale,degreeofocclusion,localizationaccuracyandcomputa-tionalcost;moreoverwegaugethestatisticalsignicanceoftherankingofdetectorsacrossmultipledatasets. 20SUBMISSIONTOIEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE[65]B.Babenko,P.Doll´ar,Z.Tu,andS.Belongie,“Simultaneouslearningandalignment:Multi-instanceandmulti-poselearning,”inECCVFacesinReal-LifeImages,2008.11[66]S.Walk,K.Schindler,andB.Schiele,“Disparitystatisticsforpedestriandetection:Combiningappearance,motionandstereo,”inEuropeanConf.ComputerVision,2010.11[67]P.Doll´ar,Z.Tu,H.Tao,andS.Belongie,“Featureminingforimageclassication,”inIEEEConf.ComputerVisionandPatternRecognition,2007.11,12[68]A.Bar-Hillel,D.Levi,E.Krupka,andC.Goldberg,“Part-basedfeaturesynthesisforhumandetection,”inEuropeanConf.Com-puterVision,2010.11,12[69]W.Schwartz,A.Kembhavi,D.Harwood,andL.Davis,“Humandetectionusingpartialleastsquaresanalysis,”inIEEEIntl.Conf.ComputerVision,2009.11,12[70]Z.LinandL.S.Davis,“Apose-invariantdescriptorforhumandet.andseg.”inEuropeanConf.ComputerVision,2008.11,12[71]P.Felzenszwalb,D.McAllester,andD.Ramanan,“Adiscrimina-tivelytrained,multiscale,deformablepartmodel,”inIEEEConf.ComputerVisionandPatternRecognition,2008.11,12[72]P.F.Felzenszwalb,R.B.Girshick,D.McAllester,andD.Ramanan,“Objectdetectionwithdiscriminativelytrainedpartbasedmod-els,”IEEETrans.PatternAnalysisandMachineIntelligence,vol.99,no.PrePrints,2009.11,12[73]A.Mohan,C.Papageorgiou,andT.Poggio,“Example-basedobjectdet.inimagesbycomponents,”IEEETrans.PatternAnalysisandMachineIntelligence,vol.23,no.4,pp.349–361,Apr.2001.11[74]K.Mikolajczyk,C.Schmid,andA.Zisserman,“Humandetectionbasedonaprobabilisticassemblyofrobustpartdetectors,”inEuropeanConf.ComputerVision,2004.11[75]M.Enzweiler,A.Eigenstetter,B.Schiele,andD.M.Gavrila,“Multi-cueped.classicationwithpartialocclusionhandling,”inIEEEConf.ComputerVisionandPatternRecognition,2010.11[76]L.BourdevandJ.Malik,“Poselets:Bodypartdetectorstrainedusing3dhumanposeannotations,”inIEEEIntl.Conf.ComputerVision,2009.11[77]M.EnzweilerandD.M.Gavrila,“Integratedpedestrianclassi-cationandorientationestimation,”inIEEEConf.ComputerVisionandPatternRecognition,2010.11[78]D.TranandD.Forsyth,“Congurationestimatesimprovepedes-triannding,”inAdvancesinNeuralInformationProcessingSystems,2008.11[79]M.Weber,M.Welling,andP.Perona,“Unsupervisedlearningofmodelsforrecog.”inEuropeanConf.ComputerVision,2000.11[80]R.Fergus,P.Perona,andA.Zisserman,“Objectclassrecognitionbyunsupervisedscale-invariantlearning,”inIEEEConf.ComputerVisionandPatternRecognition,2003.11[81]S.AgarwalandD.Roth,“Learningasparserepresentationforobjectdet.”inEuropeanConf.ComputerVision,2002.11[82]P.Doll´ar,B.Babenko,S.Belongie,P.Perona,andZ.Tu,“Multi-plecomponentlearningforobjectdetection,”inEuropeanConf.ComputerVision,2008.11[83]Z.Lin,G.Hua,andL.S.Davis,“Multipleinstancefeatureforrobustpart-basedobjectdetection,”inIEEEConf.ComputerVisionandPatternRecognition,2009.11[84]D.Park,D.Ramanan,andC.Fowlkes,“Multiresolutionmodelsforobj.det.”inEuropeanConf.ComputerVision,2010.11[85]R.M.Haralick,K.Shanmugam,andI.Dinstein,“Texturalfea-turesforimageclassication,”IEEETrans.onSystems,Man,andCybernetics,vol.3,no.6,pp.610–621,1973.12[86]E.ShechtmanandM.Irani,“Matchinglocalself-similaritiesacrossimagesandvideos,”inIEEEConf.ComputerVisionandPatternRecognition,2007.12[87]J.Demsar,“StatisticalComparisonsofClassiersoverMultipleDataSets,”JournalofMachineLearningResearch,vol.7,pp.1–30,2006.15,16[88]S.Garc´aandF.Herrera,“Anextensionon”StatisticalCom-parisonsofClassiersoverMultipleDataSets”forallpairwisecomparisons,”JournalofMachineLearningResearch,vol.9,pp.2677–2694,2008.15[89]W.Zhang,G.J.Zelinsky,andD.Samaras,“Real-timeaccurateobjectdetectionusingmultipleresolutions,”inIEEEIntl.Conf.ComputerVision,2007.16[90]C.Wojek,G.Dork´o,A.Schulz,andB.Schiele,“Sliding-windowsforrapidobjectclasslocalization:Aparalleltechnique,”inDAGMSymposiumPatternRecognition,2008.16 PiotrDoll´arreceivedhismastersdegreeincomputersciencefromHarvardUniversityin2002andhisPhDfromtheUniversityofCali-fornia,SanDiegoin2007.HejoinedtheCom-putationalVisionlabatCaltechasapostdoc-toralfellowin2007,wherehecurrentlyresides.Hehasworkedonbehaviorrecognition,bound-arylearning,manifoldlearning,andobjectandpedestriandetection,includingefcientfeaturerepresentationsandnovellearningparadigms.Hisgeneralinterestslieinmachinelearningandpatternrecognitionandtheirapplicationtocomputervision. ChristianWojekreceivedhismastersdegreeincomputersciencefromtheUniversityofKarl-sruhein2006andhisPhDfromTUDarmstadtin2010.HewasawardedaDAADscholarshiptovisitMcGillUniversityfrom2004to2005.HewaswithMPIInformaticsSaarbr¨uckenasapostdoctoralfellowfrom2010to2011andin2011joinedCarlZeissCorporateResearch.Hisresearchinterestsareobjectdetection,sceneunderstandingandactivityrecognition. BerntSchielereceivedhismastersincom-putersciencefromUniv.ofKarlsruheandINPGrenoblein1994.In1997heobtainedhisPhDfromINPGrenobleincomputervision.HewasapostdoctoralassociateandVisitingAssistantProfessoratMITbetween1997and2000.From1999until2004hewasanAssistantProfessoratETHZurichandfrom2004to2010hewasafullprofessorofcomputerscienceatTUDarmstadt.In2010,hewasappointedscienticmemberoftheMaxPlanckSocietyandadirectorattheMaxPlanckInstituteforInformatics.Since2010hehasalsobeenaProfessoratSaarlandUniversity.Hismaininterestsarecomputervision,percep-tualcomputing,statisticallearningmethods,wearablecomputers,andintegrationofmulti-modalsensordata.Heisparticularlyinterestedindevelopingmethodswhichworkunderreal-worldconditions. PietroPeronagraduatedinElectricalEngineer-ingfromtheUniversitadiPadovain1985andreceivedaPhDinEECSfromtheUniversityofCaliforniaatBerkeleyin1990.Afterapost-doctoralfellowshipatMITin1990-91hejoinedthefacultyofCaltechin1991,whereisisnowAllenE.PuckettProfessorofElectricalEngi-neeringandComputationandNeuralSystems.Hiscurrentinterestsarevisualrecognition,mod-elingandmeasuringanimalbehaviorandVisi-pedia.Hehasworkedonanisotropicdiffusion,multiresolution-multi-orientationltering,humantextureperceptionandsegmentation,dynamicvision,grouping,analysisofhumanmotion,recognitionofobjectcategories,andmodelingvisualsearch.