/
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL - PDF document

debby-jeon
debby-jeon . @debby-jeon
Follow
478 views
Uploaded On 2015-03-07

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL - PPT Presentation

21 NO 5 MAY 1999 433 Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes Andrew E Johnson Member IEEE and Martial Hebert Member IEEE Abstract We present a 3D shapebased object recognition system for simultaneous recognition ID: 42186

MAY

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "IEEE TRANSACTIONS ON PATTERN ANALYSIS AN..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, MAY 1999433Using Spin Images for Efficient Object, and Martial Hebert, —We present a 3D shape-based object recognition system for simultaneous recognition of multiple objects in scenesrepresentation. The spin image is a data level shape descriptor that is used to match surfaces represented as surface meshes. Wpresent a compression scheme for spin images that results in efficient multiple object recognition which we verify with resultsIndex Terms—3D object recognition, surface matching, spin image, clutter, occlusion, oriented point, surface mesh, pointNTRODUCTIONURFACE matching is a technique from 3D computer vi-sion that has many applications in the area of roboticsand automation. Through surface matching, an object canbe recognized in a scene by comparing a sensed surface toan object surface stored in memory. When the object surfaceis matched to the scene surface, an association is made be-tween something known (the object) and something un-known (the scene); information about the world is ob-tained. Another application of surface matching is thealignment of two surfaces represented in different coordi-nate systems. By aligning the surfaces, the transformationbetween the surface coordinate systems is determined. Sur-face alignment has numerous applications including local-ization for robot navigation [27] and modeling of complexscenes from multiple views [13].Shape representations are used to collate the informationstored in sensed 3D points so that surfaces can be comparedefficiently. Shape can be represented in many differentways, and finding an appropriate representation for shapethat is amenable to surface matching is still an open re-search issue. The variation among shape representationsspans many different axes. For instance, shape representa-tions can be classified by the number of parameters used todescribe each primitive in the representation. Representingobjects using planar surface patches [7] uses many primi-tives, each with a few parameters. On the other hand, rep-resenting an object with generalized cylinders [5] requiresfewer primitives, but each has many parameters. Anotheraxis of comparison is the local versus global nature of therepresentation. The Gaussian image [17] and related spheri-cal representations [10] are global representations useful fordescribing single objects, while surface curvature [3] meas-ures local surface properties and can be used for surfacematching in complex scenes. The multitude of proposedsurface representations indicates the lack of consensus onthe best representation for surface matching.Another factor determining the appropriate surface repre-sentation is the coordinate system in which the data is de-scribed. Surfaces can be defined in viewer-centered coordi-nate systems or object-centered coordinate systems. Viewer-centered representations [6] describe surface data with re-spect to a coordinate system dependent on the view of thesurface. Although viewer-centered coordinate systems areeasy to construct, the description of the surface changes asviewpoint changes, and surfaces must be aligned before theycan be compared. Furthermore, to represent a surface frommultiple views, a separate representation must be stored foreach different viewpoint.An object-centered coordinate system describes an objectsurface in a coordinate system fixed to the object. In object-centered coordinates, the description of the surface is view-independent, so surfaces can be directly compared, withoutfirst aligning the surfaces. Object-centered representationscan be more compact than viewer-centered representationsbecause a single surface representation describes all viewsof the surface. Finding an object-centered coordinate systemis difficult because these systems are generally based onglobal properties of the surface. However, if an object-centered coordinate system can be extracted robustly fromsurface data, then is view independence prompts its useover viewer-centered coordinate system.In 3D object recognition, an important application ofsurface matching, an object surface is searched for in ascene surface. Real scenes contain multiple objects, so sur-face data sensed in the real world will contain clutter, sur-faces that are not part of the object surface being matched.Because clutter will corrupt global properties of the sceneA.E. Johnson is with the Jet Propulsion Laboratory, Mail Stop 125-209,4800 Oak Grove Dr., Pasadena, CA 91109.E-mail: aej@robotics.jpl.nasa.gov.M. Hebert is with the Robotics Institute, Carnegie Mellon University, 5000Forbes Ave., Pittsburgh, PA 15213. E-mail: hebert@ri.cmu.edu.For information on obtaining reprints of this article, please send e-mail to: 434IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, MAY 1999data, generating object-centered coordinate systems in thecluttered scenes is difficult. The usual method for dealingwith clutter is to segment the scene into object and non-object components [1], [7]; naturally, this is difficult if theposition of the object is unknown. An alternative to seg-mentation is to construct object-centered coordinate sys-tems using local features detected in the scene [9], [18];here again there is the problem of differentiating objectfeatures from non-object features. Another difficulty oc-curs because surface data often has missing components,i.e., occlusions. Occlusions will alter global properties ofthe surfaces and therefore, will complicate construction ofobject-centered coordinate systems. Consequently, if anobject centered surface matching representation is to beused to recognize objects in real scenes, it must be robustObject representations should also enable efficientmatching of surfaces from multiple models, so that recog-nition occurs in a timely fashion. Furthermore, the repre-sentation should be efficient in storage (i.e., compact), sothat many models can be stored in the model library. With-out efficiency, a recognition system will not be able to rec-ognize the multitude of objects in the real world.1.1A Representation for Surface MatchingIn our representation, surface shape is described by adense collection of 3D points and surface normals. In ad-dition, associated with each surface point is a descriptiveimage that encodes global properties of the surface usingan object-centered coordinate system. By matching im-ages, correspondences between surface points can be es-tablished and used to match surfaces independent of thetransformation between surfaces. Taken together, thepoints, normals and associated images make up our sur-face representation. Fig. 1 shows the components of oursurface matching representation.Representing surfaces using a dense collection of pointsis feasible because many 3D sensors and sensing algorithmsreturn a dense sampling of surface shape. Furthermore,from sensor geometry and scanning patterns, the adjacencyon the surface of sensed 3D points can be established. Usingadjacency and position of sensed 3D points surface normalcan be computed. We use a polygonal surface mesh to com-bine information about the position of 3D surface pointsand the adjacency of points on the surface. In a surfacemesh, the vertices of the surface mesh correspond to 3Dsurface points, and the edges between vertices convey adja-cency. Given enough points, any object can be representedby points sensed on the object surface, so surface meshescan represent objects of general shape. Surface meshes canbe generated from different types of sensors and do notgenerally contain sensor-specific information; they are sen-sor-independent representations. The use of surface meshas representations for 3D shapes has been avoided in thepast due to computational concerns. However, our researchand the findings of other researchers have shown that proc-essing power has reached a level where computations usingsurface meshes are now feasible [2], [26].Our approach to surface matching is based on matchingindividual surface points in order to match complete sur-faces. Two surfaces are said to be similar when the imagesfrom many points on the surfaces are similar. By matchingpoints, we are breaking the problem of surface matchinginto many smaller localized problems. Consequently,matching points provides a method for handling clutterand occlusion in surface matching without first segmentingthe scene; clutter points on one surface will not havematching points on the other, and occluded points on onesurface will not be searched for on the other. If many pointsbetween the two surfaces match then the surfaces can bematched. The main difficulty with matching surfaces in thisway is describing surface points so that they can be differ-entiated from one another, while still allowing pointmatching in scenes containing clutter, occlusion and 3DTo differentiate among points, we construct 2D imagesassociated with each point. These images are created byconstructing a local basis at an oriented point (3D pointwith surface normal) on the surface of an object. As in geo-metric hashing [18], the positions with respect to the basisof other points on the surface of the object can then be de-scribed by two parameters. By accumulating these pa-rameters in a 2D histogram, a descriptive image associatedwith the oriented point is created. Because the image en-codes the coordinates of points on the surface of an objectwith respect to the local basis, it is a local description of theglobal shape of the object and is invariant to rigid transfor-mations. Since 3D points are described by images, we canapply powerful techniques from 2D template matching andpattern classification to the problem of surface matching.The idea of matching points to match surfaces is not anovel concept. Stein and Medioni [24] recognize 3D objectsby matching points using structural indexing and their“splash” representation. Similarly, Chua and Jarvis match Fig. 1. Components of our surface representation. A surface described by a polygonal surface mesh can be represented for matching as a set of JOHNSON AND HEBERT: USING SPIN IMAGES FOR EFFICIENT OBJECT RECOGNITION IN CLUTTERED 3D SCENES435points to align surfaces using principal curvatures [3] and“point-signatures” [4]. Our methods differs from these inthe way that points are represented for matching and theway that points, once matched, are grouped to match sur-faces. Our representation is a 2D image that accumulatesinformation about a surface patch while splashes and pointsignatures are 1D representations that accumulate surfaceinformation along a 3D curve. This difference makes ourrepresentation potentially more descriptive than the othertwo. Furthermore, our representation does not rely on thedefinition of a possibly ambiguous three-axis orthonormalframe at each point, but instead uses an oriented point basiswhich can be defined robustly everywhere on a surfaceusing just surface position and normal. Finally, because weuse an image based representation, techniques from imageprocessing can be used in our matching algorithms, allow-To distinguish our point matching representation fromcamera images common in computer vision, we have cho-spin image because the representation isa 2D array of values, and because the image generationprocess can be visualized as a sheet spinning about thenormal of the point. Previous papers, [12] and [15], intro-duced the concept of spin images and showed how theycan be used to match surfaces. Therefore, in Section 2, webriefly review the spin image generation and its applicationto surface matching. This section also presents an analysisof the parameters used in spin image generation showingthe effect of different parameter values on the accuracy ofThe main contribution of this paper is the descriptionand experimental analysis of the use of spin images inefficient multi-model object recognition in scenes con-taining clutter and occlusion. Two major improvementsto spin image matching enable efficient object recogni-tion. First, localization of spin images by reducing spinimage generation parameters enables surface matchingin scenes containing clutter and occlusion. Second, sincethe large number of spin images comprising our surfacerepresentation are redundant, statistical eigen-analysiscan be employed to reduce the dimensionality of the im-ages and speed up spin image matching. The techniquesemployed are similar to those used in appearance basedrecognition [19], and in particular, the combination oflocalized images and image compression is similar to thework in eigen-features [21] and parts-based appearancerecognition [11]. Section 3 describes our algorithm formulti-model object recognition using spin images, andSection 4 describes our experimental validation of recog-nition using spin images on 100 complex scenes. Ashorter description of this work has appeared as a con-ference paper [14].URFACE ATCHINGThis section provides the necessary background for under-standing spin image generation and surface matching usingspin images. More complete description of spin images andour surface matching algorithms are given in [12], [15].2.1Spin ImagesOriented points, 3D points with associated directions, areused to create spin images. We define an oriented point at asurface mesh vertex using the 3D position of the vertex andsurface normal at the vertex. The surface normal at a vertexis computed by fitting a plane to the points connected to theAn oriented point defines a partial, object-centered, co-ordinate system. Two cylindrical coordinates can be definedwith respect to an oriented point: the radial coordinate defined as the perpendicular distance to the line throughthe surface normal, and the elevation coordinate as the signed perpendicular distance to the tangent planedefined by vertex normal and position. The cylindrical an-gular coordinate is omitted because it cannot be definedrobustly and unambiguously on planar surfaces.A spin image is created for an oriented point at a vertexin the surface mesh as follows. A 2D accumulator indexed and is created. Next, the coordinates () are com-puted for a vertex in the surface mesh that is within thesupport of the spin image (explained below). The bin in-) in the accumulator is then incremented;bilinear interpolation is used to smooth the contribution ofthe vertex. This procedure is repeated for all vertices withinthe support of the spin image. The resulting accumulatorcan be thought of as an image; dark areas in the image cor-respond to bins that contain many projected points. As longas the size of the bins in the accumulator is greater than themedian distance between vertices in the mesh (the defini-tion of mesh resolution), the position of individual verticeswill be averaged out during spin image generation. Fig. 2shows the projected () 2D coordinates and spin imagesfor three oriented points on a duck model. For surfacematching, spin images are constructed for every vertex inSpin images generated from two different surfaces rep-resenting the same object will be similar because they arebased on the shape of the object, but they will not be exactlythe same due to variations in surface sampling and noise.However, if the surfaces are uniformly sampled then thespin images from corresponding points on the differentsurfaces will be linearly related. (Uniform surface samplingis enforced by preprocessing the surface meshes using amesh resampling algorithm [16].) A standard method forcomparing linearly related data sets is the linear correlationcoefficient, so we use correlation coefficient between twospin images to measure spin image similarity. As is shownin Section 3, for efficient object recognition, the similaritymeasure between spin images must be changed to the distance between images. The distance between imagesperforms as well as correlation coefficient for spin imagematching, as long as the images are properly normalized.2.2Spin Image Generation Parameters is the geometric width of the bins in the spin im-age. Bin size is an important parameter in spin image gen-eration because it determines the storage size of the spinimage and the averaging in spin images that reduces theeffect of individual point positions. It also has an effect onthe descriptiveness of the spin images. The bin size is set 436IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, MAY 1999as a multiple of the resolution of the surface mesh in orderto eliminate the dependence of setting bin size on objectscale and resolution. Setting bin size based on mesh resolu-tion is feasible because mesh resolution is related to the sizeof shape features on an object and the density of points inthe surface mesh. Spin images generated for the duckmodel using different bin sizes are shown in Fig. 3. The spinimage generated for a bin size of four times the modelresolution is not very descriptive of the global shape of themodel. The spin image generated with a bin size of onequarter the mesh resolution does not have enough averag-ing to eliminate the effect of surface sampling. The spinimage generated with a bin size equal to the mesh resolu-tion has the proper balance between encoding global shapeFig. 6 gives a quantitative analysis of the effect of binsize on spin image matching. To create the graph, first,the spin images for all vertices on the model were cre-ated for a particular bin size. Next, each spin image wascompared to all of the other spin images from the model,and the Euclidean distances between the vertex and thevertices corresponding to the best matching spin imageswere computed. After repeating this matching for allspin images on the model, the median Euclidean dis-tance (match distance) was computed. By repeating thisprocedure for multiple bin sizes using the duck model,that graph on the left of Fig. 6 was created. Match dis-tance is a single statistic that describes the correctness ofspin image matches, the lower the match distance, themore correct the matches.The graph shows that for bin sizes below the meshresolution (1.0) the match distance is large while for binsizes greater than the mesh resolution, the match distanceincreases. Consequently, the best spin image matching oc-curs when bin-size is set close to the mesh resolution; thisanalysis confirms our qualitative observations from Fig. 3.For the results in this paper, the bin size is set to the exactlythe mesh resolution. Fig. 2. Spin images of large support for three oriented points on the surface of a rubber duck model. Fig. 3. The effect of bin size on spin image appearance. Three spin images of decreasing bin size for a point on the duck model are shown. Set-ting the bin size to the model resolution creates descriptive spin images while averaging during point accumulation to eliminate the effect of indi- JOHNSON AND HEBERT: USING SPIN IMAGES FOR EFFICIENT OBJECT RECOGNITION IN CLUTTERED 3D SCENES437Although spin images can have any number of rows andcolumns, for simplicity, we generally make the number ofrows and columns in a spin image equal. This results insquare spin images whose size can be described by one pa-rameter. We define the number of rows or columns in asquare spin image to be the . To create a spinimage, an appropriate image width needs to be determined.Image width times the bin size is called the spin image port distance ); support distance determines the amountof space swept out by a spin image. By setting the imagewidth, the amount of global information in a spin imagecan be controlled. For a fixed bin size, decreasing imagewidth will decrease the descriptiveness of a spin image be-cause the amount of global shape included in the imagewill be reduced. However, decreasing image width will alsoreduce the chances of clutter corrupting a spin image. Im-age width is analogous to window size in 2D templatematching. Fig. 4 shows spin images for a single orientedpoint on the duck model as the image width is decreased.This figure shows that as image width decreases, the de-scriptiveness of the images decreases.The graph in the middle of Fig. 6 shows the effect of im-age width on spin image matching. As image width de-creases, match distance decreases. This confirms our obser-vation from Fig. 4. In general, we set the image width sothat the support distance is on order of the size of themodel. If the data is very cluttered, then we set the imagewidth to a smaller value. For the results presented in thispaper, image width is set to 15, resulting in spin imagesThe final spin image generation parameter is support an-). Support angle is the maximum angle between thedirection of the oriented point basis of a spin image and thesurface normal of points that are allowed to contribute tothe spin image. Suppose we have an oriented point withposition and normal () for which we are creating aspin image. Furthermore, suppose there exists another ori- with position and normal (). The sup-port angle constraint can then be stated as: will be accu-mulated in the spin image of Support angle is used to limit the effect of self occlusionand clutter during spin image matching. Fig. 5 shows thespin image generated for three different support anglesalong with the vertices on the model that are mapped intothe spin image. Support angle is used to reduce the numberof points on the opposite side of the model that contributeto the model spin image. This parameter decreases the ef-fect of occlusion on spin image matching; if a point has sig-nificantly different normal from the normal of the orientedpoint, then it is unlikely that it will be visible when the ori-ented point is imaged by a rangefinder in some scene data.Decreasing support angle also has the effect of decreas- Fig. 4. The effect of image width on spin images. As image width decreases, the volume swept out by the spin image (top) decreases, resulting indecreased spin image support (bottom). By varying the image width, spin images can vary smoothly from global to local representations. (a) A 40-pixel image width. (b) A 20-pixel image width. (c) A 10-pixel image width. Fig. 5. The effect of support angle on spin image appearance. As support angle decreases, the number of points contributing to the spin image(top) decreases. This results in reduction in the support of the spin images (bottom). (a) A 180 degree support angle. (b) A 90 degree supportangle. (c) A 60 degree support angle. 438IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, MAY 1999ing the descriptiveness of spin images. The graph on theright in Fig. 6 shows the effect of support angle on spin im-age match distance. The graph shows that as support angledecreases, the match distance increases because the spinimages are becoming less descriptive. However, a smallsupport angle is necessary for robustness to clutter and oc-clusion. We found that a balance can be struck betweenshape descriptiveness and matching robustness; in this pa-per, all results are generated for a spin images generatedwith a support angle of 60 degrees.Fig. 7 shows how spin image generation parameters lo-calize spin images to reduce the effect of scene clutter onmatching. When the spin images are not localized, themodel and scene spin images are very different in appear-ance, a fact which is born out by the linear correlation coef-ficient of the two images ( = 0.636). As the spin images arelocalized by decreasing the support angle and support, the spin images become much more similar.When creating spin images with large support angle andare spin mapped into the scene spin image. This causes thescene spin image to become uncorrelated with the model Fig. 6. Effect of spin image generation parameters bin size, image width, and support angle on match distance. Fig. 7. Localizing generation parameters increases the similarity of spin images. The top shows a scatterplot of the model and scene spin imagesgenerated using global parameters. The scatterplot shows that the spin images are not particularly correlated. The bottom shows a scatterplot ofthe model and scene spin images generated using local parameters. The scatterplot shows that the spin images are much more linearly corre- JOHNSON AND HEBERT: USING SPIN IMAGES FOR EFFICIENT OBJECT RECOGNITION IN CLUTTERED 3D SCENES439spin image because, as shown in the scatterplot of the twoimages, scene spin image pixels are being corrupted byclutter. When smaller support angle and distance are used,the spin images become similar; the pixel values shown inthe scatterplot of the images created with local parametersare linearly related ( = 0.958). By varying spin image gen-eration parameters, we are using knowledge of the spinimage generation process to eliminate outlier pixels, mak-ing the spin images much more similar.2.3Surface Matching EngineAs shown in Fig. 8, two surfaces are matched as follows.Spin images from points on one surface are compared bycomputing correlation coefficient with spin images frompoints on another surface; when two spin images arehighly correlated, a point correspondence between thesurfaces is established. More specifically, before matching,all of the spin images from one surface (the model) areconstructed and stored in a spin image stack. Next, a ver-tex is selected at random from the other surface (thescene) and its spin image is computed. Point correspon-dences are then established between the selected pointand the points with best matching spin images on theother surface. This procedure is repeated for many pointsresulting in a sizeable set of point correspondences (~100).Point correspondences are then grouped and outliers areeliminated using geometric consistency. Groups of geo-metrically consistent correspondences are then used tocalculate rigid transformations that aligns one surfacewith the other. After alignment, surface matches are veri-fied using a modified iterative closest point algorithm.The best match is selected as the one with the greatestoverlap between surfaces. Further details of the surfacematching engine are given in [12]. Fig. 8. Surface matching block diagram. Fig. 9. Spin images generated while traversing a path along the surface of the duck model. (a) Spin images from proximal orientsimilar, resulting in one cause of redundancy in spin images. (b) Two pairs of similar spin images caused by symmetry in the du 440IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, MAY 1999Surface matching using spin images can be extended toobject recognition as follows. Each model in the model li-brary is represented as a polygonal mesh. Before recogni-tion, the spin images for all vertices on all models are cre-ated and stored. At recognition time, a scene point is se-lected and its spin image is generated. Next, its spin imageis correlated with all of the spin images from all of themodels. The best matching model spin image will indicateboth the best matching model and model vertex. Aftermatching many scene spin images to model spin images,the point correspondences are input into the surfacematching engine described in Section 2.3. The result is si-multaneous recognition and localization of the models thatThis form of surface matching is inefficient for two rea-sons. First, each spin image comparison requires a correla-tion of two spin images, an operation on order of the rela-tively large (~200) number of bins in a spin image. Second,when a spin image is matched to the model library, it iscorrelated with all of the spin images from all of the mod-els. This operation is linear in the number of vertices ineach model and linear in the number of models. This line-arly growth rate is unacceptable for recognition from largemodel libraries. Fortunately, spin images can be com-pressed to speed up matching considerably.3.1Spin Image CompressionSpin images coming from the same surface can be correlatedfor two reasons: First, as shown in Fig. 9, spin images gener-ated from oriented point bases that are close to each other onthe surface will be correlated. Second, as shown in Fig. 9,surface symmetry and the inherit symmetry of spin imagegeneration will cause two oriented point bases on equal butopposite sides of a plane of symmetry to be correlated. Fur-thermore, surfaces from different objects can be similar onthe local scale, so there can exist a correlation between spinimages of small support generated for different objects.This correlation can be exploited to make spin imagematching more efficient through image compression. Forcompression, it is convenient to think of spin images asvectors in an -dimensional vector space where is thenumber of pixels in the spin image. Correlation betweenspin images places the set of spin images in a low dimen-A common technique for image compression in objectrecognition is principal component analysis (PCA) [19].PCA or Karhunen-Loeve expansion is a well-knownmethod for computing the directions of greatest variancefor a set of vectors [8]. By computing the eigenvectors of thecovariance matrix of the set of vectors, PCA determines anorthogonal basis, called the eigenspace, in which to de-PCA has become popular for efficient comparison of im-ages because it is optimal in the correlation sense. The distance between two spin images in spin image space isthe same as the distance between the two spin imagesrepresented in the eigenspace. Furthermore, when vectorsare projected into a subspace defined by the eigenvectors oflargest eigenvalue, the distance between projected vectorsis the best approximation (with respect to mean square er-ror) to the distance between the unprojected vectors,given the dimension of the subspace [8]. By minimizingmean-square error, PCA gives us an elegant way to balancecompression of images against ability to discriminate be-PCA is used to compress the spin images coming fromall models simultaneously as follows. Suppose the modellibrary contains spin images of size ; the mean of all Subtracting the mean of the spin images from each spinimage makes the principal directions computed by PCAmore effective for describing the variance between spin xxxbe the mean-subtracted set of spin images which can berepresented as a matrix with each column of the ma- Sxxx$$$The covariance of the spin images is the matrix CSSmmmThe eigenvectors of are then computed by solving theeigenvector problemeCeSince the dimension of the spin images is not too large(~200), the standard Jacobi algorithm from the book merical Recipes in C [22] is used to determine the eigenvec- and eigenvalues of . Since the eigenvectors of can be considered spin images, they will be called Next, the model projection dimension, , is determinedusing a reconstruction metric that depends on the neededfidelity in reconstruction and the variance among images(see [15]). Every spin image from each model is then pro-jected into the -dimensional subspace spanned by the eigenvectors of largest eigenvalue; the -tuple of projectioncoefficients, , becomes the compressed representation ofpxexexe,...,The amount of compression is defined by pressed representation of a model library has two compo-nents: the most significant eigenvectors and the set of tuples, one for each model spin image. Since the similaritybetween images is determined by computing the distance-tuples, the amount of storage for spin imagesand the time to compare them is reduced.3.2Matching Compressed Spin ImagesDuring object recognition, scene spin images are matchedto compressed model spin images represented as JOHNSON AND HEBERT: USING SPIN IMAGES FOR EFFICIENT OBJECT RECOGNITION IN CLUTTERED 3D SCENES441Given the low dimension of -tuples, it is possible to matchspin images in time that is sublinear in the number ofmodel spin images using efficient closest point searchstructures.To match a scene spin image to a model -tuple, a scene-tuple must be generated for the scene spin image. Thescene spin image is generated using the model spin imagegeneration parameters. Suppose the scene spin image for is represented in vector notation as The first step in constructing the scene -tuple is to subtract $ y y Next the mean-subtracted scene spin image is projectedonto the top library eigen-spin images to get the scene qyeyeye,...,-tuple is the projection of the scene spin im-age onto the principal directions of the library spin images.To determine the best matching model spin image toscene spin image, the distance between the scene andmodel tuples is used. When comparing compressed modelspin images, finding closest -tuples replaces correlatingspin images. Although the distance between spin imagesis not the same as the correlation coefficient used in spinimage matching (correlation is really the normalized dotproduct of two vectors), it is still a good measure of thesimilarity of two spin images.To find closest points, we use the efficient closest pointsearch structure proposed by Nene and Nayar [20]. Theefficiency of their data structure is based on the assump-tion that one is interested only in the closest point, if it isless than a predetermined distance from the query point.This assumption is reasonable in the context of spin imagematching, so we chose their data structure. Furthermore,in our experimental comparison, we found that usingtheir data structure resulted in order of magnitude im-provement in matching speed over matching using kd-trees or exhaustive search. The applicability of the algo-rithm to the problem of matching -tuples is not surpris-ing; the authors of the algorithm demonstrated its effec-tiveness in the domain of appearance-based recognition[19], a domain that is similar to spin image matching. Inboth domains, PCA is used to compress images resultingin set of structured -tuples that must be searched for clos- Fig. 10. Procedure for simultaneous matching of multiple models to a single scene point. 442IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, MAY 1999est points. In our implementation, the search parameter was automatically set to the median of the distances be-tween pairs of closest model -tuples. Setting in this waybalances the likelihood of finding closest points againstSpin image matching with compression is very similar tothe recognition algorithm without compression. Fig. 10shows a pictorial description for the procedure for match-ing of multiple models to a single scene point. Before rec-ognition, all of the model surface meshes are resampled tothe same resolution to avoid scale problems when com-paring spin images from different models. Next, the spinimages for each model in the model library are generated,and the library eigen-spin images are computed. The pro- is then determined for the library. Next,-tuples for the spin images in each model are com- Fig. 11. The 20 models used for recognition. (a) Toy sublibrary. (b) Plumbing sublibrary. Fig. 12. Simultaneous recognition of seven models from a library of 20 models in a cluttered scene. JOHNSON AND HEBERT: USING SPIN IMAGES FOR EFFICIENT OBJECT RECOGNITION IN CLUTTERED 3D SCENES443puted by projecting model spin images onto library eigen-spin images. Finally, model -tuples are then stored in anefficient closest point search structure.At recognition time, a fraction of oriented points areselected at random from the scene. For each scene ori-ented point, its spin image is computed using the scenedata. Next, for each model, the scene spin image is pro-jected onto the model’s eigen-spin images to obtain a-tuple. The scene -tuple is then used as a querypoint into the current model’s efficient closest point searchstructure which returns a list of current model close to the scene -tuple. These point matches are thenfed into the surface matching engine to find model/scene3.3ResultsTo test our recognition system we created a model librarycontaining 20 complete object models. The models in thelibrary are shown in Fig. 11; each was created by register-ing and integrating multiple range views of the objects[13]. Next, cluttered scenes were created by pushing ob-jects into a pile and acquiring a range image with a Kstructured light range finder. The scene data was thenprocessed to remove faces on occluding edges, isolatedpoints, dangling edges and small patches. This topologicalfilter was followed by mesh smoothing without shrinking[25] and mesh resampling [16] to change the scene dataresolution to that of the models in the model library. In allof the following results, the spin image generation pa-rameters are: a bin size equal to mesh resolution, an image Fig. 13. Simultaneous recognition of six models from a library of 20 models in a cluttered scene. 444IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, MAY 1999width of 15 bins (225 bins per image), and a support angleof 60 degrees.Fig. 12 shows the simultaneous recognition of sevenmodels from the library of 20 models. In the top right ofthe figure is shown the intensity image of the scene, andin the top left is shown the scene intensity image with theposition of recognized models superimposed as whitedots. In the middle is shown a frontal 3D view of the scenedata, shown as wireframe mesh, and then the same viewof the scene data with models superimposed as shadedsurfaces. The bottom shows a top view of the scene andmodels. From the three views it is clear that the modelsare closely packed a condition which creates a clutteredscene with occlusions. Because spin image matching hasbeen designed to be resistant to clutter and occlusion, ouralgorithm is able to recognize simultaneously the sevenmost prominent objects in the scene with no incorrect rec-ognitions. Some of the objects present were not recognizedbecause insufficient surface data was present for match-ing. Fig. 13 shows the simultaneous recognition of six ob-jects from a library of 20 objects in a similar format to Fig. 12.Fig. 14 shows some additional results using the differentlibraries shown in Fig. 11. These results show that objectscan be distinguished even when multiple object of similarshape appear in the scene (results B, D, F, G, H). They alsoshow that recognition does not fail when a significantportion of the scene surface comes from objects not in themodel library (results B, C, D).NALYSIS OF Any recognition algorithm designed for the real worldmust work in the presence of clutter and occlusion. In Sec-tion 2, we claim that creating spin images of small supportwill make our representation robust to clutter and occlu-sion. In this section, this claim is verified experimentally.We have developed an experiment to test the effectivenessof our algorithm in the presence of clutter and occlusion. Fig. 14. Additional recognition results using the 20-model library, plumbing library, and toy library shown in Fig. 11. Each result shows a scene JOHNSON AND HEBERT: USING SPIN IMAGES FOR EFFICIENT OBJECT RECOGNITION IN CLUTTERED 3D SCENES445Stated succinctly, the experiment consists of acquiring manyscene data sets, running our recognition algorithms on thescenes, and then interactively measuring the clutter and oc-clusion in each scene along with the recognition success orfailure. By plotting recognition success or failure against theamount of clutter or occlusion in the scene, the effect of clut-ter and occlusion on recognition can be determined.4.1ExperimentsRecognition success or failure can be broken down into fourpossible recognition states. If the model exists in the sceneand is recognized by the algorithm, this is termed a state. If the model does not exist in the scene, andthe recognition algorithm concludes that the model doesexist in the scene or places the model in an entirely incor-rect position in the scene, this is termed a state.If the recognition algorithm concludes that the model doesnot exist in the scene when it actually does exist in thescene, this is termed a state. The state did not exist in our experiments because the modelbeing searched for was always present in the scene.In our experiment for measuring the effect of clutterand occlusion on recognition, a recognition trial consists ofthe following steps. First, a model is placed in the scenewith some other objects. The other objects might occludethe model and will produce scene clutter. Next, the sceneis imaged, and the scene data is processed as described inSection 3.3. A recognition algorithm that matches themodel to the scene data is applied and the result of thealgorithm is presented to the user. Using a graphical inter-face, the user then interactively segments the surfacepatch that belongs to the model from the rest of the sur- Fig. 15. Recognition states versus clutter and occlusion for compressed and uncompressed spin images. 446IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, MAY 1999face data in the scene. Given this segmentation, theamounts of clutter and occlusion are automatically calcu-lated as explained below. By viewing the model superim-posed on the scene, the user decides the recognition state;this state is then recorded with the computed clutter andocclusion. By executing many recognition trials using dif-ferent models and many different scenes, a distribution ofrecognition state versus the amount of clutter and occlu- modelsurfacepatchareatotalmodelsurfaceareaSurface area for a mesh is calculated as the sum of theareas of the faces making up the mesh. The clutter in thescene is defined as totalpoints in relevant volumeClutter points are vertices in the scene surface meshthat are not on the model surface patch. The relevantvolume is the union of the volumes swept out by eachspin image of all of the oriented points on the model sur-face patch. If the relevant volume contains points thatare not on the model surface patch, then these points willcorrupt scene spin images and are considered clutterWe created 100 scenes for analysis as follows. We se-lected four models from our library of models (Fig. 11):bunny,Mr. Potato Head, andWe then created 100 scenes using these four models; eachscene contained all four models. The models were placedin the scenes without any systematic method. It was ourhope that random placement would result in a uniformsampling of all possible scenes containing the four ob-jects. Using four models, we hoped to adequately samplethe possible shapes to be recognized, given that sam-pling of all possible surface shapes is not experimentally4.2AnalysisFor each model, we ran recognition without compressionon each of the 100 scenes, resulting in 400 recognition trials.The recognition states are shown in a scatterplot in the topof Fig. 15. Each data point in the plot corresponds to a sin-gle recognition trial; the coordinates give the amount of Fig. 16. Recognition state probability versus occlusion for compressed and uncompressed spin images (left). Recognition state p JOHNSON AND HEBERT: USING SPIN IMAGES FOR EFFICIENT OBJECT RECOGNITION IN CLUTTERED 3D SCENES447clutter and occlusion and the symbol describes the recogni-tion state. This same procedure using the same 100 sceneswas repeated for the matching spin images with compres-sion ( = 0.1), resulting in 400 different recognition runs.A scatterplot of recognition states for compressed spin im-ages is shown at the bottom of Fig. 15. Briefly looking atboth scatterplots shows that the number of true-positivestates is much larger than the number of false-negativestates and false-positive state. Furthermore, as the lines inthe scatterplots indicate, no recognition errors occur belowa fixed level of occlusion, independent of the amount ofclutter.Examining the scatterplots in Fig. 15, one notices thatrecognition rate is effected by occlusion. At low occlusionvalues, no recognition failures are reported, while at highocclusion values, recognition failures dominate. This indi-cates that recognition will almost always work if sufficientmodel surface area is visible. The decrease in recognitionsuccess after a fixed level of occlusion is reached (70 per-cent) indicates that spin image matching does not workwell when only a small portion of the model is visible. Thisis no surprise since spin image descriptiveness comes fromaccumulation of surface area around a point. On the left inFig. 16 are shown the experimental recognition rates versusscene occlusion. The rates are computed using a Gaussianweighted running average (averaging on occlusion inde-pendent of clutter level) to avoid the problems with bin-ning. These plots show that recognition rate remains highfor both forms of compression until occlusion of around 70percent is reached, then the successful recognition rate be-gins to fall off.Examining the experiment scatterplots in Fig. 15, onenotices that the effect of clutter on recognition is uniformacross all levels of occlusion until a high level of clutter isreached. This indicates that spin image matching is inde-pendent of the clutter in the scene. On the right in Fig. 16,plots of recognition rate versus amount of clutter also showthat recognition rate is fairly independent of clutter. Asclutter increases, there are slight variations about a fixedrecognition rate. Most likely, these variations are due tononuniform sampling of recognition runs and are not ac-tual trends with respect to clutter. Above a high level ofclutter, the successful recognitions decline, but from thescatterplots we see that at high levels of clutter, the numberof experiments is small, so estimates of recognition rate areimprecise.In all of the plots showing the effect of clutter and occlu-sion, the true-positive rates are higher for recognition withspin images without compression when compared with thetrue-positive rates for recognition with compression. Thisvalidates the expected decrease in the accuracy of spin im-age matching when using compressed spin images. How-ever, it should be noted that the recognition rate for bothmatching algorithms remain high. For all levels of clutterand occlusion, matching without compression has an aver-age recognition rate of 90.0 percent and matching withcompression has an average recognition rate of 83.2 per-cent. Furthermore, the false-positives rate for both algo-rithms are low and nearly the same.The right graph in Fig. 17 shows the result of an experi-ment that measured the average number of true positiverecognitions for ten scenes versus the number of models inthe model library. As the number of models in the libraryincreases, the number of models correctly recognized in-creases linearly. This is caused by the model library con-taining more and more of the models that are present in thescene. The graph shows that matching without compres-sion matches slightly more models than matching with 10:1compression, a consequence of uncompressed spin imagesbeing more discriminating.The time needed to match a single scene spin image toall of the spin images in the model library as the number ofmodels in the library increases is shown in the graph on theleft in Fig. 17. All times are real wall clock times on a SiliconGraphics O2 with a 174-MHz R10000 processor. As ex-pected, matching of spin images grows linearly with thenumber of models in the model library because the numberof spin images being compared increases linearly with thenumber of models. This is true of matching with compres-sion and matching without compression; however, thematching times with compression grow significantly slowerthan the matching times without compression. With 20models in the library, matching with 10:1 compression is 20times faster than matching without compression. Sincethere is only a slight decrease in recognition performancewhen using compression (right in Fig. 17,) compressed spinimages should be used in recognition. Another factor inmatching time, is the number of points in the scene. To ob- Fig. 17. Numbers of models recognized (a) and spin image matching time (b) versus library size for compressed and uncompressed spin images. 448IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, MAY 1999shown in Fig. 17 should be multiplied by the number ofpoints selected from the scene for matching.We have presented an algorithm for simultaneous shape-based recognition of multiple objects in cluttered sceneswith occlusion. Our algorithm can handle objects of generalshape because it is based on the spin image, a data levelshape representation that places few restrictions on objectshape. Through compression of spin images using PCA, wehave made the spin image representation efficient enoughfor recognition from large model libraries. Finally we haveshown experimentally, that the spin image representation isrobust to clutter and occlusion. Through improvements andanalysis, we have shown that the spin image representationis an appropriate representation for recognizing objects incomplicated real scenes.Spin images are a general shape representation, so theirapplicability to problems in 3D computer vision is broad.This paper has investigated the application of spin imagesto object recognition, however, other spin image applica-tions exist. For instance, the general nature of spin imagesmakes them an appropriate representation for shape analy-sis, the process that quantifies similarities and differencesbetween the shape of objects. Shape analysis can lead toobject classification, analysis of object symmetry and partsdecomposition, all of which can be used to make object rec-ognition more efficient. Other possible applications of spinimages include 3D object tracking and volumetric imageregistration.There still exist some algorithmic additions which couldbe implemented to make spin image matching more effi-cient and robust. Some extensions currently being investi-gated are multiresolution spin images for coarse to fine rec-ognition, automated learning of descriptive spin images,and improved spin image parameterizations.We would like to thank Jim Osborn and all the membersof the Artisan project for supporting this work. We wouldalso like to thank Karun Shimoga for the use of the Ksensor and Kaushik Merchant for his time spent seg-menting 3D scenes. This research was performed atCarnegie Mellon University and was supported by theU.S. Department of Energy under contract DE-AC21-92MC29104 and by the National Science Foundation un-der grant IRI9711853.1853.F. Arman and J.K. Aggarwal, “CAD-Based Vision: Object Recog-nition in Cluttered Range Images Using Recognition Strategies,”Computer Vision, Graphics and Image Processing, vol. 58, no. 1, pp., no. 1, pp.P. Besl, “The Triangle as Primary Representation,” M. Hebert etal., eds. Object Representation in Computer Vision. Berlin: SpringerVerlag, 1995.rlag, 1995.C. Chua and R. Jarvis, “3-D Free-Form Surface Registration andObject Recognition,” Int’l J. Computer Vision, vol. 17, no. 1, pp. 77- 17, no. 1, pp. 77-C. Chua and R. Jarvis, “Point Signatures: A New Representationfor 3D Object Recognition,” Int’l J. Computer Vision, vol. 25, no. 1,,D. Dion, Jr., D. Laurendeau, and R. Bergevin, “GeneralizedCylinder Extraction in Range Images,” Proc. IEEE Int’l Conf. Re-cent Advance in 3-D Digital Imaging and Modeling, pp. 141-147, pp. 141-147,C. Dorai and A. Jain, “COSMOS—a Representation Scheme for3D Free-Form Objects,” IEEE Trans. Pattern Analysis and Machine, vol. 19, no. 10, pp. 1,115-1,130, Oct. 1997.15-1,130, Oct. 1997.O. Faugeras and M. Hebert, “The Representation, Recognitionand Locating of 3-D Objects,” J. Robotics Research, vol. 5, no. 3, no. 3,K. Fukanaga, Introduction to Statistical Pattern RecognitionYork: Academic Press, 1972.ess, 1972.W.E.L. Grimson, “Localizing Overlapping Parts by Searching theInterpretation Tree,” IEEE Trans. Pattern Analysis and Machine In-ans. Pattern Analysis and Machine In-M. Hebert, K. Ikeuchi, and H. Delingette, “A Spherical Represen-tation for Recognition of Free-Form Surfaces,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 17, no. 7, pp. 681-689, July7, pp. 681-689, JulyC.-Y. Huang, O. Camps, and T. Kanungo, “Object RecognitionUsing Appearance-Based Parts and Relations,” Proc. IEEEConf. Computer Vision and Pattern Recognition, pp. 877-883, May77-883, MayA. Johnson and M. Hebert, “Surface Matching for Object Recog-nition in Complex Three-Dimensional Scenes,” Image and VisionComputing,ing,A. Johnson and S. Kang, “Registration and Integration of Tex-tured 3-D Data,” Image and Vision Computing, vol. 17, pp. 135-, vol. 17, pp. 135-A. Johnson and M. Hebert, “Efficient Multiple Model Recognitionin Cluttered 3-D Scenes,” Proc. IEEE Conf. Computer Vision and Pat-sion and Pat-A. Johnson, Spin-Images: A Representation for 3-D Surface Matching,doctoral dissertation, The Robotics Institute, Carnegie MellonUniv., 1997.., 1997.A. Johnson and M. Hebert, “Control of Polygonal Mesh Resolu-tion for 3-D Computer Vision,” Graphical Models and Image Proc-oc-S. Kang and K. Ikeuchi, “The Complex EGI: New Representationfor 3-D Pose Determination,” IEEE Trans. Pattern Analysis and Ma-ans. Pattern Analysis and Ma-Y. Lamdan and H. Wolfson, “Geometric Hashing: A General andEfficient Model-Based Recognition Scheme,” Proc. Second Int’lConf. Computer VisionsionH. Murase and S. Nayar, “Visual Learning and Recognition of 3-DObjects From Appearance,” Int’l J. Computer Vision, vol. 14, pp. 5- 14, pp. 5-S. Nene and S. Nayar, “Closest Point Search in High Dimensions,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 859-9-K. Ohba and K. Ikeuchi, “Detectability, Uniqueness and Reliabil-ity of Eigen-Windows for Stable Verification of Partially OccludedIEEE Trans. Pattern Analysis and Machine Intelligenceis and Machine IntelligenceW. Press et al., Numerical Recipes in C: The Art of Scientific Com-, 2nd ed. Cambridge, England: Cambridge Univ. Press,ess,N. Raja and A. Jain, “Recognizing Geons From SuperquadricsFitted to Range Data,” Image and Vision Computing, vol. 10, no. 3, 10, no. 3,F. Stein and G. Medioni, “Structural Indexing: Efficient 3-D ObjectIEEE Trans. Pattern Analysis and Machine Intelligence and Machine IntelligenceG. Taubin, “A Signal Processing Approach to Fair Surface De-Proc. ACM SIGGRAPH Conf. Computer Graphics, pp. 351-pp. 351-G. Taubin, “Discrete Surface Signal Processing: The Polygon asthe Surface Element,” M. Hebert et al., eds. Object Representation inComputer Vision. Berlin: Springer Verlag, 1995.rlag, 1995.Z. Zhang, “Iterative Point Matching for Registration of Free-FormCurves and Surfaces,” Int’l J. Computer Vision, vol. 13, no. 2, pp.119-152, 1994. JOHNSON AND HEBERT: USING SPIN IMAGES FOR EFFICIENT OBJECT RECOGNITION IN CLUTTERED 3D SCENES449Andrew E. Johnson graduated with HighestDistinction from the University of Kansas in 1991with a BS in engineering physics and a BS inmathematics. In 1997, he received his PhD fromthe Robotics Institute at Carnegie Mellon Univer-sity, where he studied 3D object representationand recognition. Currently, he is a senior mem-ber of technical staff at the Jet Propulsion Labo-ratory, where he is researching image-basedtechniques for autonomous navigation aroundcomets and asteroids. In addition, he is devel-during the landing phase of NASA’s Deep Space 4/Champollion cometrendezvous mission. His general research interests are 3D object rec-ognition, surface registration, multiview integration, environment mod-eling, 3D structure and motion recovery from image streams, and received his doctorate in 1984from the University of Paris. Dr. Hebert is a sen-ior research scientist in Carnegie Mellon’s Ro-botics Institute. His research interests are in thearea of perception for robot systems. His workhas focused on building and recognizing 3Dthe development of the 3D perception systemand obstacle detection, recognition, and terrainmapping using an imaging laser range finder forautonomous vehicles. In the area of object rec-ognition, he has developed new techniques for representing and rec-ognizing general, 3D free-form objects using mesh structures. He iscurrently exploring applications of those techniques to interior mappingfor nuclear environments and other industrial environments, large-scale terrain mapping from aerial and ground-based 3D data, and ac-curate object modeling from multiple views. In addition to applications,Dr. Hebert is investigating new techniques for recognition, including theuse of learning techniques, the automatic selection of optimal points formatching, and the use of harmonic mapping techniques for mapping.Dr. Hebert is also investigating recognition from video images through