OffRoad Obstacle oidance thr ough EndtoEnd Lear ning ann LeCun Courant Institute of Mathematical Sciences Ne ork Uni ersity Ne ork NY USA httpyann PDF document - DocSlides

OffRoad Obstacle oidance thr ough EndtoEnd Lear ning ann LeCun Courant Institute of Mathematical Sciences Ne ork Uni ersity Ne ork NY  USA httpyann PDF document - DocSlides

2014-12-18 177K 177 0 0

Description

lecuncom Urs Muller NetScale echnologies Mor gan ville NJ 07751 USA ursnetscalecom an Ben NetScale echnologies Mor gan ville NJ 07751 USA Eric Cosatto NEC Laboratories Princeton NJ 08540 Beat Flepp NetScale echnologies Mor gan ville NJ 07751 USA Abst ID: 26094

Embed code:

Download this pdf



DownloadNote - The PPT/PDF document "OffRoad Obstacle oidance thr ough EndtoE..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentations text content in OffRoad Obstacle oidance thr ough EndtoEnd Lear ning ann LeCun Courant Institute of Mathematical Sciences Ne ork Uni ersity Ne ork NY USA httpyann


Page 1
Off-Road Obstacle oidance thr ough End-to-End Lear ning ann LeCun Courant Institute of Mathematical Sciences Ne ork Uni ersity Ne ork, NY 10004, USA http://yann.lecun.com Urs Muller Net-Scale echnologies Mor gan ville, NJ 07751, USA urs@net-scale.com an Ben Net-Scale echnologies Mor gan ville, NJ 07751, USA Eric Cosatto NEC Laboratories, Princeton, NJ 08540 Beat Flepp Net-Scale echnologies Mor gan ville, NJ 07751, USA Abstract describe vision-based obstacle oidance system for of f-road mo- bile robots. The system is trained from end to end to map ra input images to steering angles. It is trained in supervised mode to predict the steering angles pro vided by human dri er during training runs collected in wide ariety of terrains, weather conditions, lighting conditions, and obstacle types. The robot is 50cm of f-road truck, with tw forw ard- pointing wireless color cameras. remote computer processes the video and controls the robot via radio. The learning system is lar ge 6-layer con olutional netw ork whose input is single left/right pair of unpro- cessed lo w-resolution images. The robot xhibits an xcellent ability to detect obstacles and na vigate around them in real time at speeds of m/s. Intr oduction Autonomous of f-road ehicles ha ast potential applications in wide spectrum of do- mains such as xploration, search and rescue, transport of supplies, en vironmental manage- ment, and reconnaissance. Building fully autonomous of f-road ehicle that can reliably na vigate and oid obstacles at high speed is major challenge for robotics, and ne domain of application for machine learning research. The last fe years ha seen considerable progress to ard that goal, particularly in areas such as mapping the en vironment from acti range sensors and stereo cameras [11 7], simultaneously na vigating and uilding maps [6 15 ], and classifying obstacle types. Among the arious sub-problems of of f-road ehicle na vigation, obstacle detection and oidance is subject of prime importance. The wide di ersity of appearance of potential obstacles, and the ariability of the surroundings, lighting conditions, and other actors, mak the problem ery challenging. Man recent ef forts ha attack ed the problem by relying on multiplicity of sensors, in- cluding laser range finder and radar [11 ]. While acti sensors mak the problem consider ably simpler there seems to be an interest from potential users for purely passi systems that rely xclusi ely on camera input. Cameras are considerably less xpensi e, ulk
Page 2
po wer hungry and detectable than acti sensors, allo wng le els of miniaturizations that are not otherwise possible. More importantly acti sensors can be slo limited in range, and easily confused by getation, despite rapid progress in the area [2 ]. oiding obstacles by relying solely on camera input requires solving highly comple vision problem. time-honored approach is to deri range maps from multiple images through multiple cameras or through motion [6 ]. Deri ving steering angles to oid ob- stacles from the range maps is simple matter lar ge number of techniques ha been proposed in the literature to construct range maps from stereo images. Such methods ha been used successfully for man years for na vigation in indoor en vironments where edge features can be reliably detected and matched [1 ], ut na vigation in outdoors en vironment, despite long history is still challenge [14 3]: real-time stereo algorithms are consid- erably less reliable in unconstrained outdoors en vironments. The xtreme ariability of lighting conditions, and the highly unstructured nature of natural objects such as tall grass, ushes and other getation, ater surf aces, and objects with repeating te xtures, conspire to limit the reliability of this approach. In addition, stereo-based methods ha rather limited range, which dramatically limits the maximum dri ving speed. End-T o-End Lear ning or Obstacle oidance In general, computing depth from stereo images is an ill-posed problem, ut the depth map is only means to an end. Ultimately the output of an obstacle oidance system is set of possible steering angles that direct the robot to ard tra ersible re gions. Our approach is to vie the entire problem of mapping input stereo images to possible steering angles as single indi visible task to be learned fr om end to end Our learning system tak es ra color images from tw forw ard-pointing cameras mounted on the robot, and maps them to set of possible steering angles through single trained function. The training data as collected by recording the actions of human dri er together with the video data. The human dri er remotely dri es the robot straight ahead until the robot en- counters non-tra ersible obstacle. The human dri er then oids the obstacle by steering the robot in the appropriate direction. The learning system is trained in supervised mode. It tak es single pair of hea vily-subsampled images from the tw cameras, and is trained to predict the steering angle produced by the human dri er at that time. The learning architecture is 6-layer con olutional netw ork [9 ]. The netw ork tak es the left and right 149 58 color images and produces tw outputs. lar ge alue on the first output is interpreted as laft steering command while lar ge alue on the second output indicates right steering command. Each layer in con olutional netw ork can be vie wed as set of trainable, shift-in ariant linear filters with local support, follo wed by point-wise non-linear saturation function. All the parameters of all the filters in the arious layers are trained simultaneously The learning algorithm minimizes the discrepanc between the desired output ector and the output ector produced by the output layer The approach is some what reminiscent of the AL VINN and MANIA systems [13 4]. The main dif ferences with AL VINN are: (1) our system uses stereo cameras; (2) it is trained for of f-road obtacle oidance rather than road follo wing; (3) Our trainable system uses con olutional netw ork rather than traditional fully-connected neural net. Con olutional netw orks ha tw considerable adv antages for this applications. Their lo- cal and sparse connection scheme allo ws us to handle images of higher resolution than AL VINN while eeping the size of the netw ork within reasonnable limits. Con olutional nets are particularly well suited for our task because local feature detectors that combine inputs from the left and right images can be useful for estimating distances to obstacles (possibly by estimating disparities). Furthermore, the local and shift-in ariant property of the filters allo ws the system to learn rele ant local features with limited amount of training data. The adv antage of the approach is that the entire function from ra pix els to steering angles is trained from data, which completely eliminates the need for feature design and
Page 3
selection, geometry camera calibration, and hand-tuning of parameters. The main moti- ation for the use of end-to-end learning is, in act, to eliminate the need for hand-crafted heuristics. Relying on automatic global optimization of an objecti function from massi amounts for data may produce systems that are more rob ust to the unpredictable ariability of the real orld. Another potential benefit of pure learning-based approach is that the system may use other cues than stereo disparity to detect obstacles, possibly alle viating the short-sightedness of methods based purely on stereo matching. ehicle Hard war uilt small and light-weight ehicle which can be carried by single person so as to acilitate data collection and testing in wide ariety of en vironments. Using small, rugged and lo w-cost robot allo wed us to dri at relati ely high speed without fear causing damage to people, property or the robot itself. The do wnside of this approach is the limited payload, too limited for holding the computing po wer necessary for the visual processing. Therefore, the robot has no significant on-board computing po wer It is remotely controled by of f-board computer wireless link is used to transmit video and sensor readings to the remote computer Throttle and steering controls are sent from the computer to the robot through re gular radio control channel. The robot chassis as uilt around customized 1/10-th scale remote-controlled, electric- po wered, four -wheel-dri truck which as roughly 50cm in length. The typical speed of the robot during data collection and testing sessions as roughly meters per second. forw ard-pointing lo w-cost 1/3-inch CCD cameras were mounted 110mm apart behind clear le xan windo ith 2.5mm lenses, the horizontal field of vie of each camera as about 100 de grees. pair of 900MHz analog video transmitters as used to send the camera outputs to the remote computer The analog video links were subject to high signal noise, color shifts, frequent interferences, and occasional video drop-outs. But the small size, light weight, and lo cost pro vided clear adv antages. The ehicle is sho wn in Figure 1. The remote control station consisted of 1.4GHz Athlon PC running Linux with video capture cards, and an interf ace to an R/C transmitter Figure 1: Left: The robot is modified 50 cm-long truck platform controled by remote computer Middle: sample images images from the training data. Right: poor reception occasionally caused bad quality images. Data Collection During data collection session, the human operator wears video goggles fed with the video signal from one the robot cameras (no stereo), and controls the robot through jo ystick connected to the PC. During each run, the PC records the output of the tw video cameras at 15 frames per second, together with the steering angle and throttle setting from the operator
Page 4
crucially important requirement of the data collection process as to collect lar ge amounts of data with enough di ersity of terrain, obstacles, and lighting conditions. Tt as necessary for the human dri er to adopt consistent obstacle oidance beha viour ensure this, the human dri er as to dri the ehicle straight ahead whene er no obstacle as present within threatening distance. Whene er the robot approached an obstacle, the human dri er had to steer left or right so as to oid the obstacle. The general strate gy for collecting training data as as follo ws: (a) Collecting data from as lar ge ariety of of f-road training grounds as possible. Data as collected from lar ge number of parks, playgrounds, frontyards and back yards of number of sub urban homes, and hea vily clut- tered construction areas; (b) Collecting data with arious lighting conditions, i. e., dif ferent weather conditions and dif ferent times of day; (c) Collecting sequences where the ehicle starts dri ving straight and then is steered left or right as the robot approached an obstacle; (d) oiding turns when no obstacles were present; (e) Including straight runs with no ob- stacles and no turns as part of the training set; (f) rying to be consistent in the turning beha vior i. e., al ays turning at approximately the same distance from an obstacle. Ev en though great care as tak en in collecting the highest quality training data, there were number of imperfections in the training data that could not be oided: (a) The small- form-f actor lo w-cost cameras presented significant dif ferences in their def ault settings. In particular the white balance of the tw cameras were some what dif ferent; (b) maximize image quality the automatic gain control and automatic xposure were acti ated. Because of dif ferences in abrication, the left and right images had slightly dif ferent brightness and contrast characteristics. In particular the GC adjustments seem to react at dif ferent speeds and amplitudes; (c) Because of GC, dri ving into the sunlight caused the images to become ery dark and obstacles to become hard to detect; (d) The wireless video connection caused dropouts and distortions of some frames. Approximately of the frames were af fected. An xample is sho wn in Figures 1; (e) The cameras were mounted rigidly on the ehicle and were xposed to vibration, despite the suspension. Despite these dif ficult conditions, the system managed to learn the task quite well as will be sho wn later The data as recorded and archi ed at resolution of 320 240 pix els at 15 frames per second. The data as collected on 17 dif ferent days during the inter of 2003/2004 (the sun as ery lo on the horizon). total of 1,500 clips were collected with an erage length of about 85 frames each. This resulted in total of about 127,000 indi vidual pairs of frames. Se gments during which the robot as dri en into position in preparation for run were edited out. No other manual data cleaning took place. In the end, 95,000 frame pairs were used for training and 32,000 for alidation/testing. The training pairs and testing pairs came from dif ferent sequences (and often dif ferent locations). Figure sho ws xample snapshots from the training data, including an image with poor reception. Note that only one of the tw (stereo) images is sho wn. High noise and frame dropouts occurred in approximately of the frames. It as decided to lea them in the training set and test set so as to train the system under realistic conditions. The Lear ning System The entire processing consists of single con olutional netw ork. The architecture of con- olutional nets is some what inspired by the structure of biological visual systems. Con- olutional nets ha been used successfully in number of vision applications such as handwriting recognition [9 ], object recognition [10 ], and ace detection [12 ]. The input to the con olutional net consists of planes of size 149 58 pix els. The six planes respecti ely contain the and components for the left camera and the right camera. The input images were obtained by cropping the 320 240 images, and through horizontal lo w-pass filtering and subsampling, and ertical lo w-pass filtering and subsampling. The horizontal resolution as set higher so as to preserv more accurate image disparity information. Each layer in con olutional net is composed units or ganized in planes called feature maps. Each unit in feature map tak es inputs from small neighborhood within the feature maps
Page 5
of the pre vious layer Neighborhing units in feature map are connected to neighboring (possibly erlapping) windo ws. Each unit computes weighted sum of its inputs and passes the result through sigmoid saturation function. All units within feature map share the same weights. Therefore, each feature map can be seen as con olving the feature maps of the pre vious layers with small-size ernels, and passing the sum of those con olutions through sigmoid functions. Units in feature map detect local features at all locations on the pre vious layer The first layer contains feature maps of size 147 56 connected to arious combinations of the input maps through ernels. The first feature map is connected to the YUV planes of the left image, the second feature map to the YUV planes of the right image, and the other feature maps to all input planes. Those feature maps are binocular and can learn filters that compare the location of features in the left and right images. Because of the weight sharing, the first layer merely has 276 free parameters (30 ernels of size plus biases). The ne xt layer is an eraging/subsampling layer of size 49 14 whose purpose is to reduce the spatial resolution of the feature maps so as to uild in ariances to small geometric distortions of the input. The subsampling ratios are horizontally and ertically The 3-rd layer contains 24 feature maps of size 45 12. Each feature map is connected to arious subsests of maps in the pre vious layer through total of 96 ernels of size 3. The 4-th layer is an eraging/subsampling layer of size with subsam- pling ratios. The 5-th layer contains 100 feature maps of size connected to the 4-th layer through 2400 ernels of size (full connection). finally the output layer contains tw units fully-connected to the 100 units in the 5-th layer The tw outputs respecti ely code for “turn left and “turn right commands. The netw ork has 3.15 Million connections and about 72,000 trainable parameters. The bottom half of figure sho ws the states of the six layers of the con olutional net. the size of the input, 149 58, as essentially limited by the computing po wer of the remote computer (a 1.4GHz Athlon). The netw ork as sho wn runs in about 60ms per image pair on the remote computer Including all the processing, the dri ving system ran at rate of 10 ycles per second. The system output is computed on frame by frame basis with no memory of the past and no time windo Using multiple successi frames as input ould seem lik good idea since the multiple vie ws resulting from go-motion acilitates the se gmentation and detection of nearby obstacles. Unfortunately the supervised learning approach precludes the use of multiple frames. The reason is that since the steering is airly smooth in time (with long, stable periods), the current rate of turn is an xcellent predictor of the ne xt desired steering angle. But the current rate of turn is easily deri ed from multiple successi frames. Hence, system trained with multiple frames ould merely predict steering angle equal to the current rate of turn as observ ed through the camera. This ould lead to catastrophic beha vior in test mode. The robot ould simply turn in circles. The system as trained with stochastic gradient-based method that automatically sets the relati step sizes of the parameters based on the local curv ature of the loss surf ace [8 ]. Gra- dients were computed using the ariant of back-propagation appropriate for con olutional nets. Results performance measurements were recorded, the erage loss, and the percentage of “correctly classified steering angles. The erage loss is the sum of squared dif ferences between outputs produced by the system and the tar get outputs, eraged er all sam- ples. The percentage of correctly classified steering angles measures the number of times the predicted steering angle, quantized into three bins (left, straight, right), agrees with steering angle pro vided by the human dri er Since the thresholds for deciding whether an angle counted as left, center or right were some what arbitrary the percentages cannot be intepreted in absolute terms, ut merely as relati figure of merit for comparing runs and architectures.
Page 6
Figure 2: Internal state of the con olutional net for tw sample frames. The top ro sho ws left/right image pairs xtracted from the test set. The light-blue bars belo sho the steer ing angle produced by the system. The bottom halv es sho the state of the layers of the netw ork, where each column is layer (the penultimate layer is not sho wn). Each rectan- gular image is feature map in which each pix el represents unit acti ation. The YUV components of the left and right input images are in the leftmost column. ith 95,000 training image pairs, training took 18 epochs through the training set. No significant impro ements in the error rate occurred thereafter After training, the error rate as 25.1% on the training set, and 35.8% on the test set. The erage loss (mean-sqaured error) as 0.88 on the training set and 1.24 on the test set. complete training session required about four days of CPU time on 3.0GHz Pentium/Xeon-based serv er Naturally classification error rate of 35.8 doesn mean that the ehicle crashes into obstacles 35.8 of the time, ut merely that the prediction of the system as in dif ferent bin than that of the human dri ers for 35.8 of the frames. The seemingly high error rate is not an accurate reflection of the actual ef fecti eness of the robot in the field. There are se eral reasons for this. First, there may be se eral le gitimate steering angles for gi en image pair: turning left or right around an obstacle may both be alid options, ut our performance measure ould record one of those options as incorrect. In addition, man ille gitimate errors are recorded when the system starts turning at dif ferent time than the human dri er or when the precise alues of the steering angles are dif ferent enough to be in dif ferent bins, ut close enough to cause the robot to oid the obstacle. Perhaps more informati is diagram in figure 3. It sho ws the steering angle produced by the system and the steering angle pro vided by the human dri er for 8000 frames from the test set. It is clear for the plot that only small number of obstacles ould not ha been oided by the robot. The best performance measure is set of actual runs through repre- sentati testing grounds. ideos of typical test runs are ailable at http://www .cs.nyu.edu/˜yann/r esear ch/da e/index.html Figure sho ws snapshot of the trained system in action. The netw ork as presented with scene that as not present in the training set. This figure sho ws that the system can detect obstacles and predict appropriate steering angles in the presence of back-lighting and with wild dif ference between the automatics gain settings of the left and right cameras. Another visualization of the results can be seen in Figures 4. The are snapshots of video clips recorded from the ehicle cameras while the ehicle as dri ving itself au- tonomously Only one of the tw camera outputs is sho wn here. Each picture also sho ws
Page 7
Figure 3: The steering angle produced by the system (black) compared to the steering angle pro vided by the human operator (red line) for 8000 frames from the test set. ery fe obstacles ould not ha been oided by the system. the steering angle produced by the system for that particular input. Conclusion ha demonstrate the applicability of end-to-end learning methods to the task of obsta- cle oidance for of f-road robots. 6-layer con olutional netw ork as trained with massi amounts of data to emulate the obstacle oidance beha vior of human dri er the architecture of the system allo wed it to learn lo w-le el and high-le el features that reliably predicted the bearing of tra ersible areas in the visual field. The main adv antage of the system is its rob ustness to the xtreme di ersity of situations in of f-road en vironments. Its main design adv antage is that it is trained from ra pix els to directly produce steering angles. The approach essentially eliminates the need for manual calibration, adjustments, parameter tuning etc. Furthermore, the method gets around the need to design and select an appropriate set of feature detectors, as well as the need to design rob ust and ast stereo algorithms. The construction of fully autonomous dri ving system for ground robots will require se v- eral other components besides the purely-reacti obstacle detection and oidance system described here. The present ork is merely one component of future system that will include map uilding, visual odometry spatial reasoning, path finding, and other strate gies for the identification of tra ersable areas. Ackno wledgment This project as preliminary study for the ARP project ˚Learning Applied to Ground Robotsº (LA GR). The material presented is based upon ork supported by the Defense Adv anced Research Project Agenc Information Processing echnology Of ˛ce, ARP Order No. Q458, Program Code No. 3D10, Issued by ARP A/CMO under Contract #MD A972-03-C-0111. Refer ences [1] N. yache and O. augeras. Maintaining representations of the en vironment of mobile robot. IEEE ans. Robotics and utomation 5(6):804–819, 1989. [2] C. Ber gh, B. ennedy L. Matthies, and Johnson A. compact, lo po wer tw o-axis scanning laser range˛nder for mobile robots. In The 7th Mec hatr onics orum International Confer ence 2000. [3] S. B. Goldber g, M. Maimone, and L. Matthies. Stereo vision and ro er na vigation softw are for planetary xploration. In IEEE Aer ospace Confer ence Pr oceedings March 2002. [4] Jochem, D. Pomerleau, and C. Thorpe. ision-based neural netw ork road and intersection detection and tra ersal. In Pr oc. IEEE Conf Intellig ent Robots and Systems olume 3, pages 344–349, August 1995.
Page 8
Figure 4: Snapshots from the left camera while the robots dri es itself through ari- ous en vironment. The black bar beneath each image indicates the steering angle pro- duced by the system. op ro w: four successi snapshots sho wing the robot na vigating through narro passage ay between trailer backhoe, and some construction ma- terial. Bottom ro left: narro obstacles such as table le gs and poles (left), and solid obstacles such as fences (center -left) are easily detected and oided. Higly te xtured ob- jects on the ground do not detract the system from the correct response (center -right). One scenario where the ehicle occasionally made wrong decisions is when the sun is in the field of vie w: the system seems to systematically dri to ards the sun, when- er the sun is lo on the horizon (right). ideos of these sequences are ailable at http://www .cs.nyu.edu/˜yann/r esear ch/da e/index.html [5] A. elly and A. Stentz. Stereo vision enhancements for lo w-cost outdoor autonomous ehi- cles. In International Confer ence on Robotics and utomation, orkshop WS-7, Navigation of Outdoor utonomous ehicles, (ICRA ’98) May 1998. [6] D.J. Krie gman, E. riendl, and .O. Binford. Stereo vision and na vigation in uildings for mobile robots. IEEE ans. Robotics and utomation 5(6):792–803, 1989. [7] E. Krotk and M. Hebert. Mapping and positioning for prototype lunar ro er In Pr oc. IEEE Int’l Conf Robotics and utomation pages 2913–2919, May 1995. [8] LeCun, L. Bottou, G. Orr and K. Muller Ef ˛cient backprop. In G. Orr and Muller K., editors, Neur al Networks: ric ks of the tr ade Springer 1998. [9] ann LeCun, Leon Bottou, oshua Bengio, and atrick Haf fner Gradient-based learning ap- plied to document recognition. Pr oceedings of the IEEE 86(11):2278–2324, No ember 1998. [10] ann LeCun, Fu-Jie Huang, and Leon Bottou. Learning methods for generic object recognition with in ariance to pose and lighting. In Pr oceedings of CVPR’04 IEEE Press, 2004. [11] L. Matthies, E. Gat, R. Harrison, B. ilcox, R. olpe, and Litwin. Mars microro er na vi- gation: Performance aluation and enhancement. In Pr oc. IEEE Int’l Conf Intellig ent Robots and Systems olume 1, pages 433–440, August 1995. [12] R. Osadchy M. Miller and LeCun. Syner gistic ace detection and pose estimation with ener gy-based model. In Advances in Neur al Information Pr ocessing Systems (NIPS 2004) MIT Press, 2005. [13] Dean A. Pomerleau. Kno wledge-based training of arti˛cial neural neto wrks for autonomous robot dri ving. In J. Connell and S. Mahade an, editors, Robot Learning Kluwer Academic Publishing, 1993. [14] C. Thorpe, M. Herbert, Kanade, and Shafer ision and na vigation for the carne gie-mellon na vlab IEEE ans. attern Analysis and Mac hine Intellig ence 10(3):362–372, May 1988. [15] S. Thrun. Learning metric-topological maps for indoor mobile robot na vigation. Artificial Intellig ence 99(1):21–71, February 1998.

About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.