Angle Lens Distortion Correction Algorithm on the Cell Broadband Engine Konstantis Daloukas Christos D Antonopoulos Nikolaos Bellas Department of Computer and Communications Engineering ID: 627865
Download Presentation The PPT/PDF document "Implementation of a Wide" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Implementation of a Wide-Angle Lens Distortion Correction Algorithmon the Cell Broadband Engine
Konstantis Daloukas
Christos D. Antonopoulos
Nikolaos Bellas
Department of Computer and Communications Engineering
University of Thessaly
Volos, GreeceSlide2
June 9, 2009ICS 20092Introduction
Conventional
rectilinear lens
Full-frame fisheye lens 98 degrees horizontal by 147 degrees vertical
Wide-angle lenses (a.k.a. fisheye lenses) are traditionally used to enlarge the field of view in photography
Full circular fisheye lens 180 degrees horizontal
and verticalSlide3
June 9, 2009ICS 20093IntroductionMain ApplicationsMeteorologyAstronomyRobot NavigationVideo Surveillance
Video ConferencingDigital CamerasThe incoming rays are mapped onto a spherical surface
Such mapping introduces barrel distortionSlide4
June 9, 2009ICS 20094MotivationDistortion must be corrected in real-time25-30 fps in VGA resolution for our applicationReal-time distortion correctionNot feasible with contemporary general purpose processorsCore 2 Quad: 15.82 fps with SSE and 4 threadsUse a high-performance, non-conventional processor such as the CBESlide5
June 9, 2009ICS 20095OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusionSlide6
June 9, 2009ICS 20096Cell BE Architecture From: J. A. Kahle et al. Introduction to the Cell multiprocessor IBM Journal of Research and Development, 49(4/5):589-604, July/September 2005.Slide7
June 9, 2009ICS 20097Cell BE Key PerformanceCharacteristicsPeak Performance: 256 Gflops for single-precision FP arithmeticOffers a rich repertoire for exploiting the various levels of parallelism8 SPEs – Thread-Level Parallelism
SIMD Computational Engines – Data-Level ParallelismDual-Issue Pipeline – Instruction-Level ParallelismSlide8
June 9, 2009ICS 20098OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusionSlide9
June 9, 2009ICS 20099
Wide-angle Lenses
Distortion Correction
Transformation of the distorted wide-angle images back to the central perspective space. Slide10
June 9, 2009ICS 200910Projection Model of Wide-angle Lenses
Wide-angle Projection
Central Perspective
ProjectionSlide11
June 9, 2009ICS 200911Algorithmic Flow (A)Inverse Mapping: Maps each image point (i, j) to the corresponding point (x, y) in the wide-angle spaceSlide12
June 9, 2009ICS 200912Algorithmic Flow (A)Need to approximate the value of fractional positions in the fisheye spaceComplex, irregular memory access patternSlide13
June 9, 2009ICS 200913Algorithmic Flow (B)Bicubic Interpolation: uses a 4x4 window of pixels to approximate intermediate pointsSlide14
June 9, 2009ICS 200914Algorithmic Flow (B)Bicubic interpolation is broken into horizontal and vertical 1D interpolationCi are the pixel values
s
tSlide15
June 9, 2009ICS 200915Complete AlgorithmFor each pixel (i, j) in the central perspective space { Apply inverse mapping to find fractional
coordinates (x, y) in the wide-angle space Use bicubic interpolation to approximate the pixel
value at (x,y)} Apply a 2D low pass filter and downscale output image to VGA resolution (640x480)Slide16
June 9, 2009ICS 200916OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusionSlide17
June 9, 2009ICS 200917Block TilingPartition the output image in blocks and correct a block of pixels at a time Slide18
June 9, 2009ICS 200918Advantages of block tiling technique:Maximize data reuseFacilitates the exploitation of the thread-level parallelism of the algorithmDrawback:
Block TilingSlide19
June 9, 2009ICS 200919Performance afterBlock Tiling
Both processors are not capable for
real-time execution
0.55 fps
2.20 fps
65 %
48.9 %
27 %
12.7 %
7 %
38 %Slide20
June 9, 2009ICS 200920Tile SizeTile size and shape: Very important parameters in explicitly blocked codesTile size must be large enough in order to:Maximize data reuse and increase the working set
Minimize communication overheadTile size must be small due to:The limited capacity of the LS
The curvature of input tilesCell BE imposes strict alignment requirements on DMA transfersAdditional limitations on the size and shapeSlide21
June 9, 2009ICS 200921Tile SizeSlide22
June 9, 2009ICS 200922Thread Level ParallelismExploit thread-level parallelismTiles of the output images are independentOffload the most time-consuming kernels to the SPEsSlide23
June 9, 2009ICS 200923Function Offloading
0.55 fps
15.82 fps
1.19 fps
1.19 fpsSlide24
June 9, 2009ICS 200924VectorizationUtilize the SIMD computation capabilities of the SPEsAccelerate computations by:Clustering four FP operands in a vector 4x implicit loop unrollingAs an additional positive effect the branch misprediction penalty is reducedBackward branches in loops are predicted as not taken
20 cycles misprediction penalty per branch eliminatedSlide25
June 9, 2009ICS 200925Vectorization
0.55 fps
15.82 fps
1.19 fps
1.19 fps
0.55 fps
1.19 fps
15.82 fps
1.19 fps
10.75 fpsSlide26
June 9, 2009ICS 200926Color Loop UnrollingThe frames are in the (R, G, B) color spaceEach doubly-nested loop contains an additional loop for the color components
Explicit 3x unrolling: Furthers the positive effects of branch eliminationIncreases the potential for efficient schedulingSlide27
June 9, 2009ICS 200927
0.55 fps
1.19 fps
1.19 fps
10.75 fps
15.82 fps
Color Loop Unrolling
0.55 fps
1.19 fps
15.82 fps
1.19 fps
10.75 fps
14.28 fps
14.28 fps
10.75 fpsSlide28
June 9, 2009ICS 200928Unaligned LoadsUnaligned memory accesses due to the formation of the 4x4 window in bicubic interpolationPipeline stalls due to vector loads
r
1
1
2
3
4
r
2
r
3
r
4
5
6
r
1
r
2
r
3
r
4
1
2
3
4
7
8
9
10
11
12
13
14
15
16
5
9
13
6
10
14
7
11
15
8
12
16Slide29
June 9, 2009ICS 200929Unaligned Loads
0.55 fps
1.19 fps
15.82 fps
1.19 fps
14.28 fps
10.75 fps
10.75 fps
14.28 fps
0.55 fps
1.19 fps
15.82 fps
1.19 fps
10.75 fps
14.28 fps
15.38 fps
10.75 fps
14.28 fps
15.38 fpsSlide30
June 9, 2009ICS 200930Manual InstructionSchedulingThe compiler proved too conservative rescheduling independent instructionsManually interleaved instructions of vector loads with computational operationsReduced the remaining pipeline stalls
Manual scheduling is facilitated by the loop unrollingThe usage of the dual-issue pipeline increased from 22.6 % to 34.6 %Slide31
June 9, 2009ICS 200931Manual InstructionScheduling
0.55 fps
1.19 fps
15.82 fps
1.19 fps
10.75 fps
14.28 fps
15.38 fps
10.75 fps
14.28 fps
15.38 fps
1.19 fps
0.55 fps
15.82 fps
10.75 fps
14.28 fps
15.38 fps
1.19 fps
20 fps
10.75 fps
14.28 fps
15.38 fps
20 fpsSlide32
June 9, 2009ICS 200932Inverse MappingAmortizationThe inverse mapping kernel has to be executed only when the Field-of-View or Region-Of-Interest changesThese parameters change infrequently in a typical usage scenario
We evaluated the option of executing this kernel on the PPE (using the VMX/AltiVec extensions)The coordinates are stored at the main memory
Each SPE fetches the appropriate coordinatesThe execution time of the algorithm decreased to 0.045 sec./frame when 6 SPEs are used0.033 sec./frame when 8 SPEs are usedSlide33
June 9, 2009ICS 200933OutlineIntroductionThe Cell Broadband Engine ArchitectureWide-angle Lenses Distortion Correction AlgorithmMapping and Optimization StepsConclusionsSlide34
June 9, 2009ICS 200934ConclusionsOutlined and evaluated the various optimizations needed to achieve real-time wide-angle lens distortion correction on the Cell BEMost optimizations applicable to many stencil computation algorithmsCounter-intuitive optimizations highly unlike to be made automatically
Commercially available general purpose multi-cores not capable of handling real-time distortion correctionMore mature compiler technology needed