Ang Li University of WisconsinMadison Outline Brief Introduction of Background Implementation Evaluation Conclusion 3202013 2 NVIDIA GTC 2013 Background JPEG Encoding Parallelism Seeking ID: 257688
Download Presentation The PPT/PDF document "JPEG-GPU: a GPGPU Implementation of JPEG..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
JPEG-GPU: a GPGPU Implementation of JPEG Core Coding Systems
Ang
Li
University
of
Wisconsin-MadisonSlide2
Outline
Brief Introduction of Background
Implementation
EvaluationConclusion
3/20/2013
2
NVIDIA GTC 2013Slide3
Background
JPEG Encoding
Parallelism Seeking
Pre-processing: Color ConversionBlock Encoding/Decoding
3/20/2013
3
NVIDIA GTC 2013Slide4
Implementation
Step 1 – Find target functions
Encode:
encode_mcu_huff, encode_one_block,
emit_bits_sDecode: decode_mcu_DC_first,
decode_mcu_DC_refineProfiling to find other functionsUsing GPROF
Encode:
rgb_ycc_convert
Decode:
ycc_rgb_convert
Both take small half of the total execution time of encoding/decoding
3/20/2013
4
NVIDIA GTC 2013Slide5
Implementation – Cont’d
Step 2 – Parallel with CUDA
First, implementing in
OpenMP to make sure the understandings are correct
E.g., in encode_one_block, emit_bits_s changes the state of system => parallel with multiple threads will lead to incorrect results!
Secondly, make a baseline GPGPU implementation to all critical functions
Thirdly, optimize GPGPU implementations
Using constant memory
3/20/2013
5
NVIDIA GTC 2013
for (k = 1; k <= Se; k++) {
…
if (!
emit_bits_s
(…))
return FALSE;
…
if (!
emit_bits_s
(…))
return FALSE;
…
if (!
emit_bits_s
(…))
return FALSE;
…
}Slide6
Evaluation
Evaluation Environment
CPU:
Intel Nehalem Xeon E5520 2.26GHz processorGPU: Tesla K20c
Picture usedMy favorite pictureCompressing: 1280 x 768 pixels
Decompressing: the products after compressingCorrectness checked by ``diff’’
3/20/2013
6
NVIDIA GTC 2013Slide7
Evaluation – Cont’d
Sequential
OpenMP
GPGPU Base
GPGPU
Optimized
Compress
2.886
2.648
14.700
22.412
Decompress
2.420
2.200
14.616
21.507
3/20/2013
7
NVIDIA GTC 2013
Timings are in milliseconds,
averagin
10 times of execution
Four threads are forked for
OpenMP
implementation
For both GPU implementations, configurations are tuned to be optimized
Results discussion
OpenMP
is fastest. GPGPU basically degrades the performance
while `optimized’ version degrades more (due to serialized constant memory accesses).
Observations after hacking the code:
Each kernel launch deals with at most 250 elements, too fine-grained.
Kernel launch is expensive (allocation & copying the data)
Using
OpenMP
is always going to better off as long as there is enough parallelism & loop iterations are not extremely trivial.Slide8
Conclusion
For JPEG encoding/decoding core system, GPGPU basically degrades the performance.
Coarser-grained parallelism is required.
OpenMP acceleration can be easily applied to gain some performance.
3/20/2013
8
NVIDIA GTC 2013Slide9
Thank you.
Ang Li <ali28@wisc.edu
>
3/20/2013
NVIDIA GTC 2013
9