/
JPEG-GPU: a GPGPU Implementation of JPEG Core Coding System JPEG-GPU: a GPGPU Implementation of JPEG Core Coding System

JPEG-GPU: a GPGPU Implementation of JPEG Core Coding System - PowerPoint Presentation

jane-oiler
jane-oiler . @jane-oiler
Follow
398 views
Uploaded On 2016-03-16

JPEG-GPU: a GPGPU Implementation of JPEG Core Coding System - PPT Presentation

Ang Li University of WisconsinMadison Outline Brief Introduction of Background Implementation Evaluation Conclusion 3202013 2 NVIDIA GTC 2013 Background JPEG Encoding Parallelism Seeking ID: 257688

gtc 2013 gpgpu nvidia 2013 gtc nvidia gpgpu implementation bits emit openmp jpeg false mcu return encode evaluation decoding

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "JPEG-GPU: a GPGPU Implementation of JPEG..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

JPEG-GPU: a GPGPU Implementation of JPEG Core Coding Systems

Ang

Li

University

of

Wisconsin-MadisonSlide2

Outline

Brief Introduction of Background

Implementation

EvaluationConclusion

3/20/2013

2

NVIDIA GTC 2013Slide3

Background

JPEG Encoding

Parallelism Seeking

Pre-processing: Color ConversionBlock Encoding/Decoding

3/20/2013

3

NVIDIA GTC 2013Slide4

Implementation

Step 1 – Find target functions

Encode:

encode_mcu_huff, encode_one_block,

emit_bits_sDecode: decode_mcu_DC_first,

decode_mcu_DC_refineProfiling to find other functionsUsing GPROF

Encode:

rgb_ycc_convert

Decode:

ycc_rgb_convert

Both take small half of the total execution time of encoding/decoding

3/20/2013

4

NVIDIA GTC 2013Slide5

Implementation – Cont’d

Step 2 – Parallel with CUDA

First, implementing in

OpenMP to make sure the understandings are correct

E.g., in encode_one_block, emit_bits_s changes the state of system => parallel with multiple threads will lead to incorrect results!

Secondly, make a baseline GPGPU implementation to all critical functions

Thirdly, optimize GPGPU implementations

Using constant memory

3/20/2013

5

NVIDIA GTC 2013

for (k = 1; k <= Se; k++) {

if (!

emit_bits_s

(…))

return FALSE;

if (!

emit_bits_s

(…))

return FALSE;

if (!

emit_bits_s

(…))

return FALSE;

}Slide6

Evaluation

Evaluation Environment

CPU:

Intel Nehalem Xeon E5520 2.26GHz processorGPU: Tesla K20c

Picture usedMy favorite pictureCompressing: 1280 x 768 pixels

Decompressing: the products after compressingCorrectness checked by ``diff’’

3/20/2013

6

NVIDIA GTC 2013Slide7

Evaluation – Cont’d

Sequential

OpenMP

GPGPU Base

GPGPU

Optimized

Compress

2.886

2.648

14.700

22.412

Decompress

2.420

2.200

14.616

21.507

3/20/2013

7

NVIDIA GTC 2013

Timings are in milliseconds,

averagin

10 times of execution

Four threads are forked for

OpenMP

implementation

For both GPU implementations, configurations are tuned to be optimized

Results discussion

OpenMP

is fastest. GPGPU basically degrades the performance

 while `optimized’ version degrades more (due to serialized constant memory accesses).

Observations after hacking the code:

Each kernel launch deals with at most 250 elements, too fine-grained.

Kernel launch is expensive (allocation & copying the data)

Using

OpenMP

is always going to better off as long as there is enough parallelism & loop iterations are not extremely trivial.Slide8

Conclusion

For JPEG encoding/decoding core system, GPGPU basically degrades the performance.

Coarser-grained parallelism is required.

OpenMP acceleration can be easily applied to gain some performance.

3/20/2013

8

NVIDIA GTC 2013Slide9

Thank you.

Ang Li <ali28@wisc.edu

>

3/20/2013

NVIDIA GTC 2013

9