/
CULZSS LZSS Lossless Data Compression on CUDA CULZSS LZSS Lossless Data Compression on CUDA

CULZSS LZSS Lossless Data Compression on CUDA - PowerPoint Presentation

bety
bety . @bety
Follow
27 views
Uploaded On 2024-02-03

CULZSS LZSS Lossless Data Compression on CUDA - PPT Presentation

Adnan Ozsoy amp Martin Swany DAMSL Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011 OUTLINE Introduction Background ID: 1044537

data implementation version cpu implementation data cpu version compression memory buffer performance threads thread block gpu length shared amp

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CULZSS LZSS Lossless Data Compression on..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. CULZSSLZSS Lossless Data Compression on CUDAAdnan Ozsoy & Martin SwanyDAMSL - Distributed and MetaSystems LabDepartment of Computer Information and ScienceUniversity of DelawareSeptember 2011

2. OUTLINEIntroductionBackgroundImplementation DetailsPerformance AnalysisDiscussion and LimitationsRelated WorkFuture WorkConclusionQ & A

3. INTRODUCTIONUtilization of expensive resources usageMemoryBandwidthCPU usageData compression Trade off in increasing running timeGPUsCULZSS General purpose data compressionReduces the effect of increasing time compared to CPU Keeping the compression ratio

4. BACKGROUNDLZSS AlgorithmA variant of LZ77 Dictionary encoding Sliding window search bufferUncoded lookahead bufferInsert a diagram here shows the flow

5. BACKGROUNDLZSS AlgorithmExample:

6. BACKGROUNDGPU architectureFermi architecture Up to 512 CUDA cores16 SM of 32 cores eachHierarchy of memory typesGlobal, constant, texture are accessible by all threadsShared memory only by threads in same blockRegisters and local memory only by the thread it is owned

7. IMPLEMENTATION DETAILSLZSS CPU implementationCUDA implementation have two versionsAPIgpu_compress( *buffer, buf_length, **compressed_buffer, &comp_length, compression_parameters);gpu_decompress(*buffer, buf_length, **decompressed_buffer, &decomp_length, compression_parameters);Version 1 Version 2 DecompressionOptimizations

8. IMPLEMENTATION DETAILSLZSS CPU implementationStraight forward implementation of the algorithmMainly adapted from Dipperstein’s workTo be fair to CPU – a thread version is implemented with POSIX threads.Each thread is given some chunk of the file and chunks compressed concurrently, and reassembled.

9. IMPLEMENTATION DETAILSCUDA implementationVersion 1 Similar to Pthread implementation Chunks of file is distributed among blocks and in each block smaller chunks are assigned to threads

10. IMPLEMENTATION DETAILSCUDA implementationVersion 2Exploit the algorithm’s SIMD natureThe work distributed among threads in the same block is the matching phase of the compression for a single chunk

11. IMPLEMENTATION DETAILSVersion 2Matching phaseCPU steps

12. IMPLEMENTATION DETAILSDecompressionIdentical in both versionsEach character is read, decoded and written to outputIndependent behavior exist among blocksA list of compressed block sizes needs to be keptThe length of the list depends on the number of blocksNegligible impact on compression ratio

13. IMPLEMENTATION DETAILSOptimizationsCoalesced access to global memoryAccess fit into a block, done by just one memory transactionAccess to global memory is needed before and after matching stage.In Version 1, each thread reads/writes buffer size of data In Version 2, each thread reads/writes 1 byte of memoryFor that reason test results give the best performance with 128 bytes of buffer size and 128 threads per block which leads one memory transaction in Fermi architectures

14. IMPLEMENTATION DETAILSOptimizations (cont.)Shared memory usageShared memory is divided into banks, and each bank can only address one dataset request at a time, if all requests from different banks, all satisfied in parallelUsing shared memory for sliding window in Version 1 gave a %30 speed upAccess pattern on Version 2 is suitable for bank conflict free access where one thread can access one bankSpeed up is shown on analysis

15. IMPLEMENTATION DETAILSOptimizations (cont.)Configuration parametersThread per blockV1 has limitations on shared memoryBuffer sizeV2 consumes double buffer size than a single which limits the encoding offsets into 16 bit spaceBuffer sizes 128 bytes128 threads per block

16. PERFORMANCE ANALYSISTestbed ConfigurationsGeForce GTX 480 CUDA version 3.2 Intel® Core ™ i7 CPU 920 at 2.67 GHZ

17. PERFORMANCE ANALYSISDatasetsFive sets of dataCollection of C files – text based inputDelaware State Digital Raster Graphics and Digital Line Graphs Server – basemaps for georeferencing and visual analysisEnglish dictionary –alphabetical ordered textLinux kernel tarballHighly compressible custom data set – contains repeating characters in substrings of 20

18. PERFORMANCE ANALYSIS

19. PERFORMANCE ANALYSIS

20. DISCUSSION AND LIMITATIONSLimitations with shared memoryVersion 1 needed more shared memory for threads more than 128LZSS is not inherently parallelSome parts left to CPU Opportunity for overlap and upto double performanceMain goal being achieved by better performance than CPU based implementationBZIP2 is used to compare for a well known application

21. DISCUSSION AND LIMITATIONSVersion 1 gives better performance on highly compressible dataCan skip matched dataVersion 2 is better on the other data sets Cannot skip matched dataBetter utilization of GPU (coalescing accesses and avoiding bank conflicts) leads better performanceTwo version and the option to choose the version in API gives users the ability to use best matching implementation

22. RELATED WORK CPU basedParallelizing with threadsGPU basedLossy data compression has been exposed to GPU communityImage/Video Processing, Texture Compressor, … etcLossless data compressionO’Neil et al. GFC – specialized for double precision floating point data compressionBalevic proposes a data parallel algorithm for length encoding

23. FUTURE WORKMore detailed tuning configuration APICombined CPU and GPU heterogeneous implementation to benefit for future proof architectures that have both of them on the same die; AMD Fusion, Intel Nehalem, …Multi GPU usageOverlapping computations with CPU and GPU in a pipelining fashion

24. CONCLUSIONCULZSS shows a promising usage of GPUs for lossless data compressionTo our best knowledge first successful improvement attempt of lossless data compression for any kind of general data. Outperformed the serial LZSS by up to 18x, Pthread LZSS by up to 3x in compression timeCompression ratios kept very similar with CPU version

25. Q & A