An Efficient GPU Implementation of a Tree-based
Author : myesha-ticknor | Published Date : 2025-05-10
Description: An Efficient GPU Implementation of a Treebased nBody Algorithm Martin Burtscher Department of Computer Science Highend CPUGPU Comparison Xeon E52687W Kepler GTX 680 Cores 8 superscalar 1536 simple Active threads 2 per core 11 per
Presentation Embed Code
Download Presentation
Download
Presentation The PPT/PDF document
"An Efficient GPU Implementation of a Tree-based" is the property of its rightful owner.
Permission is granted to download and print the materials on this website for personal, non-commercial use only,
and to display it on your personal computer provided you do not modify the materials and that you retain all
copyright notices contained in the materials. By downloading content from our website, you accept the terms of
this agreement.
Transcript:An Efficient GPU Implementation of a Tree-based:
An Efficient GPU Implementation of a Tree-based n-Body Algorithm Martin Burtscher Department of Computer Science High-end CPU-GPU Comparison Xeon E5-2687W Kepler GTX 680 Cores 8 (superscalar) 1536 (simple) Active threads 2 per core ~11 per core Frequency 3.1 GHz 1.0 GHz Peak performance (SP) 397 GFlop/s 3090 GFlop/s Peak mem. bandwidth 51 GB/s 192 GB/s Maximum power 150 W 195 W* Price $1900 $500* Release dates Xeon: March 2012 Kepler: March 2012 *entire card An Efficient GPU Implementation of a Tree-based n-Body Algorithm 2 Nvidia Intel GPU Advantages Performance 8x as many operations executed per second Main memory bandwidth 4x as many bytes transferred per second Cost-, energy-, and size-efficiency 29x as much performance per dollar 6x as much performance per watt 11x as much performance per area (based on peak values) An Efficient GPU Implementation of a Tree-based n-Body Algorithm 3 GPU Disadvantages Clearly, we should use GPUs all the time So why aren’t we? GPUs are harder to program and tune than CPUs Easy to make performance mistakes GPUs can only execute some types of code fast Need lots of data parallelism, data reuse, & regularity Mostly regular codes have been ported to GPUs E.g., matrix codes executing many ops/word Dense matrix operations, stencil codes (PDEs) An Efficient GPU Implementation of a Tree-based n-Body Algorithm 4 LLNL GPU Disadvantages Clearly, we should use GPUs all the time So why aren’t we? GPUs are harder to program and tune than CPUs Easy to make performance mistakes GPUs can only execute some types of code fast Need lots of data parallelism, data reuse, & regularity Mostly regular codes have been ported to GPUs E.g., matrix codes executing many ops/word Dense matrix operations, stencil codes (PDEs) An Efficient GPU Implementation of a Tree-based n-Body Algorithm 5 LLNL Our goal is to also handle irregular codes well Outline Introduction GPU programming Barnes Hut algorithm CUDA implementation Experimental results Conclusions An Efficient GPU Implementation of a Tree-based n-Body Algorithm 6 CUDA Programming Model Non-graphics programming Uses GPU as massively parallel co-processor Thousands of threads needed for full efficiency C/C++ with extensions Kernel launch Calling functions on GPU Memory management GPU memory allocation, copying data to/from GPU Declaration qualifiers Device, shared, local, etc. Special instructions Barriers, fences, etc. Keywords threadIdx, blockIdx An Efficient GPU Implementation of a Tree-based n-Body Algorithm 7 Calling GPU Kernels Kernels are functions that run on the GPU Callable by