GPGPU Applications Jaewoong Sim Aniruddha Dasgupta Hyesoon Kim Richard Vuduc 1 Outline Motivation GPUPerf P erformance analysis framework Performance Advisor Analytical Model Frontend Data Collector ID: 311510
Download Presentation The PPT/PDF document "A Performance Analysis Framework for Ide..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications
Jaewoong Sim Aniruddha Dasgupta Hyesoon Kim Richard Vuduc
1Slide2
OutlineMotivationGPUPerf: Performance analysis frameworkPerformance AdvisorAnalytical ModelFrontend Data CollectorEvaluationsConclusion
2/26Slide3
CPU VersionGPGPUVersion
CPU Version
GPGPU
Version
GPGPU Programming
GPGPU architectures have become very powerful.
Programmers want to convert CPU applications to GPGPU applications.
Case 1: 10x speed-up
Case 2: 1.1x speed-up
For case 2, programmers might wonder why the benefit is
so poor
.
Maybe, the algorithm is not parallelizable Maybe, the GPGPU code are not well optimizedFor case 1, programmers might wonder if 10x is the best speed-up.
3
/26
Programmers want to optimize code whenever possible!Slide4
GPGPU OptimizationsOptimizing parallel programs is difficult^100!
Most of programmers apply optimization techniques one by one.Try one more optimization with Shared Memory. Which one to choose?
4
/26
Best for this kernel
Still the best!
Programmers want to understand performance benefit! Slide5
GPGPU Performance GuidanceProviding performance guidance is not easy. Program analysis: Obtain program information as much as possible Performance modeling: Have a sophiscated analytical model
User-friendly metrics: Convert the performance analysis information into performance guidanceWe propose GPUPerf, performance analysis framework Quantatively predicts potential performance benefits
In this talk,
we
will focus more on
performance
modeling and potential benefit metrics
5
/26Slide6
OutlineMotivationGPUPerf: Performance Analysis FrameworkPerformance AdvisorAnalytical ModelFrontend Data CollectorEvaluationsConclusion
6/26Slide7
What is required for performance guidance?Program analysisPerformance modelingUser-friendly metrics
GPGPU KernelGPUPerf Overview
Frontend
Data Collector
Analytical
Model
Performance
Advisor
GPUPerf
7
/26
ILP, #insts
Model output
BenefitMetricsFor clarity, each component will be explained in a reverse orderSlide8
Performance AdvisorGoal of the performance advisorConvey performance bottleneck informationEstimate the potential gains from reducing the bottlenecks
Performance advisor provides four potential benefit metricsBitilp : benefits of increasing ITILPBmemlp : benefits of increasing MLPBserial
: benefits of removing
serialization effects
B
fp
: benefits of improving
computing inefficiency
Programmers can get an idea of the potential benefit of a GPGPU Kernel
Performance
Advisor
Analytical
ModelFrontendData CollectorBenefit metrics are provided by our analytical model8/26Slide9
Previous Work: MWP-CWP Model 9/26Depending on MWP and CWP, the execution time is predicted by the model.
The MWP-CWP model can predict general cases. Problem: did not model corner cases, which is critical to predict different program optimization benefits!
MWP (Memory Warp Parllelism)
Indicator of memory-level parallelism
C
WP (Compute Warp Parllelism)
Comp
Mem
Comp
Comp
CWP=3
MWP-CWP [Hong and Kim, ISCA’09]
Mem
Mem
Mem
Mem
MWP=4
Mem
Mem
Mem
Mem
8 warps
TimeSlide10
Analytical ModelOur analytical model follows a top-down approachEasy to interpret model componentsRelate them directly to performance bottlenecks
TexecTcomp
T
mem
T
overlap
Performance
Advisor
Analytical
Model
Frontend
Data Collector
T
exec : Final execution timeTcomp
: Computation timeTmem : Memory timeToverlap : Overlapped timeCompMemCompMemCompMemComp
Mem
Mem
Mem
Mem
Mem
T
comp
T
mem
Comp
Comp
Comp
Comp
T
overlap
T
exec
= T
comp
+ T
mem
- T
overlap
10
/26
4
warps
MWP=2
TimeSlide11
Analytical ModelTcomp is the amount of time to execute compute instructions
TexecTcomp
T
mem
T
overlap
W
serial
W
parallel
Performance
Advisor
Analytical
ModelFrontendData Collector
Wparallel : Work executed in parallel (useful work)Wserial : Overhead due to serialization effectsTcomp = Wparallel + Wserial11/26Slide12
Analytical ModelWparallel is the amount of work that can be executed in parallel T
execTcomp
T
mem
T
overlap
W
serial
W
parallel
Performance
Advisor
AnalyticalModel
FrontendData Collector
Wparallel = Total insts × Effective inst. throughputEffective Inst. throughput =
Effective inst. throughput = f(warp_size, SIMD_width, # pipeline stages)
12/26
ITILP represents the number of instructions that can be parallely executed in the pipeline.Slide13
Analytical ModelITILP (Inter-thread ILP)ITILP is inter-thread ILP.
Inst1
Inst2
Inst3
Inst4
Inst1
Inst2
Inst3
Inst4
Inst1
Inst2
Inst3
Inst4
Inst1
Inst2Inst3Inst4stallstallInst1
Inst2
Inst3
Inst4
Time
Performance
Advisor
Analytical
Model
Frontend
Data Collector
TLP (N)
ITILP = MIN(ILP × N, ITILP
max
)
ITILP
max
=avg_inst_lat/
13
/26
Inst1
Inst2
Inst3
Inst4
stall
stall
Inst1
Inst2
Inst3
Inst4
Inst1
Inst2
Inst3
Inst1
Inst2
Inst3
Inst4
stall
stall
Inst1
Inst2
Inst3
Inst4
Execution latency is already all hidden!
TLP =1
TLP =2
TLP =3
I
LP =4/3
TLP =1
ITILP=4/3
TLP =2
ITILP=8/3
TLP =3
ITILP=ITILP max
Low ITILP
As TLP increases, stall time reducesSlide14
Analytical ModelWserial represents the overhead due to serialization effects
TexecTcomp
T
mem
T
overlap
W
serial
W
parallel
O
sync
OSFUOCFDivO
bankPerformance AdvisorAnalyticalModel
FrontendData Collector
W
serial = Osync + O
SFU
+ O
CFDiv
+ O
bank
O
sync
: Synchroization overhead
O
SFU
: SFU contention overhead
O
CFDiv
: branch divergence overhead
O
bank
: Bank-conflict overhead
14
/26Slide15
Analytical ModelSFU OverheadPerformance
AdvisorAnalyticalModel
Frontend
Data Collector
GPGPU has SFUs where expensive operations can be executed.
With a good ratio of insts and SFU insts, SFU executing cost can be hidden.
Inst
SFU Inst
Inst
Inst
Inst
Inst
Inst
InstInstInstSFU Inst
InstSFU InstSFU InstInstSFU InstSFU InstInstInstSFU Inst
Inst
SFU Inst
Latency of SFU instructions is not completely hidden in this case!
15
/26
High
Inst to SFU ratio
Low
Inst to SFU ratio
O
SFUSlide16
Analytical ModelTexec
TcompTmem
T
overlap
Performance
Advisor
Analytical
Model
Frontend
Data Collector
T
mem
represents the amount of time spent on memory requests and transfers
Comp
MemCompMemCompMemCompMemMem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
T
mem = 4MEM / 2
T
mem = 4MEM / 1
T
mem
= Effective mem. requests × AMAT
16
/26
MWP=2
MWP=1Slide17
Analytical ModelToverlap represents how much the memory cost can be hidden by multi-threading
TexecTcomp
T
mem
T
overlap
Performance
Advisor
Analytical
Model
Frontend
Data Collector
All the memory costs are overlapped with computation
Toverlap Tmem 17/26CompMem
CompMemCompMemCompCompMWP ≥ CWPMWP=3
CWP=3Slide18
Mem
MemAnalytical ModelToverlap represents how much the memory access cost can be hidden by multi-threading
T
exec
T
comp
T
mem
T
overlap
Performance
Advisor
AnalyticalModelFrontendData Collector
CompMemCompMemMem
CompCompCompCompComputation cost is hidden by memory costToverlap Tcomp 18/26
CWP > MWP
MWP=2
CWP=4Slide19
Benefit ChartPerformance Advisor
AnalyticalModel
Frontend
Data Collector
Time metrics are converted into potential benefit metrics.
T
mem
Mem Cost
Comp Cost
T
comp
T
mem’
ToverlapTmem_minTfp
Potential Benefit ChartBmemlpBserialBitilpBfpBenefit MetricsBenefits of BmemlpIncreasing MLPBserialRemoving serialization effects
BitilpIncreasing inter-thread ILP
Bfp
Improving computing efficiency
T
fp
:
ideal computation cost
T
mem_min
: ideal memory cost
19
/26
Single Thread
Optimized KernelSlide20
Frontend Data CollectorArchitecture-related information from H/W counters#insts, global LD/ST requests, cache info
ComputeVisual ProfilerInstruction
Analyzer (IA)
Static Analysis
Tools
Performance
Advisor
Analytical
Model
Frontend
Data Collector
CUDA Executable
Ocelot
ExecutableCUDA Binary(CUBIN)#InstsOccupancy#SFU_Insts
ILP, MLP, ...Information in CUDA binaries instead of PTX after low-level compiler optimizations ILP, MLP
Detailed information from emulating PTX executions
#SFU insts, #sync insts, loop counters
Performance
Advisor
Analytical
Model
Frontend
Data Collector
The collected information is fed into our analytical model
20
/26
Ocelot
[Diamos et al., PACT’10]Slide21
OutlineMotivationGPUPerf: A Performance Analysis FrameworkPerformance AdvisorAnalytical ModelFrontend Data Collector
EvaluationsConclusion21/26Slide22
Evaluation22/26NVIDIA C2050 Fermi architectureFMM (Fast Multi-pole Method): approximation of n-body problem
Parboil benchmarks, Reduction (in the paper)
Prefetching
Loop Unrolling
SFU
Shared Memory
Vector Packing
Loop optimization
[
Winner
, 2010 Gordon Bell Prize at
Supercomputing
] 44 Optimization combinationsSlide23
Results - FMM
23/26
Vector packing + Shared memory + Unroll-Jam + SFU
combination shows
the best performance Slide24
Results - FMM
24/26
Our model follows the speed-up trend pretty well
Our model correctly pinpoints the best optimization combination that improves the kernel Slide25
Results – Potential BenefitsBfp implies that the kernel could be improved via optimizationsSmall Bfp value indicates that adding Prefetching(Pref) does not lead to further performance improvement
25
/26
B
fp
: Computing ineffciency
(Higher is wrose)Slide26
ConclusionWe propose performance analysis framework.Front-end data collector, analytical model and performance advisor. Performance advisor provides potential benefit metrics, which can guide performance tuning for GPGPU code.
(Bmemlp, Bserial, Bitilp, Bfp). 44 optimization combinations in FMM are well predicted. Future work: the performance benefit advisor can be inputs to compilers.
26
/26