/
A Performance Analysis Framework for Identifying Potential A Performance Analysis Framework for Identifying Potential

A Performance Analysis Framework for Identifying Potential - PowerPoint Presentation

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
444 views
Uploaded On 2016-05-08

A Performance Analysis Framework for Identifying Potential - PPT Presentation

GPGPU Applications Jaewoong Sim Aniruddha Dasgupta Hyesoon Kim Richard Vuduc 1 Outline Motivation GPUPerf P erformance analysis framework Performance Advisor Analytical Model Frontend Data Collector ID: 311510

performance mem model comp mem performance comp model inst analytical sfu advisor collector data mwp benefit memory itilp gpgpu

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A Performance Analysis Framework for Ide..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications

Jaewoong Sim Aniruddha Dasgupta Hyesoon Kim Richard Vuduc

1Slide2

OutlineMotivationGPUPerf: Performance analysis frameworkPerformance AdvisorAnalytical ModelFrontend Data CollectorEvaluationsConclusion

2/26Slide3

CPU VersionGPGPUVersion

CPU Version

GPGPU

Version

GPGPU Programming

GPGPU architectures have become very powerful.

Programmers want to convert CPU applications to GPGPU applications.

Case 1: 10x speed-up

Case 2: 1.1x speed-up

For case 2, programmers might wonder why the benefit is

so poor

.

Maybe, the algorithm is not parallelizable Maybe, the GPGPU code are not well optimizedFor case 1, programmers might wonder if 10x is the best speed-up.

3

/26

Programmers want to optimize code whenever possible!Slide4

GPGPU OptimizationsOptimizing parallel programs is difficult^100!

Most of programmers apply optimization techniques one by one.Try one more optimization with Shared Memory. Which one to choose?

4

/26

Best for this kernel

Still the best!

Programmers want to understand performance benefit! Slide5

GPGPU Performance GuidanceProviding performance guidance is not easy. Program analysis: Obtain program information as much as possible Performance modeling: Have a sophiscated analytical model

User-friendly metrics: Convert the performance analysis information into performance guidanceWe propose GPUPerf, performance analysis framework Quantatively predicts potential performance benefits

In this talk,

we

will focus more on

performance

modeling and potential benefit metrics

5

/26Slide6

OutlineMotivationGPUPerf: Performance Analysis FrameworkPerformance AdvisorAnalytical ModelFrontend Data CollectorEvaluationsConclusion

6/26Slide7

What is required for performance guidance?Program analysisPerformance modelingUser-friendly metrics

GPGPU KernelGPUPerf Overview

Frontend

Data Collector

Analytical

Model

Performance

Advisor

GPUPerf

7

/26

ILP, #insts

Model output

BenefitMetricsFor clarity, each component will be explained in a reverse orderSlide8

Performance AdvisorGoal of the performance advisorConvey performance bottleneck informationEstimate the potential gains from reducing the bottlenecks

Performance advisor provides four potential benefit metricsBitilp : benefits of increasing ITILPBmemlp : benefits of increasing MLPBserial

: benefits of removing

serialization effects

B

fp

: benefits of improving

computing inefficiency

Programmers can get an idea of the potential benefit of a GPGPU Kernel

Performance

Advisor

Analytical

ModelFrontendData CollectorBenefit metrics are provided by our analytical model8/26Slide9

Previous Work: MWP-CWP Model 9/26Depending on MWP and CWP, the execution time is predicted by the model.

The MWP-CWP model can predict general cases. Problem: did not model corner cases, which is critical to predict different program optimization benefits!

MWP (Memory Warp Parllelism)

Indicator of memory-level parallelism

C

WP (Compute Warp Parllelism)

Comp

Mem

Comp

Comp

CWP=3

MWP-CWP [Hong and Kim, ISCA’09]

Mem

Mem

Mem

Mem

MWP=4

Mem

Mem

Mem

Mem

8 warps

TimeSlide10

Analytical ModelOur analytical model follows a top-down approachEasy to interpret model componentsRelate them directly to performance bottlenecks

TexecTcomp

T

mem

T

overlap

Performance

Advisor

Analytical

Model

Frontend

Data Collector

T

exec : Final execution timeTcomp

: Computation timeTmem : Memory timeToverlap : Overlapped timeCompMemCompMemCompMemComp

Mem

Mem

Mem

Mem

Mem

T

comp

T

mem

Comp

Comp

Comp

Comp

T

overlap

T

exec

= T

comp

+ T

mem

- T

overlap

10

/26

4

warps

MWP=2

TimeSlide11

Analytical ModelTcomp is the amount of time to execute compute instructions

TexecTcomp

T

mem

T

overlap

W

serial

W

parallel

Performance

Advisor

Analytical

ModelFrontendData Collector

Wparallel : Work executed in parallel (useful work)Wserial : Overhead due to serialization effectsTcomp = Wparallel + Wserial11/26Slide12

Analytical ModelWparallel is the amount of work that can be executed in parallel T

execTcomp

T

mem

T

overlap

W

serial

W

parallel

Performance

Advisor

AnalyticalModel

FrontendData Collector

Wparallel = Total insts × Effective inst. throughputEffective Inst. throughput =  

Effective inst. throughput = f(warp_size, SIMD_width, # pipeline stages)

12/26

ITILP represents the number of instructions that can be parallely executed in the pipeline.Slide13

Analytical ModelITILP (Inter-thread ILP)ITILP is inter-thread ILP.

Inst1

Inst2

Inst3

Inst4

Inst1

Inst2

Inst3

Inst4

Inst1

Inst2

Inst3

Inst4

Inst1

Inst2Inst3Inst4stallstallInst1

Inst2

Inst3

Inst4

Time

Performance

Advisor

Analytical

Model

Frontend

Data Collector

TLP (N)

ITILP = MIN(ILP × N, ITILP

max

)

ITILP

max

=avg_inst_lat/

 

13

/26

Inst1

Inst2

Inst3

Inst4

stall

stall

Inst1

Inst2

Inst3

Inst4

Inst1

Inst2

Inst3

Inst1

Inst2

Inst3

Inst4

stall

stall

Inst1

Inst2

Inst3

Inst4

Execution latency is already all hidden!

TLP =1

TLP =2

TLP =3

I

LP =4/3

TLP =1

ITILP=4/3

TLP =2

ITILP=8/3

TLP =3

ITILP=ITILP max

Low ITILP

As TLP increases, stall time reducesSlide14

Analytical ModelWserial represents the overhead due to serialization effects

TexecTcomp

T

mem

T

overlap

W

serial

W

parallel

O

sync

OSFUOCFDivO

bankPerformance AdvisorAnalyticalModel

FrontendData Collector

W

serial = Osync + O

SFU

+ O

CFDiv

+ O

bank

O

sync

: Synchroization overhead

O

SFU

: SFU contention overhead

O

CFDiv

: branch divergence overhead

O

bank

: Bank-conflict overhead

14

/26Slide15

Analytical ModelSFU OverheadPerformance

AdvisorAnalyticalModel

Frontend

Data Collector

GPGPU has SFUs where expensive operations can be executed.

With a good ratio of insts and SFU insts, SFU executing cost can be hidden.

Inst

SFU Inst

Inst

Inst

Inst

Inst

Inst

InstInstInstSFU Inst

InstSFU InstSFU InstInstSFU InstSFU InstInstInstSFU Inst

Inst

SFU Inst

Latency of SFU instructions is not completely hidden in this case!

15

/26

High

Inst to SFU ratio

Low

Inst to SFU ratio

O

SFUSlide16

Analytical ModelTexec

TcompTmem

T

overlap

Performance

Advisor

Analytical

Model

Frontend

Data Collector

T

mem

represents the amount of time spent on memory requests and transfers

Comp

MemCompMemCompMemCompMemMem

Mem

Mem

Mem

Mem

Mem

Mem

Mem

T

mem = 4MEM / 2

T

mem = 4MEM / 1

T

mem

= Effective mem. requests × AMAT

16

/26

MWP=2

MWP=1Slide17

Analytical ModelToverlap represents how much the memory cost can be hidden by multi-threading

TexecTcomp

T

mem

T

overlap

Performance

Advisor

Analytical

Model

Frontend

Data Collector

All the memory costs are overlapped with computation

Toverlap Tmem 17/26CompMem

CompMemCompMemCompCompMWP ≥ CWPMWP=3

CWP=3Slide18

Mem

MemAnalytical ModelToverlap represents how much the memory access cost can be hidden by multi-threading

T

exec

T

comp

T

mem

T

overlap

Performance

Advisor

AnalyticalModelFrontendData Collector

CompMemCompMemMem

CompCompCompCompComputation cost is hidden by memory costToverlap Tcomp 18/26

CWP > MWP

MWP=2

CWP=4Slide19

Benefit ChartPerformance Advisor

AnalyticalModel

Frontend

Data Collector

Time metrics are converted into potential benefit metrics.

T

mem

Mem Cost

Comp Cost

T

comp

T

mem’

ToverlapTmem_minTfp

Potential Benefit ChartBmemlpBserialBitilpBfpBenefit MetricsBenefits of BmemlpIncreasing MLPBserialRemoving serialization effects

BitilpIncreasing inter-thread ILP

Bfp

Improving computing efficiency

T

fp

:

ideal computation cost

T

mem_min

: ideal memory cost

19

/26

Single Thread

Optimized KernelSlide20

Frontend Data CollectorArchitecture-related information from H/W counters#insts, global LD/ST requests, cache info

ComputeVisual ProfilerInstruction

Analyzer (IA)

Static Analysis

Tools

Performance

Advisor

Analytical

Model

Frontend

Data Collector

CUDA Executable

Ocelot

ExecutableCUDA Binary(CUBIN)#InstsOccupancy#SFU_Insts

ILP, MLP, ...Information in CUDA binaries instead of PTX after low-level compiler optimizations ILP, MLP

Detailed information from emulating PTX executions

#SFU insts, #sync insts, loop counters

Performance

Advisor

Analytical

Model

Frontend

Data Collector

The collected information is fed into our analytical model

20

/26

Ocelot

[Diamos et al., PACT’10]Slide21

OutlineMotivationGPUPerf: A Performance Analysis FrameworkPerformance AdvisorAnalytical ModelFrontend Data Collector

EvaluationsConclusion21/26Slide22

Evaluation22/26NVIDIA C2050 Fermi architectureFMM (Fast Multi-pole Method): approximation of n-body problem

Parboil benchmarks, Reduction (in the paper)

Prefetching

Loop Unrolling

SFU

Shared Memory

Vector Packing

Loop optimization

[

Winner

, 2010 Gordon Bell Prize at

Supercomputing

] 44 Optimization combinationsSlide23

Results - FMM

23/26

Vector packing + Shared memory + Unroll-Jam + SFU

combination shows

the best performance Slide24

Results - FMM

24/26

Our model follows the speed-up trend pretty well

Our model correctly pinpoints the best optimization combination that improves the kernel Slide25

Results – Potential BenefitsBfp implies that the kernel could be improved via optimizationsSmall Bfp value indicates that adding Prefetching(Pref) does not lead to further performance improvement

25

/26

B

fp

: Computing ineffciency

(Higher is wrose)Slide26

ConclusionWe propose performance analysis framework.Front-end data collector, analytical model and performance advisor. Performance advisor provides potential benefit metrics, which can guide performance tuning for GPGPU code.

(Bmemlp, Bserial, Bitilp, Bfp). 44 optimization combinations in FMM are well predicted. Future work: the performance benefit advisor can be inputs to compilers.

26

/26