/
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative - PowerPoint Presentation

desiron
desiron . @desiron
Follow
347 views
Uploaded On 2020-06-17

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative - PPT Presentation

Mohammad Sadrosadati Amirhossein Mirhosseini Seyed Borna Ehsani Hamid Sarbazi Azad Mario Drumond Babak Falsafi Rachata Ausavarungnirun Onur Mutlu Register file size limits GPU scalability ID: 779852

file register cache prefetch register file prefetch cache intervals latency warp crossbar action ltrf registers compiler main scheduler sets

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "LTRF: Enabling High-Capacity Register Fi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

Mohammad SadrosadatiAmirhossein MirhosseiniSeyed Borna EhsaniHamid Sarbazi-AzadMario DrumondBabak FalsafiRachata AusavarungnirunOnur Mutlu

Slide2

Register file size limits GPU scalability

Register file (RF) already accounts for 60% of on-chip storageBut, there is still demand for more registers to achieve maximum performance and concurrency Future slow memory accesses call for more threads Multi-socket, multi-GPU, RDMA, NVM, etc.Compiler optimizations call for more registers per threadLoop unrolling, thread coarsening, etc.2Need mechanisms to expand RF capacity (without large area/power overheads)2.3x5.9x(KB)

Slide3

How to make register files larger?Emerging technologies

[Jing’13][Mao’14][Wang’15][Abdel-Majid’17]Register file compression [Lee’15]Register file virtualization [Jion’15][Vijaykumar’16][Kloosterman’17]Common challenge: Latency overheadExample: 8x larger register file with NTV TFET3Goal: Tolerate register file latencies5.3x slowerNo latency overhead

Slide4

Contributions

Latency Tolerant Register File (LTRF)“2-level” main register file + register cachePerforms prefetch ops while executing other warpsPaves the way for several power/area optimizationsCompiler-driven Register PrefetchingBreak control flow graph into “prefetch subgraphs”Prefetch registers at the beginning of each subgraphInterval analysis to identify prefetch subgraphs4LTRF tolerates up to 6x slower register filesExample LTRF use case: 8× larger RF  34.8% higher performance

Slide5

Outline

Background and challengesThe case for compiler-driven register prefetching in GPUsLTRF architecture and compiler supportEvaluation methodologyResults5

Slide6

Register file caching [Gebhart’ ISCA11]

Promising approach for latency tolerant register files6Unfortunately, classic demand fetch and replace yields low hit rate in register cachesRegister

File

Cache

(

multiple banks)

Warp Scheduler

Crossbar

Operand Collector

SIMD Units

Main

Register

File

(

multiple banks

)

Crossbar

Slide7

Demand fetch/replace register caching

8-30% hit rateWhy?No spatial locality for registers Values might be renamed to different registersScrambles temporal localityLots of threads  cache thrashing7Our solution: Precise register prefetching

Slide8

Compiler-driven register prefetchingPossible to have near-perfect register prefetchers

8Register working sets known at compile timeNo indirection or address translationPrefetch latency may overlap with other warps

Slide9

Compiler-driven register prefetchingPossible to have near-perfect register prefetchers

9Register working sets known at compile timeNo indirection or address translationPrefetch latency may overlap with other warpsKey idea: “prefetch subgraphs”Prefetch register working sets into the cache at the beginning of each subgraphAll register accesses in the prefetch subgraph hit in the register cache

Slide10

Compiler-driven register prefetchingPossible to have near-perfect register prefetchers

10Register working sets known at compile timeNo indirection or address translationPrefetch latency may overlap with other warpsKey idea: “prefetch subgraphs”Prefetch register working sets into the cache at the beginning of each subgraphAll register accesses in the prefetch subgraph hit in the register cacheR1

R3

R2

R4

Slide11

Compiler-driven register prefetchingPossible to have near-perfect register prefetchers

11Register working sets known at compile timeNo indirection or address translationPrefetch latency may overlap with other warpsP(R1,R2)R1

P(R3)

R3

R2

P(R4)

R4

Key idea:

“prefetch subgraphs”

Prefetch register working sets into the cache at the beginning of each subgraph

All register accesses in the prefetch subgraph hit in the register cache

What are best prefetch subgraphs?

Slide12

Objectives

Prefetch operations dominate register usesMinimum number of prefetch operationsFit entire loopsMaximize dynamic instsImplicationsSingle-entry subgraphsLargest possible subgraphsCapture backward branches12Optimal prefetch subgraphs

Slide13

Register intervals

Intervals: disjoint single-entry subgraphs of CFGRegister intervals access at most k registersReserve k register slots for each warp to prevent evictionEntry block is the header of the first intervalGreedily add child basic blocks iff:Incoming edges only from within interval, AND| registers accessed in interval | ≤ kRemaining children become new headersRepeat until graph is irreducible13Prefetch register working sets at the beginning of register intervals

Slide14

Register intervals in action

14AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

Slide15

Register intervals in action

A is the first interval header15AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

Slide16

Register intervals in action

E is the only candidate to merge16AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

Slide17

Register intervals in action

E merges17AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

R7

R8

Slide18

Register intervals in action

F and G are potential candidates to merge18AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

R7

R8

Slide19

Register intervals in action

F merges but G can’t (register cache is full)19AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

R7

R8

R3

Slide20

Register intervals in action

Done with first interval --- B and G become headers20AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

R7

R8

R3

R2

R3

R1

R9

Slide21

Register intervals in action

No candidate to merge into B --- C becomes header21AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

R7

R8

R3

R2

R3

R1

R9

R1

Slide22

Register intervals in action

D becomes candidate to merge into C22AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

R7

R8

R3

R2

R3

R1

R9

R1

Slide23

Register intervals in action

D merges into C --- done with the first pass23AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

R7

R8

R3

R2

R3

R1

R9

R1

R5

R6

Slide24

Register intervals in action

Second pass: Yellow is able to merge into Red24AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

R1

R7

R8

R3

R2

R3

R1

R9

R1

R5

R6

Slide25

Register intervals in action

Done with second pass --- graph no further reducible25AR1

B

R1, R9

C

R1

D

R5, R6

E

R7, R8

G

R2, R3

F

R3, R8

Terminate

Entire

nested

loop

PREFETCH

R1, R9, R5 , R6

PREFETCH

R1, R7, R8 , R3

PREFETCH

R2, R3

R1

R7

R8

R3

R2

R3

R1

R9

R5

R6

Slide26

Register interval highlights

Single-entry prefetch subgraphsPrefetch operations dominate register usesMaximal-length subgraphsMinimize prefetch overheadsMinimal termination constraintsEncapsulate entire loopsMaximize dynamic instructions per intervalMulti-pass construction algorithm based on classic interval analysis26Need hardware mechanisms to provide fixed-size cache partitions for register intervals

Slide27

Outline

Background and challengesThe case for compiler-driven register prefetching in GPUsLTRF architecture and compiler supportEvaluation methodologyResults27

Slide28

Register

File

“2-level” register file

28

Latency Tolerant Register File (LTRF)

Warp Scheduler

Crossbar

Operand Collector

SIMD Units

Crossbar

Register File Cache

Main

Slide29

Register

File

“2-level” register file + warp scheduler

29

Latency Tolerant Register File (LTRF)

Two-Level

Warp Scheduler

Crossbar

Operand Collector

SIMD Units

Crossbar

Register File Cache

Main

Slide30

Register

File

“2-level” register file + warp scheduler

Cache registers only for the

active

warps

30

Latency Tolerant Register File (LTRF)

Two-Level

Warp Scheduler

Crossbar

Operand Collector

SIMD Units

Crossbar

Active Warps

Cache

Main

Slide31

“2-level” register file + warp scheduler

Cache registers only for the

active

warps

Dedicated

register cache space for each warp

31

Latency Tolerant Register File (LTRF)

Active Warps

Cache

Two-Level

Warp Scheduler

Crossbar

Operand Collector

SIMD Units

16

Registers

Crossbar

File

Main

Register

Slide32

“2-level” register file + warp scheduler

Cache registers only for the

active

warps

Dedicated

register cache space for each warp

Swap warps on PREFETCH

32

Latency Tolerant Register File (LTRF)

Active Warps

Cache

Two-Level

Warp Scheduler

Crossbar

Operand Collector

SIMD Units

16

Registers

Crossbar

File

Main

Register

Slide33

Outline

Background and challengesThe case for compiler-driven register prefetching in GPUsLTRF architecture and compiler supportEvaluation methodologyResults33

Slide34

Evaluation methodology

Simulator: GPGPU-Sim modeling NVIDIA MaxwellWorkloads: CUDA-SDK, Rodinia, and Parboil suitesComparison points: Baseline: No register cachingRFC: Demand fetch register file caching (Gebhart’ ISCA 2011)LTRFIdeal: Increased capacity with no latency overhead34

Slide35

Latency tolerance

Max tolerable RF latency with IPC slowdown <= 5%35LTRF tolerates the latencies of up to 6x slower register files

Slide36

Performance improvement

Example use case: Increase register file capacity from 256KB to 2MB using NTV TFETSame power/area as the baseline register file (256 KB)2nd-level RF accesses 5.3X slower than baseline36LTRF+TFET improves performance by 34.8%within 2.3% of an ideal large register file

Slide37

Also in the paper…

LTRF+: minimize register movement between the register file and register cache using liveness analysisRegister interval compared to other subraphsStrands, superblocks, etc.Detailed analysis of hardware overheads16% more area21% less powerVarious LTRF use cases with different register file technologies and optimizations37

Slide38

Conclusion

Register files are the main GPU scalability bottlenecksThey already consume 60% of total on-chip memoryNeed more registers for highest performanceStandalone register caching solutions yield low hit ratesLatency Tolerant Register File (LTRF)“2-level” main register file + register cachePrefetch register working sets ahead of timePerforms prefetch ops while executing other warpsInterval analysis for near-optimal prefetchingTolerates up to 6x slower main register filesPaves the way for several power/area optimizations38

Slide39

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

Mohammad SadrosadatiAmirhossein MirhosseiniSeyed Borna EhsaniHamid Sarbazi-AzadMario DrumondBabak FalsafiRachata AusavarungnirunOnur Mutlu