Mohammad Sadrosadati Amirhossein Mirhosseini Seyed Borna Ehsani Hamid Sarbazi Azad Mario Drumond Babak Falsafi Rachata Ausavarungnirun Onur Mutlu Register file size limits GPU scalability ID: 779852
Download The PPT/PDF document "LTRF: Enabling High-Capacity Register Fi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
Mohammad SadrosadatiAmirhossein MirhosseiniSeyed Borna EhsaniHamid Sarbazi-AzadMario DrumondBabak FalsafiRachata AusavarungnirunOnur Mutlu
Slide2Register file size limits GPU scalability
Register file (RF) already accounts for 60% of on-chip storageBut, there is still demand for more registers to achieve maximum performance and concurrency Future slow memory accesses call for more threads Multi-socket, multi-GPU, RDMA, NVM, etc.Compiler optimizations call for more registers per threadLoop unrolling, thread coarsening, etc.2Need mechanisms to expand RF capacity (without large area/power overheads)2.3x5.9x(KB)
Slide3How to make register files larger?Emerging technologies
[Jing’13][Mao’14][Wang’15][Abdel-Majid’17]Register file compression [Lee’15]Register file virtualization [Jion’15][Vijaykumar’16][Kloosterman’17]Common challenge: Latency overheadExample: 8x larger register file with NTV TFET3Goal: Tolerate register file latencies5.3x slowerNo latency overhead
Slide4Contributions
Latency Tolerant Register File (LTRF)“2-level” main register file + register cachePerforms prefetch ops while executing other warpsPaves the way for several power/area optimizationsCompiler-driven Register PrefetchingBreak control flow graph into “prefetch subgraphs”Prefetch registers at the beginning of each subgraphInterval analysis to identify prefetch subgraphs4LTRF tolerates up to 6x slower register filesExample LTRF use case: 8× larger RF 34.8% higher performance
Slide5Outline
Background and challengesThe case for compiler-driven register prefetching in GPUsLTRF architecture and compiler supportEvaluation methodologyResults5
Slide6Register file caching [Gebhart’ ISCA11]
Promising approach for latency tolerant register files6Unfortunately, classic demand fetch and replace yields low hit rate in register cachesRegister
File
Cache
(
multiple banks)
Warp Scheduler
Crossbar
Operand Collector
SIMD Units
Main
Register
File
(
multiple banks
)
Crossbar
Slide7Demand fetch/replace register caching
8-30% hit rateWhy?No spatial locality for registers Values might be renamed to different registersScrambles temporal localityLots of threads cache thrashing7Our solution: Precise register prefetching
Slide8Compiler-driven register prefetchingPossible to have near-perfect register prefetchers
8Register working sets known at compile timeNo indirection or address translationPrefetch latency may overlap with other warps
Slide9Compiler-driven register prefetchingPossible to have near-perfect register prefetchers
9Register working sets known at compile timeNo indirection or address translationPrefetch latency may overlap with other warpsKey idea: “prefetch subgraphs”Prefetch register working sets into the cache at the beginning of each subgraphAll register accesses in the prefetch subgraph hit in the register cache
Slide10Compiler-driven register prefetchingPossible to have near-perfect register prefetchers
10Register working sets known at compile timeNo indirection or address translationPrefetch latency may overlap with other warpsKey idea: “prefetch subgraphs”Prefetch register working sets into the cache at the beginning of each subgraphAll register accesses in the prefetch subgraph hit in the register cacheR1
R3
R2
R4
Slide11Compiler-driven register prefetchingPossible to have near-perfect register prefetchers
11Register working sets known at compile timeNo indirection or address translationPrefetch latency may overlap with other warpsP(R1,R2)R1
P(R3)
R3
R2
P(R4)
R4
Key idea:
“prefetch subgraphs”
Prefetch register working sets into the cache at the beginning of each subgraph
All register accesses in the prefetch subgraph hit in the register cache
What are best prefetch subgraphs?
Slide12Objectives
Prefetch operations dominate register usesMinimum number of prefetch operationsFit entire loopsMaximize dynamic instsImplicationsSingle-entry subgraphsLargest possible subgraphsCapture backward branches12Optimal prefetch subgraphs
Slide13Register intervals
Intervals: disjoint single-entry subgraphs of CFGRegister intervals access at most k registersReserve k register slots for each warp to prevent evictionEntry block is the header of the first intervalGreedily add child basic blocks iff:Incoming edges only from within interval, AND| registers accessed in interval | ≤ kRemaining children become new headersRepeat until graph is irreducible13Prefetch register working sets at the beginning of register intervals
Slide14Register intervals in action
14AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
Slide15Register intervals in action
A is the first interval header15AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
Slide16Register intervals in action
E is the only candidate to merge16AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
Slide17Register intervals in action
E merges17AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
R7
R8
Slide18Register intervals in action
F and G are potential candidates to merge18AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
R7
R8
Slide19Register intervals in action
F merges but G can’t (register cache is full)19AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
R7
R8
R3
Slide20Register intervals in action
Done with first interval --- B and G become headers20AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
R7
R8
R3
R2
R3
R1
R9
Slide21Register intervals in action
No candidate to merge into B --- C becomes header21AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
R7
R8
R3
R2
R3
R1
R9
R1
Slide22Register intervals in action
D becomes candidate to merge into C22AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
R7
R8
R3
R2
R3
R1
R9
R1
Slide23Register intervals in action
D merges into C --- done with the first pass23AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
R7
R8
R3
R2
R3
R1
R9
R1
R5
R6
Slide24Register intervals in action
Second pass: Yellow is able to merge into Red24AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
R1
R7
R8
R3
R2
R3
R1
R9
R1
R5
R6
Slide25Register intervals in action
Done with second pass --- graph no further reducible25AR1
B
R1, R9
C
R1
D
R5, R6
E
R7, R8
G
R2, R3
F
R3, R8
Terminate
Entire
nested
loop
PREFETCH
R1, R9, R5 , R6
PREFETCH
R1, R7, R8 , R3
PREFETCH
R2, R3
R1
R7
R8
R3
R2
R3
R1
R9
R5
R6
Slide26Register interval highlights
Single-entry prefetch subgraphsPrefetch operations dominate register usesMaximal-length subgraphsMinimize prefetch overheadsMinimal termination constraintsEncapsulate entire loopsMaximize dynamic instructions per intervalMulti-pass construction algorithm based on classic interval analysis26Need hardware mechanisms to provide fixed-size cache partitions for register intervals
Slide27Outline
Background and challengesThe case for compiler-driven register prefetching in GPUsLTRF architecture and compiler supportEvaluation methodologyResults27
Slide28Register
File
“2-level” register file
28
Latency Tolerant Register File (LTRF)
Warp Scheduler
Crossbar
Operand Collector
SIMD Units
Crossbar
Register File Cache
Main
Slide29Register
File
“2-level” register file + warp scheduler
29
Latency Tolerant Register File (LTRF)
Two-Level
Warp Scheduler
Crossbar
Operand Collector
SIMD Units
Crossbar
Register File Cache
Main
Slide30Register
File
“2-level” register file + warp scheduler
Cache registers only for the
active
warps
30
Latency Tolerant Register File (LTRF)
Two-Level
Warp Scheduler
Crossbar
Operand Collector
SIMD Units
Crossbar
Active Warps
Cache
Main
Slide31“2-level” register file + warp scheduler
Cache registers only for the
active
warps
Dedicated
register cache space for each warp
31
Latency Tolerant Register File (LTRF)
Active Warps
Cache
Two-Level
Warp Scheduler
Crossbar
Operand Collector
SIMD Units
16
Registers
Crossbar
File
Main
Register
Slide32“2-level” register file + warp scheduler
Cache registers only for the
active
warps
Dedicated
register cache space for each warp
Swap warps on PREFETCH
32
Latency Tolerant Register File (LTRF)
Active Warps
Cache
Two-Level
Warp Scheduler
Crossbar
Operand Collector
SIMD Units
16
Registers
Crossbar
File
Main
Register
Slide33Outline
Background and challengesThe case for compiler-driven register prefetching in GPUsLTRF architecture and compiler supportEvaluation methodologyResults33
Slide34Evaluation methodology
Simulator: GPGPU-Sim modeling NVIDIA MaxwellWorkloads: CUDA-SDK, Rodinia, and Parboil suitesComparison points: Baseline: No register cachingRFC: Demand fetch register file caching (Gebhart’ ISCA 2011)LTRFIdeal: Increased capacity with no latency overhead34
Slide35Latency tolerance
Max tolerable RF latency with IPC slowdown <= 5%35LTRF tolerates the latencies of up to 6x slower register files
Slide36Performance improvement
Example use case: Increase register file capacity from 256KB to 2MB using NTV TFETSame power/area as the baseline register file (256 KB)2nd-level RF accesses 5.3X slower than baseline36LTRF+TFET improves performance by 34.8%within 2.3% of an ideal large register file
Slide37Also in the paper…
LTRF+: minimize register movement between the register file and register cache using liveness analysisRegister interval compared to other subraphsStrands, superblocks, etc.Detailed analysis of hardware overheads16% more area21% less powerVarious LTRF use cases with different register file technologies and optimizations37
Slide38Conclusion
Register files are the main GPU scalability bottlenecksThey already consume 60% of total on-chip memoryNeed more registers for highest performanceStandalone register caching solutions yield low hit ratesLatency Tolerant Register File (LTRF)“2-level” main register file + register cachePrefetch register working sets ahead of timePerforms prefetch ops while executing other warpsInterval analysis for near-optimal prefetchingTolerates up to 6x slower main register filesPaves the way for several power/area optimizations38
Slide39LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
Mohammad SadrosadatiAmirhossein MirhosseiniSeyed Borna EhsaniHamid Sarbazi-AzadMario DrumondBabak FalsafiRachata AusavarungnirunOnur Mutlu