Value Similarity Daniel Wong Nam Sung Kim Murali Annavaram University of California Riverside dwongeceucredu University of Illinois Urbana Champagin ID: 623332
Download Presentation The PPT/PDF document "Approximating Warps with Intra-warp Oper..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Approximating Warps with Intra-warp Operand Value Similarity
Daniel Wong†, Nam Sung Kim‡, Murali Annavaram¥†University of California, Riversidedwong@ece.ucr.edu‡University of Illinois, Urbana-Champagin¥University of Southern CaliforniaSlide2
Value SimilarityValues differ only in the least significant bits
d-similarValues that differ in the d least significant bitsApproximating Warps with Intra-warp Operand Value Similarity 21131110001
2127
11111112
4
-similarSlide3
Warp consists of 32 threadsEach thread execute same instruction on it’s own data
Intra-warp operand value similarityinput operands of all threads within a warp exhibit value similarityIntra-warp Operand Value Similarity
Approximating Warps with Intra-warp Operand Value Similarity
3
Warp
ADD
8
6
10
5
9
4
8, 10, 9
2
-similar
6, 5, 4
2
-similarSlide4
Significant Value Similarity Exists
Significant value similarity exists even with low d-similarity levelsApproximating Warps with Intra-warp Operand Value Similarity 4Slide5
How can we leverage value similarity for
approximate computing?Approximating Warps with Intra-warp Operand Value Similarity 5Slide6
Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity
6141513
Warp
ADD
8
6
10
5
9
4Slide7
Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity
714
Warp
ADD
8
6
10
5
9
4Slide8
Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity
8
Coarse-grain Annotation
Identify Operand
Value Similarity
≈
Representative
Value Storage
Representative
Thread ExecutionSlide9
Coarse-grain AnnotationIdentify regions of code
where approximation can be tolerated, and opportunistically approximate when hardware detects value similarityAPI / ISA extensionStart / End of approximable regionSets per warp ApproxBit in hardwareLevel of approximationProgrammer / Compiler annotate coarse-grain approximable regions
Approximating Warps with Intra-warp Operand Value Similarity 9Slide10
Programmer Support
Approximating Warps with Intra-warp Operand Value Similarity 10Slide11
Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity
11
Coarse-grain Annotation
Identify Operand
Value Similarity
≈
Representative
Value Storage
Representative
Thread ExecutionSlide12
Detecting value similarityAdd value similarity detection stages before Write-back stage to Register File
Approximating Warps with Intra-warp Operand Value Similarity 12≈Slide13
Detecting value similarityComparison selection stage to select value from active thread
Approximating Warps with Intra-warp Operand Value Similarity 13≈Slide14
Detecting value similarityComparison stage checks for value similaritySimilarBit set if value similarity exist
Approximating Warps with Intra-warp Operand Value Similarity 14≈Slide15
Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity
15
Coarse-grain Annotation
Identify Operand
Value Similarity
≈
Representative
Value Storage
Representative
Thread ExecutionSlide16
Representative Value StorageInstead of storing entire vector register entry, store a
value similar scalar, called the representative value.Reduce access energy for reading/writing representative value to register fileRepresentative value is picked from a Representative ThreadApproximating Warps with Intra-warp Operand Value Similarity 16Slide17
Selecting Representative Thread Pick lowest order laneRT Selector logic is simply a
priority encoder and decoderRepresentative Thread Mask (RT Mask)1 for representative thread0 for all other threadsApproximating Warps with Intra-warp Operand Value Similarity 17Slide18
Writing Representative Values
1Approximating Warps with Intra-warp Operand Value Similarity 18
8
8
10
…
…
9
SimilarBit
1
0
…
...
0
RT Mask
≈Slide19
0
1…...
0
0
8
10
…
…
9
Handling Divergence
Approximating Warps with Intra-warp Operand Value Similarity
19
A/11…1
B/10…1
C/01…0
1
8
SimilarBit
1
0
…
...
0
RT Mask
✓
✗
9
Need to copy representative
value on divergence
Basic Block/
Active
MaskSlide20
Handling DivergenceOn divergence, copy representative value using dummy MOV instruction†
Supports only 1 level of divergenceOn convergence or nested divergence,expand representative values Approximating Warps with Intra-warp Operand Value Similarity 20
†
Sangpil
Lee, et. al.
Warped-compression: enabling power efficient GPUs through register compression
. In ISCA’15Slide21
8
8
8
9
8
0
1
…
...
0
Handling Divergence
Approximating Warps with Intra-warp Operand Value Similarity
21
A/11…1
B/10…1
C/01…0
SimilarBit
1
0
…
...
0
RT Mask
✓
✓
1
9
10
Basic Block/
Active
MaskSlide22
Basic Block/
Active MaskHandling DivergenceApproximating Warps with Intra-warp Operand Value Similarity 22
A/11…1
B/10…1
C/01…0
SimilarBit
1
D/11…1
9
10
On convergence, expand each
representative value
w/ active mask
9
10
…
…
9
0Slide23
Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity
23
Coarse-grain Annotation
Identify Operand
Value Similarity
≈
Representative
Value Storage
Representative
Thread ExecutionSlide24
Representative Thread ExecutionInstead of executing
all lanes in a warp precisely, we pick a single representative thread to execute Executing a single lane reduces dynamic energyIdle lanes can be power gated to reduce static energy†
Approximating Warps with Intra-warp Operand Value Similarity 24
†Mohammad Abdel-
Majeed
, et. al.
Warped gates: gating aware scheduling and power gating for GPGPUs
. In MICRO’13
Qiumin
Xu
and
Murali
Annavaram
.
PATS: pattern aware scheduling and power gating for GPGPUs.
In PACT’14Slide25
Representative Thread ExecutionSimply use RT Mask as Active Mask
If some input operands are not similar, and others are, then we need to expand the representative value read from the RF using broadcast logicApproximating Warps with Intra-warp Operand Value Similarity 25
Slide26
Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity
26
Coarse-grain Annotation
Identify Operand
Value Similarity
≈
Representative
Value Storage
Representative
Thread ExecutionSlide27
EVALUATIONApproximating Warps with Intra-warp Operand Value Similarity
27Slide28
MethodologyGPGPU-Sim v3.2.1Nvidia
GTX480-like architectureGPUWattch + McPATLane-level Power Gating†Benchmarks:Approximating Warps with Intra-warp Operand Value Similarity 28
†Qiumin Xu
and Murali
Annavaram
.
PATS: pattern aware scheduling and power gating for GPGPUs.
In PACT’14Slide29
Application Error
Average error: 1.75%, Range: 0.006% - 5.5%Approximating Warps with Intra-warp Operand Value Similarity 29
* Blackscholes
d-similarity scaled by 2xSlide30
PerformanceEvaluated with various Warp SchedulersRound Robin, Two-level, Greedy-then-oldest
Negligible performance impact,regardless of warp schedulerApproximating Warps with Intra-warp Operand Value Similarity 30Slide31
Execution Unit EnergyCompared with lane power gating†and scalar execution units
‡Approximating Warps with Intra-warp Operand Value Similarity 31†Qiumin
Xu and Murali
Annavaram
.
PATS: pattern aware scheduling and power gating for GPGPUs.
In PACT’14
‡Syed
Zohaib
Gilani
, Nam Sung Kim, and Michael J. Schulte. Power-efficient computing for compute-intensive GPGPU applications. In PACT '12Slide32
Register File EnergyCompared with Clock Gating (Base) and Scalar register files†
Warp Approximation creates more opportunities to reduce read/write access energyApproximating Warps with Intra-warp Operand Value Similarity 32†Syed Zohaib
Gilani, Nam Sung Kim, and Michael J. Schulte. Power-efficient computing for compute-intensive GPGPU applications. In PACT '12Slide33
Energy EfficiencyEnergy efficiency = IPC/W
Efficiency related to amount of approximation opportunitiesAverage 26% improvement to energy efficiencyApproximating Warps with Intra-warp Operand Value Similarity 33Slide34
OverheadsAssuming 45nm technologyComparison logic12.75mW dynamic, 95.83uW leakage, 0.017mm
2Representative Thread selection and Broadcast19.76mW dynamic, 127.80uW leakage, 0.031mm2Total overheads: 1.7% dynamic energy0.0013% leakage energy0.098% area per SMApproximating Warps with Intra-warp Operand Value Similarity 34Slide35
More in paperOptimizationsMore evaluationsDiscussion on error bounding
Approximating Warps with Intra-warp Operand Value Similarity 35Slide36
ConclusionIdentify regions of code
where approximation can be tolerated, and opportunistically approximate when hardware detects value similarityWarp Approximation=Representative Value Storage+Representative Thread Execution
Improve energy efficiency by 26% with less than 5.5% error (on average 1.75%)
Approximating Warps with Intra-warp Operand Value Similarity
36
Slide37
Questions?
Thank you!Approximating Warps with Intra-warp Operand Value Similarity 37