Approximating Warps with Intra-warp Operand - PowerPoint Presentation

390 views
Uploaded On 2018-01-15

Approximating Warps with Intra-warp Operand - PPT Presentation

Value Similarity Daniel Wong Nam Sung Kim Murali Annavaram University of California Riverside dwongeceucredu University of Illinois Urbana Champagin ID: 623332

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/623332" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Approximating Warps with Intra-warp Oper..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Approximating Warps with Intra-warp Operand Value Similarity

Daniel Wong†, Nam Sung Kim‡, Murali Annavaram¥†University of California, Riversidedwong@ece.ucr.edu‡University of Illinois, Urbana-Champagin¥University of Southern CaliforniaSlide2

Value SimilarityValues differ only in the least significant bits

d-similarValues that differ in the d least significant bitsApproximating Warps with Intra-warp Operand Value Similarity 21131110001

2127

11111112

-similarSlide3

Warp consists of 32 threadsEach thread execute same instruction on it’s own data

Intra-warp operand value similarityinput operands of all threads within a warp exhibit value similarityIntra-warp Operand Value Similarity

Approximating Warps with Intra-warp Operand Value Similarity

Warp

  

ADD

8, 10, 9

-similar

6, 5, 4

-similarSlide4

Significant Value Similarity Exists

Significant value similarity exists even with low d-similarity levelsApproximating Warps with Intra-warp Operand Value Similarity 4Slide5

How can we leverage value similarity for

approximate computing?Approximating Warps with Intra-warp Operand Value Similarity 5Slide6

Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity

6141513

Warp

  

ADD

4Slide7

Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity

714

Warp

  

ADD

4Slide8

Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity

Coarse-grain Annotation

Identify Operand

Value Similarity

≈

Representative

Value Storage



Representative

Thread ExecutionSlide9

Coarse-grain AnnotationIdentify regions of code

where approximation can be tolerated, and opportunistically approximate when hardware detects value similarityAPI / ISA extensionStart / End of approximable regionSets per warp ApproxBit in hardwareLevel of approximationProgrammer / Compiler annotate coarse-grain approximable regions

Approximating Warps with Intra-warp Operand Value Similarity 9Slide10

Programmer Support

Approximating Warps with Intra-warp Operand Value Similarity 10Slide11

Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity

Coarse-grain Annotation

Identify Operand

Value Similarity

≈

Representative

Value Storage



Representative

Thread ExecutionSlide12

Detecting value similarityAdd value similarity detection stages before Write-back stage to Register File

Approximating Warps with Intra-warp Operand Value Similarity 12≈Slide13

Detecting value similarityComparison selection stage to select value from active thread

Approximating Warps with Intra-warp Operand Value Similarity 13≈Slide14

Detecting value similarityComparison stage checks for value similaritySimilarBit set if value similarity exist

Approximating Warps with Intra-warp Operand Value Similarity 14≈Slide15

Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity

Coarse-grain Annotation

Identify Operand

Value Similarity

≈

Representative

Value Storage



Representative

Thread ExecutionSlide16

Representative Value StorageInstead of storing entire vector register entry, store a

value similar scalar, called the representative value.Reduce access energy for reading/writing representative value to register fileRepresentative value is picked from a Representative ThreadApproximating Warps with Intra-warp Operand Value Similarity 16Slide17

Selecting Representative Thread Pick lowest order laneRT Selector logic is simply a

priority encoder and decoderRepresentative Thread Mask (RT Mask)1 for representative thread0 for all other threadsApproximating Warps with Intra-warp Operand Value Similarity 17Slide18

Writing Representative Values

1Approximating Warps with Intra-warp Operand Value Similarity 18

…

SimilarBit

…

...

RT Mask

≈Slide19

1…...

…

Handling Divergence

Approximating Warps with Intra-warp Operand Value Similarity

A/11…1

B/10…1

C/01…0

SimilarBit

…

...

RT Mask

✓

✗

Need to copy representative

value on divergence

Basic Block/

Active

MaskSlide20

Handling DivergenceOn divergence, copy representative value using dummy MOV instruction†

Supports only 1 level of divergenceOn convergence or nested divergence,expand representative values Approximating Warps with Intra-warp Operand Value Similarity 20

†

Sangpil

Lee, et. al.

Warped-compression: enabling power efficient GPUs through register compression

. In ISCA’15Slide21

…

...

Handling Divergence

Approximating Warps with Intra-warp Operand Value Similarity

A/11…1

B/10…1

C/01…0

SimilarBit

…

...

RT Mask

✓

Basic Block/

Active

MaskSlide22

Basic Block/

Active MaskHandling DivergenceApproximating Warps with Intra-warp Operand Value Similarity 22

A/11…1

B/10…1

C/01…0

SimilarBit

D/11…1

On convergence, expand each

representative value

w/ active mask

…

0Slide23

Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity

Coarse-grain Annotation

Identify Operand

Value Similarity

≈

Representative

Value Storage



Representative

Thread ExecutionSlide24

Representative Thread ExecutionInstead of executing

all lanes in a warp precisely, we pick a single representative thread to execute Executing a single lane reduces dynamic energyIdle lanes can be power gated to reduce static energy†

Approximating Warps with Intra-warp Operand Value Similarity 24



†Mohammad Abdel-

Majeed

, et. al.

Warped gates: gating aware scheduling and power gating for GPGPUs

. In MICRO’13

Qiumin

and

Murali

Annavaram

PATS: pattern aware scheduling and power gating for GPGPUs.

In PACT’14Slide25

Representative Thread ExecutionSimply use RT Mask as Active Mask

If some input operands are not similar, and others are, then we need to expand the representative value read from the RF using broadcast logicApproximating Warps with Intra-warp Operand Value Similarity 25

Slide26

Warp ApproximationApproximating Warps with Intra-warp Operand Value Similarity

Coarse-grain Annotation

Identify Operand

Value Similarity

≈

Representative

Value Storage



Representative

Thread ExecutionSlide27

EVALUATIONApproximating Warps with Intra-warp Operand Value Similarity

27Slide28

MethodologyGPGPU-Sim v3.2.1Nvidia

GTX480-like architectureGPUWattch + McPATLane-level Power Gating†Benchmarks:Approximating Warps with Intra-warp Operand Value Similarity 28

†Qiumin Xu

and Murali

Annavaram

PATS: pattern aware scheduling and power gating for GPGPUs.

In PACT’14Slide29

Application Error

Average error: 1.75%, Range: 0.006% - 5.5%Approximating Warps with Intra-warp Operand Value Similarity 29

* Blackscholes

d-similarity scaled by 2xSlide30

PerformanceEvaluated with various Warp SchedulersRound Robin, Two-level, Greedy-then-oldest

Negligible performance impact,regardless of warp schedulerApproximating Warps with Intra-warp Operand Value Similarity 30Slide31

Execution Unit EnergyCompared with lane power gating†and scalar execution units

‡Approximating Warps with Intra-warp Operand Value Similarity 31†Qiumin

Xu and Murali

Annavaram

PATS: pattern aware scheduling and power gating for GPGPUs.

In PACT’14

‡Syed

Zohaib

Gilani

, Nam Sung Kim, and Michael J. Schulte. Power-efficient computing for compute-intensive GPGPU applications. In PACT '12Slide32

Warp Approximation creates more opportunities to reduce read/write access energyApproximating Warps with Intra-warp Operand Value Similarity 32†Syed Zohaib

Gilani, Nam Sung Kim, and Michael J. Schulte. Power-efficient computing for compute-intensive GPGPU applications. In PACT '12Slide33

Energy EfficiencyEnergy efficiency = IPC/W

Efficiency related to amount of approximation opportunitiesAverage 26% improvement to energy efficiencyApproximating Warps with Intra-warp Operand Value Similarity 33Slide34

OverheadsAssuming 45nm technologyComparison logic12.75mW dynamic, 95.83uW leakage, 0.017mm

2Representative Thread selection and Broadcast19.76mW dynamic, 127.80uW leakage, 0.031mm2Total overheads: 1.7% dynamic energy0.0013% leakage energy0.098% area per SMApproximating Warps with Intra-warp Operand Value Similarity 34Slide35

More in paperOptimizationsMore evaluationsDiscussion on error bounding

Approximating Warps with Intra-warp Operand Value Similarity 35Slide36

ConclusionIdentify regions of code

where approximation can be tolerated, and opportunistically approximate when hardware detects value similarityWarp Approximation=Representative Value Storage+Representative Thread Execution

Improve energy efficiency by 26% with less than 5.5% error (on average 1.75%)

Approximating Warps with Intra-warp Operand Value Similarity

Slide37

Questions?

Thank you!Approximating Warps with Intra-warp Operand Value Similarity 37