/
Center for Efficient, Scalable, and Reliable Computing Center for Efficient, Scalable, and Reliable Computing

Center for Efficient, Scalable, and Reliable Computing - PowerPoint Presentation

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
388 views
Uploaded On 2016-03-16

Center for Efficient, Scalable, and Reliable Computing - PPT Presentation

Department of Electrical and Computer Engineering North Carolina State University Rami Sheikh James Tuck and Eric Rotenberg ControlFlow Decoupling Rami Sheikh 2012 MICRO45 ID: 257854

sheikh rami 2012 micro rami sheikh micro 2012 branch push control pop flow cfd dependent region slice decoupling side

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Center for Efficient, Scalable, and Reli..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Center for Efficient, Scalable, and Reliable Computing

Department

of Electrical and Computer EngineeringNorth Carolina State University

Rami Sheikh, James Tuck, and Eric Rotenberg

Control-Flow Decoupling

Rami Sheikh © 2012 MICRO-45

1Slide2

Single-thread Performance is Important

96

128168

192

15-cycle

17-cycle14-17 cycles

14-17 cycles (?)OoO scheduling window doubledPipeline depthremains high

*Source: Intel IDF presentations

14% - 33%

generation

to

generation

gains

*Source:

AnandTech

website

*Source:

AnandTech

website

2

Rami Sheikh © 2012 MICRO-45Slide3

Energy is Important

3Rami Sheikh © 2012 MICRO-45Slide4

ASTAR (Rivers)

Memory Latency Tolerance

4

Rami Sheikh © 2012 MICRO-45

Baseline uses ISL-TAGE predictor

63%

65%

67%

68%

69%

67%

65%

Energy ReductionSlide5

Better Branch Handling is ImportantImproves performanceReduces energy consumptionWrong pathPreparing for recovery

RecoveryNecessary catalyst for memory latency tolerance 5

Rami Sheikh © 2012 MICRO-45Slide6

Interesting Observation

branch-slice

control-

dependent

region

branch

6

Rami Sheikh © 2012 MICRO-45Slide7

Control-Flow Decoupling

branch-slice

control-

dependent

region

branch

branch-slice

control-

dependent

region

branch

7

Rami Sheikh © 2012 MICRO-45Slide8

Control-Flow Decoupling

branch-slice

control-

dependent

region

branch

control-

dependent

region

branch-slice

branch

branch-slice

Push_BQ

control-

dependent

region

Branch

_on_BQ

1 0 1 1 0 0 0 1 1 0 1 0

Original Loop

CFD Loops

BQ

BQ drives fetch

8

Rami Sheikh © 2012 MICRO-45

Generate a vector of predicatesSlide9

Problem #2

No mechanism to comm. predicates to Fetch Unit

Control-Flow Decoupling

Original Loop

IF

…….…

EX

IF

…….…

EX

IF

…….…

EX

IF

…….…

EX

IF

…….…

EX

IF

…….…

EX

Problem #1

No fetch separation: need branch prediction

9

Rami Sheikh © 2012 MICRO-45Slide10

Control-Flow Decoupling

CFD Loops

IF

…….…

EX

IF

…….…EX

IF

…….…

EX

IF

…….…

EX

IF

…….…

EX

IF

…….…

EX

10

Rami Sheikh © 2012 MICRO-45

CFD provides:

Fetch Separation

Mechanism to comm. predicates to Fetch Unit

BQ

1

0

1Slide11

Control-Flow Decoupling: Example

SOPLEX

31% contrib.

1 0 1 1 0 0 0 1 1 0 1 0

11

Rami Sheikh © 2012 MICRO-45Slide12

Agenda MethodologyControl-flow classificationControl-flow decoupling (CFD)

EvaluationConclusion12Rami Sheikh © 2012 MICRO-45Slide13

Methodology

Where do state-of-the-art branch predictors fall short?

4 benchmark suites

(80

apps)

PIN with x86 binaries27 out of 80 have misprediction rate >= 2%6 of which had problems with cross-compilerRemaining 21 apps contribute 78% of total MPKI (in the 80 apps)13Rami Sheikh © 2012 MICRO-45Slide14

Control-Flow Classification

Classify

targeted mispredictions

into four classes

Hammock

If-conversion

Separable branchesControl-Flow Decoupling (CFD)Inseparable branches (very serial)Other solutions requiredNot analyzed

14

Rami Sheikh © 2012 MICRO-45Slide15

Classify

targeted mispredictions

into four classes

Hammock

SeparableInseparableNot

analyzed15Rami Sheikh © 2012 MICRO-45

Control-Flow ClassificationSlide16

Agenda MethodologyControl-flow classification

Control-flow decoupling (CFD)EvaluationConclusion

16Rami Sheikh © 2012 MICRO-45Slide17

Targets

separable branches with large, complex

CD regions

ISA supportSoftware sideHardware side

17Rami Sheikh © 2012 MICRO-45

Control-Flow DecouplingSlide18

BQ specification:

Size (N)

Content

Length

18

Rami Sheikh © 2012 MICRO-45

ISA Support

N elements

BQ

1-bit flag

Length register

Two purposes:

Needed to save/restore

BQ state

Flexible implementation

(e.g., circular vs. shifting buffer)

<predicate>Slide19

Rule #2: N consecutive pushes must be followed by exactly N consecutive pops (same order)

New instructions:

Push_BQ

(push)

Branch_on_BQ (pop)19

Rami Sheikh © 2012 MICRO-45

ISA SupportPush-Pop Ordering Rules push-1

push-2

….…

push-N

pop-1

pop-2

pop-N

….…

push-1

….

Rule #1

: a

push must precede its corresponding

pop

Rule #3

: N

cannot exceed the BQ size

time

Align predicateswith their corresponding branches

PreventdeadlockSlide20

Working with finite-size BQ

for (

large_trip_count

/N)

{

for (1 ... N) {body of first loop} for (1 ... N) {body of second loop} }

for (

large_trip_count

) {body of first loop}

for (

large_trip_count

) {body of second loop}

BQ

BQ

CFD

L

oops

Strip-mined CFD Loops

20

Rami Sheikh © 2012 MICRO-45

Software SideSlide21

Hardware

Side

BQ implementation

21

Rami Sheikh © 2012 MICRO-45Slide22

Execution scenarios

BQ hit

BQ miss

22

Rami Sheikh © 2012 MICRO-45

IF

…….…EXIF

slice

branch

BQ miss

IF

…….

EX

IF

slice

branch

BQ hit

Common Case

Uncommon Case

Speculate or

Stall

Hardware

SideSlide23

Instruction Window

BQ length

23

Rami Sheikh © 2012 MICRO-45

BQ size is

N

push-1push-2

….…

push-N

pop-1

pop-2

pop-N

….…

push-1

….

time

0

BQ Length

N

Hardware

SideSlide24

Instruction Window

BQ length

24

Rami Sheikh © 2012 MICRO-45

push-1

push-2

….…

push-N

pop-1

pop-2

pop-N

….…

push-1

….

time

N

Stall push-1:

BQ is full

Hardware

Side

BQ size is

N

BQ LengthSlide25

Instruction Window

BQ length

25

Rami Sheikh © 2012 MICRO-45

push-1

push-2

….…push-N

pop-1

pop-2

pop-N

….…

push-1

….

time

N -1

Unstall

push-1

Hardware

Side

BQ size is

N

BQ LengthSlide26

Instruction Window

BQ

length

26

Rami Sheikh © 2012 MICRO-45

push-1

push-2….…

push-N

pop-1

pop-2

pop-N

….…

push-1

….

time

N

Hardware

Side

BQ size is

N

BQ LengthSlide27

Checkpoint:RMT, … etc

BQ

recovery

27

Rami Sheikh © 2012 MICRO-45

Committed State:

AMT, … etc

Hardware

SideSlide28

Checkpoint:RMT, … etc

BQ head ptrBQ tail ptr

BQ

recovery

28

Rami Sheikh © 2012 MICRO-45

Committed State:AMT, … etcArch. BQ head ptrArch. BQ tail ptr

Hardware

SideSlide29

control-

dependent

region

control-

dependent

region

branch

branch-slice

control-

dependent

region

branch

29

Rami Sheikh © 2012 MICRO-45

Other Interesting Aspects of CFD

Supports partially separable branches

branch-slice

branchSlide30

control-

dependent

region

branch

branch-slice

control-

dependent

region

branch

control-

dependent

region

30

Rami Sheikh © 2012 MICRO-45

Other Interesting Aspects of CFD

Supports partially separable branches

branch-slice

branch

branch-slice

Push_BQ

if-converted hammock

control-

dependent

region

Branch_on_BQSlide31

Works with nested branches:

Combine predicates (if safe)

Multi-level decoupling

CFD overheads can be reduced through value communication

(see CFD+ in the paper)31

Rami Sheikh © 2012 MICRO-45

Other Interesting Aspects of CFDSlide32

Agenda MethodologyControl-flow classification

Control-flow decoupling (CFD)Evaluation

Conclusion32Rami Sheikh © 2012 MICRO-45Slide33

Evaluation EnvironmentSimulatorIn-house detailed execution-driven, execute-at-execute, cycle-level Alpha simulatorCFD microarchitecture is faithfully modeled

McPAT and CACTI are used to measure energy consumptionBenchmarksCompiled with gcc and -O3 level optimizationModified benchmarks are validated by compiling and running to completion on x86 host (emulate BQ with software queue

)When simulating modified binaries, we simulate as many retired instructions as needed in order to perform the same amount of work as the unmodified binaries.33Rami Sheikh © 2012 MICRO-45Slide34

Evaluation EnvironmentBaseline

Branch Prediction

BP

: 64KB ISL-TAGE predictor

- 16 tables: 1 bimodal, 15 partially-tagged. In addition to, IUM, SC, LP.

- History lengths:

{0, 3, 8, 12, 17, 33, 35, 67, 97, 138, 195, 330, 517, 1193, 1741, 1930}BTB: 4K entries, 4-way set-associativeRAS: 64 entries

Memory Hierarchy

Block size

: 64B

Victim caches

: each cache has a 16-entry FA victim cache

L1

: split, 64KB each, 4-way set-associative, 1-cycle access latency

L2

: unified, private for each core, 512KB, 8-way set-associative, 20-cycle access latency

-

L2 stream

prefetcher

: 4 streams, each of depth 16

L3

: unified, shared among cores, 8MB, 16-way set-associative, 40-cycle access latency

Memory

: 200-cycle access latency

Fetch/Issue/Retire Width

4 instr./cycle

ROB/IQ/LDQ/STQ

168/54/64/36

(modeled after Sandy Bridge)

Fetch-to-Execute Latency

10-cycle

Physical RF

236

Checkpoints

8,

OoO

reclamation, confidence estimator (8K entries, 4-bit resetting counter,

gshare

index)

CFD

BQ

: 96B (128 6-bit entries)

VQ

renamer

: 128B (128 8-bit entries)

34

Rami Sheikh © 2012 MICRO-45Slide35

35

Rami Sheikh © 2012 MICRO-45

ResultsSlide36

36

Rami Sheikh © 2012 MICRO-45

ResultsSlide37

Fetch-to-execute depth

Bobcat/Power6

GeoMean

=1.16

Cortex A15

GeoMean

=1.18

Pentium 4

GeoMean

=1.22

37

Rami Sheikh © 2012 MICRO-45

Results – Sensitivity StudySlide38

38

Rami Sheikh © 2012 MICRO-45

Results – Manual vs. AutomatedSlide39

Conclusion

State-of-the-art

branch predictors have limitations

A third of mispredictions come from separable branches

CFD is a software/hardware collaboration for exploiting separability with low complexity and high efficacyCFD is comparable to if-conversion in terms of number of static branches and MPKI contribution

39Rami Sheikh © 2012 MICRO-45Slide40

Thanks!Questions?