/
DeNovo : Rethinking the DeNovo : Rethinking the

DeNovo : Rethinking the - PowerPoint Presentation

scarlett
scarlett . @scarlett
Follow
342 views
Uploaded On 2022-06-15

DeNovo : Rethinking the - PPT Presentation

Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi Nima Honarmand Rakesh Komuravelli Robert Smolinski Hyojin Sung Sarita V Adve Vikram S ID: 919754

storage cache directory overhead cache storage overhead directory memory denovo byte protocol data performance line shared parallel core mesi

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "DeNovo : Rethinking the" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Byn

Choi

,

Nima

Honarmand

,

Rakesh Komuravelli

,

Robert

Smolinski

,

Hyojin Sung

, Sarita V.

Adve

,

Vikram

S.

Adve

, Nicholas P.

Carter

,

Ching-Tsun

Chou

University of Illinois, Urbana-Champaign,

Intel

Slide2

MotivationGoal: Power-, complexity-, performance-scalable hardware

Today: shared-memory

Directory-based coherence

Complex and

unscalable

Difficult programming model

Data races, non-determinism, no safety/

composability

/modularity

Mismatched interface between HW and SW,

a.k.a

memory model

Can’t specify “what value can read return”

Data races defy acceptable semantics

Fundamentally broken for hardware and software

Slide3

Solution?Banish shared memory?

As you scale the number of cores on a cache coherent system (CC), “cost” in “time and memory” grows to a point beyond which the additional cores are not useful in a single parallel program. This is the coherency wall….

A case for message-passing for many-core computing,

Timothy Mattson et al

Slide4

SolutionThe problems are not inherent to shared memory paradigm

Global address space

Implicit,

unstructured

communication and synchronization

Shared

Memory

Slide5

SolutionBanish wild

shared memory!

Global address space

Implicit,

unstructured communication

and synchronization

Wild Shared

Memory

Slide6

Implicit,

unstructured

communication and

synchronization

Solution

Build

disciplined

shared memory!

Global address space

Explicit, structured

side-effects

Disciplined Shared

Memory

Slide7

Disciplined Shared Memory

Disciplined Shared Memory

No data races, determinism-by-default, safe non-determinism

Simple semantics, safety and

composability

Simple coherence and

consistency

Software

-aware

address/

comm

/coherence granularity

Power

-, complexity-, performance-scalable

HW

explicit effects +

structured

parallel control

Slide8

Research Strategy: Software ScopeMany systems provide disciplined shared-memory features

Current driver is DPJ (Deterministic Parallel Java)

Determinism-by-default, safe non-determinism

End goal is language-oblivious interface

Current focus on

deterministic codes

Common and best case

Extend later to safe non-determinism, legacy codes

Slide9

Research Strategy: Hardware ScopeToday’s focus:

cache coherence

Limitations of current directory-based protocols

Complexity

Subtle races and numerous transient sates in the protocol

Hard to extend for optimizations

Storage overhead

Directory overhead for sharer lists

Performance and power inefficiencies

Invalidation and

ack

messages

False sharing

Indirection through the directory

Suboptimal comm. granularity of cache line …

Ongoing: Rethink communication architecture and data layout

Slide10

ContributionsSimplicityCompared protocol complexity with MESI

25x less reachable states for model checking

Extensibility

Direct cache-to-cache transfer

Flexible communication granularity

Storage overhead

No storage overhead for directory information

Storage overheads beat MESI after tens of cores and scale beyond

Performance/Power

Up to 73% reduction in memory stall time

Up to 70% reduction in network traffic

Slide11

ContributionsSimplicityCompared protocol complexity with MESI

25x less reachable states with model checking

Extensibility

Direct cache-to-cache transfer

Flexible communication granularity

Storage overhead

No storage overhead for directory information

Storage overheads beat MESI after tens of cores and scale beyond

Performance/Power

Up to 73% reduction in memory stall time

Up to 70% reduction in network traffic

Slide12

ContributionsSimplicityCompared protocol complexity with MESI

25x less reachable states with model checking

Extensibility

Direct cache-to-cache transfer

Flexible communication granularity

Storage overhead

No storage overhead for directory information

Storage overheads beat MESI after tens of cores and scale beyond

Performance/Power

Up to 73% reduction in memory stall time

Up to 70% reduction in network traffic

Slide13

ContributionsSimplicityCompared protocol complexity with MESI

25x less reachable states with model checking

Extensibility

Direct cache-to-cache transfer

Flexible communication granularity

Storage overhead

No storage overhead for directory information

Storage overheads beat MESI after tens of cores and scale beyond

Performance/Power

Up to 73% reduction in memory stall time

Up to 70% reduction in network traffic

Slide14

ContributionsSimplicityCompared protocol complexity with MESI

25x less reachable states with model checking

Extensibility

Direct cache-to-cache transfer

Flexible communication granularity

Storage overhead

No storage overhead for directory information

Storage overheads beat MESI after tens of cores and scale beyond

Performance/Power

Up to 73% reduction in memory stall time

Up to 70% reduction in network traffic

Slide15

OutlineMotivationBackground: DPJDeNovo Protocol

DeNovo

Extensions

Evaluation

Protocol Verification

Performance

Conclusion and Future Work

Slide16

Background: DPJExtension for modern OO languagesStructured parallel control: nested fork-join style

foreach

,

cobegin

Type and effect system ensures non-interference

Region

names: partition heapEffects for methods: which regions read/written in method

Type checking ensures race-freedom for parallel tasks

 Deterministic execution

Recent support for safe non-determinism

Slide17

Memory Consistency Model

Guaranteed determinism

Read returns value of

last

write in sequential order

Same task in this parallel phase

Or before this parallel phase

LD 0xa

ST 0xa

Parallel

Phase

ST 0xa

Slide18

Memory Consistency Model

Guaranteed determinism

Read returns value of

last

write in sequential order

Same task in this parallel phase

Or before this parallel phase

LD 0xa

Coherence

Mechanism

ST 0xa

Parallel

Phase

Slide19

Cache CoherenceCoherence Enforcement

Invalidate stale copies in caches

Track up-to-date copy

Slide20

Cache CoherenceCoherence Enforcement

Invalidate stale copies in caches

Track up-to-date copy

Explicit effects

Compiler knows

all regions written in this parallel phase

Cache can self-invalidate before next parallel phase

Invalidates data in

writeable regions

not accessed by itself

Slide21

Cache CoherenceCoherence Enforcement

Invalidate stale copies in caches

Track up-to-date copy

Explicit effects

Compiler knows

all regions written in this parallel phase

Cache can self-invalidate before next parallel phase

Invalidates data in

writeable regions

not accessed by itself

Registration

Directory keeps track of

one

up-to-date copy

Writer updates before next parallel phase

Slide22

Basic DeNovo Coherence

Assume (for now): Private L1, shared L2; single word line

Data-race freedom at word granularity

L2 data arrays double as

directory

Keep

valid

data or

registered

core id, no space overhead

L1/L2 states

“Touched” bit – set only if read in the phase

registry

Invalid

Valid

Registered

Read

Write

Write

Slide23

Example Run

R

X

0

V

Y

0

R

X

1

V

Y

1

R

X

2

V

Y

2

V

X

3

V

Y

3

V

X

4

V

Y

4

V

X

5

V

Y

5

class

S_type

{

X in

DeNovo

-region ;

Y in

DeNovo

-region ;

}

S _type

S[size

];

...

Phase1

writes

{

//

DeNovo

effect

foreach

i

in 0, size {

S[i].X

= …;

}

self_invalidate

( );

}

L1 of Core 1

R

X

0

V

Y

0

R

X

1

V

Y

1

R

X

2

V

Y

2

I

X

3

V

Y

3

I

X

4

V

Y

4

I

X

5

V

Y

5

L1 of Core 2

I

X

0

V

Y

0

I

X

1

V

Y

1

I

X

2

V

Y

2

R

X

3

V

Y

3

R

X

4

V

Y

4

R

X

5

V

Y

5

Shared L2

R C1VY0R C1VY1R C1VY2R C2VY3RC2VY4RC2VY5

RegisteredValidInvalid

V X0VY0V X1VY1V X2VY2V X3VY3V X4VY4V X5VY5

V X0VY0V X1VY1V X2VY2V X3VY3VX4VY4VX5VY5

V X0VY0V X1VY1V X2VY2V X3VY3VX4VY4VX5VY5

V X0VY0V X1VY1V X2VY2R X3VY3RX4VY4RX5VY5

Registration

Registration

Ack

Ack

Slide24

Addressing LimitationsLimitations of current directory-based protocols

Complexity

Subtle races and numerous transient sates in the protocol

Hard to extend for optimizations

Storage overhead

Directory overhead for sharer lists

Performance and power overhead

Invalidation and

ack

messages

False-sharing

Indirection through the directory

Slide25

Practical DeNovo Coherence

Basic protocol impractical

High tag storage overhead (a tag per word)

Address/Transfer granularity > Coherence granularity

DeNovo

Line-based protocol

Traditional software-oblivious spatial localityCoherence granularity still at word

no

word-level

false-sharing

“Line Merging”

Cache

V

V

R

Tag

V

V

V

Slide26

Storage OverheadDeNovo line (64 bytes/line)L1: 2 state bits, 1 touched bit, 5 region bits (128 bits/line)

L2: 1 valid/line, 1 dirty/line, 1 state bit/word (18 bits/line)

MESI (full-map, in-cache directory)

L1: 5 state bits (5 bits/line)

L2: 5 state bits, [# of cores for directory] bits (5+P bits/line)

Size of L2 == 8

x [aggregate size of L1s]

Slide27

Storage Overhead

29 Cores

Slide28

Alternative Directory DesignsDuplicate Tag DirectoryL1 tags duplicated in directoryRequires highly-associative lookups

256-way associative lookup for 64-core setup with 4-way L1s

Sparse Directory

Significant performance overhead with higher core-counts

Tagless

Directory (MICRO ’09)

Imprecise sharer-list encoding using bloom-filtersEven more complexity in the protocol (false-positives)Trade performance for less storage overhead/power

Slide29

Addressing LimitationsLimitations of current directory-based protocols

Complexity

Subtle races and numerous transient sates in the protocol

Hard to extend for optimizations

Storage overhead

Directory overhead for sharer lists

Performance and power overhead

Invalidation and

ack

messages

False-sharing

Indirection through the directory

Slide30

ExtensionsTraditional directory-based protocols

Sharer-lists always contain all the true sharers

DeNovo

protocol

Registry points to latest copy at end of phase

Valid data can be copied around freely

Slide31

Extensions (1 of 2)Basic with Direct cache-to-cache transferGet data directly from producerThrough prediction and/or software-assistance

Convert 3-hop misses to 2-hop misses

L1 of Core 1

R

X

0

V

Y

0

V

Z

0

R

X

1

V

Y

1

V

Z

1

R

X

2

V

Y

2

V

Z

2

I

X

3

V

Y

3

V

Z

3

I

X

4

V

Y

4

V

Z

4

I

X

5

V

Y

5

V

Z

5

X

3

L1 of Core 2

I

X

0

V

Y

0

V

Z

0

I

X

1

V

Y

1

V

Z

1

I

X

2

V

Y

2

V

Z

2

R

X

3

V

Y

3

V

Z

3

R

X

4

V

Y

4

V

Z

4

R

X

5

V

Y

5

V

Z

5

Shared L2

R

C1

V

Y

0

V

Z

0

R

C1

V

Y

1

V

Z

1

R

C1

V

Y

2

V

Z

2

R

C2

V

Y

3VZ3RC2VY4VZ4RC2VY5VZ5

RegisteredValidInvalid

LD X3LD X3

Slide32

Extensions (2 of 2)Basic with Flexible communicationSoftware-directed data transferTransfer “relevant” data together

Achieve effects of

AoS-to-SoA

transformation

without programmer/compiler intervention

L1 of Core 1

R

X

0

V

Y

0

V

Z

0

R

X

1

V

Y

1

V

Z

1

R

X

2

V

Y

2

V

Z

2

I

X

3

V

Y

3

V

Z

3

I

X

4

V

Y

4

V

Z

4

I

X

5

V

Y

5

V

Z

5

L1 of Core 2

I

X

0

V

Y

0

V

Z

0

I

X

1

V

Y

1

V

Z

1

I

X

2

V

Y

2

V

Z

2

R

X

3

V

Y

3

V

Z

3

R

X

4

V

Y

4

V

Z

4

R

X

5

V

Y

5

V

Z

5

Shared L2

R

C1

V

Y

0

V

Z

0

R

C1

V

Y

1

V

Z

1

R

C1

V

Y

2

V

Z

2

R

C2

V

Y

3

VZ3RC2VY4VZ4RC2VY5VZ5RegisteredValidInvalid

Slide33

Extensions (2 of 2)Basic with Flexible communicationSoftware-directed data transferTransfer “relevant” data together

Achieve effects of

AoS-to-SoA

transformation

without programmer/compiler intervention

L1 of Core 1

R

X

0

V

Y

0

V

Z

0

R

X

1

V

Y

1

V

Z

1

R

X

2

V

Y

2

V

Z

2

I

X

3

V

Y

3

V

Z

3

I

X

4

V

Y

4

V

Z

4

I

X

5

V

Y

5

V

Z

5

L1 of Core 2

I

X

0

V

Y

0

V

Z

0

I

X

1

V

Y

1

V

Z

1

I

X

2

V

Y

2

V

Z

2

R

X

3

V

Y

3

V

Z

3

R

X

4

V

Y

4

V

Z

4

R

X

5

V

Y

5

V

Z

5

Shared L2

R

C1

V

Y

0

V

Z

0

R

C1

V

Y

1

V

Z

1

R

C1

V

Y

2

V

Z

2

R

C2

V

Y

3VZ3RC2VY4VZ4RC2VY5VZ5RegisteredValidInvalidX

3

X4 X5 LD X3LD X4LD X5Y3 Y4 Y5 Z3 Z4 Z5

Slide34

Extensions (2 of 2)Basic with Flexible communicationSoftware-directed data transferTransfer “relevant” data together

Achieve effects of

AoS-to-SoA

transformation

without programmer/compiler intervention

L1 of Core 1

R

X

0

V

Y

0

V

Z

0

R

X

1

V

Y

1

V

Z

1

R

X

2

V

Y

2

V

Z

2

I

X

3

V

Y

3

V

Z

3

I

X

4

V

Y

4

V

Z

4

I

X

5

V

Y

5

V

Z

5

L1 of Core 2

I

X

0

V

Y

0

V

Z

0

I

X

1

V

Y

1

V

Z

1

I

X

2

V

Y

2

V

Z

2

R

X

3

V

Y

3

V

Z

3

R

X

4

V

Y

4

V

Z

4

R

X

5

V

Y

5

V

Z

5

Shared L2

R

C1

V

Y

0

V

Z

0

R

C1

V

Y

1

V

Z

1

R

C1

V

Y

2

V

Z

2

R

C2

V

Y

3VZ3RC2VY4VZ4RC2VY5VZ5RegisteredValidInvalidX

3

X4 X5 LD X3R X0VY0VZ0R X1VY1VZ1R X2VY2VZ2V

X3VY3VZ3V X4VY4VZ4V X5VY5VZ5

Slide35

OutlineMotivationBackground: DPJDeNovo Protocol

DeNovo

Extensions

Evaluation

Protocol Verification

Performance

Conclusion and Future Work

Slide36

EvaluationSimplicityFormal verification of the cache coherence protocolComparing reachable states

Performance/Power

Simulation experiments

Extensibility

DeNovo

extensions

Slide37

Protocol Verification MethodologyMurphi model checking toolVerified

DeNovo

word and MESI word protocols

State-of-the art GEMS implementation

Abstract model

Single address / region, two data values

Two cores with private L1 and unified L2, unordered n/wData race free guarantee for DeNovo

Cross phase interactions

Slide38

Protocol Verification: ResultsCorrectnessThree bugs in DeNovo protocol

Mistakes in translation from the high level specification

Simple to fix

Six bugs in MESI protocol

Two deadlock scenarios

Unhandled races due to L1

writebacksSeveral days to fixComplexity

25x fewer reachable states for

DeNovo

30x difference in the runtime

Slide39

Performance Evaluation MethodologySimulation EnvironmentWisconsin GEMS + Simics

+ Princeton Garnet n/w

System Parameters

64 cores

Private L1(128KB) and Unified L2(32MB)

Simple core model

5-stage, one-issue, in-order coreResults for only memory stall time

Slide40

BenchmarksFFT and LU from SPLASH-2kdTree – two versions

C

onstruction of k-D trees for ray tracing

Developed within UPCRC, presented at HPG 2010

Two versions

kdFalse

: false sharing in an auxiliary structurekdPad: padding to eliminated false sharing

Slide41

Simulated ProtocolsWord-based: 4B cache lineMESI and DeNovo

Line-based: 64B cache line

MESI and

DeNovo

DeNovo

extensions (proof of concept, word based)

DeNovo directDeNovo

flex

Slide42

Word Protocols

FFT

LU

kdFalse

kdPad

Dword

comparable to

Mword

563

547

Simplicity doesn’t compromise performance

Slide43

DeNovo

Direct

Ddirect

reduces remote L1 hit time

563

547

505

FFT

LU

kdFalse

kdPad

Slide44

Line Protocols

563

547

505

Dline

not susceptible to false-sharing

Up to 66% reduction in total time

FFT

LU

kdFalse

kdPad

Slide45

Word vs. Line

563

547

505

FFT

LU

kdFalse

kdPad

Application dependent benefit

kdFalse

Mline

worse by 12%

Slide46

DeNovo Flex

563

547

505

Dflex

outperforms all systems

Up to 73% reduction over

M

line

FFT

LU

kdFalse

kdPad

Slide47

Network Traffic

Dline

and

Dflex

less n/w traffic than

Mline

Up to 70% reduction

FFT

LU

kdFalse

kdPad

Slide48

ConclusionDisciplined programming models key for software

DeNovo

rethinks hardware for disciplined models

Simplicity

25x fewer reachable

states with model

checking

30x runtime difference

Extensibility

Direct cache-to-cache transfer

Flexible communication granularityStorage overheadNo storage overhead for directory information

Storage overheads

beat MESI after tens of cores

Performance/Power

73% less time spent in memory

requests

70% reduction in n/w traffic

Slide49

Future WorkRethinking cache data layoutExtend toDisciplined non-deterministic codes

Synchronization

Legacy codes

Extend to off-chip memory

Automate generation of hardware regions

More

extensive evaluations

Slide50

Thank You!

Slide51

Backup Slides

Slide52

Flexible Coherence GranularityByte-level sharing is uncommonNone of apps we studied so far

Handle it correctly but not necessarily efficiently

Compiler aligns byte-granularity regions at word boundaries

If fails, H/W “clone” the line into 4 cache frames

With at least 4-way

associativity

, all reside in the same set

Tag

State

Byte 0

Byte 1

Byte 2

Byte 3

Tag

State

Byte 3

Normal

Cloned

Tag

State

Byte 2

Tag

State

Byte 1

Tag

State

Byte 0

Tag

State

Byte 0

Byte 1

Byte 2

Byte 3

Tag

State

Byte 0

Byte 1

Byte 2

Byte 3

Tag

State

Byte 0

Byte 1

Byte 2

Byte 3

Set

Set