/
DeNovo :  A Software-Driven DeNovo :  A Software-Driven

DeNovo : A Software-Driven - PowerPoint Presentation

windbey
windbey . @windbey
Follow
350 views
Uploaded On 2020-07-02

DeNovo : A Software-Driven - PPT Presentation

Rethinking of the Memory Hierarchy Sarita Adve Vikram Adve Rob Bocchino Nicholas Carter Byn Choi ChingTsun Chou Stephen Heumann Nima Honarmand Rakesh Komuravelli Maria Kotsifakou ID: 792466

cache denovo shared memory denovo cache memory shared coherence line determinism communication read mesi based hardware data storage region

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "DeNovo : A Software-Driven" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

DeNovo: A Software-Driven Rethinking of the Memory Hierarchy

Sarita

Adve,

Vikram Adve,

Rob Bocchino

, Nicholas Carter, Byn

Choi,

Ching-Tsun

Chou,

Stephen Heumann,

Nima

Honarmand, Rakesh Komuravelli, Maria Kotsifakou,

Tatiana

Schpeisman

, Matthew

Sinclair, Robert

Smolinski

,

Prakalp

Srivastava

, Hyojin

Sung, Adam

Welc

University of Illinois at Urbana-Champaign, Intel

denovo@cs.illinois.edu

Slide2

ParallelismSpecialization, heterogeneity, …BUT large impact on Software

Hardware

Hardware-Software Interface

Silver Bullets for the Energy Crisis?

Slide3

Multicore parallelism today: shared-memoryComplex, power- and performance-inefficient hardwareComplex directory coherence, unnecessary traffic, ... Difficult programming model

Data races, non-determinism,

composability?,

testing?

Mismatched interface between HW and SW, a.k.a memory modelCan’t specify “what value can read return”

Data races defy acceptable semantics

Multicore Parallelism: Current Practice

Fundamentally broken for hardware & software

Slide4

Specialization/Heterogeneity: Current Practice6 different ISAs

7 different parallelism models

I

ncompatible

memory systems

A modern smartphone

CPU, GPU, DSP, Vector Units, Multimedia, Audio-Video accelerators

Even more broken

Slide5

How to (co-)designSoftware?Hardware?HW / SW Interface?Energy Crisis Demands Rethinking HW, SW

Deterministic Parallel Java (DPJ)

DeNovo

Virtual Instruction Set Computing (VISC)

Today: Focus on homogeneous parallelism

Slide6

Multicore parallelism today: shared-memoryComplex, power- and performance-inefficient hardwareComplex directory coherence, unnecessary traffic, ... Difficult programming model

Data races, non-determinism,

composability?,

testing?

Mismatched interface between HW and SW, a.k.a memory modelCan’t specify “what value can read return”

Data races defy acceptable semantics

Multicore Parallelism: Current Practice

Fundamentally broken for hardware & software

Slide7

Multicore parallelism today: shared-memoryComplex, power- and performance-inefficient hardwareComplex directory coherence, unnecessary traffic, ...

Difficult programming model

Data races, non-determinism,

composability

?, testing?Mismatched interface between HW and SW,

a.k.a

memory model

Can’t specify “what value can read return”

Data races defy acceptable semantics

Multicore Parallelism: Current Practice

Fundamentally broken for hardware & software

Banish shared memory?

Slide8

Multicore parallelism today: shared-memoryComplex, power- and performance-inefficient hardwareComplex directory coherence, unnecessary traffic, ...

Difficult programming model

Data races, non-determinism,

composability

?, testing?Mismatched interface between HW and SW,

a.k.a

memory model

Can’t specify “what value can read return”

Data races defy acceptable semantics

Multicore Parallelism: Current Practice

Fundamentally broken for hardware & software

Banish

wild

shared memory!

Need

disciplined

shared memory!

Slide9

Shared-Memory = Global address space +Implicit, anywhere communication, synchronizationWhat is Shared-Memory?

Slide10

Shared-Memory = Global address space +Implicit, anywhere communication, synchronizationWhat is Shared-Memory?

Slide11

Wild Shared-Memory = Global address space +Implicit, anywhere communication, synchronization

What is Shared-Memory?

Slide12

Wild Shared-Memory = Global address space +Implicit, anywhere communication, synchronization

What is Shared-Memory?

Slide13

Disciplined Shared-Memory = Global address space +Implicit, anywhere communication, synchronizationExplicit, structured side-effects

What is Shared-Memory?

How to build disciplined shared-memory software?

If software is more disciplined, can hardware be more efficient?

Slide14

Simple programming model AND efficient hardwareOur Approach

Disciplined Shared Memory

Deterministic Parallel Java (

DPJ): Strong safety properties

No data races, determinism-by-default, safe non-determinism

Simple semantics, safety, and

composability

DeNovo

: Complexity-, performance-, power-efficiency

Simplify

coherence and

consistency

Optimize communication and

data storage

explicit effects +

structured

parallel control

Slide15

Key MilestonesSoftwareDPJ: Determinism OOPSLA’09

Disciplined

non-determinism

POPL’11Unstructured synchronizationLegacy, OS

Hardware

DeNovo

Coherence, Consistency,

Communication

PACT’11 best paper

DeNovoND

ASPLOS’13 &

IEEE Micro top picks’14

DeNovoSynch

(in review)

Ongoing

A language-oblivious virtual ISA

+ Storage

Slide16

ComplexitySubtle races and numerous transient states in the protocolHard to verify and extend for optimizationsStorage overheadDirectory overhead for sharer lists

Performance and power inefficiencies

Invalidation,

ack messagesIndirection through

directoryFalse sharing (cache-line based coherence)Bandwidth waste (cache-line based communication)Cache pollution (cache-line based allocation)Current Hardware Limitations

Slide17

ComplexityNo transient statesSimple to extend for optimizations

Storage overhead

Directory overhead for sharer lists

Performance and power inefficiencies

Invalidation, ack messagesIndirection through directoryFalse sharing (cache-line based coherence)Bandwidth waste (cache-line based communication)

Cache pollution (cache-line based allocation)

Results for Deterministic Codes

Base

DeNovo

20X faster to verify vs. MESI

Slide18

ComplexityNo transient statesSimple to extend for optimizations

Storage overhead

No

storage overhead for directory information

Performance and power inefficienciesInvalidation, ack messagesIndirection through directory

False sharing (cache-line based coherence)

Bandwidth waste (cache-line based communication)

Cache pollution (cache-line based allocation)

Results for Deterministic

Codes

18

Base

DeNovo

20X faster to verify vs. MESI

Slide19

ComplexityNo transient statesSimple to extend for optimizations

Storage overhead

No

storage overhead for directory information

Performance and power inefficienciesNo invalidation, ack

messages

No

indirection through directory

No

false sharing: region based coherence

Region,

not cache-line

, communication

Region,

not cache-line

, allocation (ongoing)

Results for Deterministic

Codes

Up

to

77% lower memory stall time

Up to 71% lower traffic

Base

DeNovo

20X faster to verify vs. MESI

Slide20

DPJ OverviewStructured parallel controlFork-join parallelismRegion: name for set of memory locationsAssign region to each field, array cellEffect: read or write on a regionSummarize effects of method bodiesCompiler: simple type checkRegion types consistentE

ffect summaries correct

Parallel tasks don’t interfere (race-free)

heap

ST

ST

ST

ST

LD

Type-checked programs guaranteed determinism (sequential semantics)

Slide21

Memory Consistency ModelGuaranteed determinism  Read returns value of last

write in sequential order

Same task in this parallel phaseOr before this parallel phase

LD 0xa

ST 0xa

Parallel

Phase

ST 0xa

Coherence

Mechanism

Slide22

Cache CoherenceCoherence EnforcementInvalidate stale copies in cachesTrack one up-to-date copyExplicit effectsCompiler knows all writeable regions in this parallel phaseCache can

self-invalidate

before next parallel phaseInvalidates data in writeable regions not accessed by itself

RegistrationDirectory keeps track of

one up-to-date copyWriter updates before next parallel phase

Slide23

Basic DeNovo Coherence [PACT’11]Assume (for now): Private L1, shared L2; single word lineData-race freedom at word granularity

No transient

states

No invalidation traffic, no false sharing

No directory storage overheadL2 data arrays double as directoryKeep valid

data or

registered

core id

Touched bit: set if word read in the phase

registry

Invalid

Valid

Registered

Read

Write

Write

Read, Write

Read

Slide24

Example RunR

X

0

V

Y

0

R

X

1

V

Y

1

R

X

2

V

Y

2

V

X

3

V

Y

3

V

X

4

V

Y

4

V

X

5

V

Y

5

class

S_type

{

X in DeNovo-region ;

Y in

DeNovo

-region ;

}

S _type

S[size

];

...

Phase1

writes

{

//

DeNovo

effect

foreach

i

in 0, size {

S[i].X

= …;

}

self_invalidate

( );

}

L1 of Core 1

R

X

0

V

Y

0

R

X

1

V

Y

1

R

X

2

V

Y

2

I

X

3

V

Y

3

I

X

4

V

Y

4

I

X

5

V

Y

5

L1 of Core 2

I

X

0

V

Y

0

I

X

1

V

Y

1

I

X

2

V

Y

2

R

X

3

V

Y

3

R

X

4

V

Y

4

R

X

5

V

Y

5

Shared L2

R

C1

V

Y

0

R

C1

V

Y

1

R

C1

V

Y

2

R

C2

V

Y

3

R

C2

V

Y

4

R

C2

V

Y

5

R = R

egistered

V = V

alid

I = I

nvalid

V

X

0

V

Y

0

V

X

1

V

Y

1

V

X

2

V

Y

2

V

X

3

V

Y3V X4VY4V X5VY5

V X0VY0V X1VY1V X2VY2V X3VY3VX4VY4VX5VY5

V X0VY0V X1VY1V X2VY2V X3VY3VX4VY4VX5VY5

V X0VY0V X1VY1V X2VY2R X3VY3RX4VY4RX5VY5

Registration

Registration

Ack

Ack

Slide25

Decoupling Coherence and Tag GranularityBasic protocol has tag per wordDeNovo Line-based protocolAllocation/Transfer granularity > Coherence granularityAllocate, transfer cache line at a timeCoherence granularity still at wordNo word-level false-sharing

“Line Merging”

Cache

V

V

R

Tag

V

V

V

Slide26

Current Hardware LimitationsComplexitySubtle races and numerous transient sates in the protocolHard to extend for optimizations

Storage overhead

Directory overhead for sharer lists (makes up for new bits at ~20 cores)

Performance and power inefficiencies

Invalidation, ack messagesIndirection through directoryFalse sharing (cache-line based coherence)

Traffic (cache-line based communication)

Cache pollution (cache-line based allocation)

Slide27

Flexible, Direct CommunicationInsights1. Traditional directory must be updated at every transfer DeNovo can copy valid data around freely2. Traditional systems send cache line at a time

DeNovo uses regions to transfer only relevant data

Effect of AoS-to-SoA transformation w/o programmer/compiler

Slide28

Flexible, Direct CommunicationL1 of Core 1

R

X

0

V

Y

0

V

Z

0

R

X

1

V

Y

1

V

Z

1

R

X

2

V

Y

2

V

Z

2

I

X

3

V

Y

3

V

Z

3

I

X

4

V

Y

4

V

Z

4

I

X

5

V

Y

5

V

Z

5

L1 of Core 2

I

X

0

V

Y

0

V

Z

0

I

X

1

V

Y

1

V

Z

1

I

X

2

V

Y

2

V

Z

2

R

X

3

V

Y

3

V

Z

3

R

X

4

V

Y

4

V

Z

4

R

X

5

V

Y

5

V

Z

5

Shared L2

R

C1

V

Y

0

V

Z

0

R

C1

V

Y

1

V

Z

1

R

C1

V

Y

2

V

Z

2

R

C2

V

Y

3

V

Z

3

R

C2

V

Y

4

V

Z

4

R

C2

V

Y

5

V

Z

5

R

egistered

V

alid

I

nvalid

X

3

LD X

3

Y

3

Z

3

Slide29

L1 of Core 1

R

X

0

V

Y

0

V

Z

0

R

X

1

V

Y

1

V

Z

1

R

X

2

V

Y

2

V

Z

2

I

X

3

V

Y

3

V

Z

3

I

X

4

V

Y

4

V

Z

4

I

X

5

V

Y

5

V

Z

5

L1 of Core 2

I

X

0

V

Y

0

V

Z

0

I

X

1

V

Y

1

V

Z

1

I

X

2

V

Y

2

V

Z

2

R

X

3

V

Y

3

V

Z

3

R

X

4

V

Y

4

V

Z

4

R

X

5

V

Y

5

V

Z

5

Shared L2

R

C1

V

Y

0

V

Z

0

R

C1

V

Y

1

V

Z

1

R

C1

V

Y

2

V

Z

2

R

C2

V

Y

3

V

Z

3

R

C2

V

Y

4

V

Z

4

R

C2

V

Y

5

V

Z

5

R

egistered

V

alid

I

nvalid

X

3

X

4

X

5

R

X

0

V

Y

0

V

Z

0

R

X

1

V

Y

1

V

Z

1

R

X

2

V

Y

2

V

Z

2

V

X

3VY

3VZ3V X4VY4VZ4V X5VY5VZ5LD X3Flexible, Direct CommunicationFlexible, Direct Communication

Slide30

Current Hardware LimitationsComplexitySubtle races and numerous transient sates in the protocolHard to extend for optimizations

Storage overhead

Directory overhead for sharer lists (makes up for new bits at ~20 cores

)

Performance and power inefficienciesInvalidation, ack messagesIndirection through directoryFalse sharing (cache-line based coherence)

Traffic (cache-line based communication

)

Cache pollution (cache-line based allocation)

Stash=

cache+scratchpad

,

a

nother talk

Slide31

EvaluationVerification: DeNovo vs. MESI word with Murphi model checkerCorrectnessSix bugs in MESI protocol: Difficult to find and fixThree bugs in

DeNovo

protocol: Simple to fix

Complexity15x fewer reachable states for DeNovo

20x difference in the runtimePerformance: Simics + GEMS + Garnet 64 cores, simple in-order core modelWorkloadsFFT, LU, Barnes-Hut, and radix from SPLASH-2bodytrack and fluidanimate from PARSEC 2.1kd-Tree (two versions) [HPG 09]

Slide32

DeNovo

is comparable to or better than MESI

DeNovo

+ opts shows 32% lower memory stalls vs. MESI (max 77%)

Memory Stall Time for MESI vs.

DeNovo

FFT

LU

Barnes-Hut

kd

-false

kd

-padded

bodytrack

fluidanimate

radix

M=MESI D=

DeNovo

Dopt

=

DeNovo+Opt

Slide33

DeNovo

has 36% less traffic than MESI (max 71%)

Network Traffic for MESI vs.

DeNovo

FFT

LU

Barnes-Hut

kd

-false

kd

-padded

bodytrack

fluidanimate

radix

M=MESI D=

DeNovo

Dopt

=

DeNovo+Opt

Slide34

Key MilestonesSoftwareDPJ: Determinism OOPSLA’09

Disciplined

non-determinism

POPL’11Unstructured synchronizationLegacy, OS

Hardware

DeNovo

Coherence, Consistency,

Communication

PACT’11 best paper

DeNovoND

ASPLOS’13 &

IEEE Micro top picks’14

DeNovoSynch

(in review)

Ongoing

A language-oblivious virtual ISA

+ Storage

Slide35

DPJ Support for Disciplined Non-DeterminismNon-determinism comes from conflicting concurrent accessesIsolate interfering accesses as atomicEnclosed in atomic sectionsAtomic

regions and effects Disciplined non-determinism

Race freedom, strong isolationDeterminism-by-default semantics

DeNovoND converts atomic statements into locks

.

.

.

.

.

.

ST

LD

Slide36

.

.

.

.

.

Memory Consistency Model

Non-deterministic read returns value of last write from

B

efore

this parallel

phase

Or same task in this phase

Or in preceding critical section of same lock

LD 0xa

ST 0xa

ST 0xa

Critical

Section

Parallel

Phase

self-invalidations as before

single core

Slide37

Coherence for Non-Deterministic DataWhen to invalidate? Between start of critical section and readWhat to invalidate?Entire cache? Regions with “atomic” effects?

Track atomic writes in a signature, transfer with lock

Registration

Writer updates before next critical section

Coherence Enforcement

Invalidate stale copies in private cache

Track up-to-date copy

Slide38

Tracking Data Write SignaturesSmall Bloom filter per core tracks writes signatureOnly track atomic effectsOnly 256 bits sufficeOperations on Bloom filter On write: insert addressOn read: query filter for address for self-invalidation

Slide39

Distributed Queue-based LocksLock primitive that works on DeNovoNDNo sharers-list, no write invalidation  No spinning for lock

Modeled after QOSB Lock [Goodman et al. ‘89]

Lock requests form a distributed queueBut much simpler Details in ASPLOS’13

Slide40

lock transfer

V

X

R

Y

V

Z

V

W

V

X

R

Y

I

Z

V

W

V

X

R

Y

V

Z

V

W

R

X

V

Y

V

Z

V

W

R

X

V

Y

R

Z

V

W

R

C1

R

C2

V

Z

V

W

R

C1

R

C2

R

C1

V

W

lock transfer

Example Run

ST

LD

ST

.

.

self-invalidate( )

L1 of Core 1

L1 of Core 2

Shared L2

Z W

Registration

Ack

Read miss

X in

DeNovo

-region

Y in

DeNovo

-region

Z in

atomic

DeNovo

-region

W in

atomic

DeNovo

-region

LD

V

X

R

Y

V

Z

R

W

R

X

V

Y

R

Z

I

W

Registration

R

C1

R

C2

R

C1

R

C

2

Ack

Read miss

R

X

V

Y

R

Z

V

W

self-invalidate( )

reset filter

R

X

V

Y

R

Z

I

W

V

X

R

Y

I

Z

R

W

Z W

Slide41

Optimizations to Reduce Self-InvalidationL

oads in Registered state

Touched-atomic bit

Set on first atomic load

Subsequent loads don’t self-invalidate

.

.

X in

DeNovo

-region

Y in

DeNovo

-region

Z in

atomic

DeNovo

-region

W in

atomic

DeNovo

-region

ST

LD

self-invalidate( )

LD

LD

Slide42

OverheadsHardware Bloom filter256 bits per core Storage overheadOne additional state, but no storage overhead (2 bits) Touched-atomic bit per word in L1Communication overheadBloom filter piggybacked on lock transfer messageWriteback messages for locks

Lock

writebacks carry more info

Slide43

Evaluation of MESI vs. DeNovoND (16 cores)DeNovoND execution time comparable or better than MESIDeNovoND has 33% less traffic than MESI (67% max)No invalidation trafficReduced load misses due to lack of false sharing

Slide44

Key MilestonesSoftwareDPJ: Determinism OOPSLA’09

Disciplined

non-determinism

POPL’11Unstructured synchronizationLegacy, OS

Hardware

DeNovo

Coherence, Consistency,

Communication

PACT’11 best paper

DeNovoND

ASPLOS’13 &

IEEE Micro top picks’14

DeNovoSynch

(in review)

Ongoing

A language-oblivious virtual ISA

+ Storage

Slide45

Unstructured SynchronizationMany programs (libraries) use unstructured synchronizationE.g., non-blocking, wait-free constructsArbitrary synchronization racesSeveral reads and writesData ordered by such synchronization may still be disciplinedUse static or signature driven self-invalidationsBut what about synchronization accesses?

Slide46

Memory model: Sequential consistencyWhat to invalidate, when to invalidate?Every read? Every read to non-registered stateRegister read (to enable future hits)Concurrent readers?Back off (delay) read registrationUnstructured Synchronization

Slide47

Unstructured Synch: Execution Time (64 cores)

DeNovoSync

reduces execution time by 28% over MESI (max 49%)

Slide48

Unstructured Synch: Network Traffic (64 cores)

DeNovo

reduces traffic by 44% vs. MESI (max 61%) for 11 of 12 cases

Centralized barrier

Many concurrent readers hurt DeNovo (and MESI)Should use tree barrier even with MESI

Slide49

Key MilestonesSoftwareDPJ: Determinism OOPSLA’09

Disciplined

non-determinism

POPL’11Unstructured synchronizationLegacy, OS

Hardware

DeNovo

Coherence, Consistency,

Communication

PACT’11 best paper

DeNovoND

ASPLOS’13 &

IEEE Micro top picks’14

DeNovoSynch

(in review)

Ongoing

A language-oblivious virtual ISA

+ Storage

Slide50

Simple programming model AND efficient hardwareConclusions and Future Work (1 of 3)

Disciplined Shared Memory

Deterministic Parallel Java (

DPJ): Strong safety properties

No data races, determinism-by-default, safe non-determinism

Simple semantics, safety, and

composability

DeNovo

: Complexity-, performance-, power-efficiency

Simplify

coherence and

consistency

Optimize communication and storage

explicit effects +

structured

parallel control

Slide51

Conclusions and Future Work (2 of 2)DeNovo rethinks hardware for disciplined modelsFor deterministic codesComplexityNo transient states: 20X faster to verify than MESIExtensible: optimizations without new statesStorage overhead

No

directory overheadPerformance and power inefficienciesNo invalidations,

acks, false sharing, indirectionFlexible, not cache-line, communication

Up to 77% lower memory stall time, up to 71% lower trafficAdded safe non-determinism and unstructured synchs

Slide52

Broaden software supportedOS, legacy, …Region-driven memory hierarchyAlso apply to heterogeneous memoryGlobal address spaceRegion-driven coherence, communication, layoutStash = best of cache and scratchpadHardware/Software InterfaceLanguage-neutral virtual ISAParallelism and specialization may solve energy crisis, butRequire rethinking software, hardware, interface

Conclusions and Future Work (3 of 3)