Stash: Have Your Scratchpad and Cache it Too - PowerPoint Presentation

494 views
Uploaded On 2016-07-28

Stash: Have Your Scratchpad and Cache it Too - PPT Presentation

Matthew D Sinclair et al UIUC Presenting by Sharmila Shridhar SoCs Need an Efficient Memory Hierarchy 2 Energyefficient memory hierarchy is essential Heterogeneous SoCs use ID: 423144

cache stash scratchpad data stash cache data scratchpad implicit map evaluation addressable globally coherence execution reuse time pollution microbenchmarks

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/423144" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Stash: Have Your Scratchpad and Cache it..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Stash: Have Your Scratchpad and Cache it Too

Matthew D.

Sinclair et. al

UIUC

Presenting by

Sharmila

ShridharSlide2

SoCs Need an Efficient Memory Hierarchy

Energy-efficient memory hierarchy is

essential

Heterogeneous SoCs use specialized memoriesE.g., scratchpads, FIFOs, stream buffers, …

Scratchpad

Directly addressed

: no tags/TLB/conflictsXCompact storage: no holes in cache linesX

CacheSlide3

SoCs Need an Efficient Memory Hierarchy

Energy-efficient memory hierarchy is

essential

Heterogeneous SoCs use specialized memoriesE.g., scratchpads, FIFOs, stream buffers, …

Can specialized memories be globally addressable, coherent?Can we have our scratchpad and cache it too?

Scratchpad

Directly addressed: no tags/TLB/conflictsXCompact storage: no holes in cache lines

Global address space

: implicit data movement



Coherent

: reuse, lazy

writebacksX

CacheSlide4

Can We Have Our Scratchpad and Cache it Too?

Make specialized memories globally addressable, coherent

Efficient address

mapping

Efficient coherence protocolFocus: CPU-GPU systems with scratchpads and cachesUp to 31% less execution time, 51% less energy4

Stash

Scratchpad

CacheDirectly addressableCompact storageGlobal address space

CoherentSlide5

OutlineMotivation

Background: Scratchpads & Caches

Stash

Overview

ImplementationResultsConclusion5Slide6

Global Addressability

Scratchpads

Part of private address space: not globally addressable

Explicit movement

Cache

Globally addressable: part of global address space

Implicit copies, no pollution, support for conditional accesses6GPUCacheInterconnection n/w

Scratchpad

L2 $ Bank

CPU

Cache

Registers

L2 $ Bank

, pollution, poor conditional

accs

supportSlide7

Coherence: Globally Visible DataScratchpads

Part of private address space: not globally visible

Eager

writebacks

and invalidations on synchronizationCache

Globally visible: data kept coherent Lazy writebacks as space is needed, reuse

data across synch

7Slide8

Stash – A Scratchpad, Cache Hybrid8

Scratchpad

Directly addressed

: no tags/TLB/conflicts



Compact storage: no holes in cache linesX

Global address space: implicit data move.



Coherent

: reuse, lazy

writebacks

X



Cache

StashSlide9

OutlineMotivation

Background: Scratchpads & Caches

Stash

Overview

ImplementationResultsConclusion9Slide10

Stash: Directly & Globally Addressable

Like scratchpad: directly addressable (for hits)

Like cache: globally addressable (for misses)

Implicit loads, no cache pollution

Accelerator

Scratchpad

…

500

505

// A is global

mem

addr

scratch_base

== 500

for (

= 500;

< 600;

++) {

reg

= load[A+i-500];

scratch[

] =

;

}

reg

r =

scratch_load

[505];

// A is global

mem

addr

// Compiler info:

stash_base

[500] -> A (M

)

= M

(index in map)

reg

r =

stash_load

[505,

];

Accelerator

Stash

…

500

505

Generate

load[A+5]

500



MapSlide11

Stash: Globally Visible

Stash data can be accessed by other units

Needs coherence support

Like cache

Keep data around – lazy writebacksIntra- or inter-kernel data reuse on the same core

L2 $

BankInterconnection n/wGPUStash

L2 $Bank

CPU

Cache

Registers

Map

$Slide12

Stash:

Compact Storage

Caches: cache line granularity storage (“holes”



waste)

Do not compact

data

Like scratchpad, stash compacts data12

…

Global

…

StashSlide13

Outline

Motivation

Background: Scratchpads & Caches

Stash

OverviewImplementationResultsConclusion13Slide14

Stash Software InterfaceSoftware gives a mapping for each stash allocation

AddMap

(

stashBase, globalBase, fieldSize, objectSize, rowSize, strideSize, numStrides, isCoherent)

14Slide15

Stash Hardware15

Data Array

…

Map index

table

…

Stash-Map

PA/VA

VP-map

Stash instruction

Stash

base

Field size,

Object

size

Row size,

Stride size,

#strides

isCoh

#Dirty

Data

TLB

RTLBSlide16

Stash

Instruction Example

Data Array

…

Map index

table

…

Stash-Map

VP-map

stash_load

[505,

];

HIT

MISS

Stash

base

Field size,

Object

size

Row size,

Stride size,

#strides

isCoh

#Dirty

Data

TLB

RTLBSlide17

Lazy writebacks

Stash

writebacks

happen lazily

Chunks of 64B with per chunk dirty bit

On store miss, for the chunk set dirty bitupdate stash map indexIncrement #DirtyData counter

On eviction,

Get PA using stash map

index and

writeback

Decrement #

DirtyData

counter Slide18

Coherence Support for StashStash data needs to be kept coherentExtend a coherence protocol for three features

Track stash data at word granularity

Capability to merge partial lines when stash sends data

Modify directory to record the modifier and stash-map

IDExtension to DeNovo protocolSimple, low overhead, hybrid of CPU and GPU protocols18Slide19

DeNovo Coherence (1/3)

[

DeNoVo

: Rethinking the Memory Hierarchy for Disciplined Parallelism]

Designed for Deterministic code w/o conflicting accessLine granularity tags, word granularity coherenceOnly three coherence statesValid, Invalid, RegisteredExplicit self invalidation at the end of each phaseLines written in previous phase-> Registered stateKeep valid data or registered core ID in Shared LLC19Slide20

Private L1, shared L2; single word lineData-race freedom at word granularity

DeNovo

Coherence

(2/3)

Invalid

Valid

Registered

Read

Write

Read, Write

Read

transient states

invalidation traffic

directory storage overhead

false

sharing (word coherence)

20Slide21

DeNovo Coherence (3/3)

Extenstions

for Stash

Store

Stash Map ID along with registered core IDNewly written data in Registered stateAt the end of the kernel, self-invalidate entries that are not registeredIn contrast, scratchpad invalidates all the entriesOnly three states, 4th state used for

Writeback

21Slide22

OutlineMotivation

Background: Scratchpads &

Caches

Stash

OverviewImplementationResultsConclusion22Slide23

EvaluationSimulation Environment

GEMS

Simics

+ Princeton Garnet N/W + GPGPU-SimExtend McPAT and GPUWattch for energy evaluationsWorkloads:4 microbenchmarks: implicit, reuse, pollution, on-demandHeterogeneous workloads: Rodinia, Parboil, SURF1 CPU Core (15 for

microbenchmarks)15 GPU Compute Units (1 for microbenchmarks)32 KB L1 Caches, 16 KB

Stash/Scratchpad23Slide24

Evaluation (

Microbenchmarks

)

– Execution Time

Implicit

Scr

= Baseline configuration

C =

All requests

use

cache

Scr+D

All requests use scratchpad w/ DMASt = Converts scratchpad requests to

stashSlide25

Evaluation (

Microbenchmarks

)

–

Execution Time25

No explicit loads/stores

ImplicitSlide26

Evaluation (

Microbenchmarks

)

– Execution Time

Implicit

No cache pollution

PollutionSlide27

Evaluation (

Microbenchmarks

)

– Execution Time

Implicit

Pollution

Reuse

On-Demand

Only bring needed dataSlide28

Evaluation (

Microbenchmarks

)

– Execution Time

Implicit

Pollution

Reuse

Data

compaction, reuse

On-DemandSlide29

Evaluation (

Microbenchmarks

)

– Execution Time

Implicit

Pollution

Reuse

Average

Avg

: 27% vs. Scratch, 13% vs. Cache, 14% vs. DMA

On-DemandSlide30

Evaluation (Microbenchmarks) – Energy

Avg

: 53% vs. Scratch, 36% vs.

ache, 32% vs. DMA

Implicit

Pollution

Reuse

Average

On-DemandSlide31

Evaluation (Apps) – Execution Time

SGEMM

AVERAGE

SURF

106

102

103

Scr

Reqs

use type specified by original app

C = All

reqs

se cache

St =

Converts scratchpad

reqs

stashSlide32

Evaluation (Apps) – Execution Time

SGEMM

AVERAGE

SURF

LUD

Avg

: 10% vs. Scratch, 12% vs. Cache (max: 22%, 31%)

Source:

implicit data movement

Comparable to

Scratchpad+DMA

121

106

102

103

103Slide33

Avg

: 16% vs. Scratch, 32% vs. Cache (max:

30%, 51%)

168

120

180

126

108

128

LUD

SURF

SGEMM

AVERAGE

Evaluation (Apps) –

Energy

33Slide34

Conclusion34

Make specialized memories globally

addressable,

coherent

Efficient address mapping (only for misses)Efficient software-driven hardware coherence protocolStash = scratchpad + cacheLike scratchpads: Directly addressable and compact storageLike caches: Globally addressable and globally

visibleReduced execution time and energyFuture Work:

More accelerators & specialized memories; consistency modelsSlide35

CritiqueIn GPUs, data in shared memory has the visibilty

per thread block. Use

syncthreads

to ensure data is available. How is that behavior implemented?

Else multiple threads can encounter miss on same data. How is it handled?Why don’t they compare Scratchpad+DMA for GPU applications results?35