/
RowClone Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization RowClone Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

RowClone Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization - PowerPoint Presentation

delilah
delilah . @delilah
Follow
343 views
Uploaded On 2022-07-01

RowClone Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization - PPT Presentation

Y Kim C Fallin D Lee R Ausavarungnirun G Pekhimenko Y Luo O Mutlu P B Gibbons M A Kozuch T C Mowry Vivek Seshadri Executive Summary Bulk data copy and initialization ID: 928547

data copy core row copy data row core initialization bulk dram energy buffer memory rowclone parallel fast mode chip

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "RowClone Fast and Energy-Efficient In-DR..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

RowClone

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Y. Kim, C.

Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry

Vivek Seshadri

Slide2

Executive Summary

Bulk data copy and initializationUnnecessarily move data on the memory channel

Degrade system performance and energy efficiencyRowClone – perform copy in DRAM with low costUses row buffer to copy large quantity of data

Source row → row buffer → destination row11X lower latency and 74X lower energy for a bulk copyAccelerate Copy-on-Write and Bulk ZeroingForking, checkpointing, zeroing (security), VM cloningImproves performance and energy efficiency at low cost27% and 17% for 8-core systems (0.01% DRAM chip area)

Slide3

Memory Channel – Bottleneck

Core

Core

CacheMCMemoryChannelLimited BandwidthHigh Energy

Slide4

Goal: Reduce Memory Bandwidth Demand

Core

Core

CacheMCMemoryChannelReduce unnecessary data movement

Slide5

Bulk Data Copy and Initialization

Bulk Data Copy

Bulk Data Initialization

srcdstdstval

Slide6

Bulk Data Copy and Initialization

Bulk Data Copy

Bulk Data Initialization

srcdstdstval

Slide7

Bulk Copy and Initialization – Applications

Forking

000000000000000

Zero initialization(e.g., security)VM CloningDeduplication

Checkpointing

Page Migration

Many more

Slide8

Shortcomings of Existing Approach

Core

Core

CacheMCChannelsrcdstHigh latency (1046

ns to copy 4

KB)

Interference

High Energy

(

3600

nJ to copy

4

KB)

Slide9

Our Approach: In-DRAM Copy with Low Cost

Core

Core

CacheMCChanneldstHigh latencyInterferenceHigh EnergysrcX

X

X

?

Slide10

Outline

IntroductionDRAM BackgroundRowClone

Fast Parallel ModePipelined Serial ModeEnd-to-end DesignEvaluation

Slide11

DRAM Chip Organization

Memory Channel

Chip I/O

Bank

Bank I/O

Subarray

Row Buffer

Row of DRAM Cells

Slide12

DRAM Read Operation

Memory Channel

Chip I/O

Bank I/O

ACTIVATE

: Copy data from row to row buffer

READ

: Transfer data to channel using the shared bus

Slide13

DRAM Cell Operation

V

DD

/2VDD/2

0

V

DD

DRAM

Cell

Sense Amplifier

(Row

Buffer

)

Slide14

DRAM Cell Operation

V

DD

/2VDD/2

0

V

DD

/2

+

δ

0

V

DD

V

DD

V

DD

/2

+

δ

DRAM

Cell

Sense Amplifier

(Row

Buffer

)

Cell loses charge

Amplify the difference

Restore Cell Data

READ/WRITE

In the stable state,

the sense amplifier drives the cell

ACTIVATE

Slide15

Outline

IntroductionDRAM Background

RowCloneFast Parallel ModePipelined Serial ModeEnd-to-end DesignEvaluation

Slide16

RowClone: Fast Parallel Mode (FPM)

r

c

row

s

s

t

o

w

d

r

1

. Source row to row buffer

2

. Row buffer to destination row

Row Buffer

r

c

r

o

w

s

s

r

c

r

o

w

?

Slide17

Fast Parallel Mode: Implementation

V

DD

/2VDD/2

0

V

DD

/2

+

δ

0

V

DD

V

DD

V

DD

/2

+

δ

Sense Amplifier

(Row

Buffer

)

Amplify the difference

0

Data gets

copied

src

dst

Slide18

Fast Parallel Mode: Implementation

r

c

row

s

s

t

o

w

d

r

Row Buffer

r

c

r

o

w

s

s

r

c

r

o

w

1

.

Activate

src

row (copy data from

src

to row buffer)

2

.

Activate

dst

row (disconnect

src

from row buffer, connect

dst

– copy data from row buffer to

dst

)

Slide19

Fast Parallel Mode: Benefits

Latency

Energy

11x74xBulk Data Copy1046ns to 90ns3600nJ to 40nJNo bandwidth consumptionVery little changes to the DRAM chip

Slide20

Fast Parallel Mode: Constraints

Location of source/destinationBoth should be in the same

subarraySize of the copyCopies all the data from source row to destination

Slide21

RowClone: Pipelined Serial Mode (PSM)

Memory Channel

Chip I/O

Bank

Shared internal bus

Overlap the latency of the read and the write

1.9

X

latency reduction,

3.2

X

energy reduction

Slide22

Bulk Copy using RowClone

Memory Channel

Chip I/O

Bank

Bank I/O

Subarray

Intra

subarray

Use FPM

Inter bank

Use PSM

Inter

subarray

Use PSM twice

Slide23

Bulk Initialization

Initialization with arbitrary dataInitialize one rowCopy the data to other rows

Zero initialization (most common)Reserve a row in each subarray (always zero)Copy data from reserved row (FPM mode)

6.0X lower latency, 41.5X lower DRAM energy0.2% loss in capacity

Slide24

Latency and Energy Benefits

11.6x

1.9x

6.0x1.0x74.4x3.2x1.5x41.5xVery low cost: 0.01% increase in die area

Slide25

Outline

IntroductionDRAM Background

RowCloneFast Parallel ModePipelined Serial ModeEnd-to-end DesignEvaluation

Slide26

End-to-end System Design

DRAM (RowClone)

Microarchitecture

ISAOperating SystemApplicationHow does the software communicate occurrences of bulk copy/initialization to hardware?How to maximize use of the Fast Parallel Mode?How to ensure cache coherence?Handling data reuse after zero initialization?

Slide27

1. Hardware/Software Interface

Two new instructionsmemcopy and

meminitSimilar instructions present in existing ISAsMicroarchitecture ImplementationChecks if instructions can be sped up by RowClone

Export instructions to the memory controller

Slide28

2. Managing Cache Coherence

RowClone modifies data in memoryNeed to maintain coherence of cached data

Similar to DMASource and destination in memoryCan leverage hardware support for DMAAdditional optimizations

Slide29

3. Maximizing Use of the Fast Parallel Mode

Make operating system subarray-aware

Primitives amenable to use of FPMCopy-on-WriteAllocate destination in same subarray as sourceUse FPM to copy

Bulk ZeroingUse FPM to copy data from reserved zero row

Slide30

4. Handling Data Reuse After Zeroing

Data reuse after zero initializationPhase

1: OS zeroes out the pagePhase 2: Application uses

cachelines of the pageRowCloneAvoids misses in phase 1But incurs misses in phase 2RowClone-Zero-Insert (RowClone-ZI)Insert clean zero cachelines

Slide31

Outline

IntroductionDRAM Background

RowCloneFast Parallel ModePipelined Serial ModeEnd-to-end Design

Evaluation

Slide32

Methodology

Out-of-order multi-core simulator1MB/core last-level cache

Cycle-accurate DDR3 DRAM simulator6 Copy/Initialization intensive applications

+SPEC CPU2006 for multi-corePerformanceInstruction throughput for single-coreWeighted Speedup for multi-core

Slide33

Copy/Initialization Intensive Applications

System bootup

(Booting the Debian OS)Compile (GNU C compiler – executing cc

1)Forkbench (A fork microbenchmark)Memcached (Inserting a large number of objects)MySql (Loading a database)Shell script (find with ls on each subdirectory)

Slide34

Memory Traffic due to Copy/Initialization

Slide35

Single-Core – Performance and Energy

Improvements correlate with fraction of memory traffic due to copy/initialization

Slide36

Multi-Core Systems

Reduced bandwidth consumption benefits all applications.Run copy/initialization intensive applications with memory intensive SPEC applications.

Half the cores run copy/initialization intensive applications. Remaining half run SPEC applications.

Slide37

Multi-Core Results: Summary

Performance improvement increases

with increasing core count

Consistent improvement in energy/instruction

Slide38

Other Results and Discussion in the Paper

Discussion on interleaving and copy granularityDetailed analysis of the fork benchmarkDetailed multi-core results and analysis

Results with the PSM modeAnalysis of RowClone-ZIComparison to memory-controller-based DMA

Slide39

Conclusion

Bulk data copy and initializationUnnecessarily move data on the memory channel

Degrade system performance and energy efficiencyRowClone – perform copy in DRAM with low costUses row buffer to copy large quantity of data

Source row → row buffer → destination row11X lower latency and 74X lower energy for a bulk copyAccelerate Copy-on-Write and Bulk ZeroingForking, checkpointing, zeroing (security), VM cloningImproves performance and energy efficiency at low cost27% and 17% for 8-core systems (0.01% chip area overhead)

Slide40

RowClone

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Y. Kim, C.

Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry

Vivek Seshadri

Slide41

Backup Slides

Slide42

Multi-core Metrics

2-core

4-core

8-core# Workloads1385040Weighted Speedup15%20%27%Instruction Throughput14%15%25%Harmonic Speedup13%

16%

29%

Max Slowdown Reduction

6%

12%

23%

Bandwidth/Instruction Reduction

29%

27%

28%

Energy/Instruction

Reduction

19%

17%

17%

Slide43

RowClone-ZI Single-Core

Slide44

RowClone-ZI Multi-Core

Slide45

Forkbench – Fraction of Memory Traffic

Slide46

Forkbench – Performance

Slide47

Forkbench – Energy

Slide48

Comparison to Prior Work

Copy engines (Zhao et al. 2005,

Jiang et al. 2009)Addresses cache pollution, pipeline stalls due to copy

But requires data transfer over the memory channelIRAM (Patterson et al. 1997)Compute + memory using same technologyExploit high DRAM bandwidthGoal: Wider range of SIMD operationsHigh cost

Slide49

Why is FPM not done today?

Copy/Initialization is importantBut not well known

Opportunity to perform in DRAMNot well knownThis paper: Proof of conceptMore challenges to be addressed