Y Kim C Fallin D Lee R Ausavarungnirun G Pekhimenko Y Luo O Mutlu P B Gibbons M A Kozuch T C Mowry Vivek Seshadri Executive Summary Bulk data copy and initialization ID: 928547
Download Presentation The PPT/PDF document "RowClone Fast and Energy-Efficient In-DR..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
RowClone
Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization
Y. Kim, C.
Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry
Vivek Seshadri
Slide2Executive Summary
Bulk data copy and initializationUnnecessarily move data on the memory channel
Degrade system performance and energy efficiencyRowClone – perform copy in DRAM with low costUses row buffer to copy large quantity of data
Source row → row buffer → destination row11X lower latency and 74X lower energy for a bulk copyAccelerate Copy-on-Write and Bulk ZeroingForking, checkpointing, zeroing (security), VM cloningImproves performance and energy efficiency at low cost27% and 17% for 8-core systems (0.01% DRAM chip area)
Slide3Memory Channel – Bottleneck
Core
Core
CacheMCMemoryChannelLimited BandwidthHigh Energy
Slide4Goal: Reduce Memory Bandwidth Demand
Core
Core
CacheMCMemoryChannelReduce unnecessary data movement
Slide5Bulk Data Copy and Initialization
Bulk Data Copy
Bulk Data Initialization
srcdstdstval
Slide6Bulk Data Copy and Initialization
Bulk Data Copy
Bulk Data Initialization
srcdstdstval
Slide7Bulk Copy and Initialization – Applications
Forking
000000000000000
Zero initialization(e.g., security)VM CloningDeduplication
Checkpointing
Page Migration
Many more
Slide8Shortcomings of Existing Approach
Core
Core
CacheMCChannelsrcdstHigh latency (1046
ns to copy 4
KB)
Interference
High Energy
(
3600
nJ to copy
4
KB)
Slide9Our Approach: In-DRAM Copy with Low Cost
Core
Core
CacheMCChanneldstHigh latencyInterferenceHigh EnergysrcX
X
X
?
Slide10Outline
IntroductionDRAM BackgroundRowClone
Fast Parallel ModePipelined Serial ModeEnd-to-end DesignEvaluation
Slide11DRAM Chip Organization
Memory Channel
Chip I/O
Bank
Bank I/O
Subarray
Row Buffer
Row of DRAM Cells
Slide12DRAM Read Operation
Memory Channel
Chip I/O
Bank I/O
ACTIVATE
: Copy data from row to row buffer
READ
: Transfer data to channel using the shared bus
Slide13DRAM Cell Operation
V
DD
/2VDD/2
0
V
DD
DRAM
Cell
Sense Amplifier
(Row
Buffer
)
Slide14DRAM Cell Operation
V
DD
/2VDD/2
0
V
DD
/2
+
δ
0
V
DD
V
DD
V
DD
/2
+
δ
DRAM
Cell
Sense Amplifier
(Row
Buffer
)
Cell loses charge
Amplify the difference
Restore Cell Data
READ/WRITE
In the stable state,
the sense amplifier drives the cell
ACTIVATE
Slide15Outline
IntroductionDRAM Background
RowCloneFast Parallel ModePipelined Serial ModeEnd-to-end DesignEvaluation
Slide16RowClone: Fast Parallel Mode (FPM)
r
c
row
s
s
t
o
w
d
r
1
. Source row to row buffer
2
. Row buffer to destination row
Row Buffer
r
c
r
o
w
s
s
r
c
r
o
w
?
Slide17Fast Parallel Mode: Implementation
V
DD
/2VDD/2
0
V
DD
/2
+
δ
0
V
DD
V
DD
V
DD
/2
+
δ
Sense Amplifier
(Row
Buffer
)
Amplify the difference
0
Data gets
copied
src
dst
Slide18Fast Parallel Mode: Implementation
r
c
row
s
s
t
o
w
d
r
Row Buffer
r
c
r
o
w
s
s
r
c
r
o
w
1
.
Activate
src
row (copy data from
src
to row buffer)
2
.
Activate
dst
row (disconnect
src
from row buffer, connect
dst
– copy data from row buffer to
dst
)
Slide19Fast Parallel Mode: Benefits
Latency
Energy
11x74xBulk Data Copy1046ns to 90ns3600nJ to 40nJNo bandwidth consumptionVery little changes to the DRAM chip
Slide20Fast Parallel Mode: Constraints
Location of source/destinationBoth should be in the same
subarraySize of the copyCopies all the data from source row to destination
Slide21RowClone: Pipelined Serial Mode (PSM)
Memory Channel
Chip I/O
Bank
Shared internal bus
Overlap the latency of the read and the write
1.9
X
latency reduction,
3.2
X
energy reduction
Slide22Bulk Copy using RowClone
Memory Channel
Chip I/O
Bank
Bank I/O
Subarray
Intra
subarray
Use FPM
Inter bank
Use PSM
Inter
subarray
Use PSM twice
Slide23Bulk Initialization
Initialization with arbitrary dataInitialize one rowCopy the data to other rows
Zero initialization (most common)Reserve a row in each subarray (always zero)Copy data from reserved row (FPM mode)
6.0X lower latency, 41.5X lower DRAM energy0.2% loss in capacity
Slide24Latency and Energy Benefits
11.6x
1.9x
6.0x1.0x74.4x3.2x1.5x41.5xVery low cost: 0.01% increase in die area
Slide25Outline
IntroductionDRAM Background
RowCloneFast Parallel ModePipelined Serial ModeEnd-to-end DesignEvaluation
Slide26End-to-end System Design
DRAM (RowClone)
Microarchitecture
ISAOperating SystemApplicationHow does the software communicate occurrences of bulk copy/initialization to hardware?How to maximize use of the Fast Parallel Mode?How to ensure cache coherence?Handling data reuse after zero initialization?
Slide271. Hardware/Software Interface
Two new instructionsmemcopy and
meminitSimilar instructions present in existing ISAsMicroarchitecture ImplementationChecks if instructions can be sped up by RowClone
Export instructions to the memory controller
Slide282. Managing Cache Coherence
RowClone modifies data in memoryNeed to maintain coherence of cached data
Similar to DMASource and destination in memoryCan leverage hardware support for DMAAdditional optimizations
Slide293. Maximizing Use of the Fast Parallel Mode
Make operating system subarray-aware
Primitives amenable to use of FPMCopy-on-WriteAllocate destination in same subarray as sourceUse FPM to copy
Bulk ZeroingUse FPM to copy data from reserved zero row
Slide304. Handling Data Reuse After Zeroing
Data reuse after zero initializationPhase
1: OS zeroes out the pagePhase 2: Application uses
cachelines of the pageRowCloneAvoids misses in phase 1But incurs misses in phase 2RowClone-Zero-Insert (RowClone-ZI)Insert clean zero cachelines
Slide31Outline
IntroductionDRAM Background
RowCloneFast Parallel ModePipelined Serial ModeEnd-to-end Design
Evaluation
Slide32Methodology
Out-of-order multi-core simulator1MB/core last-level cache
Cycle-accurate DDR3 DRAM simulator6 Copy/Initialization intensive applications
+SPEC CPU2006 for multi-corePerformanceInstruction throughput for single-coreWeighted Speedup for multi-core
Slide33Copy/Initialization Intensive Applications
System bootup
(Booting the Debian OS)Compile (GNU C compiler – executing cc
1)Forkbench (A fork microbenchmark)Memcached (Inserting a large number of objects)MySql (Loading a database)Shell script (find with ls on each subdirectory)
Slide34Memory Traffic due to Copy/Initialization
Slide35Single-Core – Performance and Energy
Improvements correlate with fraction of memory traffic due to copy/initialization
Slide36Multi-Core Systems
Reduced bandwidth consumption benefits all applications.Run copy/initialization intensive applications with memory intensive SPEC applications.
Half the cores run copy/initialization intensive applications. Remaining half run SPEC applications.
Slide37Multi-Core Results: Summary
Performance improvement increases
with increasing core count
Consistent improvement in energy/instruction
Slide38Other Results and Discussion in the Paper
Discussion on interleaving and copy granularityDetailed analysis of the fork benchmarkDetailed multi-core results and analysis
Results with the PSM modeAnalysis of RowClone-ZIComparison to memory-controller-based DMA
Slide39Conclusion
Bulk data copy and initializationUnnecessarily move data on the memory channel
Degrade system performance and energy efficiencyRowClone – perform copy in DRAM with low costUses row buffer to copy large quantity of data
Source row → row buffer → destination row11X lower latency and 74X lower energy for a bulk copyAccelerate Copy-on-Write and Bulk ZeroingForking, checkpointing, zeroing (security), VM cloningImproves performance and energy efficiency at low cost27% and 17% for 8-core systems (0.01% chip area overhead)
Slide40RowClone
Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization
Y. Kim, C.
Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry
Vivek Seshadri
Slide41Backup Slides
Slide42Multi-core Metrics
2-core
4-core
8-core# Workloads1385040Weighted Speedup15%20%27%Instruction Throughput14%15%25%Harmonic Speedup13%
16%
29%
Max Slowdown Reduction
6%
12%
23%
Bandwidth/Instruction Reduction
29%
27%
28%
Energy/Instruction
Reduction
19%
17%
17%
Slide43RowClone-ZI Single-Core
Slide44RowClone-ZI Multi-Core
Slide45Forkbench – Fraction of Memory Traffic
Slide46Forkbench – Performance
Slide47Forkbench – Energy
Slide48Comparison to Prior Work
Copy engines (Zhao et al. 2005,
Jiang et al. 2009)Addresses cache pollution, pipeline stalls due to copy
But requires data transfer over the memory channelIRAM (Patterson et al. 1997)Compute + memory using same technologyExploit high DRAM bandwidthGoal: Wider range of SIMD operationsHigh cost
Slide49Why is FPM not done today?
Copy/Initialization is importantBut not well known
Opportunity to perform in DRAMNot well knownThis paper: Proof of conceptMore challenges to be addressed