Rajeev Balasubramonian School of Computing University of Utah Oct 2012 2 Power Contributions PERCENTAGE OF TOTAL SERVER POWER PROCESSOR MEMORY 3 Power Contributions PERCENTAGE OF TOTAL ID: 533423
Download Presentation The PPT/PDF document "1 Exploiting 3D-Stacked Memory Devices" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
1
Exploiting 3D-Stacked Memory Devices
Rajeev
BalasubramonianSchool of ComputingUniversity of UtahOct 2012Slide2
2
Power Contributions
PERCENTAGE
OF TOTAL
SERVER
POWER
PROCESSOR
MEMORYSlide3
3
Power Contributions
PERCENTAGE
OF TOTAL
SERVER
POWER
PROCESSOR
MEMORYSlide4
4
Example IBM Server
Source: P. Bose, WETI Workshop, 2012Slide5
5
Reasons for Memory Power Increase
Innovations for the processor, but not for memory Harder to get to memory (buffer chips) New workloads that demand more memory
SAP HANA in-memory databases
SAS in-memory analyticsSlide6
6
The Cost of Data Movement
64-bit double-precision FP MAC: 50
pJ (NSF CPOM Workshop report) 1 instruction on an ARM Cortex A5: 80 pJ (ARM datasheets)
Fetching 256-bit block from a distant cache bank: 1.2
nJ
(NSF CPOM Workshop report)
Fetching 256-bit block from an HMC device: 2.68
nJ
Fetching 256-bit block from a DDR3 device: 16.6 nJ (
Jeddeloh and Keeth, 2012 Symp. on VLSI Technology)Slide7
7
Memory Basics
Host Multi-Core
Processor
MC
MC
MC
MCSlide8
8
FB-DIMM
Host Multi-Core
Processor
MC
MC
MC
MC
…Slide9
9
SMB/SMI
Host Multi-Core
Processor
MC
MC
MC
MCSlide10
10
Micron
Hybrid Memory Cube
DeviceSlide11
11
HMC Architecture
Host Multi-Core
Processor
MC
MC
MC
MCSlide12
12
Key Points
HMC allows logic layer to easily reach DRAM chips Open question: new functionalities on the logic chip – cores, routing, refresh, scheduling
Data transfer out of the HMC is just as expensive as before
Near Data Computing … to cut off-HMC movement
Intelligent Network-of-Memories … to reduce hopsSlide13
13
Near Data Computing (NDC)Slide14
14
Timely Innovation
A low-cost way to achieve NDC Workloads that are embarrassingly parallel Workloads that are increasingly memory bound
Mature frameworks (
MapReduce
) in placeSlide15
15
Open Questions
What workloads will benefit from this? What causes the benefit?Slide16
16
Workloads
Initial focus on MapReduce, but any workload with localized data access patterns will be a good fit Map phase in
MapReduce
: the dataset is partitioned
and each Map phase works on its “split”; embarrassingly
parallel, localized data access, often the bottleneck;
e.g., count word occurrences in each individual document
Reduce phase in MapReduce: aggregates the results of
many mappers; requires random access of data; but deals with less data than Mappers;
e.g., summing up the occurrences for each wordSlide17
17
Baseline Architecture
MC
MC
MC
MC
Mappers
and Reducers both execute on the host processor
Many simple cores is better than few complex cores
2 sockets, 256 GB memory, processing power budget 260 W,
512 Arm cores (EE-Cores) per socket, each core at 876 MHzSlide18
18
NDC Architecture
MC
MC
MC
MC
Mappers
execute on ND Cores; Reducers execute on the
host processor
32 cores per HMC; 2048 total ND Cores and 1024 total
EE-Cores; 260 W total processing power budgetSlide19
19
NDC Memory Hierarchy
MC
MC
MC
MC
Memory latency excludes delay for link queuing and traversal
Many row buffer hits
L1 I and D caches per ND Core
The vault
has space reserved for intermediate outputs, and
Mapper
/Runtime code/dataSlide20
20
Methodology
Three workloads: Range-Aggregate: count occurrences of something Group-By: count occurrences of everything
Equi
-Join: for two databases, it counts the pairs that
have similar attributes
Dataset: 1998 World Cup web server logs
Simulations of individual
mappers
and reducers on EE-cores on TRAX simulatorSlide21
21
Single Thread PerformanceSlide22
22
Effect of BandwidthSlide23
23
Exec Time vs. FrequencySlide24
24
Maximizing the Power BudgetSlide25
25
Scaling the Core CountSlide26
26
Energy ReductionSlide27
27
Results Summary
Execution time reductions of 7%-89% NDC performance scales better with core count Energy reduction of 26%-91%
No bandwidth limitation
Lower memory access latency
Lower bit transport energySlide28
28
Intelligent Network of Memories
How should several HMCs be connected to the processor? How should data be placed in these HMCs?Slide29
29
Contributions
Evaluation of different network topologies Route adaptivity does help Page placement to bring popular data to nearby HMCs
Percolate-down based on page access counts
Use of router bypassing under low load
Use of deep sleep modes for distant HMCsSlide30
30
TopologiesSlide31
31
TopologiesSlide32
32
Topologies
(d) F-Tree (e) T-TreeSlide33
33
Network Properties
Supports 44-64 HMC devices with 2-4 rings Adaptive routing (deadlock avoidance based on timers) An entire page resides in one ring, but cache lines are
striped across the channelsSlide34
34
Percolate-Down Page Placement
New pages are placed in nearest ring Periodically, inactive pages are demoted to the next ring;
thresholds matter because of queuing delays
Activity is tracked with the multi-queue algorithm:
hierarchical queues, each entry has a timer and an access
count, demotion to lower queue if timer expires, promotion
to higher queue if access count is high
Page migration off the critical path, striped across many
channels, distant links are under-utilizedSlide35
35
Router Bypassing
Topologies with more links and adaptive routing (T-Tree) are better… but distant links experience relatively low load While a complex router is required for the T-Tree, the router
can often be bypassedSlide36
36
Power-Down Modes
Activity shift to nearby rings under-utilization at distant HMCs Can power off the DRAM layers (PD-0) and the
SerDes
circuits (PD-1)
26% energy saving for a 5% performance penaltySlide37
37
Methodology
128-thread traces of NAS parallel benchmarks (capacity requirements of nearly 211 GB) Detailed simulations with 1 billion memory access traces, confirmatory page-access simulations for the entire
application
Power breakdown: 3.7
pJ
/bit for DRAM access, 6.8
pJ
/bit
for HMC logic layer, 3.9
pJ/bit for a 5x5 routerSlide38
38
Results – Normalized Exec Time
T-Tree P-Down reduces exec time by 50%
86% of flits bypass the router
88% of requests serviced by Ring-0Slide39
39
Results – EnergySlide40
40
Summary
Must reduce data movement on off-chip memory links NDC reduces energy, improves performance by overcoming the bandwidth wall
More work required to analyze workloads, build software
frameworks, analyze thermals, etc.
iNoM
uses OS page placement to minimize hops for
popular data and increase power-down opportunities
Path diversity is useful, router overhead is smallSlide41
41
Acknowledgements
Co-authors: Kshitij Sudan, Seth Pugsley, Manju Shevgoor, Jeff
Jestes
, Al Davis,
Feifei
Li
Group funded by: NSF, HP, Samsung, IBMSlide42
42
Backup SlideSlide43
43
Backup Slide