/
1 Exploiting 3D-Stacked Memory Devices 1 Exploiting 3D-Stacked Memory Devices

1 Exploiting 3D-Stacked Memory Devices - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
402 views
Uploaded On 2017-04-03

1 Exploiting 3D-Stacked Memory Devices - PPT Presentation

Rajeev Balasubramonian School of Computing University of Utah Oct 2012 2 Power Contributions PERCENTAGE OF TOTAL SERVER POWER PROCESSOR MEMORY 3 Power Contributions PERCENTAGE OF TOTAL ID: 533423

data memory cores power memory data power cores processor access bit core hmc workloads page router count ndc host

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Exploiting 3D-Stacked Memory Devices" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1

Exploiting 3D-Stacked Memory Devices

Rajeev

BalasubramonianSchool of ComputingUniversity of UtahOct 2012Slide2

2

Power Contributions

PERCENTAGE

OF TOTAL

SERVER

POWER

PROCESSOR

MEMORYSlide3

3

Power Contributions

PERCENTAGE

OF TOTAL

SERVER

POWER

PROCESSOR

MEMORYSlide4

4

Example IBM Server

Source: P. Bose, WETI Workshop, 2012Slide5

5

Reasons for Memory Power Increase

Innovations for the processor, but not for memory Harder to get to memory (buffer chips) New workloads that demand more memory

SAP HANA in-memory databases

SAS in-memory analyticsSlide6

6

The Cost of Data Movement

64-bit double-precision FP MAC: 50

pJ (NSF CPOM Workshop report) 1 instruction on an ARM Cortex A5: 80 pJ (ARM datasheets)

Fetching 256-bit block from a distant cache bank: 1.2

nJ

(NSF CPOM Workshop report)

Fetching 256-bit block from an HMC device: 2.68

nJ

Fetching 256-bit block from a DDR3 device: 16.6 nJ (

Jeddeloh and Keeth, 2012 Symp. on VLSI Technology)Slide7

7

Memory Basics

Host Multi-Core

Processor

MC

MC

MC

MCSlide8

8

FB-DIMM

Host Multi-Core

Processor

MC

MC

MC

MC

…Slide9

9

SMB/SMI

Host Multi-Core

Processor

MC

MC

MC

MCSlide10

10

Micron

Hybrid Memory Cube

DeviceSlide11

11

HMC Architecture

Host Multi-Core

Processor

MC

MC

MC

MCSlide12

12

Key Points

HMC allows logic layer to easily reach DRAM chips Open question: new functionalities on the logic chip – cores, routing, refresh, scheduling

Data transfer out of the HMC is just as expensive as before

Near Data Computing … to cut off-HMC movement

Intelligent Network-of-Memories … to reduce hopsSlide13

13

Near Data Computing (NDC)Slide14

14

Timely Innovation

A low-cost way to achieve NDC Workloads that are embarrassingly parallel Workloads that are increasingly memory bound

Mature frameworks (

MapReduce

) in placeSlide15

15

Open Questions

What workloads will benefit from this? What causes the benefit?Slide16

16

Workloads

Initial focus on MapReduce, but any workload with localized data access patterns will be a good fit Map phase in

MapReduce

: the dataset is partitioned

and each Map phase works on its “split”; embarrassingly

parallel, localized data access, often the bottleneck;

e.g., count word occurrences in each individual document

Reduce phase in MapReduce: aggregates the results of

many mappers; requires random access of data; but deals with less data than Mappers;

e.g., summing up the occurrences for each wordSlide17

17

Baseline Architecture

MC

MC

MC

MC

Mappers

and Reducers both execute on the host processor

Many simple cores is better than few complex cores

2 sockets, 256 GB memory, processing power budget 260 W,

512 Arm cores (EE-Cores) per socket, each core at 876 MHzSlide18

18

NDC Architecture

MC

MC

MC

MC

Mappers

execute on ND Cores; Reducers execute on the

host processor

32 cores per HMC; 2048 total ND Cores and 1024 total

EE-Cores; 260 W total processing power budgetSlide19

19

NDC Memory Hierarchy

MC

MC

MC

MC

Memory latency excludes delay for link queuing and traversal

Many row buffer hits

L1 I and D caches per ND Core

The vault

has space reserved for intermediate outputs, and

Mapper

/Runtime code/dataSlide20

20

Methodology

Three workloads: Range-Aggregate: count occurrences of something Group-By: count occurrences of everything

Equi

-Join: for two databases, it counts the pairs that

have similar attributes

Dataset: 1998 World Cup web server logs

Simulations of individual

mappers

and reducers on EE-cores on TRAX simulatorSlide21

21

Single Thread PerformanceSlide22

22

Effect of BandwidthSlide23

23

Exec Time vs. FrequencySlide24

24

Maximizing the Power BudgetSlide25

25

Scaling the Core CountSlide26

26

Energy ReductionSlide27

27

Results Summary

Execution time reductions of 7%-89% NDC performance scales better with core count Energy reduction of 26%-91%

No bandwidth limitation

Lower memory access latency

Lower bit transport energySlide28

28

Intelligent Network of Memories

How should several HMCs be connected to the processor? How should data be placed in these HMCs?Slide29

29

Contributions

Evaluation of different network topologies Route adaptivity does help Page placement to bring popular data to nearby HMCs

Percolate-down based on page access counts

Use of router bypassing under low load

Use of deep sleep modes for distant HMCsSlide30

30

TopologiesSlide31

31

TopologiesSlide32

32

Topologies

(d) F-Tree (e) T-TreeSlide33

33

Network Properties

Supports 44-64 HMC devices with 2-4 rings Adaptive routing (deadlock avoidance based on timers) An entire page resides in one ring, but cache lines are

striped across the channelsSlide34

34

Percolate-Down Page Placement

New pages are placed in nearest ring Periodically, inactive pages are demoted to the next ring;

thresholds matter because of queuing delays

Activity is tracked with the multi-queue algorithm:

hierarchical queues, each entry has a timer and an access

count, demotion to lower queue if timer expires, promotion

to higher queue if access count is high

Page migration off the critical path, striped across many

channels, distant links are under-utilizedSlide35

35

Router Bypassing

Topologies with more links and adaptive routing (T-Tree) are better… but distant links experience relatively low load While a complex router is required for the T-Tree, the router

can often be bypassedSlide36

36

Power-Down Modes

Activity shift to nearby rings  under-utilization at distant HMCs Can power off the DRAM layers (PD-0) and the

SerDes

circuits (PD-1)

26% energy saving for a 5% performance penaltySlide37

37

Methodology

128-thread traces of NAS parallel benchmarks (capacity requirements of nearly 211 GB) Detailed simulations with 1 billion memory access traces, confirmatory page-access simulations for the entire

application

Power breakdown: 3.7

pJ

/bit for DRAM access, 6.8

pJ

/bit

for HMC logic layer, 3.9

pJ/bit for a 5x5 routerSlide38

38

Results – Normalized Exec Time

T-Tree P-Down reduces exec time by 50%

86% of flits bypass the router

88% of requests serviced by Ring-0Slide39

39

Results – EnergySlide40

40

Summary

Must reduce data movement on off-chip memory links NDC reduces energy, improves performance by overcoming the bandwidth wall

More work required to analyze workloads, build software

frameworks, analyze thermals, etc.

iNoM

uses OS page placement to minimize hops for

popular data and increase power-down opportunities

Path diversity is useful, router overhead is smallSlide41

41

Acknowledgements

Co-authors: Kshitij Sudan, Seth Pugsley, Manju Shevgoor, Jeff

Jestes

, Al Davis,

Feifei

Li

Group funded by: NSF, HP, Samsung, IBMSlide42

42

Backup SlideSlide43

43

Backup Slide