/
Samira Khan University of Virginia Samira Khan University of Virginia

Samira Khan University of Virginia - PowerPoint Presentation

pattyhope
pattyhope . @pattyhope
Follow
342 views
Uploaded On 2020-08-26

Samira Khan University of Virginia - PPT Presentation

Mar 3 2016 COMPUTER ARCHITECTURE CS 6354 Main Memory The content and concept of this course are adapted from CMU ECE 740 AGENDA Logistics Review from last lecture Main Memory 2 ANONYMOUS FEEDBACK ID: 802886

dram memory main core memory dram core main row data energy cache design access system column multiple cores single

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Samira Khan University of Virginia" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Samira KhanUniversity of VirginiaMar 3, 2016

COMPUTER ARCHITECTURE

CS 6354Main Memory

The content and concept of this course are adapted from CMU ECE 740

Slide2

AGENDALogisticsReview from last

lectureMain Memory

2

Slide3

ANONYMOUS FEEDBACKCourse Pace

Okay (11)Fast (1)Material

Right level (4)Hard (6)Too Easy (1)3

Slide4

ANONYMOUS FEEDBACKWorkload

Okay (8)Heavy (3)

CommentsExamplesPicturesText book/Reading materialBasicsExam vs. Project4

Slide5

MATERIALUndergraduate Computer Architecture Course

Includes more than what we are covering in this courseWatch the lecture videos https://www.youtube.com/watch?v=BJ87rZCGWU0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJhttps://www.youtube.com/watch?v=zLP_X4wyHbY&list=

PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKqReadingshttp://www.ece.cmu.edu/~ece447/s15/doku.php?id=readings5

Slide6

TEXTBOOKTextbooks do not provide the high-level intuition behind the ideas

Many of the ideas are not yet in the text bookWant to learn state of the art, and their tradeoffsWant to answer the question Why?What was done before?Why this is better?What are the downsides?

…. But do consult the textbook if you need ….6

Slide7

EXAM VS. PROJECT VS. GRADEFocus on project Exams are just to make sure you understood the material

You want to learn the topicsGrades do not get you a job, acquired skill doesYour project shows You know the recent topicsYou know the toolsYou can implement and evaluate ideas

7

Slide8

REVIEW: USE OF ASYMMETRY

FOR ENERGY EFFICIENCYKumar et al.,

“Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction,” MICRO 2003.Idea: Implement multiple types of cores on chip

Monitor characteristics of the running thread (e.g., sample energy/

perf

on each core periodically)

Dynamically pick the core that provides the best energy/performance tradeoff for a given phase

Best core

 Depends on optimization metric

8

Slide9

Advantages + More flexibility in energy-performance tradeoff+ Can execute computation to the core that is best suited for it (in terms of energy)

Disadvantages/issues- Incorrect predictions/sampling

 wrong core  reduced performance or increased energy- Overhead of core switching- Disadvantages of asymmetric CMP (e.g., design multiple cores)- Need phase monitoring and matching algorithms - What characteristics should be monitored? - Once characteristics known, how do you pick the core?

REVIEW: USE OF ASYMMETRY

FOR ENERGY EFFICIENCY

9

Slide10

REVIEW: SLIPSTREAM PROCESSORS

Goal: use multiple hardware contexts to speed up single thread execution (implicitly parallelize the program)Idea: Divide program execution into two threads:

Advanced thread executes a reduced instruction stream, speculativelyRedundant thread uses results, prefetches, predictions generated by advanced thread and ensures correctnessBenefit: Execution time of the overall program reducesCore idea is similar to many thread-level speculation approaches, except with a reduced instruction streamSundaramoorthy et al.,

Slipstream Processors: Improving both Performance and Fault Tolerance

,” ASPLOS 2000.

10

Slide11

REVIEW: DUAL CORE EXECUTION

Idea: One thread context speculatively runs ahead on load misses and

prefetches data for another thread contextZhou, “Dual-Core Execution: Building a Highly Scalable Single- Thread Instruction Window,” PACT 2005.

11

Slide12

DUAL CORE EXECUTION VS. SLIPSTREAM

Dual-core execution does not remove dead instructionsreuse instruction register results

uses the “leading” hardware context solely for prefetching and branch prediction+ Easier to implement, smaller hardware cost and complexity- “Leading thread

cannot run ahead as much as in slipstream when there are no cache misses

- Not reusing results in the

trailing thread

can reduce overall performance benefit

12

Slide13

HETEROGENEITY (ASYMMETRY)  SPECIALIZATION

Heterogeneity and asymmetry have the same meaningContrast with homogeneity and symmetryHeterogeneity is a very general system design concept (and life concept, as well)

Idea: Instead of having multiple instances of the same “resource” to be the same (i.e., homogeneous or symmetric), design some instances to be different (i.e., heterogeneous or asymmetric)Different instances can be optimized to be more efficient in executing different types of workloads or satisfying different requirements/goalsHeterogeneity enables specialization/customization13

Slide14

WHY ASYMMETRY IN DESIGN? (I)Different workloads executing in a system can have different behavior

Different applications can have different behavior Different execution phases of an application can have different behavior The same application executing at different times can have different behavior (due to input set changes and dynamic events)E.g., locality, predictability of branches, instruction-level parallelism, data dependencies, serial fraction, bottlenecks in parallel portion, interference characteristics, …

Systems are designed to satisfy different metrics at the same timeThere is almost never a single goal in design, depending on design pointE.g., Performance, energy efficiency, fairness, predictability, reliability, availability, cost, memory capacity, latency, bandwidth, …14

Slide15

WHY ASYMMETRY IN DESIGN? (II)Problem:

Symmetric design is one-size-fits-allIt tries to fit a single-size design to all workloads and metrics

It is very difficult to come up with a single design that satisfies all workloads even for a single metricthat satisfies all design metrics at the same timeThis holds true for different system components, or resourcesCores, caches, memory, controllers, interconnect, disks, servers, …Algorithms, policies, …15

Slide16

FUTURE

SPECIALIZED CORES

HYBRID, MEMORY WITH LOGIC

MANAGE

DATA

FLOW

APPLICATION, PROCESSOR, MEMORY

16

Slide17

MAIN MEMORY BASICS

Slide18

THE MAIN MEMORY SYSTEM

Main

memory is a critical component of all computing systems

: server, mobile, embedded, desktop, sensor

Main memory system must scale

(in

size

,

technology

,

efficiency

,

cost

, and

management algorithms

)

to maintain performance growth and technology scaling benefits

Processor

a

nd caches

Main

Memory

Storage (SSD/HDD)

18

Slide19

MEMORY SYSTEM: A SHARED RESOURCE VIEW

Storage

19

Slide20

STATE OF THE

MAIN MEMORY SYSTEM

Recent technology, architecture, and application trendslead to new requirementsexacerbate old requirementsDRAM and memory controllers

, as we know them today, are

(will be)

unlikely to

satisfy

all requirements

Some

emerging non-volatile memory technologies

(e.g., PCM)

enable

new

opportunities

:

memory+storage

merging

We need to rethink the main memory

system

to fix DRAM issues and enable emerging technologies

to satisfy all

requirements

20

Slide21

MAJOR TRENDS

AFFECTING MAIN MEMORY (I)

Need for main memory capacity, bandwidth, QoS increasing Main memory energy/power is a key system design concern

DRAM technology scaling is ending

21

Slide22

Need for main memory capacity, bandwidth, QoS

increasing Multi-core: increasing number of coresData-intensive applications

: increasing demand/hunger for dataConsolidation: cloud computing, GPUs, mobileMain memory energy/power is a key system design concern

DRAM technology scaling is ending

MAJOR TRENDS

AFFECTING MAIN MEMORY (II)

22

Slide23

EXAMPLE TREND: MANY CORES ON CHIP

Simpler and lower power than a single large coreLarge scale parallelism on chip

IBM Cell BE

8+1 cores

Intel Core i7

8 cores

Tilera TILE Gx

100 cores, networked

IBM POWER7

8 cores

Intel SCC

48 cores, networked

Nvidia Fermi

448 “cores”

AMD Barcelona

4 cores

Sun Niagara II

8 cores

23

Slide24

CONSEQUENCE:

THE MEMORY CAPACITY GAP

Memory

capacity per core

expected to drop by 30% every two years

T

rends worse for

memory bandwidth per core

!

Core count doubling ~ every 2 years

DRAM DIMM capacity doubling ~ every 3 years

24

Slide25

Need for main memory capacity, bandwidth, QoS

increasing

Main memory energy/power is a key system design concern~40-50% energy spent in off-chip memory hierarchy [Lefurgy, IEEE Computer 2003]DRAM consumes power even when not used (periodic refresh)

DRAM technology scaling is ending

MAJOR TRENDS

AFFECTING MAIN MEMORY (III)

25

Slide26

Need for main memory capacity, bandwidth, QoS

increasing

Main memory energy/power is a key system design concernDRAM technology scaling is ending ITRS projects

DRAM will not scale easily below

X nm

Scaling has provided many benefits:

higher

capacity

(density),

lower cost, lower energy

MAJOR TRENDS

AFFECTING MAIN MEMORY (IV)

26

Slide27

THE DRAM SCALING PROBLEM

DRAM stores charge in a capacitor (charge-based memory)Capacitor must be large enough for reliable sensing

Access transistor should be large enough for low leakage and high retention timeScaling beyond 40-35nm (2013) is challenging [ITRS, 2009]

DRAM

capacity, cost, and energy/power hard to scale

27

Slide28

SOLUTIONS TO THE DRAM SCALING PROBLEM

Two potential solutionsTolerate DRAM (by taking a fresh look at it)Enable emerging memory technologies to eliminate/minimize DRAM

Do bothHybrid memory systems28

Slide29

SOLUTION 1: TOLERATE DRAM

Overcome DRAM shortcomings withSystem-DRAM co-design

Novel DRAM architectures, interface, functionsBetter waste management (efficient utilization)Key issues to tackleReduce refresh energy

Improve bandwidth and latency

Reduce waste

Enable reliability at low cost

29

Slide30

SOLUTION 2: EMERGING MEMORY TECHNOLOGIES

Some emerging resistive memory technologies

seem more scalable than DRAM (and they are non-volatile)Example: Phase Change MemoryExpected to scale to 9nm (2022 [ITRS])Expected to be denser than DRAM: can store multiple bits/cell

But, emerging technologies have shortcomings as well

Can they be enabled to replace/augment/surpass DRAM?

30

Slide31

HYBRID MEMORY SYSTEMS

CPU

DRAMCtrl

Fast,

durable

Small,

leaky, volatile,

high-cost

Large, non-volatile,

low

-cost

Slow,

wears out,

high active energy

PCM Ctrl

DRAM

Phase Change Memory (or Tech. X)

Hardware/software manage data allocation and movement

t

o achieve the best of multiple technologies

31

Slide32

MAIN MEMORY IN THE SYSTEM

CORE 1

L2 CACHE 0

SHARED L3 CACHE

DRAM INTERFACE

CORE 0

CORE 2

CORE 3

L2 CACHE 1

L2 CACHE 2

L2 CACHE 3

DRAM BANKS

DRAM MEMORY CONTROLLER

32

Slide33

IDEAL MEMORYZero access time (latency)Infinite capacity

Zero costInfinite bandwidth (to support multiple accesses in parallel)33

Slide34

THE PROBLEMIdeal memory’s requirements oppose each other

Bigger is slowerBigger  Takes longer to determine the locationFaster is more expensive

Memory technology: SRAM vs. DRAMHigher bandwidth is more expensiveNeed more banks, more ports, higher frequency, or faster technology34

Slide35

MEMORY TECHNOLOGY: DRAMDynamic random access memory

Capacitor charge state indicates stored valueWhether the capacitor is charged or discharged indicates storage of 1 or 01 capacitor1 access transistor

Capacitor leaks through the RC pathDRAM cell loses charge over timeDRAM cell needs to be refreshed

row enable

_bitline

35

Slide36

Static random access memoryTwo cross coupled inverters store a single bitFeedback path enables the stored value to persist in the “cell”4 transistors for storage

2 transistors for accessMEMORY TECHNOLOGY: SRAM

row select

bitline

_bitline

36

Slide37

AN ASIDE: PHASE CHANGE MEMORY

Phase change material (chalcogenide glass) exists in two states:

Amorphous: Low optical reflexivity and high electrical resistivityCrystalline: High optical reflexivity and low electrical resistivity

PCM is resistive memory: High resistance (0), Low resistance (1

)

Lee,

Ipek

, Mutlu, Burger,

Architecting Phase Change Memory as a Scalable DRAM Alternative

,

ISCA 2009.

37

Slide38

MEMORY BANK: A FUNDAMENTAL CONCEPT

Interleaving (banking)Problem: a single monolithic memory array takes long to access and does not enable multiple accesses in parallel

Goal: Reduce the latency of memory array access and enable multiple accesses in parallelIdea: Divide the array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles)Each bank is smaller than the entire memory storageAccesses to different banks can be overlappedAn issue

: How do you map data to different banks? (i.e., how do you interleave data across banks?)

38

Slide39

MEMORY BANK ORGANIZATION

AND OPERATIONRead access sequence:

1. Decode row address & drive word-lines 2. Selected bits drive bit-lines • Entire row read

3. Amplify row data

4. Decode column address & select subset of row

• Send to output

5.

Precharge

bit-lines

• For next access

39

Slide40

WHY MEMORY HIERARCHY?We want both fast and large

But we cannot achieve both with a single level of memoryIdea: Have multiple levels of storage (progressively bigger and slower as the levels are farther from the processor) and ensure most of the data the processor needs is kept in the fast(

er) level(s)40

Slide41

MEMORY HIERARCHYFundamental tradeoff

Fast memory: smallLarge memory: slowIdea: Memory hierarchy

Latency, cost, size, bandwidth

CPU

Main

Memory

(DRAM)

RF

Cache

Hard Disk

41

Slide42

CACHING BASICS: EXPLOIT TEMPORAL LOCALITY

Idea: Store recently accessed data in automatically managed fast memory (called cache)Anticipation: the data will be accessed again soonTemporal locality principle

Recently accessed data will be again accessed in the near futureThis is what Maurice Wilkes had in mind:Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.“The use is discussed of a fast core memory of, say 32000 words as a slave to a slower core memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory.

42

Slide43

CACHING BASICS: EXPLOIT SPATIAL LOCALITY

Idea: Store addresses adjacent to the recently accessed one in automatically managed fast memoryLogically divide memory into equal size blocks

Fetch to cache the accessed block in its entiretyAnticipation: nearby data will be accessed soonSpatial locality principleNearby data in memory will be accessed in the near futureE.g., sequential instruction access, array traversalThis is what IBM 360/85 implemented16 Kbyte cache with 64 byte blocksLiptay,

Structural aspects of the System/360 Model 85 II: the cache

,

IBM Systems Journal, 1968.

43

Slide44

A NOTE ON MANUAL VS. AUTOMATIC MANAGEMENT

Manual: Programmer manages data movement across levels-- too painful for programmers on substantial programs“core” vs

“drum” memory in the 50’sstill done in some embedded processors (on-chip scratch pad SRAM in lieu of a cache)Automatic: Hardware manages data movement across levels, transparently to the programmer++ programmer’s life is easiersimple heuristic: keep most recently used items in cachethe average programmer doesn’t need to know about itYou don’t need to know how big the cache is and how it works to write a “correct” program! (What if you want a “fast” program?)

44

Slide45

AUTOMATIC MANAGEMENT IN MEMORY HIERARCHY

Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.

“By a slave memory I mean one which automatically accumulates to itself words that come from a slower main memory, and keeps them available for subsequent use without it being necessary for the penalty of main memory access to be incurred again.”

45

Slide46

A MODERN MEMORY HIERARCHY

Register File

32 words, sub-nsec

L1 cache

~32 KB, ~nsec

L2 cache

512 KB ~ 1MB, many nsec

L3 cache,

.....

Main memory (DRAM),

GB, ~100 nsec

Swap Disk

100 GB, ~10 msec

m

anual

/compiler

register spilling

automatic

demand

paging

Automatic

HW cache

management

Memory

Abstraction

46

Slide47

THE DRAM SUBSYSTEM

Slide48

DRAM SUBSYSTEM ORGANIZATIONChannelDIMM

RankChipBankRow/Column

48

Slide49

PAGE MODE DRAMA DRAM bank is a 2D array of cells: rows x columns

A “DRAM row” is also called a “DRAM page”“

Sense amplifiers” also called “row buffer”Each address is a <row,column> pairAccess to a “closed row”Activate command opens row (placed into row buffer)Read/write

command reads/writes column in the row buffer

Precharge

command closes the row and prepares the bank for next access

Access to an

open row

No need for activate command

49

Slide50

DRAM BANK OPERATION

Row Buffer

(Row 0, Column 0)

Row decoder

Column mux

Row address 0

Column address 0

Data

Row 0

Empty

(Row 0, Column 1)

Column address 1

(Row 0, Column 85)

Column address 85

(Row 1, Column 0)

HIT

HIT

Row address 1

Row 1

Column address 0

CONFLICT !

Columns

Rows

Access Address:

50

Slide51

THE DRAM CHIPConsists of multiple banks (2-16 in Synchronous DRAM)Banks share command/address/data busesThe chip itself has a narrow interface (4-16 bits per read)

51

Slide52

128M X 8-BIT DRAM CHIP

52

Slide53

Samira KhanUniversity of VirginiaMar 3, 2016

COMPUTER ARCHITECTURE

CS 6354Main Memory

The content and concept of this course are adapted from CMU ECE 740