Mar 3 2016 COMPUTER ARCHITECTURE CS 6354 Main Memory The content and concept of this course are adapted from CMU ECE 740 AGENDA Logistics Review from last lecture Main Memory 2 ANONYMOUS FEEDBACK ID: 802886
Download The PPT/PDF document "Samira Khan University of Virginia" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Samira KhanUniversity of VirginiaMar 3, 2016
COMPUTER ARCHITECTURE
CS 6354Main Memory
The content and concept of this course are adapted from CMU ECE 740
Slide2AGENDALogisticsReview from last
lectureMain Memory
2
Slide3ANONYMOUS FEEDBACKCourse Pace
Okay (11)Fast (1)Material
Right level (4)Hard (6)Too Easy (1)3
Slide4ANONYMOUS FEEDBACKWorkload
Okay (8)Heavy (3)
CommentsExamplesPicturesText book/Reading materialBasicsExam vs. Project4
Slide5MATERIALUndergraduate Computer Architecture Course
Includes more than what we are covering in this courseWatch the lecture videos https://www.youtube.com/watch?v=BJ87rZCGWU0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJhttps://www.youtube.com/watch?v=zLP_X4wyHbY&list=
PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKqReadingshttp://www.ece.cmu.edu/~ece447/s15/doku.php?id=readings5
Slide6TEXTBOOKTextbooks do not provide the high-level intuition behind the ideas
Many of the ideas are not yet in the text bookWant to learn state of the art, and their tradeoffsWant to answer the question Why?What was done before?Why this is better?What are the downsides?
…. But do consult the textbook if you need ….6
Slide7EXAM VS. PROJECT VS. GRADEFocus on project Exams are just to make sure you understood the material
You want to learn the topicsGrades do not get you a job, acquired skill doesYour project shows You know the recent topicsYou know the toolsYou can implement and evaluate ideas
7
Slide8REVIEW: USE OF ASYMMETRY
FOR ENERGY EFFICIENCYKumar et al.,
“Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction,” MICRO 2003.Idea: Implement multiple types of cores on chip
Monitor characteristics of the running thread (e.g., sample energy/
perf
on each core periodically)
Dynamically pick the core that provides the best energy/performance tradeoff for a given phase
“
Best core
”
Depends on optimization metric
8
Slide9Advantages + More flexibility in energy-performance tradeoff+ Can execute computation to the core that is best suited for it (in terms of energy)
Disadvantages/issues- Incorrect predictions/sampling
wrong core reduced performance or increased energy- Overhead of core switching- Disadvantages of asymmetric CMP (e.g., design multiple cores)- Need phase monitoring and matching algorithms - What characteristics should be monitored? - Once characteristics known, how do you pick the core?
REVIEW: USE OF ASYMMETRY
FOR ENERGY EFFICIENCY
9
Slide10REVIEW: SLIPSTREAM PROCESSORS
Goal: use multiple hardware contexts to speed up single thread execution (implicitly parallelize the program)Idea: Divide program execution into two threads:
Advanced thread executes a reduced instruction stream, speculativelyRedundant thread uses results, prefetches, predictions generated by advanced thread and ensures correctnessBenefit: Execution time of the overall program reducesCore idea is similar to many thread-level speculation approaches, except with a reduced instruction streamSundaramoorthy et al.,
“
Slipstream Processors: Improving both Performance and Fault Tolerance
,” ASPLOS 2000.
10
Slide11REVIEW: DUAL CORE EXECUTION
Idea: One thread context speculatively runs ahead on load misses and
prefetches data for another thread contextZhou, “Dual-Core Execution: Building a Highly Scalable Single- Thread Instruction Window,” PACT 2005.
11
Slide12DUAL CORE EXECUTION VS. SLIPSTREAM
Dual-core execution does not remove dead instructionsreuse instruction register results
uses the “leading” hardware context solely for prefetching and branch prediction+ Easier to implement, smaller hardware cost and complexity- “Leading thread
”
cannot run ahead as much as in slipstream when there are no cache misses
- Not reusing results in the
“
trailing thread
”
can reduce overall performance benefit
12
Slide13HETEROGENEITY (ASYMMETRY) SPECIALIZATION
Heterogeneity and asymmetry have the same meaningContrast with homogeneity and symmetryHeterogeneity is a very general system design concept (and life concept, as well)
Idea: Instead of having multiple instances of the same “resource” to be the same (i.e., homogeneous or symmetric), design some instances to be different (i.e., heterogeneous or asymmetric)Different instances can be optimized to be more efficient in executing different types of workloads or satisfying different requirements/goalsHeterogeneity enables specialization/customization13
Slide14WHY ASYMMETRY IN DESIGN? (I)Different workloads executing in a system can have different behavior
Different applications can have different behavior Different execution phases of an application can have different behavior The same application executing at different times can have different behavior (due to input set changes and dynamic events)E.g., locality, predictability of branches, instruction-level parallelism, data dependencies, serial fraction, bottlenecks in parallel portion, interference characteristics, …
Systems are designed to satisfy different metrics at the same timeThere is almost never a single goal in design, depending on design pointE.g., Performance, energy efficiency, fairness, predictability, reliability, availability, cost, memory capacity, latency, bandwidth, …14
Slide15WHY ASYMMETRY IN DESIGN? (II)Problem:
Symmetric design is one-size-fits-allIt tries to fit a single-size design to all workloads and metrics
It is very difficult to come up with a single design that satisfies all workloads even for a single metricthat satisfies all design metrics at the same timeThis holds true for different system components, or resourcesCores, caches, memory, controllers, interconnect, disks, servers, …Algorithms, policies, …15
Slide16FUTURE
SPECIALIZED CORES
HYBRID, MEMORY WITH LOGIC
MANAGE
DATA
FLOW
APPLICATION, PROCESSOR, MEMORY
16
Slide17MAIN MEMORY BASICS
Slide18THE MAIN MEMORY SYSTEM
Main
memory is a critical component of all computing systems
: server, mobile, embedded, desktop, sensor
Main memory system must scale
(in
size
,
technology
,
efficiency
,
cost
, and
management algorithms
)
to maintain performance growth and technology scaling benefits
Processor
a
nd caches
Main
Memory
Storage (SSD/HDD)
18
Slide19MEMORY SYSTEM: A SHARED RESOURCE VIEW
Storage
19
Slide20STATE OF THE
MAIN MEMORY SYSTEM
Recent technology, architecture, and application trendslead to new requirementsexacerbate old requirementsDRAM and memory controllers
, as we know them today, are
(will be)
unlikely to
satisfy
all requirements
Some
emerging non-volatile memory technologies
(e.g., PCM)
enable
new
opportunities
:
memory+storage
merging
We need to rethink the main memory
system
to fix DRAM issues and enable emerging technologies
to satisfy all
requirements
20
Slide21MAJOR TRENDS
AFFECTING MAIN MEMORY (I)
Need for main memory capacity, bandwidth, QoS increasing Main memory energy/power is a key system design concern
DRAM technology scaling is ending
21
Slide22Need for main memory capacity, bandwidth, QoS
increasing Multi-core: increasing number of coresData-intensive applications
: increasing demand/hunger for dataConsolidation: cloud computing, GPUs, mobileMain memory energy/power is a key system design concern
DRAM technology scaling is ending
MAJOR TRENDS
AFFECTING MAIN MEMORY (II)
22
Slide23EXAMPLE TREND: MANY CORES ON CHIP
Simpler and lower power than a single large coreLarge scale parallelism on chip
IBM Cell BE
8+1 cores
Intel Core i7
8 cores
Tilera TILE Gx
100 cores, networked
IBM POWER7
8 cores
Intel SCC
48 cores, networked
Nvidia Fermi
448 “cores”
AMD Barcelona
4 cores
Sun Niagara II
8 cores
23
Slide24CONSEQUENCE:
THE MEMORY CAPACITY GAP
Memory
capacity per core
expected to drop by 30% every two years
T
rends worse for
memory bandwidth per core
!
Core count doubling ~ every 2 years
DRAM DIMM capacity doubling ~ every 3 years
24
Slide25Need for main memory capacity, bandwidth, QoS
increasing
Main memory energy/power is a key system design concern~40-50% energy spent in off-chip memory hierarchy [Lefurgy, IEEE Computer 2003]DRAM consumes power even when not used (periodic refresh)
DRAM technology scaling is ending
MAJOR TRENDS
AFFECTING MAIN MEMORY (III)
25
Slide26Need for main memory capacity, bandwidth, QoS
increasing
Main memory energy/power is a key system design concernDRAM technology scaling is ending ITRS projects
DRAM will not scale easily below
X nm
Scaling has provided many benefits:
higher
capacity
(density),
lower cost, lower energy
MAJOR TRENDS
AFFECTING MAIN MEMORY (IV)
26
Slide27THE DRAM SCALING PROBLEM
DRAM stores charge in a capacitor (charge-based memory)Capacitor must be large enough for reliable sensing
Access transistor should be large enough for low leakage and high retention timeScaling beyond 40-35nm (2013) is challenging [ITRS, 2009]
DRAM
capacity, cost, and energy/power hard to scale
27
Slide28SOLUTIONS TO THE DRAM SCALING PROBLEM
Two potential solutionsTolerate DRAM (by taking a fresh look at it)Enable emerging memory technologies to eliminate/minimize DRAM
Do bothHybrid memory systems28
Slide29SOLUTION 1: TOLERATE DRAM
Overcome DRAM shortcomings withSystem-DRAM co-design
Novel DRAM architectures, interface, functionsBetter waste management (efficient utilization)Key issues to tackleReduce refresh energy
Improve bandwidth and latency
Reduce waste
Enable reliability at low cost
29
Slide30SOLUTION 2: EMERGING MEMORY TECHNOLOGIES
Some emerging resistive memory technologies
seem more scalable than DRAM (and they are non-volatile)Example: Phase Change MemoryExpected to scale to 9nm (2022 [ITRS])Expected to be denser than DRAM: can store multiple bits/cell
But, emerging technologies have shortcomings as well
Can they be enabled to replace/augment/surpass DRAM?
30
Slide31HYBRID MEMORY SYSTEMS
CPU
DRAMCtrl
Fast,
durable
Small,
leaky, volatile,
high-cost
Large, non-volatile,
low
-cost
Slow,
wears out,
high active energy
PCM Ctrl
DRAM
Phase Change Memory (or Tech. X)
Hardware/software manage data allocation and movement
t
o achieve the best of multiple technologies
31
Slide32MAIN MEMORY IN THE SYSTEM
CORE 1
L2 CACHE 0
SHARED L3 CACHE
DRAM INTERFACE
CORE 0
CORE 2
CORE 3
L2 CACHE 1
L2 CACHE 2
L2 CACHE 3
DRAM BANKS
DRAM MEMORY CONTROLLER
32
Slide33IDEAL MEMORYZero access time (latency)Infinite capacity
Zero costInfinite bandwidth (to support multiple accesses in parallel)33
Slide34THE PROBLEMIdeal memory’s requirements oppose each other
Bigger is slowerBigger Takes longer to determine the locationFaster is more expensive
Memory technology: SRAM vs. DRAMHigher bandwidth is more expensiveNeed more banks, more ports, higher frequency, or faster technology34
Slide35MEMORY TECHNOLOGY: DRAMDynamic random access memory
Capacitor charge state indicates stored valueWhether the capacitor is charged or discharged indicates storage of 1 or 01 capacitor1 access transistor
Capacitor leaks through the RC pathDRAM cell loses charge over timeDRAM cell needs to be refreshed
row enable
_bitline
35
Slide36Static random access memoryTwo cross coupled inverters store a single bitFeedback path enables the stored value to persist in the “cell”4 transistors for storage
2 transistors for accessMEMORY TECHNOLOGY: SRAM
row select
bitline
_bitline
36
Slide37AN ASIDE: PHASE CHANGE MEMORY
Phase change material (chalcogenide glass) exists in two states:
Amorphous: Low optical reflexivity and high electrical resistivityCrystalline: High optical reflexivity and low electrical resistivity
PCM is resistive memory: High resistance (0), Low resistance (1
)
Lee,
Ipek
, Mutlu, Burger,
“
Architecting Phase Change Memory as a Scalable DRAM Alternative
,
”
ISCA 2009.
37
Slide38MEMORY BANK: A FUNDAMENTAL CONCEPT
Interleaving (banking)Problem: a single monolithic memory array takes long to access and does not enable multiple accesses in parallel
Goal: Reduce the latency of memory array access and enable multiple accesses in parallelIdea: Divide the array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles)Each bank is smaller than the entire memory storageAccesses to different banks can be overlappedAn issue
: How do you map data to different banks? (i.e., how do you interleave data across banks?)
38
Slide39MEMORY BANK ORGANIZATION
AND OPERATIONRead access sequence:
1. Decode row address & drive word-lines 2. Selected bits drive bit-lines • Entire row read
3. Amplify row data
4. Decode column address & select subset of row
• Send to output
5.
Precharge
bit-lines
• For next access
39
Slide40WHY MEMORY HIERARCHY?We want both fast and large
But we cannot achieve both with a single level of memoryIdea: Have multiple levels of storage (progressively bigger and slower as the levels are farther from the processor) and ensure most of the data the processor needs is kept in the fast(
er) level(s)40
Slide41MEMORY HIERARCHYFundamental tradeoff
Fast memory: smallLarge memory: slowIdea: Memory hierarchy
Latency, cost, size, bandwidth
CPU
Main
Memory
(DRAM)
RF
Cache
Hard Disk
41
Slide42CACHING BASICS: EXPLOIT TEMPORAL LOCALITY
Idea: Store recently accessed data in automatically managed fast memory (called cache)Anticipation: the data will be accessed again soonTemporal locality principle
Recently accessed data will be again accessed in the near futureThis is what Maurice Wilkes had in mind:Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.“The use is discussed of a fast core memory of, say 32000 words as a slave to a slower core memory of, say, one million words in such a way that in practical cases the effective access time is nearer that of the fast memory than that of the slow memory.
”
42
Slide43CACHING BASICS: EXPLOIT SPATIAL LOCALITY
Idea: Store addresses adjacent to the recently accessed one in automatically managed fast memoryLogically divide memory into equal size blocks
Fetch to cache the accessed block in its entiretyAnticipation: nearby data will be accessed soonSpatial locality principleNearby data in memory will be accessed in the near futureE.g., sequential instruction access, array traversalThis is what IBM 360/85 implemented16 Kbyte cache with 64 byte blocksLiptay,
“
Structural aspects of the System/360 Model 85 II: the cache
,
”
IBM Systems Journal, 1968.
43
Slide44A NOTE ON MANUAL VS. AUTOMATIC MANAGEMENT
Manual: Programmer manages data movement across levels-- too painful for programmers on substantial programs“core” vs
“drum” memory in the 50’sstill done in some embedded processors (on-chip scratch pad SRAM in lieu of a cache)Automatic: Hardware manages data movement across levels, transparently to the programmer++ programmer’s life is easiersimple heuristic: keep most recently used items in cachethe average programmer doesn’t need to know about itYou don’t need to know how big the cache is and how it works to write a “correct” program! (What if you want a “fast” program?)
44
Slide45AUTOMATIC MANAGEMENT IN MEMORY HIERARCHY
Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.
“By a slave memory I mean one which automatically accumulates to itself words that come from a slower main memory, and keeps them available for subsequent use without it being necessary for the penalty of main memory access to be incurred again.”
45
Slide46A MODERN MEMORY HIERARCHY
Register File
32 words, sub-nsec
L1 cache
~32 KB, ~nsec
L2 cache
512 KB ~ 1MB, many nsec
L3 cache,
.....
Main memory (DRAM),
GB, ~100 nsec
Swap Disk
100 GB, ~10 msec
m
anual
/compiler
register spilling
automatic
demand
paging
Automatic
HW cache
management
Memory
Abstraction
46
Slide47THE DRAM SUBSYSTEM
Slide48DRAM SUBSYSTEM ORGANIZATIONChannelDIMM
RankChipBankRow/Column
48
Slide49PAGE MODE DRAMA DRAM bank is a 2D array of cells: rows x columns
A “DRAM row” is also called a “DRAM page”“
Sense amplifiers” also called “row buffer”Each address is a <row,column> pairAccess to a “closed row”Activate command opens row (placed into row buffer)Read/write
command reads/writes column in the row buffer
Precharge
command closes the row and prepares the bank for next access
Access to an
“
open row
”
No need for activate command
49
Slide50DRAM BANK OPERATION
Row Buffer
(Row 0, Column 0)
Row decoder
Column mux
Row address 0
Column address 0
Data
Row 0
Empty
(Row 0, Column 1)
Column address 1
(Row 0, Column 85)
Column address 85
(Row 1, Column 0)
HIT
HIT
Row address 1
Row 1
Column address 0
CONFLICT !
Columns
Rows
Access Address:
50
Slide51THE DRAM CHIPConsists of multiple banks (2-16 in Synchronous DRAM)Banks share command/address/data busesThe chip itself has a narrow interface (4-16 bits per read)
51
Slide52128M X 8-BIT DRAM CHIP
52
Slide53Samira KhanUniversity of VirginiaMar 3, 2016
COMPUTER ARCHITECTURE
CS 6354Main Memory
The content and concept of this course are adapted from CMU ECE 740