Prof Mikko H Lipasti University of WisconsinMadison Lecture notes based on notes by Jim Smith and Mark Hill Updated by Mikko Lipasti Readings Read on your own Review Shen amp Lipasti Chapter 3 ID: 722781
Download Presentation The PPT/PDF document "Main Memory ECE/CS 752 Fall 2017" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Main Memory
ECE/CS 752 Fall 2017
Prof.
Mikko
H.
Lipasti
University of Wisconsin-Madison
Lecture notes based on notes by
Jim Smith and Mark Hill
Updated by Mikko LipastiSlide2
ReadingsRead on your own:Review: Shen & Lipasti Chapter 3W.-H. Wang, J.-L. Baer, and H. M. Levy. “Organization of a two-level virtual-real cache hierarchy,” Proc. 16th ISCA, pp. 140-148, June 1989 (B6) Online PDF
Read Sec. 1, skim Sec. 2, read Sec. 3: Bruce Jacob, “The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It,” Synthesis Lectures on Computer Architecture 2009 4:1, 1-77. Online PDFTo be discussed in class:Review #1 due 11/1/2017: Andreas Sembrant, Erik Hagersten, David Black-Schaffer, “The Direct-to-Data (D2D) cache: navigating the cache hierarchy with a single lookup,” Proc. ISCA 2014, June 2014.. Online
PDF Review #2 due 11/3/2017: Jishen Zhao, Sheng Li, Doe Hyun Yoon, Yuan Xie, and Norman P.
Jouppi
. 2013. Kiln: closing the performance gap between systems with and without persistence support. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 421-432. Online PDF
Review #3 due 11/6/2017: T. Shaw, M. Martin, A. Roth, “NoSQ: Store-Load Communication without a Store Queue,” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006. Online PDF
2Slide3
Outline: Main Memory
DRAM chipsMemory organizationInterleavingBanking
Memory controller designHybrid Memory Cube
Phase Change Memory (reading)
Virtual memory
TLBs
Interaction of caches and virtual memory
(Wang et
al.)
Large pages, virtualizationSlide4
DRAM Chip Organization
Optimized for density, not speedData stored as charge in capacitor
Discharge on reads => destructive readsCharge leaks over time
refresh every 64ms
4
Cycle time roughly twice access time
Need to
precharge
bitlines
before accessSlide5
DRAM Chip Organization
Current generation DRAM8Gbit @25nm Up to 1600 MHz synchronous interface
Data clock 2x (3200MHz), double-data rate so 3200 MT/s peak5
Address pins are time-multiplexed
Row address strobe (RAS)
Column address strobe (CAS)Slide6
DRAM Chip Organization
New RAS results in:Bitline precharge
Row decode, sense
Row buffer write (up to 8K)
6
New CAS
Read from row buffer
Much faster (
3-4x
)
Streaming row accesses desirableSlide7
Simple Main Memory
Consider these parameters:10 cycles to send address60 cycles to access each word10 cycle to send word back
Miss penalty for a 4-word block(10 + 60 + 10) x 4 = 320
How can we speed this up?
7Slide8
Wider(Parallel) Main Memory
Make memory widerRead out all words in parallelMemory parameters10 cycle to send address
60 to access a double word10 cycle to send it back
Miss penalty for 4-word block: 2x(10+60+10) = 160
Costs
Wider bus
Larger minimum expansion unit (e.g. paired DIMMs)
8Slide9
Interleaved Main Memory
Each bank hasPrivate address linesPrivate data lines
Private control lines (read/write)
9
Byte in Word
Word in Doubleword
Bank
Doubleword in bank
Bank 0
Bank2
Bank 1
Bank 3
Break memory into M banks
Word A is in A mod M at A div M
Banks can operate concurrently and independentlySlide10
Interleaved and Parallel Organization10Slide11
Interleaved Memory ExamplesAi = address to bank i
Ti = data transferUnit Stride:11
Stride 3:Slide12
Interleaved Memory Summary
Parallel memory adequate for sequential accessesLoad cache block: multiple sequential wordsGood for writeback
cachesBanking useful otherwiseIf many banks, choose a prime number
Can also do both
Within each bank: parallel memory path
Across banks
Can support multiple concurrent cache accesses (
nonblocking
)Slide13
DDR SDRAM Control
13
Raise level of abstraction: commands
Activate row
Read row into row buffer
Column access
Read data from addressed row
Bank
Precharge
Get ready for new row accessSlide14
DDR SDRAM Timing
14Read accessSlide15
Constructing a Memory SystemCombine chips in parallel to increase access width
E.g. 8 8-bit wide DRAMs for a 64-bit parallel accessDIMM – Dual Inline Memory ModuleCombine DIMMs to form multiple
ranksAttach a number of DIMMs to a memory channel
Memory Controller manages a channel (or two lock-step channels)
Interleave patterns:
Rank, Row, Bank, Column, [byte]
Row, Rank, Bank, Column, [byte]
Better dispersion of addresses
Works better with power-of-two ranks
15Slide16
Memory Controller and Channel
16Slide17
Memory ControllersContains bufferingIn both directions
Schedulers manage resourcesChannel and banks
17Slide18
Resource SchedulingAn interesting optimization problemExample:
Precharge: 3 cyclesRow activate: 3 cyclesColumn access: 1 cycleFR-FCFS: 20 cycles
StrictFIFO: 56 cycles
18Slide19
DDR SDRAM PoliciesGoal: try to maximize requests to an open row (page)Close row policy
Always close row, hides precharge penaltyLost opportunity if next access to same rowOpen row policyLeave row open
If an access to a different row, then penalty for prechargeAlso performance issues related to rank interleaving
Better dispersion of addresses
19Slide20
Memory Scheduling Contesthttp://www.cs.utah.edu/~rajeev/jwac12/
Clean, simple, infrastructureTraces providedVery easy to make fair comparisonsComes with 6 schedulersAlso targets power-down modes (not just page open/close scheduling)Three tracks:
Delay (or Performance),
Energy-Delay
Product (EDP
)Performance-Fairness Product (PFP)
20Slide21
Future: Hybrid Memory CubeMicron proposal [Pawlowski
, Hot Chips 11]www.hybridmemorycube.org21Slide22
Hybrid Memory Cube MCMMicron proposal [Pawlowski
, Hot Chips 11]www.hybridmemorycube.org22Slide23
Network of DRAMTraditional DRAM: star topologyHMC: mesh, etc. are feasible
23Slide24
Hybrid Memory CubeHigh-speed logic segregated in chip stack3D TSV for bandwidth
24Slide25
High Bandwidth Memory (HBM)
High-speed serial links vs. 2.5D silicon interposer
Commercialized, HBM2/HBM3 on the way25
[
Shmuel
Csaba
Otto
Traian
]Slide26
Future: Resistive memoryPCM: store bit in phase state of materialAlternatives:
Memristor (HP Labs)STT-MRAMNonvolatileDense: crosspoint architecture (no access device)
Relatively fast for readVery slow for write (also high power)Write endurance often limitedWrite leveling (also done for flash)
Avoid redundant writes (read,
cmp
, write)Fix individual bit errors (write, read, cmp, fix)
26Slide27
Main Memory and Virtual MemoryUse of virtual memory
Main memory becomes another level in the memory hierarchyEnables programs with address space or working set that exceed physically available memoryNo need for programmer to manage overlays, etc.Sparse use of large address space is OK
Allows multiple users or programs to timeshare limited amount of physical memory space and address spaceBottom line: efficient use of expensive resource, and ease of programmingSlide28
Virtual MemoryEnablesUse more memory than system has
Think program is only one runningDon’t have to manage address space usage across programsE.g. think it always starts at address 0x0
Memory protectionEach program has private VA space: no-one else can clobber itBetter performance
Start running a large program before all of it has been loaded from diskSlide29
Virtual Memory – PlacementMain memory managed in larger blocksPage size
typically 4K – 16KFully flexible placement; fully associativeOperating system manages placementIndirection through page tableMaintain mapping between:
Virtual address (seen by programmer)Physical address (seen by main memory)Slide30
Virtual Memory – PlacementFully associative implies expensive lookup?In caches, yes: check multiple tags in parallel
In virtual memory, expensive lookup is avoided by using a level of indirectionLookup table or hash tableCalled a page tableSlide31
Virtual Memory – IdentificationSimilar to cache tag array
Page table entry contains VA, PA, dirty bitVirtual address:Matches programmer view; based on register valuesCan be the same for multiple programs sharing same system, without conflicts
Physical address:Invisible to programmer, managed by O/SCreated/deleted on demand basis, can change
Virtual Address
Physical Address
Dirty bit
0x20004000
0x2000
Y/NSlide32
Virtual Memory – ReplacementSimilar to caches:FIFO
LRU; overhead too highApproximated with reference bit checks“Clock algorithm” intermittently clears all bitsRandomO/S decides, managesCS537Slide33
Virtual Memory – Write PolicyWrite backDisks are too slow to write through
Page table maintains dirty bitHardware must set dirty bit on first writeO/S checks dirty bit on evictionDirty pages written to backing storeDisk write, 10+ msSlide34
Virtual Memory ImplementationCaches have fixed policies, hardware FSM for control, pipeline stallVM has very different miss penalties
Remember disks are 10+ ms!Hence engineered differentlySlide35
Page FaultsA virtual memory miss is a page faultPhysical memory location does not exist
Exception is raised, save PCInvoke OS page fault handlerFind a physical page (possibly evict)Initiate fetch from disk
Switch to other task that is ready to runInterrupt when disk access complete
Restart original instruction
Why use O/S and not hardware FSM?Slide36
Address TranslationO/S and hardware communicate via PTEHow do we find a PTE?
&PTE = PTBR + page number * sizeof(PTE)PTBR is private for each programContext switch replaces PTBR contents
VA
PA
Dirty
Ref
Protection
0x20004000
0x2000
Y/N
Y/N
Read/Write/ExecuteSlide37
Address Translation
PA
VA
D
PTBR
Virtual Page Number
Offset
+Slide38
Page Table SizeHow big is page table?2
32 / 4K * 4B = 4M per program Much worse for 64-bit machinesTo make it smallerUse limit register(s)If VA exceeds limit, invoke O/S to grow region
Use a multi-level page tableMake the page table pageable (use VM)Slide39
Multilevel Page Table
PTBR+
Offset
+
+Slide40
Hashed Page TableUse a hash table or inverted page tablePT contains an entry for each real address
Instead of entry for every virtual addressEntry is found by hashing VAOversize PT to reduce collisions: #PTE = 4 x (#phys. pages)Slide41
Hashed Page TablePTBR
Virtual Page Number
Offset
Hash
PTE2
PTE1
PTE0
PTE3Slide42
High-Performance VMVA translation
Additional memory reference to PTEEach instruction fetch/load/store now 2 memory referencesOr more, with multilevel table or hash collisionsEven if PTE are cached, still slow
Hence, use special-purpose cache for PTEsCalled TLB (translation lookaside buffer)
Caches PTE entries
Exploits temporal and spatial locality (just a cache)Slide43
Translation Lookaside Buffer
Set associative (a) or fully associative (b)Both widely employed
Index
TagSlide44
Interaction of TLB and CacheSerial lookup: first TLB then D-cache
Excessive cycle timeSlide45
Virtually Indexed Physically Tagged L1
Parallel lookup of TLB and cacheFaster cycle timeIndex bits must be untranslatedRestricts size of n-associative cache to n x (virtual page size)
E.g. 4-way SA cache with 4KB pages max. size is 16KBSlide46
Virtual Memory ProtectionEach process/program has private virtual address space
Automatically protected from rogue programsSharing is possible, necessary, desirableAvoid copying, staleness issues, etc.Sharing in a controlled manner
Grant specific permissionsReadWrite
Execute
Any combinationSlide47
ProtectionProcess modelPrivileged kernel
Independent user processesPrivileges vs. policyArchitecture provided primitivesOS implements policyProblems arise when h/w implements policy
Separate policy from mechanism!Slide48
Protection PrimitivesUser vs
kernelat least one privileged modeusually implemented as mode bitsHow do we switch to kernel mode?Protected “gates” or system calls
Change mode and continue at pre-determined addressHardware to compare mode bits to access rightsOnly access certain resources in kernel mode
E.g. modify page mappingsSlide49
Protection PrimitivesBase and bounds
Privileged registers base <= address <= boundsSegmentationMultiple base and bound registers
Protection bits for each segmentPage-level protection (most widely used)Protection bits in page entry table
Cache them in TLB for speedSlide50
VM SharingShare memory locations by:Map shared physical location into both address spaces:
E.g. PA 0xC00DA becomes:VA 0x2D000DA for process 0VA 0x4D000DA for process 1Either process can read/write shared locationHowever, causes synonym
problemSlide51
VM HomonymsProcess-private address space
Same VA can map to multiple PAs:E.g. VA 0xC00DA becomes:PA
0x2D000DA for process 0PA 0x4D000DA for process 1Either process can
install line into the cache
However, causes
homonym problemSlide52
Virtually-Addressed Caches
Virtually-addressed caches are desirableNo need to translate VA to PA before cache lookupFaster hit time, translate only on missesHowever, VA homonyms & synonyms
cause problemsCan end up with homonym blocks in the cacheCan
end up with two copies of same physical line
Causes coherence problems [
Wang et al. reading]Solutions to homonyms:
Flush caches/TLBs on context switch
Extend cache tags to include
PID or ASID
Effectively a shared VA space (PID becomes part of address)
Enforce global shared VA space (PowerPC)
Requires another level of addressing (EA->VA->PA
)
Solutions to synonyms:
Prevent multiple copies through reverse address translation
Or, keep pointers in PA L2 cache [Wang et al.]Slide53
Additional issues
Large page supportMost ISAs support 4K/1M/1G Page table & TLB designs must supportRenewed interest in segments as an alternativeRecent work from
Multifacet [Basu thesis, 2013][Gandhi thesis, 2016]Can be complementary to paging
Multiple levels of translation in virtualized systems
Virtual machines run unmodified OS
Each OS manages translations, page tablesHypervisor manages translations across VMs
Hardware still has to provide efficient translationSlide54
Summary: Main Memory
DRAM chipsMemory organizationInterleavingBanking
Memory controller designHybrid Memory Cube
Phase Change Memory (reading)
Virtual memory
TLBs
Interaction of caches and virtual memory
(Wang et
al.)
Large pages, virtualization