/
1 Memory & Cache Memories: Review 2 Memory is required for storing 1 Memory & Cache Memories: Review 2 Memory is required for storing

1 Memory & Cache Memories: Review 2 Memory is required for storing - PowerPoint Presentation

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
348 views
Uploaded On 2019-10-30

1 Memory & Cache Memories: Review 2 Memory is required for storing - PPT Presentation

1 Memory amp Cache Memories Review 2 Memory is required for storing Data Instructions Different memory types Dynamic RAM Static RAM Readonly memory ROM Characteristics Access time Price Volatility ID: 761115

cache memory data block memory cache block data word rate address tag size hit time access mem level 100

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Memory & Cache Memories: Review 2 ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 Memory & Cache

Memories: Review 2 Memory is required for storing Data Instructions Different memory types Dynamic RAM Static RAM Read-only memory (ROM) Characteristics Access time Price Volatility

Principle of Locality 3 Users want indefinitely large memory fast access to data items in the memory. Principle of locality Temporal locality : If an item is referenced, it will tend to be referenced again soon. Spatial locality : If an item is referenced, items whose addresses are close by will tend to be referenced soon. To take advantage of the principle of locality The memory of a computer implemented as a memory hierarchy .

Comparing Memories 4 Memory technology Typical access time $ per GB in 2004 SRAM 0.5-5 ns $4000-$10000 DRAM 50-70 ns $100-$200 Magnetic disk 5-20 million ns $0.50-$2

Memory Hierarchy 5 CPU Memory Memory Memory Speed Size Cost($/bit) Fastest Smallest Highest Slowest Largest Lowest

Organization of the Hierarchy 6 Data in a memory level closer to the processor is a subset of data in any level further away . All the data is stored in the lowest level.

Access to the Data 7 Data transfer takes place between two adjacent layers. The minimum unit of information is called a block . If a data requested by the processor appears in some block in the upper level, this is called a hit . Otherwise a miss occurs. Hit rate or hit ratio is the fraction of memory accesses found in the upper level. used to measure the performance of the memory hierarchy.Miss rate is the fraction of memory accesses not found in the upper memory level ( = 1 – hit rate).

Hit & Miss 8 Hit time is the time to access to upper level of memory hierarchy, which include the time needed to determine whether the access is a hit or miss. Miss penalty is the time to replace a block in the upper level with corresponding block from lower level, plus the time to deliver this block to processor. Hit time is much smaller than the miss penalty. Read from register: one cycle Read from 1 st level cache: one-two cyclesRead from 2nd level cache: four-five cyclesRead from main memory: 20-100 cycles

Memory Pyramid 9 Level 1 Level 2 … Level n Levels in the memory hierarchy CPU Size of the memory at each level Increasing distance from the CPU in terms of access time

Taking Advantage of Locality 10 Temporal Locality : keeping the recently accessed items closer to the processor. Usually in a fast memory called cache. Spatial Locality : Moving blocks consisting of multiple contiguous words in memory to upper levels of the hierarchy.

The Basics of Cache 11 Cache is a term used to refer to any storage taking advantage of locality of access . In general, it is the fast memory between the CPU and main memory. First appeared in machines in the early 1960s. Virtually every general-purpose machine built today, from the fastest to the slowest, includes a cache. CPU Cache Main memory

Cache Example 12 X 1 , X 2 , …, X n-1 Access to word X n It is a miss X n brought from memory into cache Xn-2 X1 X4 X n-1 X 2 X 3 before the reference to Xn X n-2 X1 X 4 X n-1 X 2 X 3 X n after the reference to X n

Direct-Mapped Cache 13 Two issues involved: How do we know if a data item is in the cache? If it is, how do we find it? Direct-mapped cache Each memory location is mapped exactly to one location in the cache. Many items at the lower level share locations in the cache The mapping is simple (Block address) mod (number of blocks in the cache)

Direct-Mapped Cache 14 Main memory Cache 000 111 001 010 011 100 101 110 00001 00101 01001 10001 01101 10101 11001 11101

Fields in the Cache 15 If the number of blocks in the cache is a power of two, then lower log 2 (cache size in blocks)- bits of the address is used as the cache address. The remaining upper bits are used as tag to identify whether the requested block is in the cache Memory address = tag || cache addressValid bit is used to indicate whether a location in the cache contains a valid entry (e.g. startup ).

Ex: 8-word Direct-Mapped Cache 16 Address Hit or miss Tag Assigned cache block 10 110 Miss 10 110 11 010 Miss 11 010 10 110 Hit 10 110 11 010Hit 1101010000 Miss10 000 00011Miss 00011 10 000 Hit 10 000 10 010 Miss 10 010

Ex: 8-word Direct-Mapped Cache 17 Index Valid Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N Index Valid Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem( 10110 ) 111 N The initial state of the cache Address of the memory reference: 10110 => MISS/HIT After handling the miss

Ex: 8-word Direct-Mapped Cache 18 Index Valid Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem( 10110 ) 111 N Address of the memory reference: 11010 => ? Index Valid Tag Data 000 N 001 N 010 Y 11 Mem( 11010 ) 011 N 100 N 101 N 110 Y 10 Mem( 10110 ) 111 N

Ex: 8-word Direct-Mapped Cache 19 Index Valid Tag Data 000 N 001 N 010 Y 11 Mem( 11010 ) 011 N 100 N 101 N 110 Y 10 Mem( 10110 ) 111 N Address of the memory reference: 10110 => ? Address of the memory reference: 11010 => ? Address of the memory reference: 10000 => ? Index Valid Tag Data 000 Y 10 Mem( 10000 ) 001 N 010 Y 11 Mem( 11010 ) 011 N 100 N 101 N 110 Y 10 Mem( 10110 ) 111 N

Ex: 8-word Direct-Mapped Cache 20 Index Valid Tag Data 000 Y 10 Mem( 10000 ) 001 N 010 Y 11 Mem( 11010 ) 011 N 100 N 101 N 110 Y 10 Mem( 10110 ) 111 N Address of the memory reference: 00011 => ? Index Valid Tag Data 000 Y 10 Mem( 10000 ) 001 N 010 Y 11 Mem( 11010 ) 011 Y 00 Mem( 00011 ) 100 N 101 N 110 Y 10 Mem( 10110 ) 111 N

Ex: 8-word Direct-Mapped Cache 21 Index Valid Tag Data 000 Y 10 Mem( 10000 ) 001 N 010 Y 11 Mem( 11010 ) 011 Y 00 Mem( 00011 ) 100 N 101 N 110 Y 10 Mem( 10110 ) 111 N Address of the memory reference: 10000 => ? Address of the memory reference: 10010 => ? Index Valid Tag Data 000 Y 10 Mem( 10000 ) 001 N 010 Y 10 Mem( 10010 ) 011 Y 00 Mem( 00011 ) 100 N 101 N 110 Y 10 Mem( 10110 ) 111 N

A More Realistic Cache 22 32-bit data, 32-bit address Cache size is 1 K (=1024) words Block size is 1 word 1 word is 32 bits. cache index size = ? tag size = ? 2-bit byte offset A valid bit

A More Realistic Cache 23 valid tag data 0 1 2 … … … … 1021 1022 1023 = hit 10 20 20 data 32 31 30 … 13 12 11 … 2 1 0 Address byte offset

Cache Size 24 A formula for computing cache size 2 n  (block size + tag size + 1) where 2 n is the number of blocks in the cache. Example : Size of a direct-mapped cache with 64 KB of data and one-word blocks, assuming a 32-bit address? 64 KB = 214 blocksTag size is 32 – 14 - 2 = 16-bitValid bit : 1 bitTotal bits in the cache is 214  (32 + 16 + 1) = 802816 bits

Handling Cache Misses 25 When the requested data is found in the cache, the processor continues its normal execution. Cache misses are handled with CPU control unit and a separate control unit When a cache miss occurs: Stall the processor Activate the memory controller Get the requested data item from the memory to the cache Load it into the cache Continue as if it is a hit.

Read & Write 26 Read misses stall the CPU, fetch the requested block from memory, deliver it to the cache, and restart Write hits & misses: Inconsistency can replace data in cache and memory ( write-through ) write the data only into the cache ( write-back the memory later)

Write-Through Scheme 27 A memory writes takes additional 100 cycles In SPEC2000Int benchmark 10% of all instructions are stores and the CPI without cache misses is about 1.17. With cache misses CPI = 1.17 + 0.1  100 = 11.17 A Write buffer can store the data while it is waiting to be written to the memory. Meanwhile, the processor can continue execution. if the rate at which the processor generates writes is more than the rate at which the memory system can accept them, then buffering is not a solution.

Write-Back Scheme 28 When a write occurs, the new value is written only to the block in the cache. The modified block in the cache is written into the memory when it is replaced Write back scheme is especially useful, when the processor generates writes faster than the writes can be handled by the main memory Write-back schemes are more complicated to implement

Unified vs. Split Cache 29 For instruction and data cache, there are two approaches: Split caches : Higher miss rate due to their sizes Higher bandwidth due to separate data path No conflict when accessing the data and the cache at the same time Unified cache : Lower miss rate thanks to larger size Lower bandwidth due to a single datapath. Possible stalls due to the simultaneous access to data and instruction.

Taking Advantage of Spatial Locality 30 The cache we described so far does not take advantage of spatial locality but temporal locality. Basic idea : whenever we have a miss, load a group of adjacent memory cells into the cache (i.e. having blocks of longer than one word and transfer entire block from memory to cache on a cache miss). Block mapping: cache index = (block address) % (# of blocks in cache)

An Example Cache 31 The Intrinsity FastMATH processor Embedded processor Uses MIPS Architecture 12 stage pipeline Separate instruction and data cache Each cache is 16 KB (4 K words) 16-word block Tag size = ?

Intrinsity FastMATH processor 32 V tag data = hit 32 31 30 … 14 Address 13 … 6 5 2 1 0 18 MUX data 256 entries 18 8 32 32 4 Block offset cache index tag

16-Word Cache Blocks 33 Tag : [31–14] Index : [13-6] Block offset : [5-2] Byte offset: [1-0] Example: What is the block address that byte address 1800 corresponds to?Block address = (byte address) / (bytes per block) = 1800 /64 = 28

Read & Writes in Multi-Word Caches 34 Read misses : always brings the entire block Write hits & misses : more complicated Compare the tag in the cache and the upper address bits If they match, it is a hit. Continue with write back or write through If tags are not identical, then this is a missRead entire block from memory into the cache and rewrite the cache with the word that caused the write miss.Unlike the case with one-word block, write misses with multi-word block will require reading from memory.

Performance of the Caches 35 Instruction miss rate Data miss rate Effective combined miss rate 0.4% 11.4% 3.2% Intrinsic FastMATH for SPEC2000 Instruction cache: 16 KB Data cache: 16 KB Effective combined miss rate for unified cache 3.18%

Block Size 36 Small block size High miss rate Does not take full advantage of spatial locality Short block loading time Large block size Low miss rate Long time for loading the entire block Higher miss penalty Early start : resume execution as soon as the requested word arrived in the cache Critical word first : the requested word is returned first, the rest is transferred later.

Miss Rate vs. Block Size 37

Memory System to Support Cache 38 DRAM (Dynamic Random Access Memory) Access time : The time between when a read is requested and when the desired word arrives in CPU. A hypothetical memory access time 1 clock cycle to send the address 15 clock cycles for initiating access for DRAM (for each word) 1 clock cycle to send a word of data

One-Word-Wide Memory 39 Given a cache block of four words , the miss penalty for one-word-wide memory organization, miss penalty: 1 + 4  15 + 4  1 = 1 + 60 + 4 = 65 Bandwidth (# of bytes transferred per clock cycle) (4  4)/65  0.25 CPU Cache Memory Bus

Wide Memory Organization 40 With main memory of 4 words, the miss penalty for 4-word block: 1 + 15 + 1  1 = 17 The bandwidth (4  4)/17  0.94 CPU Cache Memory Bus Multiplexor

Interleaved Memory Organization 41 With main memory of 4 banks, the miss penalty for a 4-word block 1 + 15 + 4  1 = 20 The bandwidth (4  4)/20=0.80 CPU Cache Memory bank 1 Bus Memory bank 0 Memory bank 3 Memory bank 2

Example 1/2 42 Block size: 1 word, Memory bus width: 1 word, miss rate: 3%, memory access per instruction: 1.2 and CPI = 2 block size = 2 words  the miss rate is 2%, block size = 4 words  the miss rate is 1%,What is the improvement in performance of interleaving two ways and four ways versus doubling the memory width and the bus assuming the access times are 1, 15, 1 clock cycles

Example 2/2 43 CPI for one-word-wide machine CPI = 2 + (1.2  3%  17) = 2.617 Two-word block one-word bus & memory; no-interleaving CPI = 2 + (1.2  2%  (1 + 15  2 + 1  2 )) = 2.792 one-word bus & memory; interleaving CPI = 2 + (1.2  2%  (1 + 15 + 2  1)) = 2.432two-word bus & memory; no-interleaving CPI = 2 + (1.2  2%  (1 + 15 + 1)) = 2.408Four-word blockone-word bus & memory; no-interleaving CPI = 2 + (1.2  1%  (1 + 15  4 + 1  4)) = 2.780one-word bus & memory; interleaving CPI = 2 + (1.2  1%  (1 + 15 + 4  1)) = 2.24two-word bus & memory; no-interleaving CPI = 2 + (1.2  1%  (1 + 15  2 + 2  1)) = 2.396

Improving Cache Performance 44 Reduce the miss rate By reducing the probability of contention Multilevel caching Second and third level caches good for reducing miss penalty as well

Flexible Placement of Cache Blocks 45 Direct mapped cache: A memory block goes exactly to one location in the cache Easy to find (Block no.) % (# of blocks in the cache) Compare the tags Many blocks contend for the same location Fully associative cache: A memory block can go in any cache line Difficult to find Search all the tags to see if the requested block is in the cache

Flexible Placement of Cache Blocks 46 Set-associative cache: There is a fixed number of cache locations (at least two) where each memory block can be placed. A set-associative cache with n locations for a block is called an n -way set-associative cache. The minimum set size is 2. Finding the block in the cache is relatively easier than fully associative cache. (Block no.) % (# of sets in the cache) Tags are compared within the set.

Locating Memory Blocks in the Cache 47 Search Search Search 0 1 2 3 4 5 6 7 0 1 2 3 Block # Set # Data Tag Tag Tag Data Data Direct Mapped 2-way set-associative Fully set-associative A block with address 12

Example 48 Consider the following successive memory accesses for direct-mapped, two-way and four-way. Block length is one word. Access pattern is 0, 8, 0, 6, 8 Memory[6] Memory[8] Miss 8 Memory[6] Memory[0] Miss 6 Memory[0] Miss 0 Memory[8] Miss 8 Memory[0] Miss 0 3 2 1 0 Contents of cache blocks after reference Hit or Miss Address of memory block accessed Direct mapped cache

Example 49 Two-way set-associative cache Memory access: 0, 8, 0, 6, 8 Memory[6] Memory[8] Miss 8 Memory[6] Memory[0] Miss 6 Memory[8] Memory[0] Hit 0 Memory[8] Memory[0] Miss 8 Memory[0] Miss 0 Set 1 Set 1 Set 0 Set 0 Contents of cache blocks after reference Hit or Miss Address of memory block accessed

Example 50 Full associative cache Memory access: 0, 8, 0, 6, 8 Memory[6] Memory[8] Memory[0] Hit 8 Memory[6] Memory[8] Memory[0] Miss 6 Memory[8] Memory[0] Hit 0 Memory[8] Memory[0] Miss 8 Memory[0] Miss 0 Block 3 Block 2 Block 1 Block 0 Contents of cache blocks after reference Hit or Miss Address of memory block accessed

Performance Improvement 51 associativity Data miss rate 1 10.3% 2 8.6% 4 8.3% 8 8.1% 16-word block 64 KB cache & SPEC2000

Locating a Block in Cache 52 31 30 … 12 11 10 9 8 ….. 3 2 1 0 V Tag Data V Tag Data V Tag Data V Tag Data = = = = Hit Data

Replacement Strategy 53 Direct mapping : no choice Set and Fully-associative : Any location is possible for a block. Which one is to replace? Most commonly used schemes: LRU ( Least Recently Used ) Keeping track which cache line has been accessed most recently.Replace one that has been unused for the longest time. Random Easy to implementOnly slightly worse than LRU

Performance Equations 54 Formula 1 : CPU time = (CPU execution clock cycles + Memory-stall clock cycles)  Clock cycle time Memory-stall clock cycles come primarily from cache misses. Formula 2: Memory-stall clock cycles = Memory access/program  Miss rate  Miss penalty

Performance Equations 55 Formula 6 : Memory-stall clock cycles = Instruction/program  Memory access/Instruction  Miss rate  Miss penalty Example : CPI = 2Instruction miss rate : 2%, Data miss rate : 4%,Miss penalty : 100 clock cycles for all misses, Frequency of loads and stores : 25% + 11% = 36%What is the CPI with memory stalls?CPI = 5.44

Continuing the Same Example 56 What if the processor is made faster, but the memory system stays the same? Assume that the CPI is reduced from 2 to 1. Then the amount of execution time spent on memory stalls would have risen from 3.44/5.44 = 0.63 to 3.44/4.44 = 0.77 . Assume the processor clock rate doubles Miss penalty for all misses is 200 clock cycles. Total miss cycles per inst. = (2%  200) + 36%  (4%  200) = 6.88CPI = 8.88 (compare it to 5.44)Speedup = (5.44)/(8.880.5) = 1.23

Multilevel Caches 57 First level caches are often implemented on-chip on contemporary processors. Second level caches, which can be on-chip or off-chip in a separate set of SRAMs, are accessed whenever a miss occurs in the primary cache Larger size Larger block size Faster than main memory hit time is higher

Example: Multilevel Caches 58 CPI = 1.0, clock rate = 5 GHz, main memory access time = 100 ns, miss rate per instruction at the primary cache = 2%. How much faster if we add a secondary cache with 5 ns access time that can reduce overall miss rate to 0.5%? The miss penalty to the main memory = 100/0.2 = 500 cycles Total CPI = 1.0 + Memory-stall cycles per instruction = 1.0 + 2%  5 00 = 11.0 ( without the secondary cache)The miss penalty in the secondary cache = 5/0.2 = 25 cyclesTotal CPI = 1.0 + 2%  25 + 0.5%  500 = 4.0

Design Considerations 59 First level cache the focus is to minimize the hit time miss rate could be slightly high smaller block size itself tend to be smaller Second level cache the focus is to reduce overall miss rate. access time is less important its local miss rate can be large larger uses larger block size.

Global vs. Local Miss Rate 60 Level 1 cache: 2% local miss rate Level 2 cache: 50% local miss rate What is the overall (global) miss rate?