eDRAM NUCA Architecture Javier Lira UPC Spain Carlos Molina URV Spain javierliraacupcedu carlosmolinaurvnet David Brooks Harvard USA Antonio González IntelUPC ID: 813587
Download The PPT/PDF document "Implementing a Hybrid SRAM /" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Implementing a HybridSRAM / eDRAM NUCA Architecture
Javier Lira (UPC, Spain) Carlos Molina (URV, Spain) javier.lira@ac.upc.edu carlos.molina@urv.net David Brooks (Harvard, USA) Antonio González (Intel-UPC, Spain) dbrooks@eecs.harvard.edu antonio.gonzalez@intel.com
HiPC 2011, Bangalore (India) – December 21, 2011
Slide2CMPs incorporate
large LLC.POWER7 implements L3 cache with eDRAM.3x density.3.5x lower energy consumption.Increases latency few cycles.We propose a placement policy to accomodate both technologies
in a NUCA cache.Motivation
40-45% chip
area
2
Slide3NUCA divides a large cache in smaller and faster banks.Cache access latency consists of the routing and bank access latencies.
Banks close to cache controller have smaller latencies than further banks.NUCA caches [1]
Processor
[1] Kim et al.
An
Adaptive
, Non-
Uniform
Cache
Structure
for
Wire-Delay
Dominated
On
-Chip
Architectures
. ASPLOS’02
3
Slide4SRAM provides high-performance.
eDRAM provides low power and high density.SRAMeDRAMLatencyX1.5xDensityX3xLeakage
2xXDynamic energy
1.5x
X
Need
refresh
?NoYes
SRAM vs. eDRAM4
Slide5IntroductionMethodology
Implementing a hybrid NUCA cacheAnalysis of our designExploiting architectural benefitsConclusionsOutline5
Slide6Baseline architecture [2]
Migration
Placement
Access
Replacement
Placement
Access
Migration
Replacement
Core
0
Core
1
Core
2
Core
3
Core
4
Core
5
Core
6
Core
7
16 positions per data
Partitioned
multicast
Gradual
promotion
LRU +
Zero-copy
Core
0
[2]
Beckmann
and Wood.
Managing
Wire
Delay
in
Large
Chip-
Multiprocessor
Caches
. MICRO’04
6
Slide7Number
of cores8 – UltraSPARC IIIiFrequency1.5 GHzMain Memory Size4 GbytesMemory Bandwidth512 Bytes/cyclePrivate L1 caches
8 x 32 Kbytes, 2-way
Shared
L2 NUCA cache
8
MBytes
, 128 Banks
NUCA Bank64 KBytes
, 8-wayL1 cache latency
3 cyclesNUCA bank
latency
4
cycles
Router
delay
1
cycle
On
-chip
wire
delay
1
cycle
Main
memory
latency
250 cycles (from core)
Experimental framework
GEMS
Simics
Solaris
10
PARSEC
SPEC CPU2006
8 x
UltraSPARC
IIIi
Ruby
Garnet
Orion
7
Slide8IntroductionMethodology
Implementing a hybrid NUCA cacheAnalysis of our designExploiting architectural benefitsConclusionsOutline8
Slide9Fast SRAM banks are located close to the cores.Slower eDRAM banks in the center of the NUCA cache.
PROBLEM: Migration tends to concentrate shared data in central banks.9Homogeneous approach
Core
0
Core
1
Core
2
Core
3
Core
4
Core
5
Core
6
Core
7
eDRAM
SRAM
Slide10Significant
amount of data in the LLC are not accessed during their lifetime.SRAM banks store most frequently accessed data.eDRAM banks allocate data blocks that either:Just arrived to the NUCA, orWere evicted from SRAM banks.10Data usage analysis
Slide11First goes to
an eDRAM.If accessed, it moves to SRAM.Features:Migration between SRAM banks.Lack of communication in eDRAM.No eviction from SRAM banks.eDRAM is extra storage
for SRAM.PROBLEM: Access scheme must search
to
the
double number
of banks.Heterogeneous approach
eDRAM
SRAM
Core
0
Core
1
Core
2
Core
3
Core
4
Core
5
Core
6
Core
7
11
Slide12Tag Directory Array
(TDA) stores tags of eDRAM banks.Using TDA, the access scheme looks up to 17 banks.TDA requires 512 Kbytes for an 8 Mbyte (4S-4D) hybrid NUCA cache.
12TDA
Slide13Heterogeneous + TDA outperforms the other hybrid alternatives.
13Performance resultsWe use Heterogeneous + TDA as hybrid NUCA cache in further analysis.
Slide14IntroductionMethodologyImplementing
a hybrid NUCA cacheAnalysis of our designExploiting architectural benefitsConclusionsOutline14
Slide15Well-balanced configurations achieve
similar performance as all-SRAM NUCA cache.The majority of hits are in SRAM banks.Performance15
Slide16Hybrid NUCA pays
for TDA.The less SRAM the hybrid NUCA uses, the better.Power and Area16
Slide17Similar performance results as all-SRAM.
Reduces power consumption by 10%.Occupies 15% less area than all-SRAM.The best configuration
4S-4D
17
Slide18IntroductionMethodologyImplementing
a hybrid NUCA cacheAnalysis of our designExploiting architectural benefitsConclusionsOutline18
Slide1919New configurations
all SRAM banksSRAM: 4MByteseDRAM: 4MBytes15% reduction on area+1MByte in SRAM banks
+2MBytes in eDRAM banks5S-4D
4S-6D
SRAM
eDRAM
Slide20And do not increase
power consumption.Both configurations increases performance by 4%.Exploiting benefits20
Slide21IntroductionMethodologyImplementing
a hybrid NUCA cacheAnalysis of our designExploiting architectural benefitsConclusionsOutline21
Slide22IBM® integrates eDRAM in its
latest general-purpose processor.We implement a hybrid NUCA cache, that effectively combines SRAM and eDRAM technologies.Our placement policy succeeds in concentrating most accesses to the SRAM banks.Well-balanced
hybrid cache achieves similar performance as all-SRAM configuration, but
occupies
15%
less
area and
dissipates 10% less power.Exploiting
architectural benefits we achieve up
to 10% performance improvement, and by 4%, on average.
Conclusions
22
Slide23Implementing a HybridSRAM / eDRAM NUCA Architecture
Questions?HiPC 2011, Bangalore (India) – December 21, 2011