Javier Lira ψ Carlos Molina ф Antonio González ψλ λ Intel Barcelona Research Center Intel Labs UPC Barcelona Spain antoniogonzalezintelcom ф Dept Enginyeria Informàtica ID: 813586
Download The PPT/PDF document "HK-NUCA: Boosting Data Searches in Dynam..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs
Javier Lira ψCarlos Molina фAntonio González ψ,λ
λ Intel Barcelona Research CenterIntel Labs - UPCBarcelona, Spainantonio.gonzalez@intel.com
ф Dept. Enginyeria InformàticaUniversitat Rovira i VirgiliTarragona, Spaincarlos.molina@urv.net
ψ
Dept. Arquitectura de ComputadorsUniversitat Politècnica de Catalunya Barcelona, Spain javier.lira@ac.upc.edu
IPDPS 2011, Anchorage, AK (USA) –
May
17, 2011
Slide2Introduction
2
Core
0
Core
1
Core
2
Core
3
Core
4
Core
5
Core
6
Core
7
NUCA
S-NUCA
(
Static
NUCA)
One
possible
location
in
the
NUCA
Simple
Trivial
search
of data
No
leverages
locality
D-NUCA
(
Dynamic
NUCA)
Multiple
candidate
banks
Migration
increases
complexity
Not
easy
to
find
data
Optimize
cache
access
latency
Slide3Motivation
3
Significant performance potentialLimited
by the access scheme
Slide4Access schemes in D-NUCA
Directory is not an alternativeNeeds to update block location on every migrationReduces D-NUCA
potentialityPotential bottleneckAlgorithmic-based schemesPartitioned multicast (hybrid access
scheme)1st step: Local bank + central banks (9 banks)2nd step: The other core’s local banks
4
Performance
EnergySerialLowLowParallelHighHigh
Slide5Serial vs Parallel
5
Reduce
the number of messages required per access is crucial
Slide6Objectives
6Optimize NUCA featuresProvide fast access when the data is near the
requesting coreReduce network contentionCrucial in both performance and energy
Slide7Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions7
Slide8Methodology
Simulation tools:Simics + GEMSCACTI v6.0Two scenarios:Multi-programmedMix of SPEC CPU2006Parallel applicationsPARSECNumber
of cores8 – UltraSPARC IIIiFrequency1.5 GHzMain Memory Size4
GbytesMemory Bandwidth512 Bytes/cyclePrivate L1 caches8 x 32 Kbytes
, 2-wayShared
L2 NUCA cache8 MBytes, 128 Banks
NUCA Bank64 KBytes, 8-wayL1 cache latency
3 cycles
NUCA bank
latency
4 cycles
Router
delay
1
cycle
On
-chip
wire
delay
1
cycle
Main
memory
latency
250
cycles
(
from
core
)
Slide9Baseline architecture
D-NUCA cache8 MBytes128 BanksBank: 64 KBytes, 8-wayMigration scheme:Gradual Promotion
ReplacementLRUAccessPartitioned Multicast9
Core
0
Core
1
Core
2
Core
3
Core
4
Core
5
Core
6
Core
7
Slide10Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions10
Slide11HK-NUCA
Home Knows where to find data in the NUCA cacheHome bank knows which other banks
have at least one data block that it managesThere
is a HK-PTR per cache set in all banks.11
0
0
1
0
1
1
0
0
0
0
0
0
1
0
1
0
HK-PTR
Slide12(2)
Call Home(3) Parallel accessHK-NUCA12
Core
0
Core
1
Core
2
Core
3
Core
4
Core
5
Core
6
Core
7
Core
0
(1)
Fast
access
0
0
1
0
1
1
0
0
0
0
0
0
1
0
1
0
Slide13Managing Home knowledge
Actions that provoke an update of HK-PTR:New data enters to the cacheEviction from the
NUCA cacheMigration movementsMigrations are synchronized with HK-PTR updates13
Slide14Overheads
HardwareImplementation HK-PTRsNetworkHome knowledge updates14
NUCA cache 8 MBytesHK-PTRs 32 KBytes
Slide15Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions15
Slide16Performance results
16Overall performance improvement of 4-6%
Workloads with high miss rate
Low
miss rate, but high hit
rate in the first two HK-NUCA
stages
Low
miss
rate
, high hit rate in the
parallel
access
stage
of HK-NUCA
Slide17HK-NUCA accuracy
1785% of memory requests send less
than 6 messages to the NUCA
Slide18On-chip network
traffic18Avg Messages
sent per requestPart. Multcast 10.03HK-NUCA (3-steps) 3.82HK-NUCA (2-steps) 4.06Perfect Search 1
Slide19Energy consumption
results19HK-NUCA reduces dynamic energy consumption by more
than 50%
Slide20Outline
Introduction and motivationMethodologyHK-NUCAResultsConclusions20
Slide21Conclusions
D-NUCA enables to take profit of the non-uniformity of NUCA cachesD-NUCA benefits are restricted by the access scheme
usedHK-NUCA is an access scheme for D-NUCA organizationsAllows fast accesses
to data that is near the requesting coreHome knowledge reduces miss resolution time and network contention
Outperforms by 6% the
best performing access scheme
Reduces dynamic energy consumption by 50%21
Slide22HK-NUCA: Boosting
data searches in Dynamic NUCA for CMPsQuestions?22
Slide23Migration is
not the problem23S-NUCA
D-NUCA
Access scheme is the main limitation in D-NUCA