CACTI 7 New Tools for Interconnect Exploration in Innovative OffChip Memories Rajeev Balasubramonian Andrew B Kahng Naveen Muralimanohar Ali Shafiee Vaishnav Srinivas 1 Main Memory Matters Architecture ID: 768083
Download Presentation The PPT/PDF document "CACTI 7: New Tools for Interconnect Expl..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories Rajeev Balasubramonian Andrew B. Kahng Naveen Muralimanohar Ali ShafieeVaishnav Srinivas 1
Main Memory Matters Architecture Software Technology In-Memory DBs, Key-Value StoresGraph Algorithms, Deep LearningDDR4, HMC, HBM, NVM Commodity CPUs, Accelerators Shift in bottlenecks Example innovations: NDP, DDR to GDDR5 3x TOPS in TPU The Innovation Hub is Moving to Memory 2
Two Silos CACTI 7 can be used out-of-the-box when defining memory parameters for traditional memory systemsCACTI 7 primitives can be leveraged to model and evaluate new memory architectures 3
Talk Outline CACTI for the main memory Inputs/outputsThe nuts and boltsModeling I/O powerDesign space exploration Case studies: two novel architecturesCascaded ChannelsNarrow Channels 4
CACTI for Memory Exhaustive Search Channel Configs Energy per accessCapacity#channels, ECC vs. NotDRAM Type: DDR3,DDR4 Access Pattern: bw , row buffer hits, Rd/ Wr ratio Cost Table Bandwidth Table Inputs and outputs 5 Power Parameters
DIMM Cost Cost factors: technology, capacity, support for ECC, max bandwidth, vendor Aggregated costs from online sourcesCost is volatile and should be updated periodically 4GB 8GB 16GB 32GB 64GB DDR3 UDIMM 40 76 RDIMM 42 64 122 304 LRDIMM 211 287 1079 DDR4 UDIMM 26 46 RDIMM 33 60 126 310 LRDIMM 279 331 1474 Cost and capacity relationship is not linear Cost in dollars 6
Bandwidth Bandwidth depends on load, voltage, and DIMM type 1DPC (MHz) 2DPC (MHz) 3DPC (MHz) 1.35V 1.5V 1.35V 1.5V 1.35V 1.5V DDR3 UDIMM-DR 533 667 533 667 RDIMM-DR 667 800 667 667 533 RDIMM-QR 667 667 LRDIMM-QR 667 667 667 667 533 533 1.2V 1.2V 1.2V DDR4 RDIMM-DR 1066 933 800 RDIMM-QR 933 800 LRDIMM-QR 1066 1066 800 7
Power Modeling Extending CACTI-I/ODDR4 and SerDes support addedSerDes parameters from literature for different lengths/speedsFor parallel buses, support for more accurate termination power with HSPICE simulations Different termination models for each bus typeDifferent frequency, DIMMs per channelOn-DIMM and on-board Different range (short or long)8
Interconnect Model API 9
Power Analysis (DDR3) 10
Power Analysis (DDR4) 11
Cost and Bandwidth Analysis Highest possible BW for the demanded capacity Lowest possible cost for the demanded capacity12
Two Case Studies Key ObservationsHigh DPC less BWMore channels high bw and low costNew Idea I: Cascaded SegmentsEach segment has few DIMMs higher BW New Idea II: Narrow ChannelsPartition the channel into many parallel channelsFewer DIMMs per data wire, new ECC higher BWLower power on DIMM13
Cascaded Channels DIMM DIMM DIMM CPU DIMM DIMM DIMM CPU Same DPC, higher BW 533 MHz 667MHz 667MHz 64 GB 64 GB CPU 64 GB 32 GB 32 GB CPU RoB Same BW, lower cost 667 MHz 667MHz 667MHz one memory cycle increase in latency 14 RoB Relay on Board chip
Hybrid Memory D D CPU N N D N CPU D N NVM is slow Software optimized to access DRAM more One Channel DRAM One Channel NVM Frontend DRAM Backend NVM Unbalanced channel Load balanced channel Load 15
Narrow Channels Higher Bandwidth but Higher Latency Lower frequency/power for DRAM Chips! ECC on DIMM and CRC for link to reduce bwCommand/Address Bus is shared between channels16
Methodology Trace-based simulationTrace fed to USIMM Memory-intensive Benchmarks (NPB and SPEC2006) Trace generated by Simics 8-core at 3.2 GHzL1D = 32KB, L1I = 32KB, L2 = 8MBPower CACTI 7 17
Cascaded Channels DDR3 DDR4 25% higher BW 22% higher IPC13% higher BW 12% higher IPC18
Cascaded Latency 19
Cascaded Power: DRAM Cartridge DIMM BoB I/O Total Power/BW Baseline 23.2W 5.5W 9.4W 38.1W 7.9 ( nJ /B) Cascaded 22.6W 6.4W 12.2W 41.2W 6.7 ( nJ /B) CPU CPU 533 MHz 70% utilization 667MHz 70% utilization 667MHz 35% utilization 20
Cascaded Cost 21
Cascaded Hybrid Percentage of Load on DRAM 22
Narrow Channel: Performance Performance Improvement: 2-channel-x36 18% 3-channel-x24 17%23
Narrow Channel: Power 23% overall memory power reduction 24
Conclusion CACTI 7: models off-chip memories and I/ODetailed I/O power modelDesign space exploration Analyzes trade-offs: capacity, power, bandwidth, and costTwo novel architectures Cascaded channelsNarrow channels 25