Samuel Naffziger AMD Corporate Fellow June 14 th 2011 VLSI Technology Symposium 2011 Introduction The new workloads and demands on computation Characteristics of serial and parallel computation ID: 648192
Download Presentation The PPT/PDF document "Technology Impacts from the New Wave of ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Technology Impacts from the New Wave of Architectures for Media-rich Workloads
Samuel Naffziger
AMD Corporate Fellow June 14th, 2011
VLSI Technology Symposium 2011Slide2
Introduction
The new workloads and demands on computationCharacteristics of serial and parallel computation
The Accelerated Processing Unit (APU) architectureAPU architecture implications for technologySummary
OutlineSlide3
Now: Parallel/Data-Dense
16:9 @ 7 megapixels
HD video
flipcams
, phones,
webcams (1GB)
3D Internet apps and HD video
online, social networking w/HD files3D Blu-ray HDMulti-touch, facial/gesture/voice recognition + mouse & keyboardAll day computing (8+ Hours)
Immersive and interactive performance
Workloads
The Big Experience/Small Form Factor Paradox
Mid 2000s
4:3 @ 1.2 megapixels
Digital cameras, SD webcams (1-5 MB files)
WWW and streaming SD video
DVDs
Mouse & keyboard
3-4 Hours
Standard-definition
Internet
Technology
Mid 1990s
Display
4:3 @
0.5 megapixel
Content
Email, film & scanners
Online
Text and low
res photos
Multimedia
CD-ROMInterfaceMouse & keyboardBattery Life*1-2 Hours
Form
Factors
Early
Internet and Multimedia
Experiences
*Resting battery life as measured with industry standard tests.Slide4
Focusing on the experiences that matter
Email
Web browsing
Office productivity
Listen to music
Online chat
Watching online video
Photo editingPersonal financesTaking notesOnline web-based gamesSocial networkingCalendar managementLocally installed gamesEducational appsVideo editingInternet phoneConsumer PC Usage
0%
20%
40%
60%
80%
100%
New
Experiences
Immersive
Gaming
Simplified Content Management
Accelerated Internet and HD Video
Source: IDC's 2009 Consumer PC Buyer SurveySlide5
People Prefer Visual Communications
Visual
Perception
Verbal
Perception
Words are processed
at only 150 words
per minutePictures and videoare processed 400 to2000 times fasterAugmenting Today’s Content:Rich visual experiencesMultiple content sources Multi-DisplayStereo 3DSlide6
Communicating
IM, Email,
Facebook Video Chat, NetMeeting
Gaming
Mainstream Games
3D games
The Emerging World of New Data Rich Applications
ArcSoft TotalMedia® Theatre 5ArcSoft MediaConverter® 7
CyberLink
Media
Espresso 6
CyberLink
Power
Director 9
Corel
VideoStudio
Pro
Corel
Digital Studio
2010
Internet
Explorer 9
Microsoft®
PowerPoint® 2010 Windows Live EssentialsCodemastersF1 2010
Nuvixa
Be Present
ViVu
Desktop
Telepresence
ViewdleUploader Using photos Viewing& Sharing Search, Recognition, Labeling? Advanced Editing Using video DVD, BLU-RAY™, HD Search, Recognition, Labeling Advanced Editing & MixingThe Ultimate Visual Experience™Fast Rich Web content, favorite HD Movies, games with realistic graphics Music Listening and Sharing Editing and Mixing Composing and compositingSlide7
New Workload Examples:
Changing Consumer Behavior
7
24
hours
of video
uploaded to YouTube
every minute50 million +digital media filesadded to personal content librariesevery dayApproximately9 billionvideo files owned are high-definition1000 imagesare uploaded to Facebook
every secondSlide8
What Are the Implications for Computation?
Insatiable demand for high bandwidth processingVisual image processingNatural user interfaces
Massive data mining for associative searches, recognitionSome of these compute needs can be offloaded to servers, some must be done on the mobile device Similar compute needs and massive growth in both spaces
How must CPU architecture change to deal with these trends?Slide9
Parallel and Serial Computation
i=0
i++
load x(i)
fmul
store
cmp i (16)
bc…Loops, branches and conditional evaluationSerial Code
Conditional
branches
i=0
i++
load x(i)
fmul
store
cmp i (
1000000
)
bc
…
…
…
…
i,j
=0
i++j++load x(i,j)fmulstorecmp j (100000)bccmp i (100000)bc2D array representingvery large datasetLoop 1M times for 1M pieces of dataDataParallel CodeSlide10
GPU/CPU Design Differences
CPU (Serial compute)
GPU (parallel compute)Slide11
Three Eras of Processor Performance
Single-Core Era
Single-thread Performance
?
Time
we are
here
oEnabled by:Moore’s LawVoltage & Process ScalingMicro ArchitectureConstrained by:PowerComplexityMulti-Core EraThroughput PerformanceTime(# of Processors)
we are
here
o
Enabled by:
Moore’s Law
Desire for Throughput
20 years of SMP arch
Constrained by:
Power
Parallel SW availabilityScalability
Heterogeneous
Systems Era
Targeted Application
Performance
Time(Data-parallel exploitation)
we arehere
o
Enabled by:Moore’s LawAbundant data parallelismPower efficient GPUsTemporarily constrained by:Programming modelsCommunication overheadsWorkloadsSlide12
Heterogeneous Computing with an APU Architecture
CPU Cores
GPU
UVD
SB Functions
~
7 GB/sec
~17 GB/sec
UNB
MC
~17 GB/sec
DDR3 DIMM
Memory
CPU Chip
FCH Chip
PCIe
®
Bandwidth pinch points and latency hold back the GPU capabilities
Integration Provides Improvement
Eliminate power and latency of extra chip crossing
3X
bandwidth between GPU and Memory!
Same
sized GPU is substantially more
effectivePower efficient, advanced technology for both CPU and GPUGraphics requires memory BW to bring full capabilities to life~27 GB/sec~27 GB/secDDR3 DIMMMemoryAPU ChipPCIe2010 IGP-based
(“Danube”) Platform
2011 APU-based
(“Llano”) Platform
GPU
CPU Cores
UVD
UNB / MCGPUOptionalSlide13
The Challenges of Integration
Thick, fast metal
Big devices
Dense, thin metal, small devices
Performance
CPU flop
area = 2.14
GPU flop area = 1.0CPU GPU Flop count for 4 Llano CPU cores=0.66MFlop count for Llano GPU =3.5MDensitySlide14
With the 20nm node, even local metal will be seeing large RC increase compromises more difficult
How to Balance the Metal Stack?
Cu Resistivity without barrier
With barrier
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Line Width (um)
Resistivity (uohm-cm)
Add metal layers?
Thin, dense layers for the GPU
Thick, low resistance layers for the CPU
Cost issues?
Via resistance?Technology improvements in BEOL are requiredPerformanceCPU GPU DensitySlide15
Device Optimization
Performance
CPU
GPU
To achieve breakthrough APU performance, the Llano GPU has ~5X the flops and ~5X the device count of the CPUs
Device
Ioff
Broader span of devices required CPUGPUdesired device rangeSpeed vs. LeakageRO speedLVTLC-RVTRVTHVTLC-HVTA broader device suite is requiredSlide16
Balanced workload
GPU-centric data parallel workload
CPU-centric serial workload
Power Transfers
Temperature
Voltage range is critical to enabling the efficient power transfers that make for compelling APU performance Slide17
Operating Voltage Range
Operating voltage requirements:
Low voltage necessary for power efficiency
High voltage necessary for a snappy user experience enabled by turbo modeSlide18
Operating Voltage Challenges
To maintain cost effective performance growth with technology node, the GPU must:
Hold power density constant
Exploit density gains to add compute units
This necessarily drives operating voltage down
This would be good for energy efficiency except …
Variation impacts are much greater at low voltageSlide19
The Operating Voltage Challenge
Many barriers to maintaining both high and low voltage as technology scales
TDDB vs. SCE control
ULK breakdown vs. denser pitches
Variation control
BOX
Poly
Fin
Current
Flow ->
S
D
FD devices should enable maintaining the functional range for a generation or two
Will turbo modes be too compromised?
What’s next?Slide20
3D Integration to the Rescue?
Through
Silicon
Vias
(TSVs)
CPU Die
Metal Layers
GPU Die
Metal Layers
Analog Die (SB, Power)
Metal Layers
Metal Layers
TIM (Thermal Interface Material)
Heat Sink
DRAM
Micro-bumps
Package Substrate
DRAM
South Bridge
Stacking offers many attractive benefits
Higher bandwidth to local memory
Enables parallel and serial compute die to be in their own separate optimized technology – interconnect speed vs. density, device optimization etc.
Allows
IO and
southbridge
content to remain in older, more analog-friendly technologySlide21
3D Integration Challenges
Economical 3D stacking in high volume manufacturing presents many challenges
Benefits must exceed the additional costs of TSVs, and yield fallout
Logistics of testing and assembling die from multiple sources can be immense
Countless mechanical and thermal issues to
solve in high volume mfg
Clearly 3D provides compelling solutions to many problems, but the barriers to entry mean heavy R&D $$ and partnerships requiredSlide22
Summary
Insatiable demand for high bandwidth computationVisual image processingNatural user interfacesMassive data mining for associate searches, recognition
Some of these compute needs can be offloaded to servers, some must be done on the mobile deviceSimilar compute needs and massive growth in both spacesCombined serial and parallel computation architectures are key in both spacesHuge technology challenges to meeting this opportunityInterconnect scaling is hitting a wall that must be overcomeA broad device suite is necessary that operates efficiently at low voltage while enabling high speed for response time3D integration offers a promising long term solution