Geoffrey Blake Ronald G Dreslinski Trevor Mudge Krisztián Flautner University of Michigan Ann Arbor ARM ISCA 2010 June 22 2010 Introduction 2000 Single core machines common ID: 192714
Download Presentation The PPT/PDF document "Evolution of Thread-Level Parallelism in..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Evolution of Thread-Level Parallelism in Desktop Applications
Geoffrey Blake*, Ronald G. Dreslinski*, Trevor Mudge*, Krisztián Flautner†University of Michigan – Ann Arbor*ARM †
ISCA 2010
June 22, 2010Slide2
Introduction
2000Single core machines commonClock speed steadily increasing – Intel predicts to reach 10GHz by 20102005Consumer CMPs announced (from Intel/AMD)Aggressive clock speed increases halt
2010 and Future
Multi-core machines common
Core counts steadily increasing
Pentium 4
Core Duo
Core i7
Nehalem EX
1
“If
you build it, they will come”
--
“Field of Dreams” 1989Slide3
Motivation
Server and Scientific workloads already parallelWhat about Desktop/Laptop workloads?Flautner et al. at ASPLOS-IX studied interactive desktop workloadsAlmost all applications behaved as single threaded1 extra core helped system responsiveness Multiprocessor desktop/laptop systems now commonHas desktop/laptop software followed?
2
# of cores
Performance
Desktop Scaling?
Server/Scientific Scaling
Ideal ScalingSlide4
Motivation
3
Past 5 years of ISCA, Server and Scientific workload research has dominated
Market shows the opposite
Correct domain to invest disproportionate effort into? Trickle down effect?
*source IDC and Gartner 2009
*source ISCA ‘05-’09Slide5
Metrics
Replicate contemporary experiments of previous workMeasure “Thread Level Parallelism” (TLP) instead of Utilizationci
= fraction of time
i
cpus are doing workc
0 = Idle (0 cpus are doing work)c1 = 1 cpu doing work, c
2 = 2 cpus doing work
4
Example: c0 = 0.5 Util = 0.25 * 1 + 0.25 * 2 = 0.75 c1 = 0.25 TLP = (0.25 * 1 + 0.25 * 2) / (1 – 0.5) = 1.5
c2 = 0.25TLP is measure of how efficiently system is using parallel resources when work needs to be done
*[Flautner et al. ASPLOS’00]Slide6
Test Systems
5
2009 Mac Pro (Fast,
highend
)
2x 2.26GHz Intel Xeon E5520(8 cores, 16 hardware threads)
NVIDIA GTX 285/GT120(240/32 CUDA cores)Mac OS 10.6 + Windows 7 x64
ASUS ASRock (Slow, cheap)1x 1.6GHz Intel Atom 330
(2 cores, 4 hardware threads)NVIDIA ION(16 CUDA cores)Windows 7 x64
M
achines were chosen to measure effect of system speed on TLP.Fast system may allow OS to schedule tasks on same core all the time.Slide7
Measurement Infrastructure
Developed system wide monitoring programs for both client OS’sMac OS X 10.6DTrace to track thread context switches I/O Kit probing of GPU driver for GPU utilizationWindows 7Event Tracing for Windows (ETW) to track thread context switches
NVPerfKit
SDK for GPU utilization
6Slide8
Benchmarking
Tested software in six categoriesGamesImage AuthoringOffice ProductivityMultimedia PlaybackVideo Authoring/CUDA enabled video authoringWeb BrowsingUsed detailed task sets and input parameters
Tests performed by user – results fully reproducible, low variance
Test length was 5 minutes or more
Details, input sets and tracing tools can be found here:
http://itlpbench.eecs.umich.edu7Slide9
Are threads used?
Benchmark
Threads
Created
Avg. Threads Alive
Handbrake 0.9
2251124Call of Duty 4
7744
Photoshop CS48275
Adobe Reader 923924Quicktime-HD
5352Firefox 3.5522
38
Many threads created
Many threads alive and visible to the OS during runtime
8Slide10
10 Year Comparison
9
Requires
high performance, progress made in 10 years
Requires
high performance, little progress in 10 years
Elsewhere
, very little progress in 10 yearsSlide11
Overall TLP Results – Xeon Windows 7
10
Idle
8 Core SMT - System Wide TLP
Application
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
TLP
AVG
TLP
Game
Bioshock
1.6
1.7
Call
of Duty 4
2.1
Crysis
1.4
Image Authoring
Maya3D 2010
2.4
2.2
Photoshop CS4
2.0
Office
Adobe Reader
9
1.3
1.2
Excel 2007
1.2
PowerPoint 2007
1.2
Streets 2010
1.4
Word 2007
1.2
Playback
iTunes 9
1.3
1.6
Quicktime
1.3
QuicktimeHD
2.1
CUDA
Badaboom
1.3
2.2
PowerDirector
v9
3.2
Video Authoring
Handbrake
8.4
6.6
PowerDirector
v9
4.8
Web Browsing
Firefox 3.5*1.51.6Safari 4.0*1.6
100%
0%Slide12
Overall TLP Results – Atom Windows 7
11
Idle
System TLP
Application
0
1
2
3
4
TLP
AVG
Game
Bioshock
2.2
2.0
Call of Duty 4
2.1
Crysis
1.8
Image Authoring
Maya3D 2010
2.1
1.9
Photoshop CS4
1.6
Office
Adobe Reader 9
1.5
1.4
Excel 2007
1.3
PowerPoint 2007
1.4
Streets
2010
1.6
Word 2007
1.3
Playback
iTunes 9
1.5
2.1
Quicktime
2.1
Quicktime
HD
2.6
CUDA
Badaboom
1.3
2.0
PowerDirector
v8
2.8
Video Authoring
Handbrake
3.8
3.6
PowerDirector
v8
3.3
Web Browsing
Firefox 3.5*
1.6
1.7
Safari 4.0*
1.8
100%
0%Slide13
Call of Duty 4: TLP vs
Time – Xeon Windows 7
Active
Idle
Time (3s)
12
75 threads spawned during execution, average of 44 live threads at any time
CPU12
CPU13
CPU14
CPU15
CPU8
CPU9
CPU10
CPU11
CPU4
CPU5
CPU6
CPU7
CPU0
CPU1
CPU2
CPU3
[Hauser et al., SIGOPS ‘93]Slide14
TLP vs Time – Atom Windows 7
13
Call of Duty 4
Firefox 3.5
Time (3s)
Time (3s)
CPU0
CPU1
CPU2
CPU3
CPU0
CPU1
CPU2
CPU3
522 threads spawned, average of 38 live threads at any time Slide15
Photoshop CS4 – TLP vs
Time (Xeon)14Slide16
GPU Measurements - Throughput
15
[Lee et al., ISCA’10]Slide17
Conclusions
Little change in TLP over ten yearsSingle thread speed has little impact on TLPSingle thread performance is still importantSpecific applications do take advantage of resourcesLarge amounts of silicon is under utilizedDebatable if aggressively increasing core count is correct directionCan desktop applications be parallelized effectively?
Would architecture specialization be more beneficial?
16Slide18
Future Directions and Work
Categorizing thread use in desktop applicationsBetter performance metricsPerform critical path analysisDetailed characterization of instruction stream17Slide19
Questions
??
18Slide20
Backup Slides
19Slide21
Motivation
The mobile market is even larger: ~1.2 Billion units shipped in 2009174.1 Million units were smartphonesSource: IDC
20Slide22
Overall Results – Fast System
21Slide23
Overall Results – Slow System
22Slide24
Web Browser Results Detail
23Slide25
Firefox 3.5: TLP vs
Time (Xeon)24
Hardware Contexts
Active
Idle
Time (3s)Slide26
TLP vs Time –
PowerDirector
25Slide27
TLP vs Time – Handbrake
26Slide28
TLP – OS Comparison
Small differences between Windows and OS XApplications written originally for a particular platform perform better than the port
27Slide29
Discussion - Atom
Reduced performance cores do not appreciably increase TLPLack of TLP appears more due to software designSingle thread performance is still important for desktop/laptopMany slow cores over few fast cores may be bad fit for desktop/laptop space
28Slide30
Overall TLP Results - Xeon
29
Idle
8 Core SMT - System Wide TLP
GPU
Application
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
TLP
σ
Util
Game
Bioshock
1%
57%
31%
7%
1%
2%
1%
0%
0%
0%
0%
0%
0%
0%
0%
0%
0%
1.6
0.05
75%
Call
of Duty 4
0%
12%
35%
20%
14%
8%
7%
2%
1%
0%
0%
0%
0%
0%
0%
0%
0%
2.1
0.21
86%
Crysis
1%
72%
23%
4%
1%
0%
0%
0%
0%0%0%0%0%0%0%0%0%1.40.0784%Image AuthoringMaya3D 201055%34%6%1%0%0%0%0%1%0%0%0%0%0%0%0%2%2.40.5318%Photoshop CS443%43%7%1%0%0%0%0%3%0%0%0%0%0%0%0%1%2.00.5617%OfficeAdobe Reader 965%25%8%1%0%0%0%0%0%0%0%0%0%0%0%0%0%1.30.0523%Excel 200772%23%4%0%0%0%0%0%0%0%0%0%0%0%0%0%0%1.20.0210%PowerPoint 2007 69%25%5%1%0%0%0%0%0%0%0%0%0%0%0%0%0%1.20.0316%Streets 201068%23%7%1%0%0%0%0%0%0%0%0%0%0%0%0%0%1.40.0114%Word 200774%22%4%0%0%0%0%0%0%0%0%0%0%0%0%0%0%1.20.0416%PlaybackiTunes 971%23%5%1%0%0%0%0%0%0%0%0%0%0%0%0%0%1.30.1622%Quicktime 50%38%10%2%0%0%0%0%0%0%0%0%0%0%0%0%0%1.30.0143%QuicktimeHD66%22%5%1%1%0%0%0%3%0%0%0%0%0%0%0%0%2.10.0640%CUDABadaboom54%35%9%2%0%0%0%0%0%0%0%0%0%0%0%0%0%1.30.0395%PowerDirector v942%20%12%6%5%4%3%2%3%1%0%0%0%0%0%0%0%3.20.5228%Video AuthoringHandbrake1%0%0%0%1%3%9%17%22%20%14%8%4%1%0%0%0%8.40.028%PowerDirector v927%20%11%6%4%3%2%6%8%5%5%3%0%0%0%0%0%4.80.1518%Web BrowsingFirefox 3.5*66%2461############1########################1.50.05
24%
Safari 4.0*
50%
34
11
3
1
###
###
###
1
###
###
###
###
###
###
###
###
1.6
0.06
24%Slide31
Overall TLP Results - Atom
30
Idle
System TLP
Application
0
1
2
3
4
TLP
σ
Game
Bioshock
2%
25%
35%
27%
11%
2.2
0.04
Call of Duty 4
1%
37%
27%
26%
10%
2.1
0.05
Crysis
0%
39%
44%
14%
2%
1.8
0.02
Image Authoring
Maya3D 2010
20%
41%
13%
5%
21%
2.1
0.05
Photoshop CS4
8%
59%
17%
6%
10%
1.6
0.11
Office
Adobe Reader 9
40%
36%
19%
5%
1%
1.5
0.03
Excel 2007
45%
40%
12%
2%
0%
1.3
0.01
PowerPoint 2007
38%
42%
16%
4%
1%
1.4
0.01
Streets
201039%34%20%5%2%1.60.02Word 200735%49%13%3%0%1.30.01PlaybackiTunes 924%45%24%6%1%1.50.09Quicktime4%28%39%23%6%2.10.10Quicktime HD11%19%22%22%26%2.60.01CUDABadaboom68%23%9%1%0%1.30.04PowerDirector v88%17%20%23%32%2.80.07Video AuthoringHandbrake0%0%2%10%88%3.80.04PowerDirector v83%8%11%19%58%3.30.04Web BrowsingFirefox 3.5*25%42%19%9%5%1.60.04Safari 4.0*23%35%21%12%8%1.80.08Slide32
Discussion
Many threads, but few used concurrentlyLack of concurrency appears due to software design issuesSingle thread performance is still importantUnderutilized GPU may offer additional opportunitiesUnlikely programmers will quickly take advantage of multi-coresFocus on desktop/laptop applications should be greaterUnderstood programs like video
transcoding
are already parallel
Others, like web browsers, use only 1 – 2 cores
31