/
Evolution of Thread-Level Parallelism in Desktop Applicatio Evolution of Thread-Level Parallelism in Desktop Applicatio

Evolution of Thread-Level Parallelism in Desktop Applicatio - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
414 views
Uploaded On 2015-11-14

Evolution of Thread-Level Parallelism in Desktop Applicatio - PPT Presentation

Geoffrey Blake Ronald G Dreslinski Trevor Mudge Krisztián Flautner University of Michigan Ann Arbor ARM ISCA 2010 June 22 2010 Introduction 2000 Single core machines common ID: 192714

time tlp desktop threads tlp time threads desktop system thread cores core windows results performance duty 2007 xeon 2010

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Evolution of Thread-Level Parallelism in..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Evolution of Thread-Level Parallelism in Desktop Applications

Geoffrey Blake*, Ronald G. Dreslinski*, Trevor Mudge*, Krisztián Flautner†University of Michigan – Ann Arbor*ARM †

ISCA 2010

June 22, 2010Slide2

Introduction

2000Single core machines commonClock speed steadily increasing – Intel predicts to reach 10GHz by 20102005Consumer CMPs announced (from Intel/AMD)Aggressive clock speed increases halt

2010 and Future

Multi-core machines common

Core counts steadily increasing

Pentium 4

Core Duo

Core i7

Nehalem EX

1

“If

you build it, they will come”

--

“Field of Dreams” 1989Slide3

Motivation

Server and Scientific workloads already parallelWhat about Desktop/Laptop workloads?Flautner et al. at ASPLOS-IX studied interactive desktop workloadsAlmost all applications behaved as single threaded1 extra core helped system responsiveness Multiprocessor desktop/laptop systems now commonHas desktop/laptop software followed?

2

# of cores

Performance

Desktop Scaling?

Server/Scientific Scaling

Ideal ScalingSlide4

Motivation

3

Past 5 years of ISCA, Server and Scientific workload research has dominated

Market shows the opposite

Correct domain to invest disproportionate effort into? Trickle down effect?

*source IDC and Gartner 2009

*source ISCA ‘05-’09Slide5

Metrics

Replicate contemporary experiments of previous workMeasure “Thread Level Parallelism” (TLP) instead of Utilizationci

= fraction of time

i

cpus are doing workc

0 = Idle (0 cpus are doing work)c1 = 1 cpu doing work, c

2 = 2 cpus doing work

4

Example: c0 = 0.5 Util = 0.25 * 1 + 0.25 * 2 = 0.75 c1 = 0.25 TLP = (0.25 * 1 + 0.25 * 2) / (1 – 0.5) = 1.5

c2 = 0.25TLP is measure of how efficiently system is using parallel resources when work needs to be done

*[Flautner et al. ASPLOS’00]Slide6

Test Systems

5

2009 Mac Pro (Fast,

highend

)

2x 2.26GHz Intel Xeon E5520(8 cores, 16 hardware threads)

NVIDIA GTX 285/GT120(240/32 CUDA cores)Mac OS 10.6 + Windows 7 x64

ASUS ASRock (Slow, cheap)1x 1.6GHz Intel Atom 330

(2 cores, 4 hardware threads)NVIDIA ION(16 CUDA cores)Windows 7 x64

M

achines were chosen to measure effect of system speed on TLP.Fast system may allow OS to schedule tasks on same core all the time.Slide7

Measurement Infrastructure

Developed system wide monitoring programs for both client OS’sMac OS X 10.6DTrace to track thread context switches I/O Kit probing of GPU driver for GPU utilizationWindows 7Event Tracing for Windows (ETW) to track thread context switches

NVPerfKit

SDK for GPU utilization

6Slide8

Benchmarking

Tested software in six categoriesGamesImage AuthoringOffice ProductivityMultimedia PlaybackVideo Authoring/CUDA enabled video authoringWeb BrowsingUsed detailed task sets and input parameters

Tests performed by user – results fully reproducible, low variance

Test length was 5 minutes or more

Details, input sets and tracing tools can be found here:

http://itlpbench.eecs.umich.edu7Slide9

Are threads used?

Benchmark

Threads

Created

Avg. Threads Alive

Handbrake 0.9

2251124Call of Duty 4

7744

Photoshop CS48275

Adobe Reader 923924Quicktime-HD

5352Firefox 3.5522

38

Many threads created

Many threads alive and visible to the OS during runtime

8Slide10

10 Year Comparison

9

Requires

high performance, progress made in 10 years

Requires

high performance, little progress in 10 years

Elsewhere

, very little progress in 10 yearsSlide11

Overall TLP Results – Xeon Windows 7

10

Idle

8 Core SMT - System Wide TLP

 

 

Application

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

TLP

AVG

TLP

Game

Bioshock

1.6

1.7

Call

of Duty 4

2.1

Crysis

1.4

Image Authoring

Maya3D 2010

2.4

2.2

Photoshop CS4

2.0

Office

Adobe Reader

9

1.3

1.2

Excel 2007

1.2

PowerPoint 2007

1.2

Streets 2010

1.4

Word 2007

1.2

Playback

iTunes 9

1.3

1.6

Quicktime

1.3

QuicktimeHD

2.1

CUDA

Badaboom

1.3

2.2

PowerDirector

v9

3.2

Video Authoring

Handbrake

8.4

6.6

PowerDirector

v9

4.8

Web Browsing

Firefox 3.5*1.51.6Safari 4.0*1.6

100%

0%Slide12

Overall TLP Results – Atom Windows 7

11

Idle

System TLP

 

Application

0

1

2

3

4

TLP

AVG

Game

Bioshock

2.2

2.0

Call of Duty 4

2.1

Crysis

1.8

Image Authoring

Maya3D 2010

2.1

1.9

Photoshop CS4

1.6

Office

Adobe Reader 9

1.5

1.4

Excel 2007

1.3

PowerPoint 2007

1.4

Streets

2010

1.6

Word 2007

1.3

Playback

iTunes 9

1.5

2.1

Quicktime

2.1

Quicktime

HD

2.6

CUDA

Badaboom

1.3

2.0

PowerDirector

v8

2.8

Video Authoring

Handbrake

3.8

3.6

PowerDirector

v8

3.3

Web Browsing

Firefox 3.5*

1.6

1.7

Safari 4.0*

1.8

100%

0%Slide13

Call of Duty 4: TLP vs

Time – Xeon Windows 7

Active

Idle

Time (3s)

12

75 threads spawned during execution, average of 44 live threads at any time

CPU12

CPU13

CPU14

CPU15

CPU8

CPU9

CPU10

CPU11

CPU4

CPU5

CPU6

CPU7

CPU0

CPU1

CPU2

CPU3

[Hauser et al., SIGOPS ‘93]Slide14

TLP vs Time – Atom Windows 7

13

Call of Duty 4

Firefox 3.5

Time (3s)

Time (3s)

CPU0

CPU1

CPU2

CPU3

CPU0

CPU1

CPU2

CPU3

522 threads spawned, average of 38 live threads at any time Slide15

Photoshop CS4 – TLP vs

Time (Xeon)14Slide16

GPU Measurements - Throughput

15

[Lee et al., ISCA’10]Slide17

Conclusions

Little change in TLP over ten yearsSingle thread speed has little impact on TLPSingle thread performance is still importantSpecific applications do take advantage of resourcesLarge amounts of silicon is under utilizedDebatable if aggressively increasing core count is correct directionCan desktop applications be parallelized effectively?

Would architecture specialization be more beneficial?

16Slide18

Future Directions and Work

Categorizing thread use in desktop applicationsBetter performance metricsPerform critical path analysisDetailed characterization of instruction stream17Slide19

Questions

??

18Slide20

Backup Slides

19Slide21

Motivation

The mobile market is even larger: ~1.2 Billion units shipped in 2009174.1 Million units were smartphonesSource: IDC

20Slide22

Overall Results – Fast System

21Slide23

Overall Results – Slow System

22Slide24

Web Browser Results Detail

23Slide25

Firefox 3.5: TLP vs

Time (Xeon)24

Hardware Contexts

Active

Idle

Time (3s)Slide26

TLP vs Time –

PowerDirector

25Slide27

TLP vs Time – Handbrake

26Slide28

TLP – OS Comparison

Small differences between Windows and OS XApplications written originally for a particular platform perform better than the port

27Slide29

Discussion - Atom

Reduced performance cores do not appreciably increase TLPLack of TLP appears more due to software designSingle thread performance is still important for desktop/laptopMany slow cores over few fast cores may be bad fit for desktop/laptop space

28Slide30

Overall TLP Results - Xeon

29

Idle

8 Core SMT - System Wide TLP

 

GPU

 

Application

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

TLP

σ

Util

Game

Bioshock

1%

57%

31%

7%

1%

2%

1%

0%

0%

0%

0%

0%

0%

0%

0%

0%

0%

1.6

0.05

75%

Call

of Duty 4

0%

12%

35%

20%

14%

8%

7%

2%

1%

0%

0%

0%

0%

0%

0%

0%

0%

2.1

0.21

86%

Crysis

1%

72%

23%

4%

1%

0%

0%

0%

0%0%0%0%0%0%0%0%0%1.40.0784%Image AuthoringMaya3D 201055%34%6%1%0%0%0%0%1%0%0%0%0%0%0%0%2%2.40.5318%Photoshop CS443%43%7%1%0%0%0%0%3%0%0%0%0%0%0%0%1%2.00.5617%OfficeAdobe Reader 965%25%8%1%0%0%0%0%0%0%0%0%0%0%0%0%0%1.30.0523%Excel 200772%23%4%0%0%0%0%0%0%0%0%0%0%0%0%0%0%1.20.0210%PowerPoint 2007 69%25%5%1%0%0%0%0%0%0%0%0%0%0%0%0%0%1.20.0316%Streets 201068%23%7%1%0%0%0%0%0%0%0%0%0%0%0%0%0%1.40.0114%Word 200774%22%4%0%0%0%0%0%0%0%0%0%0%0%0%0%0%1.20.0416%PlaybackiTunes 971%23%5%1%0%0%0%0%0%0%0%0%0%0%0%0%0%1.30.1622%Quicktime 50%38%10%2%0%0%0%0%0%0%0%0%0%0%0%0%0%1.30.0143%QuicktimeHD66%22%5%1%1%0%0%0%3%0%0%0%0%0%0%0%0%2.10.0640%CUDABadaboom54%35%9%2%0%0%0%0%0%0%0%0%0%0%0%0%0%1.30.0395%PowerDirector v942%20%12%6%5%4%3%2%3%1%0%0%0%0%0%0%0%3.20.5228%Video AuthoringHandbrake1%0%0%0%1%3%9%17%22%20%14%8%4%1%0%0%0%8.40.028%PowerDirector v927%20%11%6%4%3%2%6%8%5%5%3%0%0%0%0%0%4.80.1518%Web BrowsingFirefox 3.5*66%2461############1########################1.50.05

24%

Safari 4.0*

50%

34

11

3

1

###

###

###

1

###

###

###

###

###

###

###

###

1.6

0.06

24%Slide31

Overall TLP Results - Atom

30

Idle

System TLP

 

Application

0

1

2

3

4

TLP

σ

Game

Bioshock

2%

25%

35%

27%

11%

2.2

0.04

Call of Duty 4

1%

37%

27%

26%

10%

2.1

0.05

Crysis

0%

39%

44%

14%

2%

1.8

0.02

Image Authoring

Maya3D 2010

20%

41%

13%

5%

21%

2.1

0.05

Photoshop CS4

8%

59%

17%

6%

10%

1.6

0.11

Office

Adobe Reader 9

40%

36%

19%

5%

1%

1.5

0.03

Excel 2007

45%

40%

12%

2%

0%

1.3

0.01

PowerPoint 2007

38%

42%

16%

4%

1%

1.4

0.01

Streets

201039%34%20%5%2%1.60.02Word 200735%49%13%3%0%1.30.01PlaybackiTunes 924%45%24%6%1%1.50.09Quicktime4%28%39%23%6%2.10.10Quicktime HD11%19%22%22%26%2.60.01CUDABadaboom68%23%9%1%0%1.30.04PowerDirector v88%17%20%23%32%2.80.07Video AuthoringHandbrake0%0%2%10%88%3.80.04PowerDirector v83%8%11%19%58%3.30.04Web BrowsingFirefox 3.5*25%42%19%9%5%1.60.04Safari 4.0*23%35%21%12%8%1.80.08Slide32

Discussion

Many threads, but few used concurrentlyLack of concurrency appears due to software design issuesSingle thread performance is still importantUnderutilized GPU may offer additional opportunitiesUnlikely programmers will quickly take advantage of multi-coresFocus on desktop/laptop applications should be greaterUnderstood programs like video

transcoding

are already parallel

Others, like web browsers, use only 1 – 2 cores

31