/
Rapid Identification of Architectural Bottlenecks via Precise Event Counting Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Rapid Identification of Architectural Bottlenecks via Precise Event Counting - PowerPoint Presentation

rivernescafe
rivernescafe . @rivernescafe
Follow
343 views
Uploaded On 2020-08-04

Rapid Identification of Architectural Bottlenecks via Precise Event Counting - PPT Presentation

John Demme Simha Sethumadhavan Columbia University jddsimha cscolumbiaedu 2002 CASTL Computer Architecture and Security Technologies Lab 2 Platforms Source TIOBE Index ID: 797736

technologies architecture computer security architecture technologies security computer lab castl cycles branches misses program performance reads time mysql counter

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Rapid Identification of Architectural Bo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Rapid Identification of Architectural Bottlenecks via Precise Event Counting

John

Demme

,

Simha

Sethumadhavan

Columbia University

{

jdd,simha

}@

cs.columbia.edu

Slide2

2002CASTL: Computer Architecture and Security Technologies Lab

2

Platforms

Source: TIOBE Index

http://

www.tiobe.com

/

index.php

/

tiobe_index

Slide3

2011

CASTL: Computer Architecture and Security Technologies Lab

3

Source: TIOBE Index

http://

www.tiobe.com

/

index.php

/tiobe_index

Platforms

Multicore

Moore’s Law

Slide4

How can we possibly keep up?

CASTL: Computer Architecture and Security Technologies Lab

4

Slide5

Architectural Lifecycle

CASTL: Computer Architecture and Security Technologies Lab

5

Slide6

Performance Data CollectionAnalytical ModelsFast, but questionable accuracy

Simulation

Often the gold standard

Very detailed information

Very slow

Production Hardware (performance counters)Very fastNot very detailed

CASTL: Computer Architecture and Security Technologies Lab6

Slide7

Performance Data CollectionAnalytical ModelsFast, but questionable accuracy

Simulation

Often the gold standard

Very detailed information

Very slow

Production Hardware (Performance Counters)Very fastNot very detailedRelatively detailed

CASTL: Computer Architecture and Security Technologies Lab

7

Slide8

Accuracy, Precision & PerturbationA comparison of performance monitoring techniques

a

nd the uncertainty

p

rincipal

CASTL: Computer Architecture and Security Technologies Lab8

Slide9

Accuracy, Precision & PerturbationIn normal execution, program interacts with microarchitecture as expected

CASTL: Computer Architecture and Security Technologies Lab

9

Normal Program Execution

Corresponding Machine State (Cache, Branch Predictor,

etc

)

Time

Slide10

Precise Instrumentation

When instrumentation is inserted, the machine state is disrupted and measurements are inaccurate

CASTL: Computer Architecture and Security Technologies Lab

10

Monitored Program Execution

“Correct” Machine State (Cache, Branch Predictor,

etc

)

Measured Machine State (Cache, Branch Predictor,

etc

)

Start of

mutex_lock

Start of

mutex_unlock

Start of

b

arrier_wait

Time

Slide11

Performance Counter SW Landscape

Precise

Reads counters whenever

program or instrumentation requests a read

Heavyweight

Examples

PAPI

perf_event

Overhead

Proportional

to # of reads

PAPI: 1048ns

Perf_event

: 262ns

CASTL: Computer Architecture and Security Technologies Lab

11

Slide12

Sampling vs. InstrumentationCASTL: Computer Architecture and Security Technologies Lab

12

Sampled Program Execution

n cycles

n cycles

Traditional Instrumented Program Execution

Start of

mutex_lock

Start of

mutex_unlock

Start of

b

arrier_wait

Traditional instrumentation like polling

Sampling uses interrupts

Time

Slide13

Performance Counter SW Landscape

Sampling

Precise

Interrupts every

n

cycles and extrapolates

Reads counters whenever

program or instrumentation requests a read

Heavyweight

Examples

vTune

OProfile

PAPI

perf_event

Overhead

Inversely

proportional to n

Up to 20%

Usually much less

Proportional

to # of reads

PAPI: 1048ns

Perf_event

: 262ns

CASTL: Computer Architecture and Security Technologies Lab

13

Slide14

The Problem with SamplingCASTL: Computer Architecture and Security Technologies Lab

14

Sample Interrupt

Is this a critical section?

Slide15

Corrected with PrecisionCASTL: Computer Architecture and Security Technologies Lab

15

Read counter

Read counter

Slide16

But, Precision Adds Overhead

CASTL: Computer Architecture and Security Technologies Lab

16

Monitored Program Execution

“Correct” Machine State (Cache, Branch Predictor,

etc

)

Measured Machine State (Cache, Branch Predictor,

etc

)

Time

Slide17

Instrumentation Adds PerturbationIf instrumentation sections are short, perturbation is reduced and measurements become more accurate

CASTL: Computer Architecture and Security Technologies Lab

17

Monitored Program Execution

“Correct” Machine State (Cache, Branch Predictor,

etc

)

Measured Machine State (Cache, Branch Predictor,

etc

)

Time

Slide18

Performance Counter SW Landscape

Sampling

Precise

Interrupts every

n

cycles and extrapolates

Reads counters whenever

program or instrumentation requests a read

Heavyweight

Lightweight

Examples

vTune

OProfile

PAPI

perf_event

Overhead

Inversely

proportional to n

Up to 20%

Usually much less

Proportional

to # of reads

PAPI: 1048ns

Perf_event

: 262ns

CASTL: Computer Architecture and Security Technologies Lab

18

Slide19

Performance Counter SW Landscape

Sampling

Precise

Interrupts every

n

cycles and extrapolates

Reads counters whenever

program or instrumentation requests a read

Heavyweight

Lightweight

Examples

vTune

OProfile

PAPI

perf_event

LiMiT

Overhead

Inversely

proportional to n

Up to 20%

Usually much less

Proportional

to # of reads

PAPI: 1048ns

Perf_event

: 262nsProportional to # of reads11ns

CASTL: Computer Architecture and Security Technologies Lab

19

Slide20

Related WorkNo recent papers for better precise countingOriginal PAPI paper:

Browne et

a

l.

2000Some software, none offering

LiMiT’s featuresCharacterizing performance counters Weaver & Dongarra 2010Sampling

Counter multiplexing techniquesMytkowicz et al. 2007Azimi et al. 2005Trace Alignment

Mytkowicz et al. 2006CASTL: Computer Architecture and Security Technologies Lab

20

Slide21

Reducing counterread overheads

Implementing lightweight, precise monitoring

CASTL: Computer Architecture and Security Technologies Lab

21

Slide22

Why Precision is Slow

Avoid system calls to avoid overhead

Perfmon2 &

Perf_event

LiMiT

Program requests counter read

22

CASTL: Computer Architecture and Security Technologies Lab

Kernel reads counter and returns result

Program uses value

System Call

System Ret

Program

reads counter

Program uses value

Why is this

so hard?

Slide23

A Self-Monitoring ProcessCASTL: Computer Architecture and Security Technologies Lab

23

Slide24

Run, process, runCASTL: Computer Architecture and Security Technologies Lab

24

3

24

39

5

L1 Misses

Branches

Cycles

Slide25

OverflowCASTL: Computer Architecture and Security Technologies Lab

25

L1 Misses

Branches

Cycles

24

39

7

95

1

0

0

Psst

!

Slide26

Overflow

CASTL: Computer Architecture and Security Technologies Lab

26

L1 Misses

Branches

Cycles

24

7

0

0

L1 Misses

Branches

Cycles

0

0

0

Overflow Space

1

100

Slide27

Modified ReadCASTL: Computer Architecture and Security Technologies Lab

27

L1 Misses

Branches

Cycles

24

7

20

L1 Misses

Branches

Cycles

0

0

Overflow Space

100

20

100

+

120

Slide28

Overflow During ReadCASTL: Computer Architecture and Security Technologies Lab

28

L1 Misses

Branches

Cycles

24

7

9

9

L1 Misses

Branches

Cycles

0

0

Overflow Space

0

99

Slide29

Overflow!

CASTL: Computer Architecture and Security Technologies Lab

29

L1

Misses

Branches

Cycles

24

7

00

L1 Misses

Branches

Cycles

0

0

Overflow Space

0

1

100

99

Slide30

Atomicity Violation!CASTL: Computer Architecture and Security Technologies Lab

30

L1 Misses

Branches

Cycles

24

7

0

L1 Misses

Branches

Cycles

0

0

Overflow Space

100

99

100

+

19

9

Slide31

OS Detection & Correction

CASTL: Computer Architecture and Security Technologies Lab

31

L1 Misses

Branches

Cycles

24

7

00

L1 Misses

Branches

Cycles

0

0

Overflow Space

0

1

100

99

Slide32

OS Detection & Correction

CASTL: Computer Architecture and Security Technologies Lab

32

L1 Misses

Branches

Cycles

24

7

00

L1 Misses

Branches

Cycles

0

0

Overflow Space

100

99

Looks like he was reading that…

0

Slide33

Atomicity Violation CorrectedCASTL: Computer Architecture and Security Technologies Lab

33

L1 Misses

Branches

Cycles

24

7

0

L1 Misses

Branches

Cycles

0

0

Overflow Space

100

0

100

+

100

So what does all this effort buy us?

Slide34

Time to collect 3*107 readings

Time

PAPI

Perf_event

LiMiT

Speedup

User

1.26s0.53s

0.034s3.7x / 1.56x

System30.10s

7.30s

0

Wall

31.44s

7.87s

0.34s

92x / 23.1x

CASTL: Computer Architecture and Security Technologies Lab

34

Average

LiMiT

Readout

Number

of instructions

5

Number of cycles37.14

Time

11.3

ns

Slide35

LiMiT Enables Detailed StudyShort counter reads decrease perturbation

Little perturbation allows detailed study of

Short synchronization regions

Short function calls

Three Case Studies

Synchronization in production web applicationsNot presented here, see paperSynchronization changes in MySQL over timeUser/Kernel code behavior in runtime libraries

CASTL: Computer Architecture and Security Technologies Lab

35

Slide36

Case Study:LONGITUDINAL STUDY OF LOCKING BEHAVIOR IN MYSQL

Has MySQL gotten better since the advent of multi-cores?

CASTL: Computer Architecture and Security Technologies Lab

36

Slide37

Evolution of Locking in MySQLQuestions to answer

Has MySQL gotten better at locking?

What techniques have been used?

Methodology

Intercept

pthread locking callsCount overheads and critical sectionsCASTL: Computer Architecture and Security Technologies Lab

37

Slide38

MySQL Synchronization TimesCASTL: Computer Architecture and Security Technologies Lab

38

Slide39

MySQL Critical SectionsCASTL: Computer Architecture and Security Technologies Lab

39

Slide40

Number of Locks in MySQLCASTL: Computer Architecture and Security Technologies Lab

40

Slide41

Observations & ImplicationsCoarser granularity, better performance

Total critical section time has decreased

A

verage CS times have increased

Number of locks has decreased

Performance counters useful for software engineering studiesCASTL: Computer Architecture and Security Technologies Lab

41

Slide42

Case Study:KERNEL/USERSPACE OVERHEADS IN RUNTIME LIBRARY

Does code in the kernel and runtime library behave?

CASTL: Computer Architecture and Security Technologies Lab

42

Slide43

Full System Analysis w/o SimulationQuestions to answerHow much time do system applications spend in in runtime libraries?

How well do they perform in them? Why?

Methodology

Intercept common

libc

, libm and libpthread callsCount user-/kernel- space events during the callsBreak down by purpose (I/O, Memory,

Pthread)ApplicationsMySQL, ApacheIntel Nehalem Microarchitecture

CASTL: Computer Architecture and Security Technologies Lab

43

Slide44

Execution Cycles in Library CallsCASTL: Computer Architecture and Security Technologies Lab

44

Slide45

MySQL Clocks per InstructionCASTL: Computer Architecture and Security Technologies Lab

45

Slide46

L3 Cache MPKICASTL: Computer Architecture and Security Technologies Lab

46

Slide47

I-Cache Stall CyclesCASTL: Computer Architecture and Security Technologies Lab

47

22.4%

12.0%

Slide48

Observations & ImplicationsApache is fundamentally I/O bound

Optimization of the I/O subsystem necessary

Kernel code suffers from I-Cache stalls

Speculation: bad interrupt instruction prefetching

LiMiT

yields detailed performance dataNot as accurate or detailed as simulationBut gathered in hours rather than weeksCASTL: Computer Architecture and Security Technologies Lab

48

Slide49

ConclusionsResearch Methodology Implications,Closing thoughts

CASTL: Computer Architecture and Security Technologies Lab

49

Slide50

ConclusionsImplications from case studiesMySQL’s multicore

e

xperience helped scalability

Performance

counting for non-architecture

Libraries and kernels perform very differentlyI/O subsystems can be slowResearch MethodologyLiMiT can provide detailed results quicklySimulators are more detailed but slowOpportunity to build

microbenchmarksIdentify bottlenecks with countersVerify representativeness with countersThen simulate

CASTL: Computer Architecture and Security Technologies Lab

50

Slide51

Questions?

CASTL: Computer Architecture and Security Technologies Lab

51

Slide52

Backup slidesMan down! Need backup!

CASTL: Computer Architecture and Security Technologies Lab

52

Slide53

Performance Evaluation Methods

Accuracy

Precision

Speed

Cost

Simulators

/

Analytical Models

?

?

Prototype Hardware

Production

Hardware

/

/

Accuracy and Precision

are traded off

Production hardware provides performance counters

However, existing interfaces make accuracy/precision tradeoff

difficult

53

CASTL: Computer Architecture and Security Technologies Lab

Slide54

Sampling vs. LiMiTCASTL: Computer Architecture and Security Technologies Lab

54

Sampled Program Execution

n cycles

n cycles

LiMiT

Instrumented Program Execution

Start of

mutex_lock

Start of

mutex_unlock

Start of

b

arrier_wait

Slide55

Another process runsCASTL: Computer Architecture and Security Technologies Lab

55

Miles

Pushups

Situps

5

24

39

7

9

Slide56

Fix: Virtualization

CASTL: Computer Architecture and Security Technologies Lab

56

Miles

Pushups

Situps

24

39

30

30 Miles!

I did pretty well today.

No you didn’t.

7

Slide57

Miles

Pushups

Situps

24

39

7

Avoiding Communication

CASTL: Computer Architecture and Security Technologies Lab

57

Miles

Pushups

Situps

0

0

30

Slide58

LiMiT Operation

CASTL: Computer Architecture and Security Technologies Lab

58

Slide59

RDTSC

CASTL: Computer Architecture and Security Technologies Lab

59

Slide60

MySQL Instrumentation OverheadCASTL: Computer Architecture and Security Technologies Lab

60

Slide61

Case Study A:LOCKING IN WEB WORKLOADS

How does web-related software use locks?

CASTL: Computer Architecture and Security Technologies Lab

61

Slide62

Locking on the WebQuestions to answerIs locking a significant concern?

How can architects help?

Are traditional benchmarks similar?

Methodology

Intercept

pthread mutex calls, time w/ LiMiTApplicationsFirefoxApache

MySQLPARSECCASTL: Computer Architecture and Security Technologies Lab

62

Slide63

Execution Time by RegionCASTL: Computer Architecture and Security Technologies Lab

63

Slide64

Locking Statistics

Firefox

Apache

PARSEC

MySQL

Avg.

Lock Held Time (cycles)

789149

1181076Dynamic Locks per 10k Cycles

3.241.12

0.545

3.18

Static Locks

57

1

17

13853

CASTL: Computer Architecture and Security Technologies Lab

64

Slide65

Observations & ImplicationsApplications like Firefox and MySQL use locks differently from Apache and PARSEC

Many notions of synchronization based on scientific computing probably don’t apply

Locking overheads up to 8 - 13%

More efficient mechanisms may be helpful

But, 13% is upper bound on speedup

MySQL has some very long critical sectionsPrime targets for micro-arch optimizationIf they run faster, MySQL scales betterCASTL: Computer Architecture and Security Technologies Lab

65

Slide66

Hardware Enhancements64-bit Reads and WritesOverflows are primary source of complexity

64-bit counters w/ full read/write eliminates it

Destructive Reads

D

ifference = 2 reads, store, load & subtract

Destructive read difference = 2 readsCombined ReadsX86 counter read requires 2 instructionsCombining should reduce overheadAMD’s Lightweight Profiling ProposalReally good, depending on microarchitecture

CASTL: Computer Architecture and Security Technologies Lab

66