Rapid Identification of Architectural Bottlenecks via Precise Event Counting - PowerPoint Presentation

343 views
Uploaded On 2020-08-04

Rapid Identification of Architectural Bottlenecks via Precise Event Counting - PPT Presentation

John Demme Simha Sethumadhavan Columbia University jddsimha cscolumbiaedu 2002 CASTL Computer Architecture and Security Technologies Lab 2 Platforms Source TIOBE Index ID: 797736

technologies architecture computer security architecture technologies security computer lab castl cycles branches misses program performance reads time mysql counter

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/797736" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download The PPT/PDF document "Rapid Identification of Architectural Bo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Rapid Identification of Architectural Bottlenecks via Precise Event Counting

John

Demme

Simha

Sethumadhavan

Columbia University

{

jdd,simha

cs.columbia.edu

Slide2

2002CASTL: Computer Architecture and Security Technologies Lab

Platforms

Source: TIOBE Index

http://

www.tiobe.com

index.php

tiobe_index

Slide3

2011

CASTL: Computer Architecture and Security Technologies Lab

Source: TIOBE Index

http://

www.tiobe.com

index.php

/tiobe_index

Platforms

Multicore

Moore’s Law

Slide4

How can we possibly keep up?

CASTL: Computer Architecture and Security Technologies Lab

Slide5

Architectural Lifecycle

CASTL: Computer Architecture and Security Technologies Lab

Slide6

Performance Data CollectionAnalytical ModelsFast, but questionable accuracy

Simulation

Often the gold standard

Very detailed information

Very slow

Production Hardware (performance counters)Very fastNot very detailed

CASTL: Computer Architecture and Security Technologies Lab6

Slide7

Performance Data CollectionAnalytical ModelsFast, but questionable accuracy

Simulation

Often the gold standard

Very detailed information

Very slow

Production Hardware (Performance Counters)Very fastNot very detailedRelatively detailed

CASTL: Computer Architecture and Security Technologies Lab

Slide8

Accuracy, Precision & PerturbationA comparison of performance monitoring techniques

nd the uncertainty

rincipal

CASTL: Computer Architecture and Security Technologies Lab8

Slide9

Accuracy, Precision & PerturbationIn normal execution, program interacts with microarchitecture as expected

CASTL: Computer Architecture and Security Technologies Lab

Normal Program Execution

Corresponding Machine State (Cache, Branch Predictor,

etc

)

Time

Slide10

Precise Instrumentation

When instrumentation is inserted, the machine state is disrupted and measurements are inaccurate

CASTL: Computer Architecture and Security Technologies Lab

Monitored Program Execution

“Correct” Machine State (Cache, Branch Predictor,

etc

)

Measured Machine State (Cache, Branch Predictor,

etc

)

Start of

mutex_lock

Start of

mutex_unlock

Start of

arrier_wait

Time

Slide11

Performance Counter SW Landscape

Precise

Reads counters whenever

program or instrumentation requests a read

Heavyweight

Examples

PAPI

perf_event

Overhead

Proportional

to # of reads

PAPI: 1048ns

Perf_event

: 262ns

CASTL: Computer Architecture and Security Technologies Lab

Slide12

Sampling vs. InstrumentationCASTL: Computer Architecture and Security Technologies Lab

Sampled Program Execution

n cycles

Traditional Instrumented Program Execution

Start of

mutex_lock

Start of

mutex_unlock

Start of

arrier_wait

Traditional instrumentation like polling

Sampling uses interrupts

Time

Slide13

Performance Counter SW Landscape

Sampling

Precise

Interrupts every

cycles and extrapolates

Reads counters whenever

program or instrumentation requests a read

Heavyweight

Examples

vTune

OProfile

PAPI

perf_event

Overhead

Inversely

proportional to n

Up to 20%

Usually much less

Proportional

to # of reads

PAPI: 1048ns

Perf_event

: 262ns

CASTL: Computer Architecture and Security Technologies Lab

Slide14

The Problem with SamplingCASTL: Computer Architecture and Security Technologies Lab

Sample Interrupt

Is this a critical section?

Slide15

Corrected with PrecisionCASTL: Computer Architecture and Security Technologies Lab

Read counter

Slide16

But, Precision Adds Overhead

CASTL: Computer Architecture and Security Technologies Lab

Monitored Program Execution

“Correct” Machine State (Cache, Branch Predictor,

etc

)

Measured Machine State (Cache, Branch Predictor,

etc

)

Time

Slide17

Instrumentation Adds PerturbationIf instrumentation sections are short, perturbation is reduced and measurements become more accurate

CASTL: Computer Architecture and Security Technologies Lab

Monitored Program Execution

“Correct” Machine State (Cache, Branch Predictor,

etc

)

Measured Machine State (Cache, Branch Predictor,

etc

)

Time

Slide18

Performance Counter SW Landscape

Sampling

Precise

Interrupts every

cycles and extrapolates

Reads counters whenever

program or instrumentation requests a read

Heavyweight

Lightweight

Examples

vTune

OProfile

PAPI

perf_event

Overhead

Inversely

proportional to n

Up to 20%

Usually much less

Proportional

to # of reads

PAPI: 1048ns

Perf_event

: 262ns

CASTL: Computer Architecture and Security Technologies Lab

Slide19

Performance Counter SW Landscape

Sampling

Precise

Interrupts every

cycles and extrapolates

Reads counters whenever

program or instrumentation requests a read

Heavyweight

Lightweight

Examples

vTune

OProfile

PAPI

perf_event

LiMiT

Overhead

Inversely

proportional to n

Up to 20%

Usually much less

Proportional

to # of reads

PAPI: 1048ns

Perf_event

: 262nsProportional to # of reads11ns

CASTL: Computer Architecture and Security Technologies Lab

Slide20

Related WorkNo recent papers for better precise countingOriginal PAPI paper:

Browne et

2000Some software, none offering

LiMiT’s featuresCharacterizing performance counters Weaver & Dongarra 2010Sampling

Counter multiplexing techniquesMytkowicz et al. 2007Azimi et al. 2005Trace Alignment

Mytkowicz et al. 2006CASTL: Computer Architecture and Security Technologies Lab

Slide21

Reducing counterread overheads

Implementing lightweight, precise monitoring

CASTL: Computer Architecture and Security Technologies Lab

Slide22

Why Precision is Slow

Avoid system calls to avoid overhead

Perfmon2 &

Perf_event

LiMiT

Program requests counter read

CASTL: Computer Architecture and Security Technologies Lab

Kernel reads counter and returns result

Program uses value

System Call

System Ret

Program

reads counter

Program uses value

Why is this

so hard?

Slide23

A Self-Monitoring ProcessCASTL: Computer Architecture and Security Technologies Lab

Slide24

Run, process, runCASTL: Computer Architecture and Security Technologies Lab

L1 Misses

Branches

Cycles

Slide25

OverflowCASTL: Computer Architecture and Security Technologies Lab

L1 Misses

Branches

Cycles

Psst

Slide26

Overflow

CASTL: Computer Architecture and Security Technologies Lab

L1 Misses

Branches

Cycles

L1 Misses

Branches

Cycles

Overflow Space

100

Slide27

Modified ReadCASTL: Computer Architecture and Security Technologies Lab

L1 Misses

Branches

Cycles

L1 Misses

Branches

Cycles

Overflow Space

100

120

Slide28

Overflow During ReadCASTL: Computer Architecture and Security Technologies Lab

L1 Misses

Branches

Cycles

L1 Misses

Branches

Cycles

Overflow Space

Slide29

Overflow!

CASTL: Computer Architecture and Security Technologies Lab

Misses

Branches

Cycles

L1 Misses

Branches

Cycles

Overflow Space

100

Slide30

Atomicity Violation!CASTL: Computer Architecture and Security Technologies Lab

L1 Misses

Branches

Cycles

L1 Misses

Branches

Cycles

Overflow Space

100

Slide31

OS Detection & Correction

CASTL: Computer Architecture and Security Technologies Lab

L1 Misses

Branches

Cycles

L1 Misses

Branches

Cycles

Overflow Space

100

Slide32

OS Detection & Correction

CASTL: Computer Architecture and Security Technologies Lab

L1 Misses

Branches

Cycles

L1 Misses

Branches

Cycles

Overflow Space

100

Looks like he was reading that…

Slide33

Atomicity Violation CorrectedCASTL: Computer Architecture and Security Technologies Lab

L1 Misses

Branches

Cycles

L1 Misses

Branches

Cycles

Overflow Space

100

So what does all this effort buy us?

Slide34

Time to collect 3*107 readings

Time

PAPI

Perf_event

LiMiT

Speedup

User

1.26s0.53s

0.034s3.7x / 1.56x

System30.10s

7.30s

∞

Wall

31.44s

7.87s

0.34s

92x / 23.1x

CASTL: Computer Architecture and Security Technologies Lab

Average

LiMiT

Readout

Number

of instructions

Number of cycles37.14

Time

11.3

Slide35

LiMiT Enables Detailed StudyShort counter reads decrease perturbation

Little perturbation allows detailed study of

Short synchronization regions

Short function calls

Three Case Studies

Synchronization in production web applicationsNot presented here, see paperSynchronization changes in MySQL over timeUser/Kernel code behavior in runtime libraries

CASTL: Computer Architecture and Security Technologies Lab

Slide36

Case Study:LONGITUDINAL STUDY OF LOCKING BEHAVIOR IN MYSQL

Has MySQL gotten better since the advent of multi-cores?

CASTL: Computer Architecture and Security Technologies Lab

Slide37

Evolution of Locking in MySQLQuestions to answer

Has MySQL gotten better at locking?

What techniques have been used?

Methodology

Intercept

pthread locking callsCount overheads and critical sectionsCASTL: Computer Architecture and Security Technologies Lab

Slide38

MySQL Synchronization TimesCASTL: Computer Architecture and Security Technologies Lab

Slide39

MySQL Critical SectionsCASTL: Computer Architecture and Security Technologies Lab

Slide40

Number of Locks in MySQLCASTL: Computer Architecture and Security Technologies Lab

Slide41

Observations & ImplicationsCoarser granularity, better performance

Total critical section time has decreased

verage CS times have increased

Number of locks has decreased

Performance counters useful for software engineering studiesCASTL: Computer Architecture and Security Technologies Lab

Slide42

Case Study:KERNEL/USERSPACE OVERHEADS IN RUNTIME LIBRARY

Does code in the kernel and runtime library behave?

CASTL: Computer Architecture and Security Technologies Lab

Slide43

Full System Analysis w/o SimulationQuestions to answerHow much time do system applications spend in in runtime libraries?

How well do they perform in them? Why?

Methodology

Intercept common

libc

, libm and libpthread callsCount user-/kernel- space events during the callsBreak down by purpose (I/O, Memory,

Pthread)ApplicationsMySQL, ApacheIntel Nehalem Microarchitecture

CASTL: Computer Architecture and Security Technologies Lab

Slide44

Execution Cycles in Library CallsCASTL: Computer Architecture and Security Technologies Lab

Slide45

MySQL Clocks per InstructionCASTL: Computer Architecture and Security Technologies Lab

Slide46

L3 Cache MPKICASTL: Computer Architecture and Security Technologies Lab

Slide47

I-Cache Stall CyclesCASTL: Computer Architecture and Security Technologies Lab

22.4%

12.0%

Slide48

Observations & ImplicationsApache is fundamentally I/O bound

Optimization of the I/O subsystem necessary

Kernel code suffers from I-Cache stalls

Speculation: bad interrupt instruction prefetching

LiMiT

yields detailed performance dataNot as accurate or detailed as simulationBut gathered in hours rather than weeksCASTL: Computer Architecture and Security Technologies Lab

Slide49

ConclusionsResearch Methodology Implications,Closing thoughts

CASTL: Computer Architecture and Security Technologies Lab

Slide50

ConclusionsImplications from case studiesMySQL’s multicore

xperience helped scalability

Performance

counting for non-architecture

Libraries and kernels perform very differentlyI/O subsystems can be slowResearch MethodologyLiMiT can provide detailed results quicklySimulators are more detailed but slowOpportunity to build

microbenchmarksIdentify bottlenecks with countersVerify representativeness with countersThen simulate

CASTL: Computer Architecture and Security Technologies Lab

Slide51

Questions?

CASTL: Computer Architecture and Security Technologies Lab

Slide52

Backup slidesMan down! Need backup!

CASTL: Computer Architecture and Security Technologies Lab

Slide53

Performance Evaluation Methods

Accuracy

Precision

Speed

Cost

Simulators

↑

↓

↑

↓

Analytical Models

↑

↓

Prototype Hardware

↑

Production

Hardware

↑

↓

↑

↓

↑

↓

Accuracy and Precision

are traded off

Production hardware provides performance counters

However, existing interfaces make accuracy/precision tradeoff

difficult

CASTL: Computer Architecture and Security Technologies Lab

Slide54

Sampling vs. LiMiTCASTL: Computer Architecture and Security Technologies Lab

Sampled Program Execution

n cycles

LiMiT

Instrumented Program Execution

Start of

mutex_lock

Start of

mutex_unlock

Start of

arrier_wait

Slide55

Another process runsCASTL: Computer Architecture and Security Technologies Lab

Miles

Pushups

Situps

Slide56

Fix: Virtualization

CASTL: Computer Architecture and Security Technologies Lab

Miles

Pushups

Situps

30 Miles!

I did pretty well today.

No you didn’t.

Slide57

Miles

Pushups

Situps

Avoiding Communication

CASTL: Computer Architecture and Security Technologies Lab

Miles

Pushups

Situps

Slide58

LiMiT Operation

CASTL: Computer Architecture and Security Technologies Lab

Slide59

RDTSC

CASTL: Computer Architecture and Security Technologies Lab

Slide60

MySQL Instrumentation OverheadCASTL: Computer Architecture and Security Technologies Lab

Slide61

Case Study A:LOCKING IN WEB WORKLOADS

How does web-related software use locks?

CASTL: Computer Architecture and Security Technologies Lab

Slide62

Locking on the WebQuestions to answerIs locking a significant concern?

How can architects help?

Are traditional benchmarks similar?

Methodology

Intercept

pthread mutex calls, time w/ LiMiTApplicationsFirefoxApache

MySQLPARSECCASTL: Computer Architecture and Security Technologies Lab

Slide63

Execution Time by RegionCASTL: Computer Architecture and Security Technologies Lab

Slide64

Locking Statistics

Firefox

Apache

PARSEC

MySQL

Avg.

Lock Held Time (cycles)

789149

1181076Dynamic Locks per 10k Cycles

3.241.12

0.545

3.18

Static Locks

13853

CASTL: Computer Architecture and Security Technologies Lab

Slide65

Observations & ImplicationsApplications like Firefox and MySQL use locks differently from Apache and PARSEC

Many notions of synchronization based on scientific computing probably don’t apply

Locking overheads up to 8 - 13%

More efficient mechanisms may be helpful

But, 13% is upper bound on speedup

MySQL has some very long critical sectionsPrime targets for micro-arch optimizationIf they run faster, MySQL scales betterCASTL: Computer Architecture and Security Technologies Lab

Slide66

Hardware Enhancements64-bit Reads and WritesOverflows are primary source of complexity

64-bit counters w/ full read/write eliminates it

Destructive Reads

ifference = 2 reads, store, load & subtract

Destructive read difference = 2 readsCombined ReadsX86 counter read requires 2 instructionsCombining should reduce overheadAMD’s Lightweight Profiling ProposalReally good, depending on microarchitecture

CASTL: Computer Architecture and Security Technologies Lab