A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection PowerPoint Presentation

A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection PowerPoint Presentation

2018-02-23 41K 41 0 0

Description

Michael Laurenzano. 1. , Joshua Peraza. 1. , Laura Carrington. 1. , . Ananta. Tiwari. 1. , William A. Ward. 2. , Roy Campbell. 2. 1. Performance Modeling and Characterization (. PMaC. ) Laboratory, San Diego Supercomputer Center. ID: 634603

Embed code:

Download this presentation



DownloadNote - The PPT/PDF document "A Static Binary Instrumentation Threadin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentations text content in A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection

Slide1

A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection

Michael Laurenzano

1

, Joshua Peraza

1

, Laura Carrington

1

,

Ananta

Tiwari

1

, William A. Ward

2

, Roy Campbell

2

1

Performance Modeling and Characterization (

PMaC

) Laboratory, San Diego Supercomputer Center

2

High Performance Computing Modernization Program (HPCMP), United States Department of Defense

Slide2

Memory-driven HPC

Many HPC applications are memory bound

Understanding application requires understanding memory behavior

Measurement? (e.g. timers or hardware counters)

Measuring at fine grain with reasonable overheads & transparently is

HARD

How to get sufficient detail (e.g., reuse distance?)

Binary instrumentation

Obtains low-level details of address stream

Details are attached to specific structures within the application

Slide3

Convolution Methods

map

Application Signatures

to Machine Profiles produce performance prediction

HPC Target

System

Characteristics of HPC system – Machine Profile

HPC Target

System

Machine Profile

characterizations of the rates at which a machine can carry out fundamental operations

Measured or projected via simple benchmarks on 1-2 nodes

of the system

HPC

Application

Requirements of

HPC Application – Application Signature

PMaC

Performance/Energy Models

Performance of Application on Target system

HPC

Application

Application signature

detailed summaries of the fundamental operations to be carried out by the application

Collected via trace

tools

Performance Model

– a calculable expression of the runtime, efficiency, memory use, etc. of an HPC program on some machine

Slide4

Runtime Overhead is a Big Deal

Real HPC applications

Relatively long runtimes: minutes, hours, days?

Lots of CPUS: O(10

7

) in largest supercomputers

High slowdowns create problems

Too long for queue

Unsympathetic administrators/managers

InconvenienceUnnecessarily use resourcesPEBIL = PMaC’s Efficient Binary Instrumentation for x86/Linux

Slide5

What’s New in PEBIL?

It can instrument multithreaded code

Developers use

OpenMP

and

pthreads

!

x86_64 only

Provide access to thread-local instrumentation data at runtime

Supports turning instrumentation on/offVery lightweight operationSwap nops with inserted instrumentation code at runtimeOverhead close to zero when all instrumentation is removed

Slide6

Binary Instrumentation in HPC

Tuning and Analysis Utilities (TAU) –

Dyninst

and PEBIL

HPCToolkit

Dyninst

Open

SpeedShop – DyninstIntel Parallel Studio – PinMemcheck memory bug detector – Valgrindvalgrind

–-leak-check=yes ...Understanding performance and energyMany research projects (not just HPC)BI used in 3000+ papers in the last 15 years

Slide7

Binary Instrumentation Basics

Memory Address Tracing

Original

Instrumented

0000c000 <

foo

>:

c000: 48 89 7d f8

mov

%rdi,-0x8(%

rbp) c004: 5e pop %

rsi

c005: 75 f8 jne

0xc004 c007: c9 leaveq c008: c3 retq

0000c000 <foo>: c000: // compute -0x8(%rbp) and copy it to a buffer c008: 48 89 7d f8

mov %rdi,-0x8(%rbp) c00c: // compute (%rsp) and copy it to a buffer c014: 5e pop %rsi c015: 75 f8

jne 0xc00c c017: c9 leaveq c018: c3 retq

Slide8

Enter Multithreaded Apps

All threads use a single buffer?

Don’t need to know which thread is executing

A buffer for each thread?

Faster. No concurrency operations needed

More interesting. Per-thread behavior != average thread behavior

PEBIL uses the latter

Fast method for computing location of thread-local data

Cache that location in a register if possible

0000c000 <

foo>: c000: // compute -0x8(%rbp) and copy it to a buffer c008: 48 89 7d f8

mov %rdi,-0x8(%

rbp)

c00c: // compute (%rsp) and copy it to a buffer c014: 5e pop %rsi c015: 75 f8 jne 0xc00c

c017: c9 leaveq c018: c3 retq

Slide9

Thread-local Instrumentation Data in PEBIL

Provide a large table to each process (2M)

Each entry is a small pool of memory (32 bytes)

Must be VERY fast

Get thread id (1 instruction)

Simple hash of thread id (2 instructions)

Index table with hashed id (1 instruction)

Assume no collisions (so far so good)

Hash Function

t

hread

1 id

thread 2 id

thread 3 id

thread 4 id

Thread-local memory pools

thread 4’s memory pool

Slide10

Caching Thread-local Data

Cache the address of thread-local data

Dead registers are known at instrumentation time

Is there 1 register in a function which is dead everywhere?

Compute thread-local data address only at function [re]entry

Should use smaller scopes! (loops, blocks)

Significant reductions

Slide11

Other x86/Linux Binary Instrumentation

Tool Name

Static

or

Dynamic

Thread-local

Data Access

Threading

Overhead

Runtime overheadPin1DynamicRegister stolen from program, program JIT-compiled around that lost registerVery lowMedium

Dyninst2EitherCompute thread ID (layered function call) at every pointHighVaries

PEBIL3

Static

Table + fast hash function (4 instructions), cache result in dead registersLowLow1Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.

Luk, C., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Vijay

Janapa Reddi, and Hazelwood, K. ACM SIGPLAN Conference on Programming Language Design and Implementation, 2005.2An API for Runtime Code Patching. Buck, B. and Hollingsworth, J. International Journal of High Performance Computing Applications, 2000.3

PEBIL: Efficient Static Binary Instrumentation for Linux. Laurenzano, M., Tikir, M., Carrington, L. and Snavely, A. International Symposium on the Performance Analysis of Systems and Software, 2010.

Slide12

Runtime Overhead Experiments

Basic block counting

Classic test in binary instrumentation literature

Increment a counter each time a basic block is executed

Per-block, per-process, per-thread counters

Memory address tracing

Fill a process/thread-local buffer with memory addresses, then discard those addresses

Interval-based sampling

Take the first 10% of each billion memory accesses

Toggle instrumentation on/off when moving between sampling/non-sampling

Slide13

Methodology

2 quad-core Xeon X3450, 2.67GHz

32K L1 and 256K L2 cache per core, 8M L3 per processor

NAS Parallel Benchmarks

2 sets:

OpenMP

and MPI,

gcc

/GOMP and

gcc/mpich8 threads/processes: CG, DC (omp only), EP, FT, IS, LU, MG4 threads/processes: BT, SPDyninst 7.0 (dynamic)

Timing started when instrumented app begins runningPin 2.12PEBIL 2.0

Slide14

Basic Block Counting (MPI)

All results are average of 3 runs

Slowdown relative to un-instrumented run

1 == no slowdown

Slide15

Basic Block Counting (OpenMP

)

Y-axis = log-scale slowdown factor

Dyninst

thread ID lookup at every basic block

Slide16

Threading Support Overhead

(BB Counting)

Slide17

Memory Tracing (MPI)

Slowdown relative to un-instrumented application

Tool

BT

CG

EP

FT

IS

LU

MG

SPMEANPEBIL14.934.182.173.532.85

6.084.77

4.13

5.33Pin5.894.432.913.893.544.674.852.48

4.08Dyninst22.4413.765.649.537.2512.9015.31

10.1912.12

Slide18

Memory Tracing (OpenMP

)

Instrumentation code inserted at

every

memory instruction

Dyninst

computes thread ID at every

memop

Pin runtime-optimizes instrumented code

Lots of opportunity to optimize

ToolBTCGDCEPFTISLUMG

SP

MEAN

PEBIL16.646.052.002.556.215.9410.5510.18

9.327.71Pin6.194.423.013.013.855.56

5.895.264.774.66Dyninst???862.86530.55448.89921.89

752.181759.241555.76???975.90**

30s

 7h45m

Slide19

Interval-based Sampling

Extract useful information from a subset of the memory address stream

Simple approach: the first 10% of every billion addresses

In practice we use a window 100x as small

Obvious: avoid processing addresses (e.g., just collect and throw away)

Not so obvious: avoid

collecting

addresses

Instrumentation tools can disable/re-enable instrumentation

PEBIL: binary on/off. Very lightweight, but limitedPin and Dyninst: arbitrary removal/reinstrumentation. Heavyweight, but versatileSampling only requires on/off functionality

Slide20

Sampled Memory Tracing (MPI)

PEBIL always improves, and significantly

Pin usually, but not always improves

Amount and complexity of code re-instrumented during each interval probably drives this

Dyninst

never improves

P

Tool

BT

CG

EP

FT

IS

LUMGSPMEANPEBIL Full14.934.18

2.173.532.856.084.774.135.33PEBIL 10%4.36

2.081.481.771.702.972.201.732.28Pin Full5.89

4.432.913.893.544.674.852.484.08

Pin 10%5.425.012.612.743.195.054.512.98

3.93Pin Best

5.424.432.61

2.743.19

4.674.512.48

3.75Dyninst Full = Best

22.4413.76

5.649.537.25

12.9015.31

10.1912.12

Slide21

Sampled Memory Tracing (OpenMP

)

Tool

BT

CG

DC

EP

FT

IS

LU

MGSPMEANPEBIL16.646.052.00

2.556.21

5.94

10.5510.189.327.71PEBIL 10%4.592.751.781.59

2.432.613.293.363.292.85Pin Full6.194.42

3.013.013.855.565.895.264.774.66Pin 10%

6.044.483.592.803.053.8810.266.478.96

5.50Pin Best6.044.423.012.803.053.885.89

5.264.77

4.35Dyninst Full = Best*

???862.86

530.55448.89921.89

752.181759.24

1555.76???975.90**

Slide22

Conclusions

New PEBIL features

instrument multithreaded binaries

Turn instrumentation on/off

Fast access to per-thread memory pool to support per-thread data collection

Reasonable overheads

Cache memory pool location

Currently done at function level

Future work: smaller scopes

PEBIL is useful for practical memory address stream collectionMessage passing or threaded

Slide23

https://github.com/mlaurenzano/PEBIL

michaell@sdsc.edu

Questions?


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.