/
Reference-Driven  Performance Anomaly Identification Reference-Driven  Performance Anomaly Identification

Reference-Driven Performance Anomaly Identification - PowerPoint Presentation

tracy
tracy . @tracy
Follow
342 views
Uploaded On 2022-06-15

Reference-Driven Performance Anomaly Identification - PPT Presentation

Kai Shen Christopher Stewart Chuanpeng Li and Xin Li 6162009 SIGMETRICS 2009 1 University of Rochester Performance Anomalies 6162009 SIGMETRICS 2009 2 Complex software systems like operating systems and distributed systems ID: 918905

system 2009 anomaly performance 2009 system performance anomaly sigmetrics file linux reference request target level metrics process call para

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Reference-Driven Performance Anomaly Id..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Reference-Driven Performance Anomaly Identification

Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li

6/16/2009

SIGMETRICS 2009

1

University of Rochester

Slide2

Performance Anomalies6/16/2009

SIGMETRICS 2009

2Complex software systems (like operating systems and distributed systems):

Many system features and configuration settingsWide-ranging workload behaviors and concurrency

Their interactionsPerformance anomalies:

Low performance against expectation

Due to implementation errors, mis-configurations, or mis-managed interactions, …

Anomalies degrade the system performance; make system behaviors

undependable

Slide3

An Example Identified by Our Research6/16/2009

SIGMETRICS 2009

3

Linux anticipatory I/O scheduler

HZ is number of timer ticks per second, so (HZ/150) ticks is around 6.7ms.However, inaccurate integer divisions:

HZ defaults to 1000 at earlier Linux versions, so anticipation timeout is 6 ticks.

It defaults to 250 at Linux 2.6.23, so timeout becomes one tick. Premature timeouts lead to additional disk seeks.

/* max time we may wait to anticipate a read (default around 6ms) */

#define

default_antic_expire

((HZ / 150) ? HZ / 150 : 1)

Slide4

Challenges and Goals6/16/2009

SIGMETRICS 2009

4Challenges:

Often involving semantics of multiple system componentsNo obvious failure symptoms; normal performance isn’t always known or even clearly defined

Performance anomaly identifications relatively rare:4%

of resolved Linux 2.4/2.6 I/O bugs are performance-oriented

Goals:

Systematic techniques to identify performance anomalies; improve performance dependability

Consider wide-ranging configurations and workload conditions

Slide5

Reference-driven Anomaly Identification6/16/2009

SIGMETRICS 2009

5Given two executions

T (target) and

R (reference):If T

performs much worse than

R

against expectation, we identify

T

as anomalous to

R

.

Examples:

How to systematically derive the expectations?

Slide6

Change Profiles6/16/2009

SIGMETRICS 2009

6

Goal – derive expected performance deviations between reference and target (or with a change of system parameters)

Approach – inference from real system measurements

Change profile

– probabilistic distribution of performance deviations

p

-value(

0.5) = 0.039

Slide7

Scalable Anomaly Quantification6/16/2009

SIGMETRICS 2009

7

Approach:Construct single-

para. profiles through real system measurementsAnalytically synthesize multiple single-para

. profiles for scalability

Convolution-like synthesis

Assuming independent performance effects of different parameters

Assemble multi-

para

. performance deviation distribution using convolutions of single-

para

. change profiles

Generally applicable bounding analysis

Bound multi-

para

. p-value anomaly from single-

para

. p-values (no need for parameter independence)

Find a tight bound (small p-value) through Monte Carlo method

Slide8

Evaluation6/16/2009

SIGMETRICS 2009

8

Linux I/O case study:Five workload parameters and three system conf. parameters

Performance measurements at 300 sampled executions; use each other as references to identify anomaliesAnomalies are target executions with p-values 0.05 or less

Validate through cause analysis; probable false positive without validated cause

Results

Linux 2.6.10 – 35 identified; 34 validated; 1 probable false positive

Linux 2.6.23 – 12 identified; 9 validated; 3 probable false positives

Linux 2.6.23 (target) vs. 2.6.10 (reference) – 15 identified; all validated

Slide9

Comparison6/16/2009

SIGMETRICS 2009

9

Bounding analysis for multi-parameter anomaly quantification

Convolution synthesis assuming parameter independenceRank target-reference anomaly using

raw

perf

. difference

Convolution identifies more anomalies, but higher false positives

Slide10

Anomaly Cause Analysis6/16/2009

SIGMETRICS 2009

10

Given symptom (anomalous perf. degradation from reference to target), root cause analysis is still challenging

Root cause sometimes lies in complex component interactionsMost useful hints often relate to low-level system activities

Efficient mechanisms available to acquire large amount of system metrics (some anomaly-related); but difficult to sift through

Approach:

reference-driven filtering of anomaly-related metrics

Compare metric manifestations of an

anomalous target

and its

normal reference

Those that differ significantly may be anomaly-related

Slide11

System Events and Metrics

6/16/2009SIGMETRICS 2009

11

Traced Events:

Process management

creation of a kernel thread; process fork or clone; process exit; process wait; process signal; wake up a process; CPU context switch

System call

enter a system call; exit a system call

Memory system

allocating pages; freeing pages; swapping pages in; swapping pages out

File system

file exec; file open; file close; file read; file write; file seek; file

ioctl

; file

prefetch

operation; starting to wait for a data buffer; end to wait for a data buffer

IO scheduling

I/O request arrival at the block level; re-queue an I/O request; dispatch an I/O request; remove an I/O request; I/O request completion

SCSI device

SCSI read request; SCSI write request

Interrupt

enter an interrupt handler; Exit an interrupt handler

Network socket

socket call; socket send; socket receive; socket creation

up to 1361 metrics in Linux 2.6.23

Derived System Metrics:

Inter-arrival time of each type of events

Delays between causal events

delay between a system call enter and exit

delay between file system buffer wait start and end

delay between a block-level I/O request arrival and is dispatch

delay between a block-level I/O request dispatch and its completion

Parameter of events

file

prefetch

size

SCSI I/O request size

file offset of each I/O operation to block device

I/O concurrency

system call level

block level

SCSI device level

Slide12

A Case Result6/16/2009

SIGMETRICS 2009

12

Anomaly cause: incorrect timeout setting when timer ticks per second (HZ) changes from 1000 to 250 in Linux 2.6.23

T

op ranked metrics

– anticipatory I/O timeouts and anticipation breaks

#

define

default_antic_expire

((HZ / 150) ? HZ / 150 : 1)

Slide13

Effects of Anomaly Corrections6/16/2009

SIGMETRICS 2009

13

Anomaly corrections lead to predictable performance behavior patterns

Slide14

Related Work6/16/2009

SIGMETRICS 200914

Peer

differencing for debugging

Delta debugging [Zeller’02]: differencing program runs of various inputsPeerPressure [Wang et al.’04]: differencing Windows

registry

settings

Triage [

Tucek

et al.’07]: differencing basic

block execution

frequency

Target program/system failures; failure symptoms easily identifiable; correct peers presumably known

Our performance anomaly identification

Challenge

: both anomalous and normal performance behaviors are hard to identify in complex systems

Key contribution

: scalable construction of performance deviation profiles

Slide15

Summary6/16/2009

SIGMETRICS 200915

Principled use of references in performance anomaly identification

Scalable construction of performance deviation profiles to identify anomaly symptoms

Target-reference differencing of system metric manifestations to help identify anomaly causesIdentified real performance problems in Linux and J2EE-based distributed

system