Kai Shen Christopher Stewart Chuanpeng Li and Xin Li 6162009 SIGMETRICS 2009 1 University of Rochester Performance Anomalies 6162009 SIGMETRICS 2009 2 Complex software systems like operating systems and distributed systems ID: 918905
Download Presentation The PPT/PDF document "Reference-Driven Performance Anomaly Id..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Reference-Driven Performance Anomaly Identification
Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li
6/16/2009
SIGMETRICS 2009
1
University of Rochester
Slide2Performance Anomalies6/16/2009
SIGMETRICS 2009
2Complex software systems (like operating systems and distributed systems):
Many system features and configuration settingsWide-ranging workload behaviors and concurrency
Their interactionsPerformance anomalies:
Low performance against expectation
Due to implementation errors, mis-configurations, or mis-managed interactions, …
Anomalies degrade the system performance; make system behaviors
undependable
Slide3An Example Identified by Our Research6/16/2009
SIGMETRICS 2009
3
Linux anticipatory I/O scheduler
HZ is number of timer ticks per second, so (HZ/150) ticks is around 6.7ms.However, inaccurate integer divisions:
HZ defaults to 1000 at earlier Linux versions, so anticipation timeout is 6 ticks.
It defaults to 250 at Linux 2.6.23, so timeout becomes one tick. Premature timeouts lead to additional disk seeks.
/* max time we may wait to anticipate a read (default around 6ms) */
#define
default_antic_expire
((HZ / 150) ? HZ / 150 : 1)
Slide4Challenges and Goals6/16/2009
SIGMETRICS 2009
4Challenges:
Often involving semantics of multiple system componentsNo obvious failure symptoms; normal performance isn’t always known or even clearly defined
Performance anomaly identifications relatively rare:4%
of resolved Linux 2.4/2.6 I/O bugs are performance-oriented
Goals:
Systematic techniques to identify performance anomalies; improve performance dependability
Consider wide-ranging configurations and workload conditions
Slide5Reference-driven Anomaly Identification6/16/2009
SIGMETRICS 2009
5Given two executions
T (target) and
R (reference):If T
performs much worse than
R
against expectation, we identify
T
as anomalous to
R
.
Examples:
How to systematically derive the expectations?
Slide6Change Profiles6/16/2009
SIGMETRICS 2009
6
Goal – derive expected performance deviations between reference and target (or with a change of system parameters)
Approach – inference from real system measurements
Change profile
– probabilistic distribution of performance deviations
p
-value(
–
0.5) = 0.039
Slide7Scalable Anomaly Quantification6/16/2009
SIGMETRICS 2009
7
Approach:Construct single-
para. profiles through real system measurementsAnalytically synthesize multiple single-para
. profiles for scalability
Convolution-like synthesis
Assuming independent performance effects of different parameters
Assemble multi-
para
. performance deviation distribution using convolutions of single-
para
. change profiles
Generally applicable bounding analysis
Bound multi-
para
. p-value anomaly from single-
para
. p-values (no need for parameter independence)
Find a tight bound (small p-value) through Monte Carlo method
Slide8Evaluation6/16/2009
SIGMETRICS 2009
8
Linux I/O case study:Five workload parameters and three system conf. parameters
Performance measurements at 300 sampled executions; use each other as references to identify anomaliesAnomalies are target executions with p-values 0.05 or less
Validate through cause analysis; probable false positive without validated cause
Results
Linux 2.6.10 – 35 identified; 34 validated; 1 probable false positive
Linux 2.6.23 – 12 identified; 9 validated; 3 probable false positives
Linux 2.6.23 (target) vs. 2.6.10 (reference) – 15 identified; all validated
Slide9Comparison6/16/2009
SIGMETRICS 2009
9
Bounding analysis for multi-parameter anomaly quantification
Convolution synthesis assuming parameter independenceRank target-reference anomaly using
raw
perf
. difference
Convolution identifies more anomalies, but higher false positives
Slide10Anomaly Cause Analysis6/16/2009
SIGMETRICS 2009
10
Given symptom (anomalous perf. degradation from reference to target), root cause analysis is still challenging
Root cause sometimes lies in complex component interactionsMost useful hints often relate to low-level system activities
Efficient mechanisms available to acquire large amount of system metrics (some anomaly-related); but difficult to sift through
Approach:
reference-driven filtering of anomaly-related metrics
Compare metric manifestations of an
anomalous target
and its
normal reference
Those that differ significantly may be anomaly-related
Slide11System Events and Metrics
6/16/2009SIGMETRICS 2009
11
Traced Events:
Process management
creation of a kernel thread; process fork or clone; process exit; process wait; process signal; wake up a process; CPU context switch
System call
enter a system call; exit a system call
Memory system
allocating pages; freeing pages; swapping pages in; swapping pages out
File system
file exec; file open; file close; file read; file write; file seek; file
ioctl
; file
prefetch
operation; starting to wait for a data buffer; end to wait for a data buffer
IO scheduling
I/O request arrival at the block level; re-queue an I/O request; dispatch an I/O request; remove an I/O request; I/O request completion
SCSI device
SCSI read request; SCSI write request
Interrupt
enter an interrupt handler; Exit an interrupt handler
Network socket
socket call; socket send; socket receive; socket creation
up to 1361 metrics in Linux 2.6.23
Derived System Metrics:
Inter-arrival time of each type of events
Delays between causal events
delay between a system call enter and exit
delay between file system buffer wait start and end
delay between a block-level I/O request arrival and is dispatch
delay between a block-level I/O request dispatch and its completion
Parameter of events
file
prefetch
size
SCSI I/O request size
file offset of each I/O operation to block device
I/O concurrency
system call level
block level
SCSI device level
Slide12A Case Result6/16/2009
SIGMETRICS 2009
12
Anomaly cause: incorrect timeout setting when timer ticks per second (HZ) changes from 1000 to 250 in Linux 2.6.23
T
op ranked metrics
– anticipatory I/O timeouts and anticipation breaks
#
define
default_antic_expire
((HZ / 150) ? HZ / 150 : 1)
Slide13Effects of Anomaly Corrections6/16/2009
SIGMETRICS 2009
13
Anomaly corrections lead to predictable performance behavior patterns
Slide14Related Work6/16/2009
SIGMETRICS 200914
Peer
differencing for debugging
Delta debugging [Zeller’02]: differencing program runs of various inputsPeerPressure [Wang et al.’04]: differencing Windows
registry
settings
Triage [
Tucek
et al.’07]: differencing basic
block execution
frequency
→
Target program/system failures; failure symptoms easily identifiable; correct peers presumably known
Our performance anomaly identification
Challenge
: both anomalous and normal performance behaviors are hard to identify in complex systems
Key contribution
: scalable construction of performance deviation profiles
Slide15Summary6/16/2009
SIGMETRICS 200915
Principled use of references in performance anomaly identification
Scalable construction of performance deviation profiles to identify anomaly symptoms
Target-reference differencing of system metric manifestations to help identify anomaly causesIdentified real performance problems in Linux and J2EE-based distributed
system