Sayandeep Sen Bell Labs India Joint work with Sourjya Bhaumik amp Rijin John Cellular Base Station Monitoring Monitoring Centre Cell site Cell sites Every 15 minutes ID: 229968
Download Presentation The PPT/PDF document "Automated Cellular Root Cause Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Automated Cellular Root Cause Analysis
Sayandeep
Sen
Bell
Labs
India
Joint
work with
Sourjya
Bhaumik
&
Rijin
John Slide2
Cellular
Base Station Monitoring
Monitoring Centre
Cell site
Cell sites
Every 15 minutesSlide3
Performance counters
Example: connected users, average signal strength, cell radius etc.
Cell site
Cell sites
Performance counters
Cellular
Base Station Monitoring
Monitoring Centre
Every 15 minutesSlide4
Cellular
Base Station Monitoring
KPI: Key Performance Indicator
Example: Call drop rate, Successful connection setup rate, Throughput
Cell site
Cell sites
KPI
Every 15 minutes
Monitoring CentreSlide5
Root cause analysis
Monitoring Centre
Cell site
Cell sites
KPI
KPI
Performance counters
Why KPI went below threshold ?
ManuallySlide6
Root Cause
Analysis
–
Issues
Time
Time
Time
KPI
Parameter 1
Parameter N
Too many variables
~300 parameters
1 engineer per O(100) cell sites
Manual debugging is inefficientSlide7
Time
Time
Time
KPI
Parameter 1
Parameter N
???
Sporadic parameter dips
Root Cause Analysis
– Issues
Manual debugging is inefficient
Too many variables
~300 parameters
1 engineer per O(100) cell sitesSlide8
Time
Time
Time
KPI
Parameter 1
Parameter N
M
ultiple parameter interaction
Root Cause Analysis
– Issues
Sporadic parameter dips
Manual debugging is inefficient
Too many variables
~300 parameters
1 engineer per O(100) cell sitesSlide9
Carry out
automated (fast)
root cause a
nalysis which accounts for sporadic dips and multiple parameter interactions while ensuring
human readable output
.
Problem StatementSlide10
Motivation
Problem statement
Approach
Insight
, Mechanism, Customizations
Results
Ongoing
work
Other work
OutlineSlide11
KPI-parameter relationship is dependent on other parameter values
Key IntuitionSlide12
Conn. Req.
Call
Success
Handoff rate
Key IntuitionSlide13
Conn. Req.
Threshold
Handoff rate
Call
Success
y
Conn. Req. > X & H/o =y
X
Key IntuitionSlide14
Conn. Req.
Handoff rate
Call
Success
Conn. Req. > X’ & H/o =y’
y’
Key Intuition
KPI-parameter relationship is dependent on other parameter values
X’
Determine the rules for various parameter combination values using
R
egression treesSlide15
Motivation
Problem statement
Approach
Insight,
Mechanism
,
Customizations
Results
Ongoing
workOther work
OutlineSlide16
Form clusters of points
To minimize the sum of
distance metric
for sub-clusters
Δ
Δ
’
Δ
”
Regression trees
Call
SuccessSlide17
Distance metric:
sum of
E
uclidean distance of points in a sub-cluster
Δ
Δ
’
Δ
”
Regression trees
Call
Success
Form clusters of points
To minimize the sum of
distance metric
for sub-clusters
Provide human readable rule for each clusterSlide18
Conn. Req.
2) Calculate
Δ
Regression trees
1) Pick an axis
Call
SuccessSlide19
1) Pick an axis
2) Calculate
Δ
Conn. Req.
X
Regression trees
Call
Success
3)Pick pivot to divide
points in two clusters, Slide20
Conn. Req.
4) Calculate
Δ
’+
Δ
”
3)Pick pivot to divide
points in two clusters,
1) Pick an axis
2) Calculate
Δ
X
Regression trees
Call
Success
Δ
”
Δ
’Slide21
Conn. Req.
4) Calculate
Δ
’+
Δ
”
3)Pick pivot to divide
points in two clusters,
1) Pick an axis
2) Calculate
Δ
X
X
X
X
Repeat for
all pivots
Regression trees
Call
SuccessSlide22
Conn. Req.
4) Calculate
Δ
’+
Δ
”
3)Pick pivot to divide
points in two clusters,
1) Pick an axis
2) Calculate
Δ
Repeat for
all pivots
Regression trees
Repeat for all axis
Call
SuccessSlide23
Conn. Req.
4) Calculate
Δ
’+
Δ
”
3)Pick pivot to divide
points in two clusters,
1) Pick an axis
2) Calculate
Δ
Repeat for
all pivots
5
) Pick pivot with minimum
Δ
’+
Δ
”
X
Conn.Req
<X
Conn.Req
>=X
Regression trees
Repeat for all axis
Call
SuccessSlide24
Conn. Req.
4) Calculate
Δ
’+
Δ
”
3)Pick pivot to divide
points in two clusters,
1) Pick an axis
2) Calculate
Δ
Repeat for all axis
Repeat for
all pivots
5
) Pick pivot with minimum
Δ
’+
Δ
”
X
Repeat for sub-clusters
Conn.Req
<X
Conn.Req
>=X
Regression trees
Call
SuccessSlide25
Conn. Req.
X
Handoff rate
Y
Conn.Req
<X
Conn.Req
>=X
Handoff Rate >= Y
Handoff Rate < Y
Regression trees
Call
Success
4) Calculate
Δ
’+
Δ
”
3)Pick pivot to divide
points in two clusters,
1) Pick an axis
2) Calculate
Δ
Repeat for all axis
Repeat for
all pivots
5
) Pick pivot with minimum
Δ
’+
Δ
”
Repeat for sub-clustersSlide26
4) Calculate
Δ
’+
Δ”3)Pick pivot to divide
points in two clusters, 1) Pick an axis 2) Calculate Δ
Repeat for all axis
Repeat for
all pivots5) Pick pivot with minimum Δ’+Δ”
Repeat for sub-clusters
Conn. Req.
X
Handoff rate
Y
Conn.Req
<X
Conn.Req
>=X
Handoff Rate >= Y
Handoff Rate < Y
Select rules corresponding to low KPI values
Regression trees
Call
SuccessSlide27
Conn. Req.
X
Handoff rate
Y
Conn.Req
<X
Conn.Req
>=X
Handoff Rate >= Y
Handoff Rate < Y
Regression trees
Call
Success
Human
readable
Capture multiple variable interaction
Capture sporadic events due to time agnostic clustering Slide28
Motivation
Problem statement
Approach
Insight, Mechanism
,
Customizations
Results
Ongoing
work
Other work
OutlineSlide29
D
istance metric oblivious of significance of KPI values
Curse of dimensionality
Regression trees
–
IssuesSlide30
Conn. Req.
Handoff rate
M
etric
oblivious
KPI value significance
Call
Success
Need big separation between good and bad valuesSlide31
Conn. Req.
Handoff rate
Call
Success
98.5%
Bad
Call
Success
M
etric
oblivious
KPI value significanceSlide32
Conn. Req.
Handoff rate
98.5%
98.6%
98.7
%
98.5%
Bad
Call
Success
M
etric
oblivious
KPI value significance
Call
SuccessSlide33
Conn. Req.
Handoff rate
98.5%
98.6%
98.7
%
98.5%
Bad
Call
Success
M
etric
oblivious
KPI value significance
Distinction between good and bad is
small
Stratify KPI values
Call
SuccessSlide34
Conn. Req.
Handoff rate
98.5%
98.6%
98.7
%
98.5%
Bad
Call
Success
M
etric
oblivious
KPI value significance
Distinction between good and bad is
small
Call
Success
Multiply KPI value with custom step functionSlide35
Conn. Req.
Handoff rate
98.5%
98.6%
98.7
%
98.5%
Bad
Stratification of data
Call
Success
Multiply KPI value with custom step function
Call
Success
Distinction between good and bad is
smallSlide36
Conn. Req.
Handoff rate
98.5%
98.6%
98.7
%
Bad
Stratification of data
Call
Success
Call
Success
Distinction between good and bad is
smallSlide37
Conn. Req.
Handoff rate
Stratification of data
98.5%
98.6%
98.7
%
98.5%
Bad
Call
Success
Call
Success
Distinction between good and bad is
smallSlide38
D
istance metric oblivious of significance of KPI values
Stratify KPI values
Curse of dimensionality reduction
Regression trees
–
IssuesSlide39
Interference
Traffic Load
Curse of Dimensionality
Call Success
Traffic Load > X &
Interference > Y
Handoff rate < X &
Conn. Req. < Y
Cell Radius > X &
Allotted Power < YSlide40
Interference
Traffic Load
Traffic Load > X &
Interference > Y
Handoff rate < X &
Conn. Req. < Y
Cell Radius > X &
Allotted Power < Y
Call Success
Curse of Dimensionality
~300 variables lead to 2^300
combinations
regression tree can be misledSlide41
Preprocessing
Remove
correlated, barely changing parameters etc.
Domain
knowledge based filteringRemove unrelated parameters, apply weights
HeuristicsSpike,
Correlation, 3 more …
Dimensionality reductionSlide42
Spike heuristic
Time
Time
Call Success
Values spike around same timeSlide43
Correlation heuristic
Conn. Req.
Conn. Req.
Call Success
Call Success
Call Success > 98.5 %
Call Success <= 98.5 %
Correlation changes significantlySlide44
Regression tree
Apply filters
Stratify KPI data
Select rules
Rule
g
eneration
Data store
Rule storeSlide45
Rule application
Rule store
Matching rulesSlide46
Motivation
Problem statement
Approach
Insight, Mechanism, Customizations
Results
Ongoing
work
Other work
OutlineSlide47
Training & Verification Data
Analyzed 28 days of data from 217
cell sites
2 countries, 2 OEMs
317 parameters @ 15 minute interval
80% data to train and 20% to validateSlide48
Find rules for all KPI dips
Country #1
(18 cell sites)
Country #2
(60 cell sites)
Cell sites with at least 4 KPIs with more than 100 bad instances selected
KPI
KPI
Instances
InstancesSlide49
Rule
Verification
Picked
rules for randomly selected 50 KPI dips
Show rules to 15 RF
engineers (Ongoing)80% rules were actionable
For all the KPI dips at least one actionablerule in the rule
setSlide50
1) Total
users in 5 to 10 KM from base station > 63%
2) Total
users in bad RSS region > 21
% AND Total uplink load > 831 MBKPI dip: Call success rate < 98.5%
3) Download Traffic < 500 Kbytes AND Total active users < 200
Example rule setSlide51
2) Total
users in bad RSS region > 21
% AND Total
uplink load > 831 MB
3) Download Traffic < 500 Kbytes AND Total active users < 200
1) Total users in 5 to 10 KM from base station > 63%
Users
concentrated at cell edgeExample rule set
KPI dip: Call success rate < 98.5%Slide52
3
) Download Traffic
<
500 Kbytes AND Total active users < 200
2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB
1) Total users in 5 to 10 KM from base station > 63%
21%
users with bad RSSI and high traffic load Example rule set
KPI dip: Call success rate < 98.5%Slide53
1) Total
users in 5 to 10 KM from base station > 63%
2) Total
users in bad RSS region > 21
% AND Total uplink load > 831 MB3) Download Traffic
< 500 Kbytes AND Total active users < 200
Do not point to meaningful cause
?Example rule set
KPI dip: Call success rate < 98.5%Slide54
Example rule set
1) Total
users in 5 to 10 KM from base station > 63%
2) Total
users in bad RSS region > 21% AND Total uplink load > 831 MB
3) Download Traffic < 500 Kbytes AND Total active users < 200
Coarse timescale leading to multiple other failures
Don’t have access to relevant parameters
Specific problem rare event in current sectorKPI dip: Call success rate < 98.5%Slide55
Motivation
Problem statement
Approach
Insight, Mechanism, Customizations
Results
Ongoing
work
Other work
OutlineSlide56
Recommending solution for a problem
Cell site
Cell sites
Monitoring Centre
Parameter list
Parameter list:
Remotely configurable parameters,
Example: Antenna tilt, Min. signal strength
to associate, allowable idle time etc.
Ongoing WorkSlide57
Recommending solution for a problem
Cell site
Cell sites
Monitoring Centre
Parameter list
When a KPI dips:
G
enerate rules
Find sectors where the rules do not lead to KPI dip
Return the parameter list for those sectors
Ongoing WorkSlide58
Ongoing Work
Recommending solution for problem
More customizations necessary …Slide59
Motivation
Problem statement
Approach
Insight, Mechanism, Customizations
Results
Ongoing
work
Other work
OutlineSlide60
All bits of a video application are
not created equal
< 5 msec
< 105 msec
Nearer the
deadline
more
valuable the
packet
Value
I
P
B
MPEG4/ H.264 encoded video
Value
a
ware networkingSlide61
Application
Transport
Network
MAC
PHY
000101
011101
010101
IP
B
0001011010101100101010101100001001
Value aware application layer
I
P
B
APISlide62
Application
Transport
Network
MAC
PHY
000101
011101
010101
IP
B
000101101010110010
1010101100001001
Value aware
networking
Order of sending data
T
imes to retransmit
MAC data rate
Can protocol decisions be taken in a value aware manner ?
I
P
B
Yes
A
lmost
no
data
overhead
APISlide63
Questions?Slide64
BackupSlide65
Future work
Online regression tree formation
Fast
emulation systems for what-if analysisSlide66
Research overview
Scout
[
Submitted
]
[DySPAN 2012]
Range-Write
[OSDI 2008]
Apex[Sigcomm 2010]
Medusa
[NSDI 2010]
MOM
[Submitted]
RDP-TS
DGP
[
MobiCom 2006]
MCB-Mesh
[IMC 2008]
Fractel
[
INFOCOM 2008]
WiScape
[IMC 2011]
[WWW 2008
]
Topo
-cons
WhiteCell
PhD Dissertation
Systems & Protocols
Cross-Layer design
Measurement
& Analysis
Root-cause
MultIfaceT
[
HotMobile’10
]Slide67
Rx
Higher bandwidth
Home repeater
Vehicular whitespace
Reliability
Whitespace
femto
Benefits
Rx
Tx
Multi-Interface systems
TxSlide68
API with higher layers
Striping decision
Channel selection
Feedback gatheringMulti-Interface systems
Challenges
Rx
Tx
Tx
Rx