/
Automated Cellular Root Cause Analysis Automated Cellular Root Cause Analysis

Automated Cellular Root Cause Analysis - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
434 views
Uploaded On 2016-02-24

Automated Cellular Root Cause Analysis - PPT Presentation

Sayandeep Sen Bell Labs India Joint work with Sourjya Bhaumik amp Rijin John Cellular Base Station Monitoring Monitoring Centre Cell site Cell sites Every 15 minutes ID: 229968

success call conn req call success req conn kpi rate cell handoff pick total bad parameter users regression calculate trees clusters time

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Automated Cellular Root Cause Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Automated Cellular Root Cause Analysis

Sayandeep

Sen

Bell

Labs

India

Joint

work with

Sourjya

Bhaumik

&

Rijin

John Slide2

Cellular

Base Station Monitoring

Monitoring Centre

Cell site

Cell sites

Every 15 minutesSlide3

Performance counters

Example: connected users, average signal strength, cell radius etc.

Cell site

Cell sites

Performance counters

Cellular

Base Station Monitoring

Monitoring Centre

Every 15 minutesSlide4

Cellular

Base Station Monitoring

KPI: Key Performance Indicator

Example: Call drop rate, Successful connection setup rate, Throughput

Cell site

Cell sites

KPI

Every 15 minutes

Monitoring CentreSlide5

Root cause analysis

Monitoring Centre

Cell site

Cell sites

KPI

KPI

Performance counters

Why KPI went below threshold ?

ManuallySlide6

Root Cause

Analysis

Issues

Time

Time

Time

KPI

Parameter 1

Parameter N

Too many variables

~300 parameters

1 engineer per O(100) cell sites

Manual debugging is inefficientSlide7

Time

Time

Time

KPI

Parameter 1

Parameter N

???

Sporadic parameter dips

Root Cause Analysis

– Issues

Manual debugging is inefficient

Too many variables

~300 parameters

1 engineer per O(100) cell sitesSlide8

Time

Time

Time

KPI

Parameter 1

Parameter N

M

ultiple parameter interaction

Root Cause Analysis

– Issues

Sporadic parameter dips

Manual debugging is inefficient

Too many variables

~300 parameters

1 engineer per O(100) cell sitesSlide9

Carry out

automated (fast)

root cause a

nalysis which accounts for sporadic dips and multiple parameter interactions while ensuring

human readable output

.

Problem StatementSlide10

Motivation

Problem statement

Approach

Insight

, Mechanism, Customizations

Results

Ongoing

work

Other work

OutlineSlide11

KPI-parameter relationship is dependent on other parameter values

Key IntuitionSlide12

Conn. Req.

Call

Success

Handoff rate

Key IntuitionSlide13

Conn. Req.

Threshold

Handoff rate

Call

Success

y

Conn. Req. > X & H/o =y

X

Key IntuitionSlide14

Conn. Req.

Handoff rate

Call

Success

Conn. Req. > X’ & H/o =y’

y’

Key Intuition

KPI-parameter relationship is dependent on other parameter values

X’

Determine the rules for various parameter combination values using

R

egression treesSlide15

Motivation

Problem statement

Approach

Insight,

Mechanism

,

Customizations

Results

Ongoing

workOther work

OutlineSlide16

Form clusters of points

To minimize the sum of

distance metric

for sub-clusters

Δ

Δ

Δ

Regression trees

Call

SuccessSlide17

Distance metric:

sum of

E

uclidean distance of points in a sub-cluster

Δ

Δ

Δ

Regression trees

Call

Success

Form clusters of points

To minimize the sum of

distance metric

for sub-clusters

Provide human readable rule for each clusterSlide18

Conn. Req.

2) Calculate

Δ

Regression trees

1) Pick an axis

Call

SuccessSlide19

1) Pick an axis

2) Calculate

Δ

Conn. Req.

X

Regression trees

Call

Success

3)Pick pivot to divide

points in two clusters, Slide20

Conn. Req.

4) Calculate

Δ

’+

Δ

3)Pick pivot to divide

points in two clusters,

1) Pick an axis

2) Calculate

Δ

X

Regression trees

Call

Success

Δ

Δ

’Slide21

Conn. Req.

4) Calculate

Δ

’+

Δ

3)Pick pivot to divide

points in two clusters,

1) Pick an axis

2) Calculate

Δ

X

X

X

X

Repeat for

all pivots

Regression trees

Call

SuccessSlide22

Conn. Req.

4) Calculate

Δ

’+

Δ

3)Pick pivot to divide

points in two clusters,

1) Pick an axis

2) Calculate

Δ

Repeat for

all pivots

Regression trees

Repeat for all axis

Call

SuccessSlide23

Conn. Req.

4) Calculate

Δ

’+

Δ

3)Pick pivot to divide

points in two clusters,

1) Pick an axis

2) Calculate

Δ

Repeat for

all pivots

5

) Pick pivot with minimum

Δ

’+

Δ

X

Conn.Req

<X

Conn.Req

>=X

Regression trees

Repeat for all axis

Call

SuccessSlide24

Conn. Req.

4) Calculate

Δ

’+

Δ

3)Pick pivot to divide

points in two clusters,

1) Pick an axis

2) Calculate

Δ

Repeat for all axis

Repeat for

all pivots

5

) Pick pivot with minimum

Δ

’+

Δ

X

Repeat for sub-clusters

Conn.Req

<X

Conn.Req

>=X

Regression trees

Call

SuccessSlide25

Conn. Req.

X

Handoff rate

Y

Conn.Req

<X

Conn.Req

>=X

Handoff Rate >= Y

Handoff Rate < Y

Regression trees

Call

Success

4) Calculate

Δ

’+

Δ

3)Pick pivot to divide

points in two clusters,

1) Pick an axis

2) Calculate

Δ

Repeat for all axis

Repeat for

all pivots

5

) Pick pivot with minimum

Δ

’+

Δ

Repeat for sub-clustersSlide26

4) Calculate

Δ

’+

Δ”3)Pick pivot to divide

points in two clusters, 1) Pick an axis 2) Calculate Δ

Repeat for all axis

Repeat for

all pivots5) Pick pivot with minimum Δ’+Δ”

Repeat for sub-clusters

Conn. Req.

X

Handoff rate

Y

Conn.Req

<X

Conn.Req

>=X

Handoff Rate >= Y

Handoff Rate < Y

Select rules corresponding to low KPI values

Regression trees

Call

SuccessSlide27

Conn. Req.

X

Handoff rate

Y

Conn.Req

<X

Conn.Req

>=X

Handoff Rate >= Y

Handoff Rate < Y

Regression trees

Call

Success

Human

readable

Capture multiple variable interaction

Capture sporadic events due to time agnostic clustering Slide28

Motivation

Problem statement

Approach

Insight, Mechanism

,

Customizations

Results

Ongoing

work

Other work

OutlineSlide29

D

istance metric oblivious of significance of KPI values

Curse of dimensionality

Regression trees

IssuesSlide30

Conn. Req.

Handoff rate

M

etric

oblivious

KPI value significance

Call

Success

Need big separation between good and bad valuesSlide31

Conn. Req.

Handoff rate

Call

Success

98.5%

Bad

Call

Success

M

etric

oblivious

KPI value significanceSlide32

Conn. Req.

Handoff rate

98.5%

98.6%

98.7

%

98.5%

Bad

Call

Success

M

etric

oblivious

KPI value significance

Call

SuccessSlide33

Conn. Req.

Handoff rate

98.5%

98.6%

98.7

%

98.5%

Bad

Call

Success

M

etric

oblivious

KPI value significance

Distinction between good and bad is

small

Stratify KPI values

Call

SuccessSlide34

Conn. Req.

Handoff rate

98.5%

98.6%

98.7

%

98.5%

Bad

Call

Success

M

etric

oblivious

KPI value significance

Distinction between good and bad is

small

Call

Success

Multiply KPI value with custom step functionSlide35

Conn. Req.

Handoff rate

98.5%

98.6%

98.7

%

98.5%

Bad

Stratification of data

Call

Success

Multiply KPI value with custom step function

Call

Success

Distinction between good and bad is

smallSlide36

Conn. Req.

Handoff rate

98.5%

98.6%

98.7

%

Bad

Stratification of data

Call

Success

Call

Success

Distinction between good and bad is

smallSlide37

Conn. Req.

Handoff rate

Stratification of data

98.5%

98.6%

98.7

%

98.5%

Bad

Call

Success

Call

Success

Distinction between good and bad is

smallSlide38

D

istance metric oblivious of significance of KPI values

Stratify KPI values

Curse of dimensionality reduction

Regression trees

IssuesSlide39

Interference

Traffic Load

Curse of Dimensionality

Call Success

Traffic Load > X &

Interference > Y

Handoff rate < X &

Conn. Req. < Y

Cell Radius > X &

Allotted Power < YSlide40

Interference

Traffic Load

Traffic Load > X &

Interference > Y

Handoff rate < X &

Conn. Req. < Y

Cell Radius > X &

Allotted Power < Y

Call Success

Curse of Dimensionality

~300 variables lead to 2^300

combinations

regression tree can be misledSlide41

Preprocessing

Remove

correlated, barely changing parameters etc.

Domain

knowledge based filteringRemove unrelated parameters, apply weights

HeuristicsSpike,

Correlation, 3 more …

Dimensionality reductionSlide42

Spike heuristic

Time

Time

Call Success

Values spike around same timeSlide43

Correlation heuristic

Conn. Req.

Conn. Req.

Call Success

Call Success

Call Success > 98.5 %

Call Success <= 98.5 %

Correlation changes significantlySlide44

Regression tree

Apply filters

Stratify KPI data

Select rules

Rule

g

eneration

Data store

Rule storeSlide45

Rule application

Rule store

Matching rulesSlide46

Motivation

Problem statement

Approach

Insight, Mechanism, Customizations

Results

Ongoing

work

Other work

OutlineSlide47

Training & Verification Data

Analyzed 28 days of data from 217

cell sites

2 countries, 2 OEMs

317 parameters @ 15 minute interval

80% data to train and 20% to validateSlide48

Find rules for all KPI dips

Country #1

(18 cell sites)

Country #2

(60 cell sites)

Cell sites with at least 4 KPIs with more than 100 bad instances selected

KPI

KPI

Instances

InstancesSlide49

Rule

Verification

Picked

rules for randomly selected 50 KPI dips

Show rules to 15 RF

engineers (Ongoing)80% rules were actionable

For all the KPI dips at least one actionablerule in the rule

setSlide50

1) Total

users in 5 to 10 KM from base station > 63%

2) Total

users in bad RSS region > 21

% AND Total uplink load > 831 MBKPI dip: Call success rate < 98.5%

3) Download Traffic < 500 Kbytes AND Total active users < 200

Example rule setSlide51

2) Total

users in bad RSS region > 21

% AND Total

uplink load > 831 MB

3) Download Traffic < 500 Kbytes AND Total active users < 200

1) Total users in 5 to 10 KM from base station > 63%

Users

concentrated at cell edgeExample rule set

KPI dip: Call success rate < 98.5%Slide52

3

) Download Traffic

<

500 Kbytes AND Total active users < 200

2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB

1) Total users in 5 to 10 KM from base station > 63%

21%

users with bad RSSI and high traffic load Example rule set

KPI dip: Call success rate < 98.5%Slide53

1) Total

users in 5 to 10 KM from base station > 63%

2) Total

users in bad RSS region > 21

% AND Total uplink load > 831 MB3) Download Traffic

< 500 Kbytes AND Total active users < 200

Do not point to meaningful cause

?Example rule set

KPI dip: Call success rate < 98.5%Slide54

Example rule set

1) Total

users in 5 to 10 KM from base station > 63%

2) Total

users in bad RSS region > 21% AND Total uplink load > 831 MB

3) Download Traffic < 500 Kbytes AND Total active users < 200

Coarse timescale leading to multiple other failures

Don’t have access to relevant parameters

Specific problem rare event in current sectorKPI dip: Call success rate < 98.5%Slide55

Motivation

Problem statement

Approach

Insight, Mechanism, Customizations

Results

Ongoing

work

Other work

OutlineSlide56

Recommending solution for a problem

Cell site

Cell sites

Monitoring Centre

Parameter list

Parameter list:

Remotely configurable parameters,

Example: Antenna tilt, Min. signal strength

to associate, allowable idle time etc.

Ongoing WorkSlide57

Recommending solution for a problem

Cell site

Cell sites

Monitoring Centre

Parameter list

When a KPI dips:

G

enerate rules

Find sectors where the rules do not lead to KPI dip

Return the parameter list for those sectors

Ongoing WorkSlide58

Ongoing Work

Recommending solution for problem

More customizations necessary …Slide59

Motivation

Problem statement

Approach

Insight, Mechanism, Customizations

Results

Ongoing

work

Other work

OutlineSlide60

All bits of a video application are

not created equal

< 5 msec

< 105 msec

Nearer the

deadline

more

valuable the

packet

Value

I

P

B

MPEG4/ H.264 encoded video

Value

a

ware networkingSlide61

Application

Transport

Network

MAC

PHY

000101

011101

010101

IP

B

0001011010101100101010101100001001

Value aware application layer

I

P

B

APISlide62

Application

Transport

Network

MAC

PHY

000101

011101

010101

IP

B

000101101010110010

1010101100001001

Value aware

networking

Order of sending data

T

imes to retransmit

MAC data rate

Can protocol decisions be taken in a value aware manner ?

I

P

B

Yes

A

lmost

no

data

overhead

APISlide63

Questions?Slide64

BackupSlide65

Future work

Online regression tree formation

Fast

emulation systems for what-if analysisSlide66

Research overview

Scout

[

Submitted

]

[DySPAN 2012]

Range-Write

[OSDI 2008]

Apex[Sigcomm 2010]

Medusa

[NSDI 2010]

MOM

[Submitted]

RDP-TS

DGP

[

MobiCom 2006]

MCB-Mesh

[IMC 2008]

Fractel

[

INFOCOM 2008]

WiScape

[IMC 2011]

[WWW 2008

]

Topo

-cons

WhiteCell

PhD Dissertation

Systems & Protocols

Cross-Layer design

Measurement

& Analysis

Root-cause

MultIfaceT

[

HotMobile’10

]Slide67

Rx

Higher bandwidth

Home repeater

Vehicular whitespace

Reliability

Whitespace

femto

Benefits

Rx

Tx

Multi-Interface systems

TxSlide68

API with higher layers

Striping decision

Channel selection

Feedback gatheringMulti-Interface systems

Challenges

Rx

Tx

Tx

Rx