/
Internet Measurement for Internet Measurement for

Internet Measurement for - PowerPoint Presentation

CuddleBunny
CuddleBunny . @CuddleBunny
Follow
342 views
Uploaded On 2022-08-01

Internet Measurement for - PPT Presentation

SelfDriving Networks Matt Calder Minerva Chen Jose Nunez de Caceres Estrada Diego Perez Botero Madhura Phadke Manuel Schröder  April 4 2019 Background Azure Frontdoor Microsofts content delivery network ID: 931542

server odin driving data odin server data driving microsoft measurement collection side report alerting analysis active offline application layer

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Internet Measurement for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Internet Measurement for

Self-Driving Networks

Matt Calder

Minerva Chen, Jose Nunez de Caceres Estrada, Diego Perez Botero, Madhura Phadke, Manuel Schröder 

April 4, 2019

Slide2

BackgroundAzure Frontdoor – Microsoft’s content delivery network

1st and (recently) 3rd party CDNServers deployed on Microsoft's network edge

Global application load balancingReverse proxy / Split TCPDedicated Internet measurement teamSystems used across MicrosoftNetwork experimentationNetwork monitoringUser-facing analyticsBuild out and capacity planning

Traffic engineering

2

Slide3

We need Self-driving NetworksUse casesAvailability drop / outages. Seattle Comcast users failure rate of 10% over the last 10 minutes.

Latency regressionP90 RTT of users in Taiwan increased 100

ms.Frequent and unendingOperate self-driving networks to mitigate Traffic engineeringBuilding blocks to enable self-driving networks took years

3

Slide4

OutlineIntroductionPath to Self-Driving Solutions

Data Collection

Odin Measurement PlatformChallenges and Future Directions

4

Slide5

Leading up to Self-driving NetworksCustomer reported incidentsCan't or unaware of how to measure itAd-hoc measurement and analysis

Get info from customerTCP dump on production machine, copy trace locally, write script to analyze

Traceroute from prod machine. Have customer send you traceroute from their end or find looking glass.Time consuming investigations for engineersBest outcome is troubleshooting guide5

#1 No insight

Slide6

Leading up to Self-driving NetworksData is there when you need itMeasurements

TelemetryLarge geo-distributed service

Data ingestion latencyRaw or aggregateQueryability

6

#2 Automate Data Collection

#1 No insight

Slide7

Leading up to Self-Driving NetworksMethodologyInvest in statistics, data quality, and validationTranslate networking domain knowledge into process

Schedule recurring jobs to look for issuesProduce reports 

7

#3 Automate Issue Detection

#2 Automate Data Collection

#1 No insight

Slide8

Leading up to Self-Driving NetworksTrue test of #3Raise alert to on-call engineerFollow troubleshooting guideToo many alerts -> on-call burnout

Issues are mostly short-lived

8

#4 Alerting

#3 Automate Issue Detection

#2 Automate Data Collection

#1 No insight

Slide9

Arriving at Self-driving NetworksFeed data to traffic engineering systemAuth DNS, BGP Missing piece is measurements of alternate paths

ExamplesChange egress traffic links

Change ingress traffic PoPs9

#5 Closing the loop

#4 Alerting

#3 Automate Issue Detection

#2 Automate Data Collection

#1 No insight

Slide10

OutlineIntroductionPath to Self-Driving Solutions

Data Collection

Odin Measurement PlatformChallenges and Future Directions

10

Slide11

Data Collection at Azure Frontdoor11

Client requests instrumented at serverCollect TCP and application layer metrics

Passive Server-side

Slide12

Data Collection at Azure Frontdoor12

Client requests instrumented at serverCollect TCP and application layer metrics

Passive Server-side

Active Server-side

Traceroute, ping

From servers to Internet destinations

Slide13

Data Collection at Azure Frontdoor13

Client requests instrumented at serverCollect TCP and application layer metrics

Passive Server-side

Active Server-side

Traceroute, ping

From servers to Internet destinations

Active Client-side 

+ HTTP(S)

From Microsoft users to Internet destinations

Slide14

Data Collection at Azure Frontdoor14

Client requests instrumented at serverCollect TCP and application layer metrics

Passive Server-side

Active Server-side

Traceroute, ping

From servers to Internet destinations

Active Client-side 

+ HTTP(S)

From Microsoft users to Internet destinations

Real-time

Azure Global Telemetry 

Near Real-time

Offline

Data Access

Slide15

Measurement LimitationsPassive server-sideIssue 1: No explicit outage signal

Issue 2: Alternate path exploration adds risk

15

Slide16

Measurement LimitationsPassive server-sideIssue 1: No explicit outage signal

Issue 2: Alternate path exploration adds risk

16

Active layer

3 measurements from servers

Issue

1: Poor coverage

74% of end-users are unresponsive

Issue 2: Missing layer 7 behaviors

HTTP redirection

SSL/TLS

Slide17

OutlineIntroductionPath to Self-Driving Solutions

Data Collection

Odin Measurement PlatformChallenges and Future Directions

17

Slide18

Odin Design

18

20

ms

Server

A

Offline Analysis

Online Alerting

Report Endpoint

A: 20ms

HTTP(S)

Microsoft

Odin

GET tiny.png

1. Client-side Platform

2. Active Measurement

3. Application Layer

Slide19

Stock Ticker Desktop User

Odin Design

19

Server

B

Offline Analysis

Online Alerting

Report Endpoint

Microsoft

Odin

20

ms

HTTP(S)

GET tiny.png

4. Both Web and Rich Clients

B: 20ms

1. Client-side Platform

2. Active Measurement

3. Application Layer

Slide20

Odin Design

20

Offline Analysis

Online Alerting

Report Endpoint

Microsoft

Odin

HTTP(S)

GET tiny.png

B: ERROR

Stock Ticker Desktop User

Server

B

4. Both Web and Rich Clients

1. Client-side Platform

2. Active Measurement

3. Application Layer

5. Explicit Failure Notification

Slide21

Odin Design

21

Mail Server

B

Offline Analysis

Online Alerting

Report Upload Endpoint

Microsoft

Odin

Examples showed measurements to the application server

Want richer measurements

Slide22

Odin Design

22

Offline Analysis

Online Alerting

Report Upload Endpoint

Odin

Orchestration

Service

Microsoft

Primary and backup report endpoints

Target URLs

Slide23

Odin Design

23

Offline Analysis

Online Alerting

Report Endpoint

Microsoft U.S.

Odin

Orchestration

Service

1.

m3.contoso.com

20ms

2.

3.

m1.contoso.com: 20ms

GET tiny.png

Microsoft

Europe

Server

B

Slide24

B: ERROR

Odin Design: Fault tolerance

24

Odin

Offline Analysis

Report Endpoint

Microsoft

Orchestration

Service

Online Alerting

Need to receive measurements even if Microsoft’s network is unavailable

GET tiny.gif

Server

B

Slide25

B: ERROR

Odin Design: Fault tolerance

25

Odin

Offline Analysis

Report Endpoint

Microsoft

Orchestration

Service

Online Alerting

3

rd

Party Network

Report Proxy

GET tiny.gif

Server

B

Slide26

Summary: Odin enables Self-driving NetworksCoverageNo better vantage points than your actual customers

SafetyDon’t need to experiment/measure with prod trafficAbility to validate

FlexibilitySupports enterprise network and privacy requirementsFault toleranceMeasurements available during outages26

Slide27

OutlineIntroductionPath to Self-Driving Solutions

Data Collection

Odin Measurement PlatformChallenges and Future Directions

27

Slide28

Challenges and Future DirectionsData QualityHow do we build systems which avoid making bad decisions?When do we get humans back in the loop?

TE keeps fixing recurring problemsMay require change in service, additional capacity

Need for collaborationIssues impacting common resources e.g. IXPs, transit, end-usersSelf-driving networks will route accordinglyWant to help fix underlying issueStill need to email NOCSignals published by content providers

Network operators subscribe

28

Slide29

Thanks!29