SelfDriving Networks Matt Calder Minerva Chen Jose Nunez de Caceres Estrada Diego Perez Botero Madhura Phadke Manuel Schröder April 4 2019 Background Azure Frontdoor Microsofts content delivery network ID: 931542
Download Presentation The PPT/PDF document "Internet Measurement for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Internet Measurement for
Self-Driving Networks
Matt Calder
Minerva Chen, Jose Nunez de Caceres Estrada, Diego Perez Botero, Madhura Phadke, Manuel Schröder
April 4, 2019
Slide2BackgroundAzure Frontdoor – Microsoft’s content delivery network
1st and (recently) 3rd party CDNServers deployed on Microsoft's network edge
Global application load balancingReverse proxy / Split TCPDedicated Internet measurement teamSystems used across MicrosoftNetwork experimentationNetwork monitoringUser-facing analyticsBuild out and capacity planning
Traffic engineering
2
Slide3We need Self-driving NetworksUse casesAvailability drop / outages. Seattle Comcast users failure rate of 10% over the last 10 minutes.
Latency regressionP90 RTT of users in Taiwan increased 100
ms.Frequent and unendingOperate self-driving networks to mitigate Traffic engineeringBuilding blocks to enable self-driving networks took years
3
Slide4OutlineIntroductionPath to Self-Driving Solutions
Data Collection
Odin Measurement PlatformChallenges and Future Directions
4
Slide5Leading up to Self-driving NetworksCustomer reported incidentsCan't or unaware of how to measure itAd-hoc measurement and analysis
Get info from customerTCP dump on production machine, copy trace locally, write script to analyze
Traceroute from prod machine. Have customer send you traceroute from their end or find looking glass.Time consuming investigations for engineersBest outcome is troubleshooting guide5
#1 No insight
Slide6Leading up to Self-driving NetworksData is there when you need itMeasurements
TelemetryLarge geo-distributed service
Data ingestion latencyRaw or aggregateQueryability
6
#2 Automate Data Collection
#1 No insight
Slide7Leading up to Self-Driving NetworksMethodologyInvest in statistics, data quality, and validationTranslate networking domain knowledge into process
Schedule recurring jobs to look for issuesProduce reports
7
#3 Automate Issue Detection
#2 Automate Data Collection
#1 No insight
Slide8Leading up to Self-Driving NetworksTrue test of #3Raise alert to on-call engineerFollow troubleshooting guideToo many alerts -> on-call burnout
Issues are mostly short-lived
8
#4 Alerting
#3 Automate Issue Detection
#2 Automate Data Collection
#1 No insight
Slide9Arriving at Self-driving NetworksFeed data to traffic engineering systemAuth DNS, BGP Missing piece is measurements of alternate paths
ExamplesChange egress traffic links
Change ingress traffic PoPs9
#5 Closing the loop
#4 Alerting
#3 Automate Issue Detection
#2 Automate Data Collection
#1 No insight
Slide10OutlineIntroductionPath to Self-Driving Solutions
Data Collection
Odin Measurement PlatformChallenges and Future Directions
10
Slide11Data Collection at Azure Frontdoor11
Client requests instrumented at serverCollect TCP and application layer metrics
Passive Server-side
Slide12Data Collection at Azure Frontdoor12
Client requests instrumented at serverCollect TCP and application layer metrics
Passive Server-side
Active Server-side
Traceroute, ping
From servers to Internet destinations
Slide13Data Collection at Azure Frontdoor13
Client requests instrumented at serverCollect TCP and application layer metrics
Passive Server-side
Active Server-side
Traceroute, ping
From servers to Internet destinations
Active Client-side
+ HTTP(S)
From Microsoft users to Internet destinations
Slide14Data Collection at Azure Frontdoor14
Client requests instrumented at serverCollect TCP and application layer metrics
Passive Server-side
Active Server-side
Traceroute, ping
From servers to Internet destinations
Active Client-side
+ HTTP(S)
From Microsoft users to Internet destinations
Real-time
Azure Global Telemetry
Near Real-time
Offline
Data Access
Slide15Measurement LimitationsPassive server-sideIssue 1: No explicit outage signal
Issue 2: Alternate path exploration adds risk
15
Slide16Measurement LimitationsPassive server-sideIssue 1: No explicit outage signal
Issue 2: Alternate path exploration adds risk
16
Active layer
3 measurements from servers
Issue
1: Poor coverage
74% of end-users are unresponsive
Issue 2: Missing layer 7 behaviors
HTTP redirection
SSL/TLS
Slide17OutlineIntroductionPath to Self-Driving Solutions
Data Collection
Odin Measurement PlatformChallenges and Future Directions
17
Slide18Odin Design
18
20
ms
Server
A
Offline Analysis
Online Alerting
Report Endpoint
A: 20ms
HTTP(S)
Microsoft
Odin
GET tiny.png
1. Client-side Platform
2. Active Measurement
3. Application Layer
Slide19Stock Ticker Desktop User
Odin Design
19
Server
B
Offline Analysis
Online Alerting
Report Endpoint
Microsoft
Odin
20
ms
HTTP(S)
GET tiny.png
4. Both Web and Rich Clients
B: 20ms
1. Client-side Platform
2. Active Measurement
3. Application Layer
Slide20Odin Design
20
Offline Analysis
Online Alerting
Report Endpoint
Microsoft
Odin
HTTP(S)
GET tiny.png
B: ERROR
Stock Ticker Desktop User
Server
B
4. Both Web and Rich Clients
1. Client-side Platform
2. Active Measurement
3. Application Layer
5. Explicit Failure Notification
Slide21Odin Design
21
Mail Server
B
Offline Analysis
Online Alerting
Report Upload Endpoint
Microsoft
Odin
Examples showed measurements to the application server
Want richer measurements
Slide22Odin Design
22
Offline Analysis
Online Alerting
Report Upload Endpoint
Odin
Orchestration
Service
Microsoft
Primary and backup report endpoints
Target URLs
Slide23Odin Design
23
Offline Analysis
Online Alerting
Report Endpoint
Microsoft U.S.
Odin
Orchestration
Service
1.
m3.contoso.com
20ms
2.
3.
m1.contoso.com: 20ms
GET tiny.png
Microsoft
Europe
Server
B
Slide24B: ERROR
Odin Design: Fault tolerance
24
Odin
Offline Analysis
Report Endpoint
Microsoft
Orchestration
Service
Online Alerting
Need to receive measurements even if Microsoft’s network is unavailable
GET tiny.gif
Server
B
Slide25B: ERROR
Odin Design: Fault tolerance
25
Odin
Offline Analysis
Report Endpoint
Microsoft
Orchestration
Service
Online Alerting
3
rd
Party Network
Report Proxy
GET tiny.gif
Server
B
Slide26Summary: Odin enables Self-driving NetworksCoverageNo better vantage points than your actual customers
SafetyDon’t need to experiment/measure with prod trafficAbility to validate
FlexibilitySupports enterprise network and privacy requirementsFault toleranceMeasurements available during outages26
Slide27OutlineIntroductionPath to Self-Driving Solutions
Data Collection
Odin Measurement PlatformChallenges and Future Directions
27
Slide28Challenges and Future DirectionsData QualityHow do we build systems which avoid making bad decisions?When do we get humans back in the loop?
TE keeps fixing recurring problemsMay require change in service, additional capacity
Need for collaborationIssues impacting common resources e.g. IXPs, transit, end-usersSelf-driving networks will route accordinglyWant to help fix underlying issueStill need to email NOCSignals published by content providers
Network operators subscribe
28
Slide29Thanks!29