/
CS 4700 / CS 5700 CS 4700 / CS 5700

CS 4700 / CS 5700 - PowerPoint Presentation

mitsue-stanley
mitsue-stanley . @mitsue-stanley
Follow
373 views
Uploaded On 2016-06-04

CS 4700 / CS 5700 - PPT Presentation

Network Fundamentals Lecture 10 Inter Domain Routing Its all about the Money Revised 2 42014 Network Layer Control Plane 2 Function Set up routes between networks Key challenges ID: 348947

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CS 4700 / CS 5700" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS 4700 / CS 5700Network Fundamentals

Lecture 10: Inter Domain Routing(It’s all about the Money)

Revised

2

/4/2014Slide2

Network Layer, Control Plane

2Function:Set up routes between networksKey challenges:Implementing provider policiesCreating stable paths

Application

Presentation

Session

Transport

Network

Data Link

Physical

BGP

RIP

OSPF

Control Plane

Data PlaneSlide3

BGP Basics

Stable Paths ProblemBGP in the Real WorldDebugging BGP Path ProblemsOutline3Slide4

ASs, Revisited

4AS-1

AS-2

AS-3

Interior

Routers

BGP

RoutersSlide5

AS Numbers

Each AS identified by an ASN number16-bit values (latest protocol supports 32-bit ones)64512 – 65535 are reservedCurrently, there are > 20000 ASNsAT&T: 5074, 6341, 7018, …Sprint: 1239, 1240, 6211, 6242, …Northeastern: 156North America ASs  ftp://ftp.arin.net/info/asn.txt5Slide6

Inter-Domain Routing

6Global connectivity is at stake!Thus, all ASs must use the same protocolContrast with intra-domain routingWhat are the requirements?ScalabilityFlexibility in choosing routesCostRouting around failuresQuestion: link state or distance vector?Trick question: BGP is a path vector protocolSlide7

BGP

7Border Gateway ProtocolDe facto inter-domain protocol of the Internet Policy based routing protocolUses a Bellman-Ford path vector protocolRelatively simple protocol, but…Complex, manual configurationEntire world sees advertisementsErrors can screw up traffic globallyPolicies driven by economicsHow much $$$ does it cost to route along a given path?Not by performance (e.g. shortest paths)Slide8

BGP Relationships

8Customer

Provider

Customer pays

provider

Peer 1

Peer 2

Peer 3

Peers do

not pay each other

Peer 2 has no incentive to route

1 3 CustomerCustomerProviderSlide9

Tier-1 ISP Peering

9AT&TCenturylinkXO CommunicationsInteliquent

Verizon Business

Sprint

Level 3Slide10
Slide11

Peering WarsReduce upstream costs

Improve end-to-end performanceMay be the only way to connect to parts of the InternetYou would rather have customersPeers are often competitorsPeering agreements require periodic renegotiation11PeerDon’t Peer

Peering struggles in the ISP world are extremely contentions, agreements are usually confidentialSlide12

Two Types of BGP Neighbors

12

IGP

Exterior routers also speak IGP

e

BGP

e

BGP

i

BGP

i

BGPSlide13

Full iBGP Meshes

13Question: why do we need iBGP?OSPF does not include BGP policy infoPrevents routing loops within the ASiBGP updates do not trigger announcements

eBGP

i

BGPSlide14

Path Vector Protocol

AS-path: sequence of ASs a route traversesLike distance vector, plus additional information

Used for loop detection and to apply policy

Default choice: route with fewest # of ASs

110.10.0.0/16

AS 1

AS 2

130.10.0.0/16AS 3120.10.0.0/16AS 4

AS 514

120.10.0.0/16: AS 2  AS 3  AS 4130.10.0.0/16: AS 2  AS 3110.10.0.0/16: AS 2  AS 5Slide15

BGP Operations (Simplified)

15Establish session on TCP port 179Exchange active routesExchange incremental updates

AS-1

AS-2

BGP SessionSlide16

Four Types of BGP Messages

Open: Establish a peering session. Keep Alive: Handshake at regular intervals. Notification: Shuts down a peering session. Update: Announce new routes or withdraw previously announced routes. announcement = IP prefix + attributes values16Slide17

CS 4700 / CS 5700

ECON 4700/5700Network Fundamentals

Lecture 10: Inter Domain Routing

(It’s all about the Money)Slide18

BGP Attributes

Attributes used to select “best” pathLocalPrefLocal preference policy to choose most preferred routeOverrides default fewest AS behaviorMulti-exit Discriminator (MED)Specifies path for external traffic destined for an internal networkChooses peering point for your networkImport RulesWhat route advertisements do I accept?Export RulesWhich routes do I forward to whom?18Slide19

Route Selection Summary

19Highest Local PreferenceShortest AS Path

Lowest MED

Lowest IGP C

ost to

BGP

Egress

Lowest Router IDTraffic engineering Enforce relationshipsWhen all else fails,break ties

19Slide20

Shortest AS Path != Shortest Path

20

Source

Destination

?

?

4 hops

4 ASs

9

hops

2

ASsSlide21

Hot Potato Routing

21Destination

Source

3

hops total,

3 hops cost

?

?

5

hops total, 2 hops costSlide22

22

Importing Routes

From Provider

From Peer

From Peer

From Customer

ISP RoutesSlide23

23Exporting Routes

To Customer

To Peer

To Peer

To Provider

Customers get all routes

Customer and ISP routes only

$$$ generating routesSlide24

Modeling BGP

24AS relationshipsCustomer/providerPeerSibling, IXPGao-Rexford modelAS prefers to use customer path, then peer, then providerFollow the money!Valley-free routingHierarchical view of routing (incorrect but frequently used)P-PC-P

P-P

P-C

P-P

P-CSlide25

AS Relationships: It’s Complicated

25GR Model is strictly hierarchicalEach AS pair has exactly one relationshipEach relationship is the same for all prefixesIn practice it’s much more complicatedRise of widespread peeringRegional, per-prefix peeringsTier-1’s being shoved out by “hypergiants”IXPs dominating traffic volumeModeling is very hard, very prone to errorHuge potential impact for understanding Internet behaviorSlide26

Other BGP Attributes

26AS_SETInstead of a single AS appearing at a slot, it’s a set of AsesWhy?CommunitiesArbitrary number that is used by neighbors for routing decisionsExport this route only in EuropeDo not export to your peersUsually stripped after first interdomain hopWhy?PrependingLengthening the route by adding multiple instances of ASNWhy?Slide27

Outline27

BGP BasicsStable Paths ProblemBGP in the Real WorldDebugging BGP Path ProblemsSlide28

28

What Problem is BGP Solving?28Underlying ProblemDistributed SolutionShortest PathsRIP, OSPF, IS-IS, etc.???BGPKnowing ??? can:Aid in the analysis of BGP policyAid in the design of BGP extensionsHelp explain BGP routing anomaliesGive us a deeper understanding of the protocolSlide29

An instance of the SPP:

Graph of nodes and edgesNode 0, called the originA set of permitted paths from each node to the originEach set contains the null pathEach set of paths is rankedNull path is always least preferred

2

29

The Stable Paths Problem

0

1

2435

2 1 0

2 05 2 1 0

4 2 04 3 03 0

1 3 01 0Slide30

A solution is an assignment of permitted paths to each node such that:

Node u’s path is either null or uwP, where path wP is assigned to node w

and edge

u

w

existsEach node is assigned the higest ranked path that is consistent with their neighbors

230A Solution to the SPP012435

2 1 0

2 05 2 1 0

4 2 04 3 0

3 01 3 01 0

Solutions need not use the shortest paths, or form a spanning treeSlide31

2

31Simple SPP Example0124

3

1 0

1

3 0

2 0

2

1 0

3 04 2 04

3 0

4 3 04 2 0

Each node gets its preferred route

Totally stable topologySlide32

2

32Good Gadget0124

3

1 3 0

1 0

2 1 0

2 0

3 0

4 3 0

4

2 0

Not every node gets preferred route

Topology is still stableOnly one stable configurationNo matter which router chooses first!Slide33

33

SPP May Have Multiple Solutions012

1 2 0

1 0

2 1

0

2 0

0

12

1 2 01 0

2 1 02 0012

1 2 01 0

2 1 02 0Slide34

2

34Bad Gadget0124

3

1 3 0

1 0

2 1 0

2 0

3 4 2 0

3 0

4 2 0

4 3 0

That was only one round of oscillation!

This keeps going, infinitely

Problem stems from:

Local (not global) decisionsAbility of one node to improve its path selectionSlide35

SPP Explains BGP Divergence

35BGP is not guaranteed to converge to stable routingPolicy inconsistencies may lead to “livelock”Protocol oscillationMustConvergeMustDivergeSolvable

Can Diverge

Good Gadgets

Bad Gadgets

Naughty GadgetsSlide36

2

36Beware of Backup Policies012

4

3

1 3 0

1 0

2 1 0

2 0

3 4 2 0

3 0

4 04 2 04 3 0

BGP is not robust

It may not recover from link failureSlide37

37

BGP is Precarious63

4

5

3 1 0

3 1 2 0

5 3 1 0

5 6 3 1 2 0

5 3 1 2 001

21 2 01 0

2 1 02 0

4 3 1 04 5 3 1 2 04 3 1 2 0

6 3 1 06 4 3 1 2 06 3 1 2 0

If node 1 uses path 1  0, this is solvable

No longer stableSlide38

Can BGP Be Fixed?

Unfortunately, SPP is NP-completeStatic ApproachInter-AScoordinationAutomated Analysis of Routing Policies(This is very hard)

Dynamic Approach

Extend BGP

to

detect

and

suppresspolicy-based oscillations?

These approaches are complementary38Possible SolutionsSlide39

Outline39

BGP BasicsStable Paths ProblemBGP in the Real WorldDebugging BGP Path ProblemsSlide40

MotivationRouting reliability/fault-tolerance on small time scales (minutes) not previously a priority

Transaction oriented and interactive applications (e.g. Internet Telephony) will require higher levels of end-to-end network reliabilityHow well does the Internet routing infrastructure tolerate faults?40Slide41

Conventional WisdomInternet routing is robust under faults

Supports path re-routingPath restoration on the order of secondsBGP has good convergence propertiesDoes not exhibit looping/bouncing problems of RIPInternet fail-over will improve with faster routers and faster linksMore redundant connections (multi-homing) will always improve fault-tolerance41Slide42

Delayed Routing ConvergenceC

onventional wisdom about routing convergence is not accurate Measurement of BGP convergence in the InternetAnalysis/intuition behind delayed BGP routing convergenceModifications to BGP implementations which would improve convergence times 42Slide43

Open Question

After a fault in a path to multi-homed site, how long does it take for majority of Internet routers to fail-over to secondary path?CustomerPrimary ISPBackup ISP43

Route Withdrawn

Traffic

Routing table convergence

Stable end-to-end pathsSlide44

Bad NewsWith unconstrained policies:

DivergencePossible create unsatisfiable policiesNP-complete to identify these policiesHappening today?With constrained policies (e.g. shortest path first)Transient oscillationsBGP usually convergesIt may take a very long time…BGP Beacons: focuses on constrained policies44Slide45

16 Month Study of ConvergenceInstrument the Internet

Inject BGP faults (announcements/withdrawals) of varied prefix and AS path length into topologically and geographically diverse ISP peering sessionsMonitor impact faults throughRecording BGP peering sessions with 20 tier1/tier2 ISPsActive ICMP measurements (512 byte/second to 100 random web sites)Wait two years (and 250,000 faults)45Slide46

46

Measurement ArchitectureResearchers pretending to be an AS

Researchers pretending to be an ASSlide47

Announcement ScenariosTup

– a new route is advertisedTdown – A route is withdrawni.e. single-homed failureTshort – Advertise a shorter/better AS pathi.e. primary path repairedTlong – Advertise a longer/worse AS pathi.e. primary path fails47Slide48

Major Convergence ResultsRouting convergence requires an order of magnitude longer than expected

10s of minutesRoutes converge more quickly following Tup/Repair than Tdown/Failure eventsBad news travels more slowlyWithdrawals (Tdown) generate several more announcements than new routes (Tup)48Slide49

Example

BGP log of updates from AS2117 for route via AS2129One withdrawal triggers 6 announcements and one withdrawal from 2117Increasing AS path length until final withdrawal49Slide50

Why So Many Announcements?

50

Route Fails: AS 2129

Announce: 5696 2129

Announce: 1 5696 2129

Announce: 2041 3508 2129

Announce: 1

2041 3508 2129Route Withdrawn: 2129AS 2129

AS 5696AS 1AS 2117

AS 2041AS 3508

Events from AS 2177Slide51

How Many Announcements Does it Take For an AS to Withdraw a Route?

Answer: up to 1951Slide52

Short->Long Fail-Over

New RouteLong->Short Fail-over

Failure

Less than half of

Tdown

events converge within two minutes

Tup

/Tshort and Tdown/Tlong form equivalence classesLong tailed distribution (up to 15 minutes)BGP Routing Table Convergence TimesSlide53

Failures, Fail-overs and RepairsBad news does not travel fast…

Repairs (Tup) exhibit similar convergence as long-short AS path fail-overFailures (Tdown) and short-long fail-overs (e.g. primary to secondary path) also similarSlower than Tup (e.g. a repair)80% take longer than two minutesFail-over times degrade the greater the degree of multi-homing53Slide54

Intuition for Delayed Convergence

There exists possible ordering of messages such that BGP will explore ALL possible AS paths of ALL possible lengthsBGP is O(N!), where N number of default-free BGP routers in a complete graph with default policy54Slide55

Impact of Delayed ConvergenceWhy do we care about routing table convergence?

It impacts end-to-end connectivity for Internet pathsICMP experiment resultsLoss of connectivity, packet loss, latency, and packet re-ordering for an average of 3-5 minutes after a faultWhy?Routers drop packets when next hop is unknownPath switching spikes latency/delayMulti-pathing causes reordering55Slide56

In real life …Discussed worst case BGP behavior

In practice, BGP policy prevents worst case from happeningBGP timers also provide synchronization and limits possible orderings of messages56Slide57

Outline

57BGP BasicsStable Paths ProblemBGP in the Real WorldDebugging BGP Path ProblemsSlide58

Control plane vs. Data PlaneControl:

Make sure that if there’s a path available, data is forwarded over itBGP sets up such paths at the AS-levelData: For a destination, send packet to most-preferred next hopRouters forward data along IP pathsHow does the control plane know if a data path is broken?Direct-neighbor connectivityWhat if the outage isn’t in the direct neighbor?58Slide59

Why Network Reliability Remains Hard

VisibilityIP provides no built-in monitoringEconomic disincentives to share information publiclyControlRouting protocols optimize for policy, not reliabilityOutage affecting your traffic may be caused by distant networkDetecting, isolating and repairing network problems for Internet paths remains largely a slow, manual process Slide60

Improving Internet Availability

New Internet designMonitoring everywhere in the networkVisibility into all available routesAny operator can impact routes affecting her trafficChallengesWhat should we monitor?What do we do with additional visibility?

How to use additional control?Slide61

A Practical Approach

We can do this already in today’s InternetCrowdsourcing monitoringUse existing protocols/systems in unintended waysAllows us to address problems todayAlso informs future Internet designsSlide62

Operators Struggle to Locate Failures

Mailing List User 11 Home router2 Verizon in Baltimore3 Verizon in Philly4 Alter.net in DC5 Level3 in DC6 * * *7 * * *

Mailing List User 2

1 Home router

2 Verizon in DC

3

Alter.net

in DC4 Level3 in DC5 Level3 in Chicago6 Level3 in Denver7 * * *8 * * *“Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a tracefrom Verizon residential to Level3.” Outages mailing list, Dec. 2010Slide63

Reasons for Long-Lasting Outages

Long-term outages are:Repaired over slow, human timescalesNot well understoodCaused by routers advertising paths that do not workE.g., corrupted memory on line card causes black holeE.g., bad cross-layer interactions cause failed MPLS tunnelSlide64

Key Challenges for Internet Repair

Lack of visibilityWhere is the outage?Which networks are (un)affected?Who caused the outage?Lack of controlReverse paths determined by possibly distant ASesLimited means to affect such pathsSlide65

Goals and Approach

Improve availability through:Failure isolation and remediationIdentifying the AS(es) responsible for path changesKey techniques: Visibility Active measurements from distributed vantage points Passive collection of BGP feeds Control

On-demand BGP prepending to route around outages

Active BGP measurements to identify alternative pathsSlide66

LIFEGUARD: Locating Internet Failures

E

ffectively and

G

enerating Usable

Alternate Routes Dynamically66Locate the ISP / link causing the problemBuilding blocksExampleDescription of technique

Suggest that other ISPs reroute around the problemSlide67

Building blocks for failure isolation

LIFEGUARD can use:Ping to test reachabilityTraceroute to measure forward pathDistributed vantage points (VPs)PlanetLab for our experimentsSome can source spoofReverse traceroute

to measure reverse path (NSDI

10

)

I’ll teach you about this during the security lecture

Atlas of historical forward/reverse paths between VPs and targets67Slide68

Historical atlas enables reasoning about changes

Traceroute yields only path from GMU to targetReverse traceroute reveals path asymmetry

68

How does

LIFEGUARD

locate a failure?

Before outage:

Historical

CurrentSlide69

69

Forward path works

Problem with ZSTTK?

Ping?

Fr:

VP

Ping!

To:

VP

During outage:

Historical

Current

How does

LIFEGUARD

locate a failure?Slide70

70

Forward path works

NTT:Ping

?

Fr:GMU

GMU:Ping

!

Fr:NTT

During outage:

Historical

Current

How does

LIFEGUARD

locate a failure?Slide71

71

Forward path works

Rostelcom

is not forwarding traffic towards GMU

Rostele

:

Ping?

Fr:GMU

During outage:

Historical

Current

How does

LIFEGUARD

locate a failure?Slide72

How LIFEGUARD Locates Failures

LIFEGUARD:Maintains background historical atlasIsolates direction of failure, measures working directionTests historical paths in failing direction in order toprune candidate failure locationsLocates failure as being at the horizon of reachability

72Slide73

Our Approach and Outline

73LIFEGUARD: Locating Internet F

ailures

E

ffectively and

Generating U

sable Alternate Routes DynamicallyLocate the ISP / link causing the problemSuggest that other ISPs reroute around the problemWhat would we like to add to BGP to enable this?What can we deploy today, using only available protocolsand router support?Slide74

Our Goal for Failure Avoidance

Enable content / service providers to repairpersistent routing problems affecting them,regardless of which ISP is causing themSettingAssume we can locate problemAssume we are multi-homed / have multiple data centersAssume we speak BGPWe use TransitPortal

to speak BGP to the real Internet:

5 US universities as providersSlide75

Self-Repair of Forward PathsSlide76

A Mechanism for Failure Avoidance

Forward path: Choose route that avoids ISP or ISP-ISP linkReverse path: Want others to choose paths to my prefix P that avoid ISP or ISP-ISP link XWant a BGP announcement AVOID(X,P):Any ISP with a route to P that avoids X uses such a route

Any ISP not using

X

need only pass on the announcement

76Slide77

AVOID(L3,WS)

AVOID(L3,WS)

AVOID(L3,WS)

77

Ideal Self-Repair of Reverse PathsSlide78

Do paths exist that AVOID problem?

LIFEGUARD repairs outages by instructing others to avoid particular routes.Q: Do alternative routes exist?A: Alternate policy-compliant paths exist in 90% of simulated AVOID(X,P) announcements.Simulated 10 million AVOIDs on actual measured routes.

78Slide79

WS

ATT

WS

UW

L3

ATT → WS

Sprint → Qwest → WSAISP → Qwest → WS

L3 → ATT → WSQwest → WS

79

Practical Self-Repair of Reverse PathsSlide80

WS

ATT

WS

UW

L3

ATT → WS

Sprint → Qwest → WSAISP → Qwest → WS

?Qwest → WS

UW

→ Sprint → Qwest → WS → L3→ WSSprint → Qwest → WS → L3

→ WS

AISP

Qwest → WS →

L3→ WS

ATT

WS

L3

WS

WS

L3

WS

Qwest

WS

L3

WS

AVOID(L3,WS)

80

L3

ATT

WS

BGP loop prevention encourages switch to working path.

Practical Self-Repair of Reverse PathsSlide81

Other results

Results from real poisoningsPoisoning in the wild / poisoning anomaliesCase study of restoring connectivityMaking poisoning flexibleMonitoring broken path while it is disabledAllowing ISPs w/o alternatives to use disabled routeLIFEGUARD’s scalabilityOverhead and speed of failure location

Router update load if many ISPs deploy our approach

Alternatives to poisoning

Compatibility with secure routing (BGPSEC, etc.)

Comparing to other route control mechanismsSlide82

Can poisoning approximate AVOID effects?

LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration.Q: Does poisoning disrupt working routes?A: No. As I will describe: Under certain circumstances, we can disable a link without disabling the full ISP. We can speed BGP convergence by carefully crafting announcements.Slide83

What if some routes in an ISP still work?

83

We only want

C3

to change its route, to avoid

A-B2Slide84

What if some routes in an ISP still work?

84

We only want

C3

to change its route, to avoid

A-B2

Forward direction is easy: choose a different routeSlide85

What if some routes in an ISP still work?

85

We only want

C3

to change its route, to avoid

A-B2

Forward direction is easy: choose a different routeSlide86

What if some routes in an ISP still work?

86

We only want

C3

to change its route, to avoid

A-B2

Poisoning seems blunt, disabling an entire ISPSlide87

What if some routes in an ISP still work?

87

We only want

C3

to change its route, to avoid

A-B2

Poisoning seems blunt, disabling an entire ISPSlide88

What if some routes in an ISP still work?

88

We only want

C3

to change its route, to avoid

A-B2

Poisoning seems blunt, disabling an entire ISPSlide89

What if some routes in an ISP still work?

89

We only want

C3

to change its route, to avoid

A-B2

Poisoning seems blunt, disabling an entire ISP

Selective advertising via just D1 is also bluntSlide90

What if some routes in an ISP still work?

90

We only want

C3

to change its route, to avoid

A-B2

Poisoning seems blunt, disabling an entire ISP

Selective advertising via just D1 is also bluntSlide91

What if some routes in an ISP still work?

91

We only want

C3

to change its route, to avoid

A-B2

Poisoning seems blunt, disabling an entire ISP

If D1 and D2 (transitively) connect to different PoPs of A, selectively poison via D2 and not D1Slide92

What if some routes in an ISP still work?

92We only want C3 to change its route, to avoid A-B2Poisoning seems blunt, disabling an entire ISPIf D1

and

D2

(transitively) connect to different PoPs of

A, selectively poison via D2 and not

D1Slide93

93

What if some routes in an ISP still work?

We only want

C3

to change its route, to avoid

A-B2

Poisoning seems blunt, disabling an entire ISP

If

D1 and D2 (transitively) connect to different PoPs of A, selectively poison via D2 and not D1Slide94

Can poisoning approximate AVOID effects?

94L

IFE

G

UARD

s poisoning repairs outages by disabling routes to induce route exploration.Q: Does poisoning disrupt working routes?A: No. As I will describe:

“Selective poisoning” can avoid 73% of links without disabling entire AS.Real-world results from 5 provider BGP-Mux testbed We can speed BGP convergence by carefully crafting announcements.Slide95

Naive Poisoning Causes Transient Loss

Some ISPs may have working paths that avoid problem ISP X Naively, poisoning causes path exploration even for these ISPsPath exploration causes transient loss

95

AVOID(X,P)Slide96

Naive Poisoning Causes Transient Loss

Some ISPs may have working paths that avoid problem ISP X Naively, poisoning causes path exploration even for these ISPsPath exploration causes transient loss

96

AVOID(X,P)Slide97

Naive Poisoning Causes Transient Loss

Some ISPs may have working paths that avoid problem ISP X Naively, poisoning causes path exploration even for these ISPsPath exploration causes transient loss

97

AVOID(X,P)Slide98

Naive Poisoning Causes Transient Loss

Some ISPs may have working paths that avoid problem ISP X Naively, poisoning causes path exploration even for these ISPsPath exploration causes transient loss

98

AVOID(X,P)Slide99

Naive Poisoning Causes Transient Loss

Some ISPs may have working paths that avoid problem ISP X Naively, poisoning causes path exploration even for these ISPsPath exploration causes transient loss

99

AVOID(X,P)Slide100

Naive Poisoning Causes Transient Loss

Some ISPs may have working paths that avoid problem ISP X Naively, poisoning causes path exploration even for these ISPsPath exploration causes transient loss

100

AVOID(X,P)Slide101

Naive Poisoning Causes Transient Loss

Some ISPs may have working paths that avoid problem ISP X Naively, poisoning causes path exploration even for these ISPsPath exploration causes transient loss

101

AVOID(X,P)Slide102

Naive Poisoning Causes Transient Loss

Some ISPs may have working paths that avoid problem ISP X Naively, poisoning causes path exploration even for these ISPsPath exploration causes transient loss

102

AVOID(X,P)Slide103

Prepend to Reduce Path Exploration

Most routing decisions based on:(1) next hop ISP(2) path lengthKeep these fixed to speed convergencePrepending prepares ISPs for later poison

103

AVOID(X,P)Slide104

Prepend to Reduce Path Exploration

Most routing decisions based on:(1) next hop ISP(2) path lengthKeep these fixed to speed convergencePrepending prepares ISPs for later poison

104

AVOID(X,P)Slide105

Prepend to Reduce Path Exploration

Most routing decisions based on:(1) next hop ISP(2) path lengthKeep these fixed to speed convergencePrepending prepares ISPs for later poison

105

AVOID(X,P)Slide106

Prepend to Reduce Path Exploration

Most routing decisions based on:(1) next hop ISP(2) path lengthKeep these fixed to speed convergencePrepending prepares ISPs for later poison

106

AVOID(X,P)Slide107

Prepend to Reduce Path Exploration

Most routing decisions based on:(1) next hop ISP(2) path lengthKeep these fixed to speed convergencePrepending prepares ISPs for later poison

107

AVOID(X,P)Slide108

Prepending Speeds Convergence

With no prepend, only 65% of unaffected ISPs converge instantlyWith prepending, 95% of unaffected ISPs re-converge instantly, 98%<1/2 min.Also speeds convergence to new paths for affected peersSlide109

LIFEGUARD Summary

We increasingly depend on the Internet, but availability lagsMuch of Internet unavailability due to long-lasting outagesLIFEGUARD: Let edge networks reroute around failuresLocation challenge: Find problem, given unidirectional failures and tools that depend on connectivityUse reverse traceroute, isolate directions, use historical viewAvoidance challenge: Reroute without participation of transit networksBGP poisoning gives control to the destination

Well-crafted announcements ease concernsSlide110

Inter-Domain Routing SummaryBGP4 is the only inter-domain routing protocol currently in use world-wide

Issues?Lack of securityEase of misconfigurationPoorly understood interaction between local policiesPoor convergenceLack of appropriate information hidingNon-determinismPoor overload behavior110Slide111

Lots of research into how to fix this

111SecurityBGPSEC, RPKIMisconfigurations, inflexible policySDNPolicy InteractionsPoiRoot (root cause analysis)ConvergenceConsensus RoutingInconsistent behaviorLIFEGUARD, among othersSlide112

Why are these still issues?

112Backward compatibilityBuy-in / incentives for operatorsStubbornness

Very similar issues to IPv6 deployment