/
F10: A Fault-Tolerant Engineered Network F10: A Fault-Tolerant Engineered Network

F10: A Fault-Tolerant Engineered Network - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
448 views
Uploaded On 2015-09-23

F10: A Fault-Tolerant Engineered Network - PPT Presentation

Vincent Liu Daniel Halperin Arvind Krishnamurthy Thomas Anderson University of Washington Todays Data Centers Todays data centers are built using multirooted trees C ommodity switches for cost ID: 138478

path f10 local failure f10 path failure local direct failures slow rerouting dst src fattrees protocols fattree recovery switches failover congestion loss

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "F10: A Fault-Tolerant Engineered Network" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

F10: A Fault-Tolerant Engineered Network

Vincent Liu

, Daniel

Halperin

,

Arvind

Krishnamurthy, Thomas Anderson

University of WashingtonSlide2

Today’s Data Centers

Today’s data centers are built using multi-rooted trees

C

ommodity switches for cost, bisection bandwidth, and resilience to failures

2

*From Al-Fares et al.

SIGCOMM ‘08Slide3

FatTree

Example:

PortLand

Heartbeats to detect failures

Centralized controller installs updated routes

Exploits path redundancy

3Slide4

Unsolved Issues with FatTrees

Slow Detection

Commodity switches fail often

Not always sure they failed (gray/partial failures)

Slow RecoveryFailure recovery is not local

Topology does not support local reroutesSuboptimal Flow Assignment

Failures result in an unbalanced treeLoses load balancing properties

4Slide5

F10

Co-design of topology, routing protocols and failure detector

Novel topology that enables local, fast recovery

Cascading protocols for optimal recoveryFine-grained failure detector for fast detectionS

ame # of switches/links as FatTrees

5Slide6

Outline

Motivation & Approach

Topology: AB

FatTreeCascaded Failover Protocols

Failure DetectionEvaluationConclusion

6Slide7

Why is

FatTree

Recovery Slow?

Lots of redundancy on the upward path

Immediately restore connectivity at the point of failure

7

dst

srcSlide8

Why is

FatTree

Recovery Slow?No redundancy on the way downA

lternatives are many hops away

8

dst

src

src

No direct path

Has alternate pathSlide9

Type A Subtree

9

x

y

1

2

3

4

Consecutive ParentsSlide10

Type B Subtree

10

x

y

1

2

3

4

Strided

ParentsSlide11

AB FatTree

11Slide12

Alternatives in AB FatTrees

12

More nodes have alternative,

d

irect pathsOne hop away from node with an alternative

dst

src

src

No direct path

Has alternate pathSlide13

Cascaded Failover Protocols

A local rerouting mechanism

I

mmediate restorationA pushback notification schemeRestore direct paths

An epoch-based centralized schedulerglobally re-optimizes traffic

13

μs

ms

sSlide14

Local Rerouting

14

Route to a sibling in an opposite-type

subtree

Immediate, local rerouting around the failure

dst

uSlide15

Local Rerouting – Multiple Failures

15

Resilient to multiple failures, refer to paper

Increased load and path dilation

dst

uSlide16

Pushback Notification

Detecting switch broadcasts notification

Restores direct paths, but not finished yet

16

u

No direct path

Has alternate path

uSlide17

Centralized Scheduler

Related to existing work (

Hedera

, MicroTE)Gather traffic matricesPlace

long-lived flows based on their sizePlace shorter

flows with weighted ECMP17Slide18

Outline

Motivation & Approach

Topology: AB

FatTree

Cascaded Failover ProtocolsFailure Detection

EvaluationConclusion18Slide19

Why are Today’s Detectors Slow?

B

ased on loss of multiple heartbeats

Detector is separated from failureSlow because:

CongestionGray failuresDon’

t want to waste too many resources19Slide20

F10 Failure Detector

Look at the link itself

Send traffic to physical neighbors when idle

Monitor incoming bit transitions and packetsStop sending

and reroute the very next packetCan be fast because rerouting is cheap

20Slide21

Outline

Motivation & Approach

Topology: AB

FatTree

Cascaded Failover ProtocolsFailure Detection

EvaluationConclusion

21Slide22

Evaluation

Can F10 reroute quickly?

Can F10 avoid congestion loss that results from failures?

How much does this effect application performance?

22Slide23

Methodology

Testbed

Emulab

w/ Click implementationUsed smaller packets to account for slower speedPacket-level simulator24-port 10GbE switches, 3 levels

Traffic model from Benson et al. IMC 2010Failure model from Gill et al. SIGCOMM 2011Validated using

testbed23Slide24

F10 Can Reroute Q

uickly

F10 can recover from failures in under a millisecond

Much less time than a TCP timeout

24Slide25

F10 Can Avoid Congestion L

oss

PortLand

has 7.6x the congestion loss of F10 under realistic traffic and failure conditions

25Slide26

F10 Improves App Performance

Median

s

peedup is 1.3x

26

Speedup of a

MapReduce

computationSlide27

Conclusion

F10 is a co-design of topology, routing protocols, and failure detector:

AB

FatTrees to allow local recovery and increase path diversity

Pushback and global re-optimization restore congestion-free

operationSignificant benefit to application performance on typical workloads and failure conditions

Thanks!27