/
Tail Latency: Networking Tail Latency: Networking

Tail Latency: Networking - PowerPoint Presentation

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
446 views
Uploaded On 2017-11-05

Tail Latency: Networking - PPT Presentation

The story thus far Tail latency is bad Causes Resource contention with background jobs Device failure Unevensplit of data between tasks Network congestion for reducers Ways to address tail latency ID: 602860

deadline network tasks time network deadline time tasks latency transfer flow data shuffle based broadcast bursts completion flows size

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Tail Latency: Networking" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Tail Latency: NetworkingSlide2

The story thus far

Tail latency is bad

Causes:

Resource contention with background jobs

Device failure

Uneven-split of data between tasks

Network congestion for reducersSlide3

Ways to address tail latency

Clone all tasks

Clone slow tasks

Copy intermediate data

Remove/replace frequently failing machines

Spread out reducersSlide4

What is missing from this picture?

Networking:

Spreading out reducers is not sufficient.

The network is extremely crucial

Studies on Facebook traces show that [Orchestra]

in 26% of jobs, shuffle is 50% of runtime.

in 16% of jobs, shuffle is more than 70% of

runtime

42% of tasks spend over 50% of their time writing to HDFSSlide5

Other implication of Network Limits

Scalability

Scalability of Netflix-like recommendation system is bottlenecked by communication

5

Did not scale beyond 60 nodes

Comm. time increased faster than comp. time decreasedSlide6

What is the Impact of the Network

Assume 10ms deadline for tasks [DCTCP]

Simulate job completion times based on distributions of tasks completion times

For 40 about

4 tasks (14%)

for 400

14 tasks [3%] fail respectivelySlide7

What is the Impact of the Network

Assume 10ms deadline for tasks [DCTCP]

Simulate job completion times based on distributions of tasks completion times (focus on 99.9%)

For 40 about

4 tasks (14%)

for 400

14 tasks [3%] fail respectivelySlide8

What is the Impact of the Network

Assume 10ms deadline for tasks [DCTCP]

Simulate job completion times based on distributions of tasks completion times

For 40 about

4 tasks (14%)

for 400

14 tasks [3%] fail respectivelySlide9

Other implication of Network Limits

Scalability

Scalability of Netflix-like recommendation system is bottlenecked by communication

9

Did not scale beyond 60 nodes

Comm. time increased faster than comp. time decreasedSlide10

What Causes this Variation in Network Transfer Times?

First let’s look at type of traffic in network

Background Traffic

Latency sensitive short control messages; e.g. heart beats, job status

Large files: e.g. HDFS replication, loading of new data

Map-reduce jobs

Small RPC-request/response with tight deadlines

HDFS reads or writes with tight deadlinesSlide11

What Causes this Variation in Network Transfer Times?

No notion of priority

Latency sensitive and non-latency sensitive share the network equally.

Uneven load-balancing

ECMP doesn’t schedule flows evenly across all paths

Assume long and short are the same

Bursts of traffic

Networks have buffers which reduce loss but introduce latency (time waiting in buffer is variable)

Kernel optimization introduce

burstiness

Slide12

Ways to Eliminate Variation and Improve tail latency

Make the network faster

HULL,

DeTail

, DCTCP

Faster networks == smaller tail

Optimize how application use the network

Orchestra,

CoFlows

Specific big-data transfer patterns, optimize the patterns to reduce transfer time

Make the network aware of deadlinesD3, PDQTasks have deadlines. No point doing any work if deadline wouldn’t be metTry and prioritize flows and schedule them based on deadline.Slide13

Fair-Sharing or Deadline-based sharing

Fair-share (Status-Quo)

Every one plays nice but some deadlines lines can be missed

Deadline-based

Deadlines met but may require non-trial

implemantionat

Two ways to do deadline-based sharing

Earliest deadline first (PDQ)

Make BW reservations for each flow

Flow rate = flow size/flow deadline

Flow size & deadline are known aprioriSlide14

Fair-Sharing or Deadline-based sharing

Two versions of deadline-based sharing

Earliest deadline first (PDQ)

Make BW reservations for each flow

Flow rate = flow size/flow deadline

Flow size & deadline are known

aprioriSlide15

Issues with Deadline Based Scheduling

I

mplications for non-deadline based jobs

Starvation? Poor completion times?

Implementation Issues

Assign deadlines to flows not packets

Reservation approach

Requires reservation for each flow

Big data flows: can be small & have small RTT

Control loop must be

extremelly fastEarliest deadline firstRequires coordination between switches & serversServers: specify flow deadline Switches: priority flows and determine rateMay require complex switch mechanismsSlide16

How do you make the Network Faster

Throw more hardware at the problem

Fat-Tree, VL2, B-Cube, Dragonfly

Increases bandwidth (throughput) but not necessarily latencySlide17

So, how do you reduce latency

Trade bandwidth for latency

Buffering adds variation (unpredictability)

Eliminate network buffering & bursts

Optimize the network stack

Use link level information to detect congestion

Inform application to adapt by using a different pathSlide18

HULL: Trading BW for Latency

Buffering introduces latency

Buffer is used to accommodate bursts

To allow congestion control to get good throughput

Removing buffers means

Lower throughput for large flows

Network can’t handle bursts

Predictable low latencySlide19

Why do Bursts Exists?

Systems review:

NIC (network Card) informs OS of packets via interrupt

Interrupt consume CPU

If one interrupt for each packet the CPU will be overwhelmed

Optimization: batch packets up before calling interrupt

Size of the batch is the size of the burstSlide20

Why do Bursts Exists?

Systems review:

NIC (network Card) informs OS of packets via interrupt

Interrupt consume CPU

If one interrupt for each packet the CPU will be overwhelmed

Optimization: batch packets up before calling interrupt

Size of the batch is the size of the burstSlide21

Why Does Congestion Need buffers?

Congestion Control AKA TCP

Detects bottleneck link capacity through packet loss

When loss it halves its sending rate.

Buffers help the keep the network busy

Important for when TCP reduce sending rate by half

Essentially the network must double capacity for TCP to work well.

Buffer allow for this doublingSlide22

TCP

Review

Bandwidth-delay product rule of thumb:

A single flow needs

C×RTT

buffers for

100% Throughput.

Throughput

Buffer Size

100%

B

B ≥ C×RTT

B

100%

B < C×RTT

22Slide23

Key Idea Behind Hull

Eliminate Bursts

Add a token bucket (Pacer) into the network

Pacer must be in the network so it happens after the system optimizations that cause bursts.

Eliminate Buffering

Send congestion notification messages before link it fully utilized

Make applications believe the link is full when there’s still capacity

TCP has poor congestion control algorithm

Replace with DCTCPSlide24

Key Idea Behind Hull

Eliminate Bursts

Add a token bucket (Pacer) into the network

Pacer must be in the network so it happens after the system optimizations that cause bursts.

Eliminate Buffering

Send congestion notification messages before link it fully utilized

Make applications believe the link is full when there’s still capacitySlide25

Orchestra:

Managing Data Transfers in Computer Clusters

Group all flows belonging to a stage into a transfer

Perform inter-transfer coordination

Optimize at the level of transfer rather than individual flowsSlide26

Transfer Patterns

Transfer

: set of all flows transporting data between two stages of a job

Acts as a

barrier

Completion time

: Time for the last receiver to finish

S

huffle

Broadcast

Incast*

26

Map

Map

Map

HDFS

Reduce

Reduce

HDFSSlide27

TC

(broadcast)

TC

(broadcast)

HDFS

Tree

Cornet

HDFS

Tree

Cornet

TC

(shuffle)

Hadoop shuffle

WSS

shuffle

broadcast 1

broadcast 2

ITC

Fair sharing

FIFO

Priority

27

Shuffle

Transfer

Controller (TC)

Broadcast

Transfer

Controller (TC)

Broadcast

Transfer

Controller (TC)

Inter-Transfer

Controller (ITC)

Orchestra

Cooperative broadcast (Cornet)

Infer and utilize topology information

Weighted Shuffle Scheduling (WSS)

Assign flow rates to optimize shuffle completion time

Inter-Transfer Controller

Implement weighted fair sharing between transfers

End-to-end performanceSlide28

Cornet

: Cooperative broadcast

Observations

Cornet Design Decisions

High-bandwidth,

low-latency network

Large block size (4-16MB)

No selfish or malicious peers

No need for

incentives (e.g., TFT)

No (un)chokingEveryone stays till the endTopology mattersTopology-aware broadcast

28

Broadcast same data to every receiver

Fast, scalable, adaptive to bandwidth, and resilient

Peer-to-peer mechanism optimized for cooperative

environments

Use bit-torrent to distribute dataSlide29

Topology-aware Cornet

Many data center networks employ tree topologies

Each rack should receive

exactly one copy

of broadcast

Minimize cross-rack communication

Topology information reduces cross-rack data transfer

Mixture of spherical Gaussians to infer network topology

29Slide30

Shuffle bottlenecks

An optimal shuffle schedule must keep at least one link fully utilized throughout the transfer

30

At a sender

At a receiver

In the networkSlide31

Status quo in Shuffle

31

r

1

r

2

s

2

s

3

s

4

s

1

s

5

Links to r

1

and r

2

are full:

Link from s

3

is full:

Completion time:

3 time units

2

time units

5

time unitsSlide32

Allocate rates to each flow using weighted fair sharing, where the weight of a flow between a sender-receiver pair is proportional to the total amount of data to be sent

32

Up to 1.5X improvement

Completion time: 4 time units

Weighted Shuffle Scheduling

r

1

r

2

s

2

s

3

s

4

s

1

s

5

1

1

2

2

1

1Slide33

Faster spam classification

33

Communication reduced from 42% to 28% of the iteration time

Overall 22% reduction in iteration timeSlide34

Summary

Discuss tail latency in network

Types of traffic in network

Implications on jobs

Cause of tail latency

Discuss Hull:

Trade Bandwidth for latency

Penalize huge flows

Eliminate bursts and buffering

Discuss Orchestra:

Optimize transfers instead of individual flowsUtilize knowledge about application semantics

34

http://www.mosharaf.com/