The story thus far Tail latency is bad Causes Resource contention with background jobs Device failure Unevensplit of data between tasks Network congestion for reducers Ways to address tail latency ID: 602860
Download Presentation The PPT/PDF document "Tail Latency: Networking" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Tail Latency: NetworkingSlide2
The story thus far
Tail latency is bad
Causes:
Resource contention with background jobs
Device failure
Uneven-split of data between tasks
Network congestion for reducersSlide3
Ways to address tail latency
Clone all tasks
Clone slow tasks
Copy intermediate data
Remove/replace frequently failing machines
Spread out reducersSlide4
What is missing from this picture?
Networking:
Spreading out reducers is not sufficient.
The network is extremely crucial
Studies on Facebook traces show that [Orchestra]
in 26% of jobs, shuffle is 50% of runtime.
in 16% of jobs, shuffle is more than 70% of
runtime
42% of tasks spend over 50% of their time writing to HDFSSlide5
Other implication of Network Limits
Scalability
Scalability of Netflix-like recommendation system is bottlenecked by communication
5
Did not scale beyond 60 nodes
Comm. time increased faster than comp. time decreasedSlide6
What is the Impact of the Network
Assume 10ms deadline for tasks [DCTCP]
Simulate job completion times based on distributions of tasks completion times
For 40 about
4 tasks (14%)
for 400
14 tasks [3%] fail respectivelySlide7
What is the Impact of the Network
Assume 10ms deadline for tasks [DCTCP]
Simulate job completion times based on distributions of tasks completion times (focus on 99.9%)
For 40 about
4 tasks (14%)
for 400
14 tasks [3%] fail respectivelySlide8
What is the Impact of the Network
Assume 10ms deadline for tasks [DCTCP]
Simulate job completion times based on distributions of tasks completion times
For 40 about
4 tasks (14%)
for 400
14 tasks [3%] fail respectivelySlide9
Other implication of Network Limits
Scalability
Scalability of Netflix-like recommendation system is bottlenecked by communication
9
Did not scale beyond 60 nodes
Comm. time increased faster than comp. time decreasedSlide10
What Causes this Variation in Network Transfer Times?
First let’s look at type of traffic in network
Background Traffic
Latency sensitive short control messages; e.g. heart beats, job status
Large files: e.g. HDFS replication, loading of new data
Map-reduce jobs
Small RPC-request/response with tight deadlines
HDFS reads or writes with tight deadlinesSlide11
What Causes this Variation in Network Transfer Times?
No notion of priority
Latency sensitive and non-latency sensitive share the network equally.
Uneven load-balancing
ECMP doesn’t schedule flows evenly across all paths
Assume long and short are the same
Bursts of traffic
Networks have buffers which reduce loss but introduce latency (time waiting in buffer is variable)
Kernel optimization introduce
burstiness
Slide12
Ways to Eliminate Variation and Improve tail latency
Make the network faster
HULL,
DeTail
, DCTCP
Faster networks == smaller tail
Optimize how application use the network
Orchestra,
CoFlows
Specific big-data transfer patterns, optimize the patterns to reduce transfer time
Make the network aware of deadlinesD3, PDQTasks have deadlines. No point doing any work if deadline wouldn’t be metTry and prioritize flows and schedule them based on deadline.Slide13
Fair-Sharing or Deadline-based sharing
Fair-share (Status-Quo)
Every one plays nice but some deadlines lines can be missed
Deadline-based
Deadlines met but may require non-trial
implemantionat
Two ways to do deadline-based sharing
Earliest deadline first (PDQ)
Make BW reservations for each flow
Flow rate = flow size/flow deadline
Flow size & deadline are known aprioriSlide14
Fair-Sharing or Deadline-based sharing
Two versions of deadline-based sharing
Earliest deadline first (PDQ)
Make BW reservations for each flow
Flow rate = flow size/flow deadline
Flow size & deadline are known
aprioriSlide15
Issues with Deadline Based Scheduling
I
mplications for non-deadline based jobs
Starvation? Poor completion times?
Implementation Issues
Assign deadlines to flows not packets
Reservation approach
Requires reservation for each flow
Big data flows: can be small & have small RTT
Control loop must be
extremelly fastEarliest deadline firstRequires coordination between switches & serversServers: specify flow deadline Switches: priority flows and determine rateMay require complex switch mechanismsSlide16
How do you make the Network Faster
Throw more hardware at the problem
Fat-Tree, VL2, B-Cube, Dragonfly
Increases bandwidth (throughput) but not necessarily latencySlide17
So, how do you reduce latency
Trade bandwidth for latency
Buffering adds variation (unpredictability)
Eliminate network buffering & bursts
Optimize the network stack
Use link level information to detect congestion
Inform application to adapt by using a different pathSlide18
HULL: Trading BW for Latency
Buffering introduces latency
Buffer is used to accommodate bursts
To allow congestion control to get good throughput
Removing buffers means
Lower throughput for large flows
Network can’t handle bursts
Predictable low latencySlide19
Why do Bursts Exists?
Systems review:
NIC (network Card) informs OS of packets via interrupt
Interrupt consume CPU
If one interrupt for each packet the CPU will be overwhelmed
Optimization: batch packets up before calling interrupt
Size of the batch is the size of the burstSlide20
Why do Bursts Exists?
Systems review:
NIC (network Card) informs OS of packets via interrupt
Interrupt consume CPU
If one interrupt for each packet the CPU will be overwhelmed
Optimization: batch packets up before calling interrupt
Size of the batch is the size of the burstSlide21
Why Does Congestion Need buffers?
Congestion Control AKA TCP
Detects bottleneck link capacity through packet loss
When loss it halves its sending rate.
Buffers help the keep the network busy
Important for when TCP reduce sending rate by half
Essentially the network must double capacity for TCP to work well.
Buffer allow for this doublingSlide22
TCP
Review
Bandwidth-delay product rule of thumb:
A single flow needs
C×RTT
buffers for
100% Throughput.
Throughput
Buffer Size
100%
B
B ≥ C×RTT
B
100%
B < C×RTT
22Slide23
Key Idea Behind Hull
Eliminate Bursts
Add a token bucket (Pacer) into the network
Pacer must be in the network so it happens after the system optimizations that cause bursts.
Eliminate Buffering
Send congestion notification messages before link it fully utilized
Make applications believe the link is full when there’s still capacity
TCP has poor congestion control algorithm
Replace with DCTCPSlide24
Key Idea Behind Hull
Eliminate Bursts
Add a token bucket (Pacer) into the network
Pacer must be in the network so it happens after the system optimizations that cause bursts.
Eliminate Buffering
Send congestion notification messages before link it fully utilized
Make applications believe the link is full when there’s still capacitySlide25
Orchestra:
Managing Data Transfers in Computer Clusters
Group all flows belonging to a stage into a transfer
Perform inter-transfer coordination
Optimize at the level of transfer rather than individual flowsSlide26
Transfer Patterns
Transfer
: set of all flows transporting data between two stages of a job
Acts as a
barrier
Completion time
: Time for the last receiver to finish
S
huffle
Broadcast
Incast*
26
Map
Map
Map
HDFS
Reduce
Reduce
HDFSSlide27
TC
(broadcast)
TC
(broadcast)
HDFS
Tree
Cornet
HDFS
Tree
Cornet
TC
(shuffle)
Hadoop shuffle
WSS
shuffle
broadcast 1
broadcast 2
ITC
Fair sharing
FIFO
Priority
27
Shuffle
Transfer
Controller (TC)
Broadcast
Transfer
Controller (TC)
Broadcast
Transfer
Controller (TC)
Inter-Transfer
Controller (ITC)
Orchestra
Cooperative broadcast (Cornet)
Infer and utilize topology information
Weighted Shuffle Scheduling (WSS)
Assign flow rates to optimize shuffle completion time
Inter-Transfer Controller
Implement weighted fair sharing between transfers
End-to-end performanceSlide28
Cornet
: Cooperative broadcast
Observations
Cornet Design Decisions
High-bandwidth,
low-latency network
Large block size (4-16MB)
No selfish or malicious peers
No need for
incentives (e.g., TFT)
No (un)chokingEveryone stays till the endTopology mattersTopology-aware broadcast
28
Broadcast same data to every receiver
Fast, scalable, adaptive to bandwidth, and resilient
Peer-to-peer mechanism optimized for cooperative
environments
Use bit-torrent to distribute dataSlide29
Topology-aware Cornet
Many data center networks employ tree topologies
Each rack should receive
exactly one copy
of broadcast
Minimize cross-rack communication
Topology information reduces cross-rack data transfer
Mixture of spherical Gaussians to infer network topology
29Slide30
Shuffle bottlenecks
An optimal shuffle schedule must keep at least one link fully utilized throughout the transfer
30
At a sender
At a receiver
In the networkSlide31
Status quo in Shuffle
31
r
1
r
2
s
2
s
3
s
4
s
1
s
5
Links to r
1
and r
2
are full:
Link from s
3
is full:
Completion time:
3 time units
2
time units
5
time unitsSlide32
Allocate rates to each flow using weighted fair sharing, where the weight of a flow between a sender-receiver pair is proportional to the total amount of data to be sent
32
Up to 1.5X improvement
Completion time: 4 time units
Weighted Shuffle Scheduling
r
1
r
2
s
2
s
3
s
4
s
1
s
5
1
1
2
2
1
1Slide33
Faster spam classification
33
Communication reduced from 42% to 28% of the iteration time
Overall 22% reduction in iteration timeSlide34
Summary
Discuss tail latency in network
Types of traffic in network
Implications on jobs
Cause of tail latency
Discuss Hull:
Trade Bandwidth for latency
Penalize huge flows
Eliminate bursts and buffering
Discuss Orchestra:
Optimize transfers instead of individual flowsUtilize knowledge about application semantics
34
http://www.mosharaf.com/