Cloud Scale Load Balancing Presenter Donghwi Kim 1 Background Datacenter Each server has a hypervisor and VMs Each VM is assigned a Direct IP DIP 2 Each service has zero or more external ID: 611197
Download Presentation The PPT/PDF document "Ananta" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Ananta: Cloud Scale Load Balancing
Presenter: Donghwi Kim
1Slide2
Background: Datacenter
Each server has a hypervisor and VMs
Each VM is assigned a Direct IP(
DIP
)
2
Each service has zero or more external
end-points
Each service is assigned one Virtual IP (
VIP
)Slide3
Background: Datacenter
Each datacenter has many servicesA service may work with Another service in same datacenterAnother service in other datacenter
A client over the internet
3Slide4
Background: Load-balancer
Entrance of server poolDistribute workload to worker serversHide server pools from client with network address translator (NAT)
4Slide5
Do destination address translation (DNAT)
Inbound VIP Communication
5
Front-end
VM
LB
Front-end
VM
Front-end
VM
Internet
DIP 1
VIP
src
: Client,
dst
: VIP
payload
src
: Client,
dst
:
DIP1
payload
DIP 2
DIP 3
src
: Client,
dst
:
DIP2
payload
src
: Client,
dst
:
DIP3
payload
src
: Client,
dst
: VIP
payload
src
: Client,
dst
: VIP
payloadSlide6
Do source address translation (SNAT)
VIP 1
Outbound VIP Communication
6
Front-end
VM
LB
Back-end
VM
DIP 1
DIP
2
Front-end
VM
LB
Front-end
VM
Front-end
VM
DIP 3
Service 1
Service 2
Datacenter
Network
VIP 2
src
: DIP2,
dst
: VIP2
payload
src
:
VIP1
,
dst
: VIP2
payload
DIP 4
DIP 5
src
:
VIP1
,
dst
: VIP2
payloadSlide7
State of the ArtA load balancer is a hardware device
Expensive, slow failover, no scalability7
LBSlide8
Cloud RequirementsScale
Reliability
8
Requirement
State-of-the-art
~40
Tbps
throughput using 400 servers
20Gbps
for $80,000
100Gbps for a single VIP
Up to 20Gbps per VIP
Requirement
State-of-the-art
N+1 redundancy
1+1 redundancy
or slow failover
Quick
failoverSlide9
Cloud RequirementsAny service anywhere
Tenant isolation
9
Requirement
State-of-the-art
Servers and LB/NAT are placed across L2 boundaries
NAT
supported only in the same L2
Requirement
State-of-the-art
An overloaded or abusive tenant cannot affect other tenants
Excessive
SNAT from one tenant causes complete outageSlide10
Ananta
10Slide11
SDNSDN: Managing a flexible data plane via a centralized control plane
11
Controller
Control Plane
Data plane
SwitchSlide12
Break downLoad-balancer’s functionality
Control plane:VIP configurationMonitoring
Data plane
Destination/source selection
address translation
12Slide13
Design
Ananta ManagerSource selectionNot scalable(like SDN controller)
Multiplexer (Mux)
Destination selection
Host Agent
Address translation
Reside in each server’s hypervisor13Slide14
Data plane
14
Multiplexer
Multiplexer
Multiplexer
. . .
VM Switch
VM
N
Host Agent
VM
1
. . .
VM Switch
VM
N
Host Agent
VM
1
. . .
VM Switch
VM
N
Host Agent
VM
1
. . .
. . .
dst
: VIP1
dst
: VIP2
dst
: VIP1
dst
: VIP2
dst
:
DIP3
dst
: VIP1
dst
:
DIP1
dst
: VIP1
dst
:
DIP2
dst
:
DIP1
dst
:
DIP2
dst
:
DIP3
1
st
tier (Router)
packet-level
load spreading
via
ECMP.
2
nd
tier (Multiplexer)
connection-level
load spreading
destination selection
.
3
rd
tier (Host Agent)
Stateful
NATSlide15
Inbound connections
15
Router
Router
MUX
Host
MUX
Router
MUX
…
Host Agent
1
2
3
VM
DIP
4
5
6
7
8
Client
s:
CLI
, d: VIP
s:
CLI
, d:
DIP
s:
VIP
, d: CLI
s:
DIP
, d: CLI
s:
CLI
, d: VIP
s:
MUX
, d: DIPSlide16
Outbound (SNAT) connections
16
Server
s:
DIP:555
, d: SVR:80
Port??
Map VIP:777 to DIP
Map VIP:777 to DIP
s:
VIP
:
777
, d: SVR:80
s:
SVR:80
, d: VIP:777
s:
SVR:80, d: VIP:777
s:
MUX, d: DIP:555
s: SVR:80, d:
DIP:555Slide17
Reducing Load of AnantaManager
OptimizationBatching: Allocate 8 ports instead of onePre-allocation: 160 ports per VM
Demand
prediction: Consider recent request history
Less
than 1% of outbound connections ever hit Ananta
ManagerSNAT request latency is reduced
17Slide18
VIP traffic in a datacenterLarge portion of traffic via load-balancer is intra-DC
18Slide19
Step 1: Forward Traffic
19
Host
MUX
MUX
MUX1
VM
…
Host Agent
1
DIP1
MUX
MUX
MUX2
2
Host
VM
…
Host Agent
DIP2
Data Packets
Destination
VIP1
VIP2Slide20
Step 2: Return Traffic
20
Host
MUX
MUX
MUX1
VM
…
Host Agent
1
DIP1
4
MUX
MUX
MUX2
2
3
Host
VM
…
Host Agent
DIP2
Data Packets
Destination
VIP1
VIP2Slide21
Step 3: Redirect Messages
21
Host
MUX
MUX
MUX1
VM
…
Host Agent
DIP1
5
6
MUX
MUX
MUX2
Host
VM
…
Host Agent
DIP2
7
Redirect Packets
Destination
VIP1
VIP2Slide22
Step 4: Direct Connection
22
Host
MUX
MUX
MUX1
VM
…
Host Agent
DIP1
MUX
MUX
MUX2
8
Host
VM
…
Host Agent
DIP2
Redirect Packets
Data Packets
Destination
VIP1
VIP2Slide23
SNAT FairnessAnanta
Manager is not scalableMore VMs, more resources
23
DIP1
DIP2
DIP3
DIP4
VIP1
VIP2
1
2
3
Pending SNAT Requests per DIP. At most one per DIP.
1
Pending SNAT Requests per VIP.
SNAT processing queue
Global queue. Round-robin
dequeue
from VIP queues. Processed by thread pool.
4
6
5
1
3
2
4
4
2
3Slide24
Packet Rate Fairness
Each Mux keeps track of its top-talkers(top-talker: VIPs with the highest rate of packets)When packet drop happens, Ananta
Manager
withdraws the topmost top-talker from all
Muxes
24Slide25
ReliabilityWhen Ananta Manager fails
Paxos provides fault-tolerance by replicationTypically 5 replicasWhen Mux fails
1
st
tier routers detect failure by BGP
The routers stop sending traffic to that Mux.
25Slide26
Evaluation
26Slide27
Impact of Fastpath
Experiment:One 20 VM tenant as the serverTwo 10 VM tenants a clientsEach VM setup 10 connections, upload 1MB data
27Slide28
Ananta Manager’s SNAT latency
Ananta manager’s port allocation latencyover 24 hour observation
28Slide29
SNAT Fairness
Normal users (N) make 150 outbound connections per minuteA heavy user (H) keep increases outbound connection rateObserve SYN retransmit and SNAT latencyNormal users are not affected by a heavy user
29Slide30
Overall AvailabilityAverage availability over a month: 99.95%
30Slide31
SummaryHow
Ananta meet cloud requirements
31
Requirement
Description
Scale
Mux: ECMP
Host agent: Scale-out naturally
Reliability
Ananta
manager:
Paxos
Mux: BGP
Any service anywhere
Ananta
is on layer
4 (Transport layer)
Tenant isolation
SNAT fairness
Packet
rate fairnessSlide32
MUX (NEW)
MUX
Discussion
Ananta may lose some
connections
W
hen
it recovers from MUX
failure
B
ecause
there is no way to copy MUX’s internal
state.
32
5-tuple
DIP
…
DIP1
…
DIP2
1
st tier Router5-tuple
DIP
???
TCP flowsSlide33
Discussion
Detection of MUX failure takes at most 30 seconds (BGP hold timer). Why don’t we use additional health monitoring?Fastpath does not preserve the order of packets.Passing through a software component, MUX, may increase the latency of connection establishment.* (
Fastpath
does not relieve this.)
Scale of evaluation is too small. (e.g. Bandwidth of 2.5Gbps, not
Tbps
). Another paper insists that Ananta requires 8,000 MUXes to cover mid-size datacenter.*
33
*DUET: Cloud Scale Load Balancing with Hardware and Software, SIGCOMM‘14Slide34
Thanks !Any Questions ?
34Slide35
Lessons learnt
Centralized controllers workThere are significant challenges in doing per-flow processing, e.g., SNATProvide overall higher reliability and easier to manage system
Co-location of control plane and data plane provides faster local recovery
Fate sharing eliminates the need for a separate, highly-available management channel
Protocol semantics are violated on the Internet
Bugs in external code forced us to change network MTU
Owning our own software has been a key enabler for:Faster turn-around on bugs, DoS
detection, flexibility to design new featuresBetter monitoring and management
MicrosoftSlide36
Backup: ECMPEqual-Cost Multi-Path Routing
Hash packet header and choose one of equal-cost paths36Slide37
Backup: SEDA
37Slide38
Backup: SNAT
38Slide39
VIP traffic in a data center
MicrosoftSlide40
CPU usage of MuxCPU usage over typical 24-hr period by 14
Muxes in single Ananta instance
40Slide41
Remarkable PointsThe
first middlebox architecture that moves parts of it to the hostDeployed and served for Microsoft datacenter more than 2 years
41