/
Designing a New Multicast Infrastructure for Linux Designing a New Multicast Infrastructure for Linux

Designing a New Multicast Infrastructure for Linux - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
385 views
Uploaded On 2017-07-07

Designing a New Multicast Infrastructure for Linux - PPT Presentation

Ken Birman Cornell University CS5410 Fall 2008 Mission Impossible Today multicast is persona nongrata in most cloud settings Amazons stories of their experience with violent load oscillations has frightened most people in the industry ID: 567455

ipmc multicast network nic multicast ipmc nic network data udp rate application address loss packets kernel traffic mechanisms control

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Designing a New Multicast Infrastructure..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Designing a New Multicast Infrastructure for Linux

Ken Birman

Cornell University. CS5410 Fall 2008. Slide2

Mission Impossible…Today, multicast is

persona non-grata

in most cloud settings

Amazon’s stories of their experience with violent load oscillations has frightened most people in the industryThey weren’t the only ones…Today:Design a better multicast infrastructure for using the Linux Red Hat operating system in enterprise settingsTarget: trading floor in a big bank (if any are left) on Wall Street, cloud computing in data centersSlide3

What do they need?Quick, scalable, pretty reliable message delivery

Argues for IPMC or a protocol like Ricochet

Virtual synchrony,

Paxos, transactions: all would be examples of higher level solutions running over the basic layer we want to designBut we don’t want our base layer to misbehaveAnd it needs to be a good “team player” side by side with TCP and other protocolsSlide4

Reminder: What goes wrong?Earlier in the semester we touched on the issues with IPMC in existing cloud platforms

Applications unstable, exhibit violent load swings

Usually totally lossless, but sometimes drops zillions of packets all over the

placeWorst of all?Problems impacted not just IPMC but also disrupted UDP, TCP, etc. And they got worse, not better, with faster networks!Start by trying to understand the big picture: why is this happening?Slide5

Misbehavior patternNoticed when an application-layer solution, like a virtual synchrony protocol, begins to exhibit wild load swings for no obvious reason

For example,

we saw this in QSM

(QuicksilverScalable Multicast)Fixing the problem at the end-to-endlayer was really hard!

QSM oscillated in this 200-node experiment when its damping and prioritization mechanisms were disabledSlide6

Aside: QSM works well nowWe did all sorts of things to stabilize it

Novel “minimal memory footprint” design

Incredibly low CPU loads minimize delays

Prioritization mechanisms ensure that lost data is repaired first, before new good data piles up behind gapBut most systems lack these sorts of unusual solutionsHence most systems simply destabilize, like QSM did before we studied and fixed these issues!Linux goal: a system-wide solutionSlide7

It wasn’t just QSM…This graph was for Quicksilver but

Most products are prone to destabilization of this sort

And they often break down in ways that disrupt everyone else

What are the main forms of data center-wide issues?Slide8

Convoy effectsOne problem is called a

convoy effect

Tends to occur suddenly

System transitions from being load-balanced to a state in which there are always a small set of hot spots, very overloaded, and everything else is pretty much idleHot spots might move around (like Heisenbugs)Underlying cause? Related to phenomena that cause traffic to slow down on highways…Slide9

Convoy effectsImagine that you are on a highway driving fast

With a low density of cars, small speed changes make no difference

But if you are close to other cars, a “coupling” effect will occur: the car in front slows down, so you slow down, and suddenly cars bunch up – they form a convoy

In big distributed systems this is also seen when load rises to a point where minor packet scheduling delays start to cascade. Seen in any componentized systemSlide10

Convoys were just one culprit…

Why was QSM acting this way?

When we started work, this wasn’t easy to fix…

… issue occurred only with 200 nodes and high data ratesBut we tracked down a loss related patternUnder heavy load, the network was delivering packets to our receivers faster than they could handle them

Caused kernel-level queues to overflow… hence wide lossRetransmission requests and resends made things worse

So: goodput drops to zero, overhead to infinity. Finally problem repaired and we restart… only to do it again!Slide11

IPMC and lossIn fact, one finds that

In systems that don’t use IPMC, packet loss is rare, especially if they use TPC throughout

But with IPMC,

all forms of communication get lossy!This impacts not just IPMC, but also UDP, TCPTCP reacts by choking back (interprets loss as congestion)Implication: somehow IPMC is triggering loss

not seen when IPMC is not in use. Why???Slide12

Assumption?Assume that if we enable IP multicast

Some applications will use it heavily

Testing will be mostly on smaller configurations

Thus, as they scale up and encounter loss, many will be at risk of oscillatory meltdownsFixing the protocol is obviously the best solution…… but we want the data center (the cloud) to also protect itself against disruptive impact of such events!Slide13

So why did receivers get so lossy?

To understand the issue, need to understand history of network speeds and a little about the hardware

ethernet

NIC

kernel

user

NIC

kernel

user

(1) App sends packet

(2) UDP adds header, fragments it

(3)

Enqueued

for NIC to send

(4) NIC sends…

(8) App receives

(7) UDP queues on socket

(6) Copied into a handy

mbuf

(5) NIC receives…Slide14

Network speedsWhen Linux was developed, Ethernet ran at 10Mbits and NIC was able to keep up

Then network sped up: 100Mbits common, 1Gbit more and more often seen, 10 or 40 “soon”

But typical PCs didn’t speed up remotely that much!

Why did PC speed lag?Ethernets transitioned to optical hardwarePCs are limited by concerns about heat, expense. Trend favors multicore solutions that run slower… so why invest to create a NIC that can run faster than the bus?Slide15

NIC as a “rate matcher”Modern NIC has two sides running at different rates

Ethernet side is blazingly fast, uses ECL memory…

Main memory side is slower

So how can this work?Key insight: NIC usually receives one packet, but then doesn’t need to accept the “next” packet. Gives it time to unload the incoming dataBut why does it get away with this?Slide16

NIC as a “rate matcher”When would a machine get several back-to-back packets?

Server with many clients

Pair of machines with a stream between them: but here, limited because the sending NIC will run at the speed of its interface to the machine’s main memory – in today’s systems, usually 100MBits

In a busy setting, only servers are likely to see back-to-back traffic, and even the server is unlikely to see a long run packets that it needs to accept!Slide17

… So normallyNIC sees big gaps between messages it needs to accept

This gives us time…

…. for OS to replenish the supply of memory buffers

…. to hand messages off to the applicationIn effect, the whole “system” is well balancedBut notice the hidden assumption:All of this requires that most communication be point-to-point… with high rates of multicast, it breaks down!Slide18

Multicast: wrench in the worksWhat happens when we use multicast heavily?

A NIC that on average received 1 out of k packets suddenly might receive many in a row (just thinking in terms of the “odds”)

Hence will see far more back-to-back packets

But this stresses our speed limitsNIC kept up with fast network traffic partly because it rarely needed to accept a packet… letting it match the fast and the slow sides… With high rates of incoming traffic we overload itSlide19

Intuition: like a highway off-ramp

With a real highway, cars just

end up in a jam

With a high speed optical net coupled to a slower NIC, packets are dropped by receiver!Slide20

More NIC worriesNext issue relates to implementation of multicast

Ethernet NIC actually is a pattern match machine

Kernel loads it with a list of {

mask,value} pairsIncoming packet has a destination addressComputes (dest&mask)==value and if so, acceptsUsually has 8 or 16 such pairs availableSlide21

More NIC worriesIf the set of patterns is full… kernel puts NIC into what we call “promiscuous” mode

It starts to accept

all

incoming trafficThen OS protocol stack makes sense of itIf not-for-me, ignoreBut this requires an interrupt and work by the kernelAll of which adds up to sharply higherCPU costs (and slowdown due to cache/TLB effects)Loss rate, because the more packets the NIC needs to receive, the more it will drop due to overrunning queuesSlide22

More NIC worries

We can see this effect in an experiment done by Yoav Tock at IBM Research in

Haifa

When many groups are used, CPU loads rise and machine can’t keep up with traffic. O/S drops packets

All slots full

Costs of “filtering” load the CPU…Slide23

What about the switch/router?

Modern data centers used a switched network architecture

Question to ask: how does a switch handle multicast?

Slide24

Concept of a Bloom filterGoal of router?

Packet p arrives on port a. Quickly decide which port(s) to forward it on

Bit vector filter approach

Take IPMC address of p, hash it to a value in some range like [0..1023]Each output port has an associated bit vector… Forward p on each port with that bit setBitvector -> Bloom filterJust do the hash multiple times, test against multiple vectors. Must match in all of them (reduces collisions)Slide25

Concept of a Bloom filter

So… take our class-D multicast address (233.0.0.0/8)

233.

17.31.129… hash it 3 times to a bit numberNow look at outgoing link ACheck bit 19 in [….0101010010000001010000010101000000100000….]Check bit 33 in […. 101000001010100000010101001000000100000….]

Check bit 8 in [….00000010101000000110101001000000

10100000..]… all matched, so we relay a copyNext look at outgoing link B… match failed

… ETCSlide26

What about the switch/router?

Modern data centers used a switched network architecture

Question to ask: how does a switch handle multicast?

p

*

*

p

pSlide27

Aggressive use of multicastBloom filters “fill up” (all bits set)

Not for a good reason, but because of hash conflicts

Hence switch becomes promiscuous

Forwards every multicast on every network linkAmplifies problems confronting NIC, especially if NIC itself is in promiscuous modeSlide28

Worse and worse…Most of these mechanisms have long memories

Once an IPMC address is used by a node, the NIC tends to retain memory of it, and the switch does, for a long time

This is an artifact of a “stateless” architecture

Nobody remembers why the IPMC address was in useApplication can leave but no “delete” will occur for a whileUnderlying mechanisms are lease based: periodically “replaced” with fresh data (but not instantly)Slide29

…pulling the story into focusWe’ve seen that multicast loss phenomena can ultimately be traced to two major factors

Modern systems have a serious rate mismatch vis-à-vis the network

Multicast delivery pattern and routing mechanisms scale poorly

A better Linux architecture needs toAllow us to cap the rate of multicastsAllow us to control which apps can use multicastControl allocation of a limited set of multicast groups Slide30

Dr. Multicast (the MCMD)Rx for your multicast woes

Intercepts use of IPMC

Does this by

library interposition exploiting a feature of DLL linkageThen maps the logical IPMC address used by the application to eitherA set of point-to-point UDP sendsA physical IPMC address, for lucky applicationsMultiple groups share same IPMC address for efficiencySlide31

Criteria usedDr Multicast has an “acceptable use policy”

Currently expressed as low-level firewall type rules, but could easily integrate with higher level tools

Examples

Application such-and-such can/cannot use IPMCLimit the system as a whole to 50 IPMC addressesCan revoke IPMC permission rapidly in case of troubleSlide32

How it works

Application uses IPMC

source

Receiver (one of many)

IPMC

event

UDP multicast interface

Socket interfaceSlide33

How it works

Application uses IPMC

source

Receiver (one of many)

IPMC

event

UDP multicast interface

Socket interface

Replace UDP multicast with some other multicast protocol, like RicochetSlide34

UDP multicast interfaceMain UDP system calls

Socket() – creates a socket

Bind() connects that socket to the UDP multicast distribution network

Sendmsg/recvmsg() – send dataSlide35

Dynamic library “trick”Using multiple dynamically linked libraries, we can intercept system calls

Application used to link to, say,

libc.so

/libc.saWe add a library libx.so/libx.sa earlier on lib search pathIt defines the socket interfaces, hence is linked inBut it also does calls to __Xbind, __

Xsendmsg, etc

Now we add liby.so, liby.saThese define __Xbind – it just calls bind – etc

We get to either “handle” calls ourselves, in libx, or pass them to the normal libc “via” liby.Slide36

MimicryMany options could mimic IPMC

Point to point UDP or TCP, or even HTTP

Overlay multicast

Ricochet (adds reliability)MCMD can potentially swap any of these in under user controlSlide37

OptimizationProblem of finding an optimal

group-to-IPMC

mapping is surprisingly hard

Goal is to have an “exact mapping” (apps receive exactly the traffic they should receive). Identical groups get the same IPMC addressBut can also fragment some groups….Should we give an IPMC address to A, to B, to AB?Turns out to be NP complete!

A

BSlide38

Greedy heuristicDr Multicast currently uses a greedy heuristic

Looks for big, busy groups and allocates IPMC addresses to them first

Limited use of group fragmentation

We’ve explored more aggressive options for fragmenting big groups into smaller ones, but quality of result is very sensitive to properties of the pattern of group useSolution is fast, not optimal, but works wellSlide39

Flow controlHow can we address rate concerns?

A good way to avoid broadcast storms is to somehow suppose an AUP of the type “at most xx IPMC/sec”

Two sides of the coin

Most applications are greedy and try to send as fast as they can… but would work on a slower or more congested network.For these, we can safely “slow down” their rateBut some need guaranteed real-time deliveryCurrently can’t even specify this in LinuxSlide40

Flow controlApproach taken in Dr Multicast

Again, starts with an AUP

Puts limits on the aggregate IPMC rate in the data center

And can exempt specific applications from rate limitingNext, senders in a group monitor traffic in itConceptually, happens in the network driverUse this to apportion limited bandwidthSliding scale: heavy users give up moreSlide41

Flow controlTo make this work, the kernel send layer can delay sending packets…

… and to prevent application from overrunning the kernel, delay the application

For sender using non-blocking mode, can drop packets if sender side becomes overloaded

Highlights a weakness of the standard Linux interfaceNo easy way to send “upcalls” notifying application when conditions change, congestion arises, etcSlide42

The “AJIL” protocol in action

Protocol adds a rate limiting module to the Dr Multicast stack

Uses a gossip-like mechanism to figure out the rate limits

Work by Hussam Abu-Libdeh and others in my research groupSlide43

Fast join/leave patterns

Currently Dr Multicast doesn’t do very much if applications thrash by joining and leaving groups rapidly

We have ideas on how to rate limit them, and it seems like it won’t be hard to support

Real question is: how should this behave?Slide44

End to End philosophy / debateIn the dark ages, E2E idea was proposed as a way to standardize rules for what should be done in the network and what should happen at the endpoints

In the network?

Minimal mechanism, no reliability, just routing

(Idea is that anything more costs overhead yet end points would need the same mechanisms anyhow, since best guarantees will still be too weak)End points do security, reliability, flow controlSlide45

A religion… but inconsistent…E2E took hold and became a kind of battle cry of the Internet community

But they don’t always stick with their own story

Routers drop packets when overloaded

TCP assumes this is the main reason for loss and backs downWhen these assumptions break down, as in wireless or WAN settings, TCP “out of the box” performs poorlySlide46

E2E and Dr MulticastHow would the E2E philosophy view Dr Multicast?

On the positive side, the mechanisms being interposed operate mostly on the edges and under AUP control

On the negative side, they are network-wide mechanisms imposed on all users

Original E2E paper had exceptions, perhaps this falls into that class of things?E2E except when doing something something in the network layer brings big win, costs little, and can’t be done on the edges in any case…Slide47

SummaryDr Multicast brings a vision of a new world of controlled IPMC

Operator decides who can use it, when, and how much

Data center no longer at risk of instability from malfunctioning applications

Hence operator allows IPMC in: trust (but verify, and if problems emerge, intervene)Could reopen door for use of IPMC in many settings