Reliable Multicast Framew ork or Lightweight Sessions and pplication Le el Framing Sally Flo yd an Jacobson ChingGung Liu Ste en McCanne and Lixia Zhang to appear in IEEEA CM ransactions on Netw orki
125K - views

Reliable Multicast Framew ork or Lightweight Sessions and pplication Le el Framing Sally Flo yd an Jacobson ChingGung Liu Ste en McCanne and Lixia Zhang to appear in IEEEA CM ransactions on Netw orki

The SRM frame ork has been pr ototyped in wb distrib uted whiteboard application which has been used on global scale with sessions ranging fr om few to few hundr ed participants The paper describes the principles that ha guided the SRM design includ

Tags : The SRM frame ork
Download Pdf

Reliable Multicast Framew ork or Lightweight Sessions and pplication Le el Framing Sally Flo yd an Jacobson ChingGung Liu Ste en McCanne and Lixia Zhang to appear in IEEEA CM ransactions on Netw orki




Download Pdf - The PPT/PDF document "Reliable Multicast Framew ork or Lightwe..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Reliable Multicast Framew ork or Lightweight Sessions and pplication Le el Framing Sally Flo yd an Jacobson ChingGung Liu Ste en McCanne and Lixia Zhang to appear in IEEEA CM ransactions on Netw orki"— Presentation transcript:


Page 1
Reliable Multicast Framew ork or Light-weight Sessions and pplication Le el Framing Sally Flo yd, an Jacobson, Ching-Gung Liu, Ste en McCanne, and Lixia Zhang to appear in IEEE/A CM ransactions on Netw orking, December 1997 This paper describes SRM (Scalable Reliable Multicast), eliable multicast framew ork or light-weight sessions and application le el framing The algorithms of this framew ork ar ef˛cient, ob ust, and scale well to both ery lar ge netw orks and ery lar ge sessions. The SRM frame- ork has been pr ototyped in wb, distrib uted whiteboard application, which

has been used on global scale with sessions ranging fr om few to few hundr ed participants. The paper describes the principles that ha guided the SRM design, including the IP multicast gr oup deli ery model, an end-to-end, ecei er -based model of eliability and the application le el framing pr otocol model. As with unicast communications, the perf ormance of eliable multicast deli ery algorithm depends on the underlying topol- ogy and operational en vir onment. in estigate that dependence via anal- ysis and simulation, and demonstrate an adapti algorithm that uses the esults of pr vious loss

eco ery ents to adapt the contr ol parameters used or futur loss eco ery ith the adapti algorithm, our eliable multicast deli ery algorithm pr vides good perf ormance er wide range of under lying topologies. Intr oduction Se eral researchers ha proposed generic reliable multicast pro- tocols, much as TCP is generic transport protocol for reliable unicast transmission. In this paper we tak dif ferent vie w: un- lik the unicast case where requirements for reliable, sequenced data deli ery are airly general, dif ferent multicast applications ha widely dif ferent requirements for reliability or

xam- ple, some applications require that deli ery obe total order ing while man others do not. Some applications ha man or all the members sending data while others ha only one data source. Some applications ha replicated data, for xample in an -redundant ˛le store, so se eral members are capable of transmitting data item while for others all data originates at single source. These dif ferences all af fect the design of reliable multicast protocol. Although one could design protocol for the orst-case requirements, e.g., guaranteeing totally ordered deli ery of replicated data from lar ge

number of sources, such an approach results in substantial erhead for applications with more modest requirements. One cannot mak single reliable multicast deli ery scheme that optimally meets the functional- ity scalability and ef ˛cienc requirements of all applications. The weakness of ˚one size ˛ts allº protocols has long been S. Flo yd and Jacobson are both with the Netw ork Research Group, La wrence Berk e- le Laboratory Berk ele CA, and S. McCanne is with the Uni ersity of California, Berk e- le CA (email: ˇo yd, an@ee.lbl.go mccanne@eecs.berk ele .edu). S. Flo yd,

Jacob- son, and S. McCanne were supported by the Director Of ˛ce of Ener gy Research, Scien- ti˛c Computing Staf f, of the U.S. Department of Ener gy under Contract No. DE-A C03- 76SF00098. Ching-Gung Liu is with the Uni ersity of Southern California, Los Angeles, CA (email: charle y@carlsbad.usc.edu). Lixia Zhang is with UCLA, Los Angeles, CA (email: lixia@cs.ucla.edu). Ching-Gung Liu and Lixia Zhang were supported in part by the Adv anced Research Projects Agenc monitored by ort Huachuca under contract ABT63-94-C-0073. An earlier ersion of this paper appeared in CM SIGCOMM 95.

recognized. In 1990 Clark and ennenhouse proposed ne protocol model called Application Le el Framing (ALF) which xplicitly includes an application© semantics in the design of that application© protocol [6 ]. ALF as later elaborated with light-weight rendezv ous mechanism based on the IP multicast distrib ution model, and with notion of recei er -based adap- tation for unreliable, real-time applications such as audio and video conferencing. The result, kno wn as Light-W eight Ses- sions (L WS) [19 ], has been ery successful in the design of wide-area, lar ge-scale, conferencing applications.

This paper further olv es the principles of ALF and WS to add frame- ork for Scalable Reliable Multicast (SRM). ALF says that the best ay to meet di erse application re- quirements is to lea as much functionality and ˇe xibility as possible to the application. Therefore SRM is designed to meet only the minimal de˛nition of reliable multicast, i.e., entual deli ery of all the data to all the group members, without en- forcing an particular deli ery order belie that if the need arises, machinery to enforce particular deli ery order can be easily added on top of this reliable deli ery

service. It has been ar gued [36 34 that single dynamically con˛g- urable protocol should be used to accommodate dif ferent appli- cation requirements. The ALF ar gument is en stronger: not only do dif ferent applications require dif ferent types of error re- co ery ˇo control, and rate control mechanisms, ut further these mechanisms must xplicitly account for the structure of the underlying application data itself. SRM is also hea vily based on the group deli ery model that is the centerpiece of the IP multicast protocol [8 ]. In IP multi- cast, data sources simply send to the

group© multicast address (a normal IP address chosen from reserv ed range of addresses) without needing an adv ance kno wledge of the group member ship. recei an data sent to the group, recei ers simply an- nounce that the are interested (via ˚joinº message multicast on the local subnet) no kno wledge of the group membership or acti senders is required. Each recei er joins and lea es the group indi vidually without af fecting the data transmission to an other member SRM further enhances the multicast group concept by maximizing information and data sharing among all the members, and

strengthens the indi viduality of membership by making each member responsible for its wn correct recep- tion of all the data. Finally SRM attempts to follo the core design principles of TCP/IP First, SRM requires only the basic IP deli ery model best-ef fort with possible duplication and reordering of pack- ets and uilds reliability on an end-to-end basis. No change or special support is required from the underlying IP netw ork.
Page 2
Second, in ashion similar to TCP adapti ely setting timers or congestion control windo ws, the algorithms in SRM dynami- cally adjust their control

parameters based on the observ ed per formance within session. This allo ws applications using the SRM frame ork to adapt to wide range of group sizes, topolo- gies and link bandwidths while maintaining rob ust and high per formance. Wb, the distrib uted whiteboard tool designed and implemented by McCanne and Jacobson [17 23 ], is the ˛rst application based on the SRM frame ork. In this paper we discuss wb in some de- tail, to illustrate the use of SRM in speci˛c application. The paper proceeds as follo ws: Section discusses general is- sues for reliable multicast deli ery Section

describes the SRM frame ork, and discusses the wb instantiation of this frame- ork. Section discusses the performance of SRM in sim- ple topologies such as chains, stars, and bounded-de gree trees, and Section presents simulation results from more comple topologies. Section xamines the beha vior of the loss reco ery algorithm in SRM as function of the timer parameters. Sec- tion discusses xtensions to the basic reliable multicast frame- ork, such as adapti algorithms for adjusting the timer pa- rameters and algorithms for local reco ery Section discusses related ork on reliable multicast.

Section discusses future ork on SRM. The design of eliable multicast 2.1 Reliable data deli ery: adding the ord ˚mul- ticastº The problem of reliable unicast data deli ery is well understood and ariety of well-tested solutions are ailable. Ho we er for the reliable transmission of data to potentially lar ge group of recei ers, multicast transmission of fers the most promising approach. If sender were to open separate unicast TCP con- nections to dif ferent recei ers, then copies of each pack et might ha to be sent er links close to the sender making poor use of the ailable bandwidth. In

addition, the sender ould ha to eep track of the status of each of the recei ers. Mul- ticast deli ery permits much more ef ˛cient use of the ailable bandwidth, with at most one cop of each pack et sent er each link in the absence of dropped pack ets. In addition, IP multicast allo ws the sender to send to group without ha ving to ha an kno wledge of the group membership. At the same time, adding ˚multicastº to the data transport problem signi˛cantly changes the solution set for reliable deli ery or xample, in an reliable protocol some party must tak responsibility for loss

detection and reco ery Because of the ˚f ate-sharingº implicit in unicast communication, i.e., the data transmission ails if either of the tw ends ails, either the sender or recei er can tak on this role. In TCP the sender times trans- missions and eeps retransmitting until an ackno wledgment is recei ed. NETBL [7 uses the opposite model and mak es the recei er responsible for all loss detection and reco ery Both approaches ha been sho wn to ork well for unicast. Ho we er if TCP-style, sender -based approach is applied to multicast distrib ution, number of problems occur First, because

data pack ets trigger ackno wledgments (positi or ne g- ati e) from all the recei ers, the sender is subject to the well- kno wn CK implosion ef fect [10 ]. Also, if the sender is re- sponsible for reliable deli ery it must continuously track the changing set of acti recei ers and the reception state of each. Since the IP multicast model deliberately imposes le el of in- direction between senders and recei ers (i.e., data is sent to the multicast group, not to the set of recei ers), the recei er set may be xpensi or impossible to obtain. Finally the algorithms that are used to adapt to

changing netw ork conditions tend to lose their meaning in the case of multicast. E.g., ho should the round-trip time estimate for retransmit timer be computed when there may be se eral orders of magnitude dif ference in propagation time to dif ferent recei ers? What is congestion windo if the delay-bandwidth product to dif ferent recei ers aries by orders of magnitude? What self-clocking information xists in the CK stream(s) if some recei ers share one bottle- neck link and some another? These problems illustrate that single-point, sender -based con- trol does not adapt or scale well for

multicast deli ery Since members of multicast group ha dif ferent communication paths and may come and go at an time, the ˚f ate-sharedº cou- pling of sender and recei er in unicast transmissions does not generalize to multicast. Thus it is clear that recei er -based reli- ability is ar better uilding block for reliable multicast [33 ]. Another unicast con ention that migrates poorly to multicast has to do with the ocab ulary used by the sender and recei er(s) to describe the progress of their communication. recei er can request retransmission either in application data units

(˚sector of ˛le sigcomm-slides.psº) or in terms of the shared communi- cation state (˚sequence numbers 2560 to 3071 of this con ersa- tionº). Both models ha been used successfully (e.g., NFS uses the former and TCP the latter) ut, because the use of commu- nication state for naming data allo ws the protocol to be entirely independent of an application© namespace, it is by ar the most popular approach for unicast applications. Ho we er since mul- ticast transmission tends to ha much weak er and more di erse state synchronization than does unicast, using shared commu- nication

state to name data does not ork well in the multicast case. or xample, if recei er joins con ersation late and re- cei es sequence numbers 2560 to 3071, it has no idea of what© been missed (since the sender© starting number is arbitrary) and so can neither do an ything useful with the data nor mak an intelligent request for retransmission. If recei ers hear from sender again after lengthy netw ork partition, the ha no ay of kno wing whether ˚2560º is retransmission of data the recei ed before the partition or is completely ne (due to se- quence number wrapping during the partition). Thus

the ˚nam- ing in application data units (ADUs)º model orks ar better for multicast. Use of this model also has tw bene˛cial side ef fects. As Clark and ennenhouse [6 point out, separate protocol names- pace can impose delays and inef ˛ciencies on an application, e.g., TCP will only deli er data in sequence en though ˛le transfer application might be perfectly happ to recei sectors in an order The ADU model eliminates this delay and puts the application back in control. Also, since ADU names can
Page 3
be made independent of the sending host, it is possible to use

the anon ymity of IP multicast to xploit the redundanc of mul- tiple recei ers. E.g., if some recei er asks for retransmit of ˚sigcomm-slides.ps sector 5º, an member who has cop of the data, not just the original sender can carry out the retrans- mission. 2.2 Reliable multicast equir ements While the ALF model says that applications should be acti ely in olv ed in their communications and that communication should be done in terms of ADUs rather than some generic protocol namespace, we do not claim that ery application© protocol must be completely dif ferent from ery other© or that there

can be no shared design or code. great deal of design com- monality is imposed simply because dif ferent applications are attempting to solv the same problem: scalable, reliable, multi- point communication er the Internet. As Section 2.1 pointed out, just going from unicast to multicast greatly limits the vi- able protocol design choices. In addition, xperience with the Internet has sho wn that successful protocols must accommo- date man orders of magnitude ariation in ery possible di- mension. While se eral algorithms meet the constraints of Sec- tion 2.1, ery fe of them continue to ork if

the delay band- width and user population are all aried by actors of 1000 or more. In the end we belie the ALF model results in frame ork that is then ˛lled in with application speci˛c details. Portions of the SRM frame ork are completely determined by netw ork dynamics and scaling considerations and apply to an applica- tion. or xample, the scalable request and repair algorithms described in Sections through are completely generic and apply to wide ariety of reliable multicast applications. Each dif ferent application supplies this reliability frame ork with namespace to talk about

what data has been sent and recei ed; polic and machinery to determine ho much bandwidth is ailable to the group as whole; polic to determine ho the ailable bandwidth should be apportioned between the partici- pants in the group; and local send polic that participant uses to arbitrate the dif ferent demands on its bandwidth (e.g., locally originated data, requests and responses, etc.). It is the intent of this paper to describe the frame ork common to scalable, reli- able multicast applications. In particular this paper focuses on reliability rather than on congestion control. belie that for

multicast applications, the congestion control mechanisms will ha to tak into account application-speci˛c needs and capa- bilities. mak the SRM frame ork concrete, we ˛rst describe widely used application wb, the LBNL netw ork whiteboard that has been implemented according to the SRM frame ork. One component of wb is an application-le el reliable multicast protocol that is the precursor to SRM. Ho we er the goal of this paper is not to xplore the speci˛cs of wb, ut to use wb to illus- trate the underlying reliable multicast frame ork. After men- tioning some details of wb©

operation that are direct results of the design considerations outlined in Section 2.1, we then ac- tor out the wb speci˛cs to xpose the generic SRM frame ork underneath. The remaining sections of this paper are an xplo- ration of that frame ork. 2.3 Wb© assumptions about eliable multicast This section brieˇy describes wb, netw ork conferencing tool that pro vides distrib uted whiteboard, and xplores some of the assumptions made in wb© use of reliable multicast. Wb separates the dra wing into pages, where ne page can correspond to ne vie wgraph in talk or the clearing of the screen

by member of meeting. An member can create page and an member can dra on an page. There are ˇoor control mechanisms, lar gely xternal to wb, that can be used if necessary to control who can create or dra on pages. These can be combined with normal Internet pri ac mechanisms (e.g., symmetric-k encryption of all the wb data) to limit partici- pation to particular group and/or with normal authentication mechanisms (e.g., participants signing their dra wing operations via public-k encryption of cryptographic hash er the data) [24 18 ]. Each member is identi˛ed by globally unique

identi˛er the Source-ID, and each page is identi˛ed by age-ID consisting of the Source-ID of the initiator of the page and page number locally unique to that initiator Each member dra wing on the whiteboard produces stream of dra wing operations, or ˚dra- opsº, that are timestamped and assigned sequence numbers, relati to the sender Each sequence of dra ops is sent with the age-ID of the rele ant page. An xample ould be dra op to dra blue line at particular set of coordinates on page. Wb has no requirement for ordered deli ery because most dra wing operations are idempotent and

are rendered immedi- ately upon receipt; out of order dra ops are sorted upon arri al according to their timestamps. Each member© graphics stream is thus independent from that of other sites. Operations that are not strictly idempotent, such as ˚deleteº that references an ear lier dra op, can be patched after the act, when the missing data arri es. The follo wing assumptions are made in wb© reliable multi- cast design: All data has unique, persistent name. This global name consists of the end host© Source-ID and locally-unique sequence number The name al ays refers to the same data. It is

impossible to achie consistenc among dif ferent re- cei ers in the ace of late arri als and netw ork partitions if, say dra op ˚ˇo yd:5º initially means to dra blue line and later means to dra red circle. This does not mean that the dra wing can© change, only that dra ops must ef- fect the change. E.g., to change blue line to red circle, ˚deleteº dra op for ˚ˇo yd:5º is sent, then dra op for the circle is sent. Source-ID© are persistent. user will often quit session and later re-join, obtaining the session© history from the netw ork. By ensuring that Source-ID© are

persistent across in ocations of the appli- cation, the user maintains wnership of an data created before quitting. IP multicast datagram deli ery is ailable. All participants join the same multicast group; there is no
Page 4
distinction between senders and recei ers. The SRM framew ork SRM is the reliable multicast frame ork intended for range of applications that share wb© assumptions abo e, including that of IP multicast datagram deli ery One assumption central to SRM is that the data has unique, persistent names. An open re- search challenge is to design data naming scheme that

reˇects the ˇe xibility of ALF yet allo ws the SRM frame ork to manip- ulate names in generic ashion. second assumption is that the application naming con entions allo us to impose hierar chy er the name space. or the rest of this paper we assume that the data space is subdi vided into groups or containers that we call ˚pagesº, and that the locally unique name is simple sequence number with suf ˛cient precision to ne er wrap. (The term ˚pageº refers to general concept en though it reˇects our whiteboard-biased design.) Whene er member generates ne data, the data

is multi- cast to the group. Each member of the group is indi vidually responsible for detecting loss, generally by detecting gap in the sequence space, and requesting retransmission. Ho we er since it is possible that the last object of sequence is dropped, each member multicasts lo w-rate, periodic, session messages that announce the highest sequence number recei ed from v- ery member for the current page. In addition to the reception state, the session messages contain timestamps that are used to estimate the distance (in time) from each member to ery other (described in Section 3.1). pre

ent the implosion of control pack ets sent from re- cei ers in multicast group, recei ers in the Xpress ransport Protocol (XTP) design [36 multicast control pack ets to the en- tire group. Using the slotting and damping mechanisms in the XTP design, recei ers ait for random time before sending control pack et, and refrain from sending control pack et if the see control pack et from another recei er with the same information. SRM uses similar mechanisms to control the send- ing of request and repair pack ets, with the addition that in the SRM design, the random delay before sending request or

re- pair pack et is function of that member© distance in seconds from the node that triggered the request or repair The timer calculations are described in detail in Section 3.2. As with the original data, repair requests and retransmissions are al ays multicast to the whole group. Thus, although num- ber of hosts may all miss the same pack et, host close to the point of ailure is lik ely to timeout ˛rst and multicast the re- quest. Other hosts that are also missing the data hear that re- quest and suppress their wn request. An host that has cop of the requested data can answer request.

It will set repair timer and multicast the repair when the timer goes of f. Other hosts that had the data and scheduled repairs will cancel their repair timers when the hear the multicast from the ˛rst host. This does not require that all session members eep all of the data all of the time; reliable data deli ery is ensured as long as each data item is ailable from at least one member Ideally lost pack et triggers only single request from host just do wn- stream of the point of ailure and single repair from host just upstream of the point of ailure. Section xplores in more detail the

number of requests and repairs in dif ferent topologies. 3.1 Session messages In SRM, each member multicasts periodic session messages that report the sequence number state for acti sources. Session messages for reliable multicast [10 ha been pre viously pro- posed to enable recei ers to detect the loss of the last pack et in urst, and to enable the sender to monitor the status of re- cei ers. Members can also use session messages in SRM to determine the current participants of the session. The erage bandwidth consumed by session messages is limited to small fraction (e.g., 5%) of the aggre

gate data bandwidth, whether pre-allocated by reserv ation protocol or measured adapti ely by congestion control algorithm. SRM members use the algo- rithm de eloped for at and described in [32 for dynamically adjusting the generation rate of session messages in proportion to the multicast group size. In lar ge, long-li ed session, the state ould become unman- ageable if each recei er had to report the sequence numbers of eryone who had er sent data to the group. pre ent this xplosion, we impose hierarchy on the data by partitioning the state space into Æpages©. Each member only reports the

state of the page it is currently vie wing. recei er bro wsing er pre vious pages may issue pa equests to learn the sequence number state for that page. If recei er joins late, it may is- sue page requests to learn the xistence of pre vious pages. omit the details of the page state reco ery protocol as it is almost identical to the repair request/response protocol for data. In addition to state xchange, recei ers use the session mes- sages to estimate the one-w ay distance between nodes. All pack- ets for that group, including session pack ets, include Source- ID and timestamp. The session

pack et timestamps are used to estimate the host-to-host distances needed by the repair algo- rithm. The timestamps are used in highly simpli˛ed ersion of the NTP time synchronization algorithm [26 ]. Assume that host sends session pack et at time and host recei es at time At some later time, host generates session pack et mark ed with where (time is included in to mak the algorithm rob ust to lost session pack ets). Upon recei ving at time host can estimate the latenc from host to host as or equi alently as Note that while this estimate does not assume synchronized clocks, it does assume

that paths are roughly symmetric. ha not yet xplored the performance of these algorithms in topologies with strong asymmetry in the one-w ay delays of forw ard and re erse paths. 3.2 Loss eco ery This section describes SRM© loss reco ery algorithm, which pro vides the foundation for reliable deli ery Section 7.1 de- scribes modi˛ed ersion of this algorithm with an adapti ad- justment of the timer parameters. Section 7.2 discusses the local reco ery algorithms that ould be critical component of SRM for ef ˛cient operation in lar ge multicast groups in congested
Page 5
en

vironment. In SRM, members who detect loss ait random time and then multicast their repair request, to suppress requests from other members sharing that loss. These epair equests dif fer from traditional ne gati ackno wledgements (N CKs) in tw respects: the are not addressed to speci˛c sender and the request data by its unique, persistent name. When host de- tects loss, it schedules repair request for random time in the future. When the request timer xpires, host multicasts request for the missing data, and doubles the request timer to ait for the repair In SRM, the interv al er which the

request timer is set is function of the member© estimated distance to the source of the pack et. The request timer is chosen from the uniform distrib u- tion on seconds, where is host estimate of the one-w ay delay to the original source of the missing data. The numbers and are parameters of the request algorithm that are discussed at length later in the paper If host recei es request for the missing data before its wn request timer for that data xpires, then host does (random) xponential back of f, and resets its request timer That is, if the current timer had been chosen from the uniform

distrib ution on then the back ed-of timer is randomly chosen from the uniform distrib ution on When some other host (where may be the original source S) recei es request from that host is capable of answering, host sets repair timer to alue from the uniform distrib ution on seconds, where is host B© estimate of the one-w ay delay to host A, and the numbers and are parameters of the repair algorithm discussed later in the paper If host recei es repair for the missing data before its repair timer xpires, then host cancels its repair timer Otherwise, when host B© repair timer xpires host

multicasts the repair In eeping with the philosophy that the recei er is responsible for ensuring its wn correct reception of the data, host does not erify whether host actually recei es the repair Due to the probabilistic nature of these algorithms, it is not unusual for dropped pack et to be follo wed by more than one request. When tw or more hosts generate request for the same data at roughly the same time, we ha redundant control traf ˛c (i.e., asted bandwidth) and the colliding participants should increase the spread in their retransmission distrib ution to oid similar collisions in

the future. Some care is required in deciding when to back-of an already back ed-of timer In our simulator we use heuristic to detect requests that belong to the same iteration of loss reco ery When member backs-of the request timer then member sets an ignor e- bac of ariable to time halfw ay between the current time and the xpiration time, and ignores additional duplicate requests until ignor e-bac of time. Requests recei ed before the ignore-back of time are assumed to belong to the same iteration of the loss reco ery as the request that resulted in the most recent back of f. request recei

ed after the ignore- back of time is assumed to belong to the ne xt iteration, and causes member to again back-of its request timer Because there can be more than one request, host could re- cei duplicate request immediately after sending repair or immediately after recei ving repair in response to its wn ear lier request. In order to pre ent duplicate requests from trigger ing responding set of duplicate repairs, host ignores requests for data for seconds after sending or recei ving repair for that data, where host is either the original source of data or the source of the ˛rst request.

3.3 Congestion contr ol The simplest congestion control mechanism for SRM ould be for all members of the multicast group to assume ˛x ed bandwidth constraint er the aggre gate session. This ould be appropriate, for xample, if members of the multicast group used an out-of-band mechanism (e.g., xplicit bandwidth reser ations, or the informal, consensus-based procedures of the cur rent Mbone) to erify bandwidth ailability Ho we er dif ferent congestion control mechanisms are lik ely to be required for dif- ferent applications and dif ferent conte xts. Congestion control mechanisms for SRM

are discussed further in Section 9.3. Because data represents idempotent operations, loss reco ery can proceed independently from the transmission of ne data. Similarly reco ery for losses from tw dif ferent sources can also proceed independently Since transmission bandwidth is of- ten limited, single transmission rate is allocated to control the throughput across all these dif ferent modes of operation, while the application determines the order of pack et transmission ac- cording to their relati importance. 3.4 Netw ork partitioning and other concer ns Because SRM relies on the underlying

concept of an IP multi- cast group, where members can arri and depart independently SRM does not distinguish netw ork partition from normal de- parture of members from the multicast session. During parti- tion, members can continue to send data in the connected com- ponents of the partitions. After reco ery all data will still ha unique names and the repair mechanism will distrib ute an ne state throughout the entire group. or applications that may require partial or total data order ing, the SRM frame ork could be used to reliably deli er the data to all group members, and partial or total

ordering proto- col could be uilt on top that is speci˛cally tailored to the order ing needs of that application. Ordering is further complicated by disagreements about ho the ordering itself should be de˛ned: Cheriton and Sk een [5 ha ar gued (and Birman [1 has reb ut- ted) that for applications with ordering requirements, preserving the ordering of messages as the appear in the netw ork can be an xpensi and inadequate substitute for preserving the ˚seman- tic orderingº of the messages appropriate for the application. Potential applications for SRM other than wb, including

rout- ing protocol updates, Usenet ne ws, and adapti web caches, are discussed brieˇy in [12 13 ].
Page 6
3.5 Wb© instantiation of SRM This section describes both the design and the current state of the implementation of reliable multicast for wb As discussed belo the rate-contol mechanism and the estimates of one-w ay delay are aspects of the design that are not yet included in the current implementation of wb In the present implementation of wb (v ersion 1.59), mem- bers set request timer to random alue from the interv al ], where is set to ˛x ed alue of 30 msec. The

estimation of the distance to other members has not yet been included in the current implementation. Similarly after recei ving request members set repair timer to random alue from the interv al ]. or the original source of the data, is set to ˛x ed alue of 100 msec., and for other members is set to 200 msec. These ˛x ed alues for and were chosen after xaminations of traces tak en er se eral typical wide-area wb sessions. The current alues for and are suf ˛ciently lar ge to ensure that there is generally only one request and one repair When the original source of the data is

still on-line, the repair generally comes from that original source. The current implementation of wb relies on the informal, consensus- based ˚admissions-control procedureº of the current Mbone. The congestion control mechanism in the design for wb assumes ˛x ed maximum bandwidth allocation for each session. In this design, each wb session ould ha sender bandwidth limit adv ertised as part of the session announcement, and indi vidual members ould use tok en uck et rate limiter to enforce this peak rate on transmissions. As of the writing of this paper this rate control mechanism has

not yet been added to the wb imple- mentation. In practice, wb sessions generally use considerably less erage bandwidth than their accompan ying audio sessions. Ho we er the need for this rate control can at times be made painfully ob vious, for xample, when ne members join ses- sion and ask for back history One application-speci˛c issue concerns the relati priorities between sending ne data, requests, and repairs. When mem- ber of wb session is able to send pack et, the highest priority goes to requests or repairs for the current page, middle priority to ne data, and lo west priority to

requests or repairs for pre vi- ous pages. One issue that has been made ob vious from implementation xperience has been the persistence of the data. Wb does not necessarily store all of the data on backup storage on disk; data for current pages is ept only in memory If data someho becomes corrupt either due to internal application ugs or because of xternal system ailures it can spread lik virus throughout the wb session. When the corrupted data is used to answer repair requests, the corrupted data is distrib uted through- out the multicast group, and persists for the life of the session. oid

this, each piece of data can be accompanied by tag that not only authenticates the source of the data ut also eri˛es its inte grity Request/r epair algorithms or simple topologies no turn to more detailed in estigation of the loss reco ery algorithms in SRM. Because multiple hosts may detect the same losses, and multiple hosts may attempt to handle the same repair request, the goal of the request/repair timer algorithms is to de- synchronize host actions to eep the number of duplicates lo Among hosts that ha di erse delays to other hosts in the same group, this dif ference in delay can be

used to dif ferentiate hosts; for hosts that ha similar delays to reach others, we can only rely on randomization to de-synchronize their actions. This section discusses fe simple, yet representati e, topolo- gies, namely chain, star and tree topologies, to pro vide foun- dation for understanding the loss reco ery algorithms in more comple en vironments. or chain the essential feature of loss reco ery algorithm is that the timer alue is function of distance. or star topology the essential feature of the loss re- co ery algorithm is the randomization used to reduce implosion. Request/repair

algorithms in tree combine both the randomiza- tion and the setting of the timer as function of distance. This section sho ws that the performance of the loss reco ery algo- rithms depends on the underlying netw ork topology 4.1 Chains Figure sho ws chain topology where all nodes in the chain are members of the multicast session. Each node in the underlying multicast tree has de gree at most tw o. The chain is an xtreme topology where simple deterministic loss reco ery algorithm suf ˛ces. In this section we assume that the timer parameters and are set to 1, and that and are set to 0. This

results in request timers set deterministically to and repair timers set to or the chain, as in most of the other scenarios in this paper link distance and delay are both normalized. assume that pack ets tak one unit of time to tra el each link, i.e., all links ha distance of 1. . . . . . . . . . Lj L2 L1 R1 R2 Rk L(j+1) : source of dropped packet : congested edge Figure 1: chain topology In Figure the nodes in the chain are labeled as either to the right or to the left of the congested link. Assume that source multicasts pack et that is subsequently dropped on link ), and that the second pack

et sent from source is not dropped. call the edge that dropped the pack et, whether due to congestion or to other problems, the cong ested link. Let the right-hand nodes each detect the ailure when the recei the second pack et from Let node ˛rst detect the loss at time and let each link ha distance 1. Then node multicasts request at time
Page 7
Node recei es the request at time and multicasts repair at time Node recei es the repair at time Note that all nodes to the right of node recei the request from before their wn request timers xpire. call this deterministic suppr ession

The reader can erify that, due to deterministic suppression, there will be only one request and one repair or xample, node detects the loss at time sets its request timer for time and recei es the request from node at time well before its wn request timer xpires. Had the loss repair been done by unicast, i.e. node sent unicast request to the source as soon as it detected the ailure and sent unicast repair to as soon as it recei ed the request, node ould not recei the repair until time Thus, with chain and with simple deterministic loss reco ery algorithm, the furthest node recei es the repair

sooner than it ould if it had to rely on its wn unicast communication with the original source, because both the request and the repair come from nodes immediately adjacent to the congested link. 4.2 Stars or the star topology in Figure we assume that all links are identical and that the center node is not member of the mul- ticast group. or star topology setting the request timer as function of the distance from the source is not an essential fea- ture, as all nodes detect loss at xactly the same time. Instead, the essential feature of the loss reco ery algorithm in star is the randomization

used to reduce implosion; we call this pr oba- bilistic suppr ession N1 N2 N3 N4 N5 N6 . . . Ng : source of dropped packet : congested edge Figure 2: star topology or the star topology in Figure assume that the ˛rst pack et from node is dropped on the adjacent link. There are members of the multicast session, and the other members detect the loss at xactly the same time. or the discussion of this topology we assume that the timer parameters and are set to 0; because all nodes detect losses and recei requests at the same time, and are not needed to amplify dif ferences in delay or star

topology the only bene˛ts in setting greater than are to oid unnecessary requests from out-of- order pack ets and to ensure minimum delay when request timer is back ed-of f. If is at most 1, then there will al ays be requests. Increasing reduces the xpected number of requests ut in- creases the xpected time until the ˛rst request is sent. or the xpected number of requests is roughly and the xpected delay until the ˛rst timer xpires is sec- onds (where one unit of time is one second). or xample, if is set to then the xpected number of requests is roughly and the xpected delay

until the ˛rst timer xpires is seconds. Note that if as the source of the dropped pack et, then ould be the only node to send request, and the other session members ould recei the request at the same time. The same remarks as abo ould then apply to with respect to repairs. 4.3 Bounded-degr ee tr ees The loss reco ery performance in tree topology uses both the deterministic suppression described for chain topologies and the probabilistic suppression described for star topologies. Con- sider netw ork topology of bounded-de gree tree with nodes where interior nodes ha de gree tree topology

combines aspects of both chains and stars. The timer alue should be function of distance, to enable requests and repairs to suppress request and repair timers at nodes further do wn in the tree. In addition, randomization is needed to reduce request/repair im- plosion from nodes that are at an equal distance from the source (of the dropped pack et, or of the ˛rst request). In this section, we sho that the beha vior of the request algorithms in tree topol- ogy depends principally on the distance of the sender from the congested link, and on the ratio between the timer parameters and assume

that node in the tree is the source of the dropped pack et, and that link (B,A) drops pack et from source S. call nodes on the source© side of the congested link (including node B) good nodes, and nodes on the other side of the congested link (including node A) bad nodes. Node detects the dropped pack et at time when it recei es the ne xt pack et from node S. designate node as le vel-0 node, and we call bad node le vel-i node if it is at distance from node A. Assume that the source of the dropped pack et is at distance from node A. Node request timer xpires at time where denotes uniform random

ariable between and Assuming that node request is not suppressed, le el- node recei es node request at time Node recei es node repair request at time bad le el- node detects the loss at time and such node© request timer xpires at some time Note that re gardless of the alues of and le el- node recei es node request by time The nodes all detect the ailure at the same time, and all set their timers to uniform alue in an interv al of width If the ˛rst timer xpires at time then the other recei ers recei that ˛rst request at time So the xpected number of duplicate requests is equal to the

xpected number of timers that xpire in the interv al ].
Page 8
and le el- node© request timer xpires no sooner than If that is, if then the le el- node© request timer will al ays be suppressed by the request from the le el-0 node. Thus, the smaller the ratio the fe wer the number of le els that could be in- olv ed in duplicate requests. This relation also demonstrates why the number of duplicate requests or repairs is smaller when the source (of the dropped pack et, or of the request) is close to the congested link. Note that the parameter serv es tw dif ferent functions. smaller

alue for gi es smaller delay for node to recei the ˛rst request. At the same time, for nodes further ay from the congested link, lar ger alue for contrib utes to suppress- ing additional le els of request timers. similar tradeof occurs with the parameter smaller alue for gi es smaller delay for node to recei the ˛rst repair request. At the same time, for topologies such as star topologies, lar ger alue for helps to pre ent duplicate requests from session members at the same distance from the congested link. Similar remarks apply to the functions of and in the repair timer algorithm.

Simulations of the equest and epair algorithms or gi en underlying netw ork, set of session members, ses- sion sources, and congested link, it should be feasible to ana- lyze the beha vior of the repair and request algorithms with ˛x ed timer parameters and Ho we er we are inter ested in the repair and request algorithms across wide range of topologies and scenarios. use simulations to xamine the performance of the loss reco ery algorithms for indi vidual pack et drops in random and bounded-de gree trees. do not claim to be presenting realistic topologies or typical patterns of pack et

loss. de˛ne the density of session as the fraction of nodes in the underlying netw ork that are members of the multicast ses- sion. The simulations in this section sho that the loss reco ery algorithms with ˛x ed timer parameters perform well in ran- dom or bounded-de gree tree for dense sessions, where man of the nodes in the underlying tree are members of the multicast session. The loss reco ery algorithms perform some what less well for spar se session, where the session size is small relati to the size of the underlying netw ork, and the members might be scattered throughout the

net. This moti ates the de elopment on the adapti loss reco ery algorithm in Section 7.1, where the timer parameters and are adjusted in response to past performance. In these simulations the ˛x ed timer parameters are set as fol- lo ws: and where is the number of members in the same multicast session. The choice of for and is not critical, ut gi es slightly better performance than for lar ge G. Each simulation constructs either random tree or bounded de gree tree with nodes as the netw ork topology Ne xt, of the nodes are randomly chosen to be session members; these session members are

not necessarily leaf nodes in the netw ork topology Finally source is randomly chosen from the session members. assume that messages are multicast to members of the multicast group along shortest-path tree from the source of the message. In each simulation we randomly choose link on the shortest-path tree from source to the members of the multicast group. assume that the ˛rst pack et from source is dropped by link L, and that recei ers detect this loss when the recei the subsequent pack et from source S. [13 discusses the tools that we used to erify that our sim- ulator is correctly

implementing the loss reco ery algorithms. The simulator that we used for the simulations in this paper is not publically ailable. Ho we er much of the same function- ality has been implemented in the ns-2 simulator [28 ]. Further progress will be reported on the SRM web page [38 ]. 5.1 Simulations on random tr ees In this section we consider netw orks of random labeled trees, where all nodes in the netw orks are session members. The ne xt section considers lar ge netw orks with nodes of de gree four where only fraction of the nodes are members of the multicast group. or the simulations on

random labeled trees of nodes, the random labeled trees are constructed according to the labeling algorithm in [30 p.99]. These trees ha unbounded de gree, ut for lar ge the probability that particular erte in random labeled tree has de gree at most four approaches (approximately) 0.98 [30 p.114]. Figure sho ws simulations of the loss re- co ery algorithm for this case, where all nodes in the tree are members of the multicast session (that is, ). or each graph the -axis sho ws the session size twenty simula- tions were run for each alue of or each simulation, ne random tree as constructed, and

session members, source, and congested link were randomly chosen. Each simulation is represented by jitter ed dot and the median from the twenty simulations is sho wn by solid line. The tw dotted lines mark the upper and lo wer quartiles; thus, the results from half of the simulations lie between the tw dotted lines. While there are not enough simulations to mak accurate predictions of the beha v- ior of the loss reco ery algorithms, the simulations do illustrate the loss reco ery algorithms under range of circumstances. The top tw graphs in Figure sho the number of requests and repairs to

reco er from single loss. In these graphs the median, lo wer quartile, and upper quartile lines are the same; the -axis as chosen for an easy visual comparison with other simulations later in the paper or each member af fected by the loss, we de˛ne the loss e- co very delay as the time from when the member ˛rst detects the loss until the member ˛rst recei es repair or each sim- ulation, there is dot in the bottom graph in Figure sho wing the loss reco ery delay for the last member of the multicast ses- jittered dot is dot for which some small random jitter has been added to the

and coordinates. In this ay the reader can dif ferentiate between single dot, and multiple dots all with the same coordinates.
Page 9
Session Size Number of Requests 20 40 60 80 100 10 15 20 Session Size Number of Repairs 20 40 60 80 100 10 15 20 Session Size Delay (in units of RTT) 20 40 60 80 100 Figure 3: Random trees with random congested link and single pack et loss, where all nodes are members of the multicast session. sion to recei the repair This loss reco ery delay is gi en as multiple of the TT the roundtrip time from that member to the original source of the dropped pack

et. While this member has the lar gest loss reco ery delay in absolute terms, this mem- ber generally does not ha the lar gest delay when xpressed in units of its wn TT Note that with unicast communications the ratio of loss re- co ery delay to TT is at least one. or unicast recei er that detects pack et loss by aiting for retransmit timer to time out, the typical ratio of delay to TT is closer to 2. ith mul- ticast loss reco ery algorithms the ratio of delay to TT can be less than one, because the request and repair could each come from node close to the point of ailure. Figure sho ws that

the repair/request algorithm with ˛x ed timer parameters orks well for tree topology where all nodes of the tree are members of the multicast session. There is usually only one request and one repair (Some lack of symmetry re- sults from the act that the original source of the dropped pack et might be ar from the point of ailure, while the ˛rst request comes from node close to the point of ailure.) The erage reco ery delay for the arthest node is less than TT compet- iti with the erage delay ailable from unicast algorithm such as TCP The results are similar in simulations where the

congested link is chosen adjacent to the source of the dropped pack et, and for simulations on bounded-de gree tree of size where interior nodes ha de gree four (W do not claim that this is the erage de gree for router in the Inter net, in the current Mbone, or in the lik ely multicast backbone of the foreseeable future. From looking at map of the current Mbone topology choosing de gree of four seemed as reason- able choice as an other that we might ha made.) 5.2 Simulations on lar ge bounded-degr ee tr ees The loss reco ery algorithms with ˛x ed timer parameters per form less well for

sparse session in lar ge bounded-de gree tree. The underlying topology for the simulations in this section is balanced bounded-de gree tree of nodes, with in- terior nodes of de gree four In these simulations the session size is signi˛cantly less than or session that is sparse rela- ti to the underlying netw ork, the nodes close to the congested link might not be members of the session. Session Size Number of Requests 20 40 60 80 100 10 15 20 Session Size Number of Repairs 20 40 60 80 100 10 15 20 Session Size Delay (in units of RTT) 20 40 60 80 100 Figure 4: Bounded-de gree tree, de gree

4, 1000 nodes, with random congested link. As Figure sho ws, the erage number of repairs for each loss is some what high. In simulations sho wn in [13 where the congested link is al ays adjacent to the source, the number of repairs is lo ut the erage number of requests is high. The performance of the loss reco ery algorithm on range of topologies is sho wn in [13 ]. These include topologies where each of the nodes in the underlying netw ork is router with an adjacent Ethernet with orkstations, point-to-point topolo- gies where the edges ha range of propagation delays, and topologies where the

underlying netw ork is more dense that tree. None of these ariations that we ha xplored ha signif- icantly af fected the performance of the loss reco ery algorithms with ˛x ed timer parameters.
Page 10
Exploring the parameter space As the pre vious section sho wed, particular set of alues for the timer parameters and that performs well in one scenario might not perform well in another scenario. In this section we choose fe simple topologies, and xplore the beha vior of the request/repair algorithms as function of the request timer parameter In the follo wing section we discuss

adapti algorithms where the timer parameters are adjusted as function of the past performance of the loss reco ery algorithms. The results in this section can be brieˇy summarized as fol- lo ws. The only simulations in this section that gi unaccept- ably lar ge numbers of requests are those with small alues for on stars or for sparse sessions on trees. or these scenarios, increasing reduces the number of duplicate requests, accom- panied by moderate increases in the loss reco ery delay or star topology there is clear tradeof between the delay and the number of duplicates. In contrast,

with chain topology setting to zero gi es the optimal beha vior both in terms of delay and in the number of duplicates. or dense session in tree topol- ogy small alue for gi es good performance in terms of both delay and duplicates. or the simulations in this section, is set to 2. As Section 4.1 sho wed, for chain with deterministic loss reco ery algo- rithm, it is suf ˛cient to set to 1. Ho we er for chain with randomized loss reco ery algorithm, higher alue of is needed to ensure that members further from the congested link recei request before their wn request timer xpires. Figure sho

ws the tradeof fs between delay and duplicates in star topology of size 100, where the congested link is adjacent to the source of the dropped pack et. de˛ne the equest delay for session member as the delay from when the request timer is set until request as either sent by that member or recei ed from another member The top graph in Figure contains dot for each inte ger alue of from to 100, for the star topology described in Section 4.2. or each dot, the -coordinate is the xpected request delay for that alue of and the -coordinate is the xpected number of requests. More precisely the

-coordinate is gi en by the xpected re- quest delay for the bad member closest to the source of the dropped pack et, xpressed as multiple of the roundtrip time from that member to the source of the dropped pack et. When there is not unique bad member at the minimum distance from the source, as in star topology then the -axis sho ws the x- pected smallest request delay from those members at the min- imum distance from the source. or star topology this is the request delay for that member whose request timer xpires ˛rst. From the heuristic analysis in Section 4.2, the xpected re- quest

delay (in units of the TT of is as follo ws: where is the distance in seconds from the source to session member From Section 4.2, the xpected number of requests is estimated as The ˚xº in Figure sho ws the results for and the circle sho ws the results for Thus the top graph of Figure sho ws that increasing in Star Topology Expected Request Delay (in units of RTT) Expected Number of Requests 1.0 1.1 1.2 1.3 1.4 1.5 20 40 60 80 100 Star Topology Simulation Results of Average Request Delay (in units of RTT) Average Number of Requests 1.0 1.1 1.2 1.3 1.4 1.5 20 40 60 80 100 Figure 5: radeof

between delay and duplicates in star topol- ogy Chain Topology Simulation Results of Average Request Delay (in units of RTT) Average Number of Requests 1.0 2.0 3.0 4.0 Figure 6: radeof between delay and duplicates in chain topology star topology increases the xpected request delay slightly while signi˛cantly decreasing the xpected number of requests. The bottom graph in Figure sho ws the results from simula- tions, which concur with the analytical results in the top graph. or each inte ger alue of from to 100, twenty simulations are run, and the request delay and total number of requests

is calculated for each simulation. Each simulation is represented by jittered dot, and the line sho ws the erage for each alue of or xample, for set to one hundred the erage number of requests is 1.5 and the erage request delay as multiple of the TT is 1.42. The minimum request delay of comes from the ˛x ed alue of for request parameter These results generally concur with those of [31 ], which in- estigates the relati bene˛ts of using unicast or multicast CKs. La Porta and Schw artz [31 conclude that for scenario simi- lar to our star topology where message sent by an member is recei

ed by all other members xactly seconds later and for multicast group with ten members, the random interv al er which CK timers were set ould ha to be at least 10 times for the multicasting of CKs to result in band- width sa vings er scheme of unicasting CKs to the source. La Porta and Schw artz [31 conclude that unicasting CKs ould be desirable in some scenarios, ut for multicast groups that could ha hundreds of members, and for multicast groups where the recei ers were some what tolerant of delay multicast- 10
Page 11
ing CKs ould be quite ef fecti in reducing the unnecessary use of

bandwidth. Figure sho ws the results from the chain topology discussed in Section 4.1. or chain, with set to zero there will be x- actly one request, with request delay Increasing can increase both the xpected request delay and the xpected number of duplicates. The four lines in Figure sho the re- sults for chain topology with ailed edge 1, 2, 5, or 10 hops, respecti ely from the source of the dropped pack et. or the simulations with ailed edge one hop from the source, the in- di vidual simulations are sho wn by dot. or each scenario ranges from to 10 in increments of 1, and then from 10 to

100 in increments of 10. While increasing can increase the num- ber of duplicates, the magnitude of the increase is quite small. (Tree Topology, Degree 4, Session Membership Density 1), Simulation Results of Average Request Delay (in units of RTT) Average Number of Requests 10 10 12 (Tree Topology, Degree 4, Session Membership Density 0.1), Simulation Results of Average Request Delay (in units of RTT) Average Number of Requests 10 10 12 Figure 7: radeof between delay and duplicates for dense ses- sions in tree topologies. (Tree Topology, Degree 4, Session Membership Density 0.02), Simulation

Results of Average Request Delay (in units of RTT) Average Number of Requests 10 10 15 20 25 30 Figure 8: radeof between delay and duplicates for sparse ses- sions in tree topology Figures and sho the results for range of tree topologies. Each line sho ws the results for particular ˛x ed scenario, as aries from to 100. In all of the scenarios the session size is at least 100. In each graph, the lines represent scenarios that dif fer only in the number of hops between the source and the ailed edge. The four lines represent scenarios with ailed edges that are one, tw o, three, or four hops,

respecti ely from the source of the dropped pack et. or all of the topologies, the ailed edge closest to the source gi es the line with the orst-case number of duplicate requests. or this line, the indi vidual simulations are each sho wn by jittered dot. The graphs are sized for easy comparisons, and do not necessarily sho all of the dots. As an xample, the top graph in Figure sho ws the results for trees of density 1. or each of the lines the erage number of duplicates is minimized for and maximized for an intermediate alue of Extending the basic appr oach 7.1 Adapti adjustment of random

timer algo- rithms The results in the pre vious section suggest that the SRM loss re- co ery algorithms with ˛x ed timer parameters gi acceptable performance for sessions willing to tolerate small number of duplicate requests and repairs and willing to accept moderate request and repair delay (in terms of the roundtrip times of the underlying multicast group). Ho we er there is not single set- ting for the timer parameters that gi es optimal performance for all topologies, session memberships, and loss patterns. or ap- plications where it is desirable to optimize the tradeof between delay

and the number of duplicate requests and repairs, an adap- ti algorithm can be used that adjusts the timer parameters and in response to the past beha vior of the loss re- co ery algorithms. In this section we describe an adapti algorithm that adjusts the timer parameters as function of both the delay and of the number of duplicate requests and repairs in recent loss reco v- ery xchanges. related strate gy to minimize the number of duplicate requests is to rely on deterministic suppression, with members closest to the point of ailure sending requests ˛rst. The rest of Section VII-A

describes the adapti algorithm for adjusting the timer parameters in some detail. Section VII-B continues with discussion of local reco ery mechanisms. One mechanism for encouraging deterministic suppression is for members to reduce after the send request. Be- cause members who frequently send requests are lik ely to also be members who are close to the point of ailure, reducing for those members aids the deterministic suppression. In star topology where otherwise the loss reco ery mechanisms rely on probabilistic suppression, reducing in this ashion helps to break symmetry encouraging certain

members to continue sending requests early second mechanism for encouraging deterministic suppres- sion is for members who ha sent requests to reduce if the ha recei ed duplicate requests from members signi˛cantly fur ther from the source of the ailed pack et. This mechanism for requests requires that requests include the requestor© estimated distance from the original source of the requested pack et. The 11
Page 12
corresponding mechanism for replies requires that replies in- clude the replier© estimated distance from the source of the re- quest. After sending request: decrease

start of req. timer interval Before each new request timer is set: if requests sent in prev. rounds, and any dup. requests were from further away: decrease request timer interval else if ave. dup. requests high: increase request timer interval else if ave. dup. requests low and ave. req. delay too high: decrease request timer interval Figure 9: Dynamic adjustment algorithm for request timer in- terv al. Figure gi es the outline of the dynamic adjustment algo- rithm for adjusting the request timer parameters. correspond- ing algorithm applies for adjusting the reply timer parameters. This

adapti algorithm combines the general adaptation per formed by all members when the set request timer with more speci˛c adaptations performed only by members who ha re- cently sent requests. member determines if the erage num- ber of duplicate requests is ˚too highº by comparing the ob- serv ed erage to prede˛ned threshold; in this paper the pre- de˛ned threshold is one duplicate request. If the erage num- ber of duplicate requests is too high, then the adapti algorithm increases the request timer interv al. Alternately if the erage number of duplicates is okay ut the erage

delay in sending request is too high, then the adapti algorithm decreases the request timer interv al. In this ashion the algorithm can adapt the timer parameters not only to ˛t the generally-˛x ed underly- ing topology ut also to ˛t changing session membership and pattern of congestion. First we describe ho session member measures the erage delay and number of duplicate requests in pre vious loss reco ery rounds in which that member has been participant. equest period be gins when member ˛rst detects loss and sets re- quest timer and ends when that member detects

subsequent loss and be gins ne request period. The ariable dup eq eeps count of the number of duplicate requests recei ed during one request period; these could be duplicates of the most recent request or of some pre vious request, ut do not include requests for data for which that member ne er set request timer At the end of each request period, the member updates ave dup eq the erage number of duplicate requests per request period, be- fore resetting dup eq to zero. The erage is computed as an xponential-weighted mo ving erage, with in our simulations. Thus, ave dup eq gi es the erage number

of duplicate requests for those request ents for which that member has actually set request timer When request timer either xpires or is reset for the ˛rst time, indicating that either this member or some other member has sent request for that data, the member computes the delay from the time the request timer as ˛rst set (follo w- ing the detection of loss) until request as sent (as indicated by the time that the request timer either xpired or as reset). The ariable xpresses this delay as multiple of the roundtrip time to the source of the missing data. The member computes the erage

request delay In similar ashion, epair period be gins when member recei es request and sets repair timer and ends when mem- ber recei es request and sets repair timer for dif ferent data item. In computing dup ep the number of duplicate repairs, the member considers only those repairs for which that member at some point set repair timer At the end of repair period the member updates ave dup ep the erage number of duplicate repairs. When repair timer either xpires or is cleared, indicating that this member or some other member sent repair for that data, the member computes the delay from the

time the repair timer as set (follo wing the receipt of request) until repair as sent (as indicated by the time that the repair timer either xpired or as cleared). As abo e, the ariable xpresses this delay as multiple of the roundtrip time to the source of the missing data. The member computes the erage repair delay After request timer expires or is first reset: update ave eq delay After sending request: Before each new request timer is set: update ave dup eq if closest equestor on past requests: else if ave dup eq AveDups)): else if ave dup eq AveDups ): if ave eq delay AveDelay): if ave dup

eq 1/4): else Figure 10: Dynamic adjustment algorithm for request timer pa- rameters. In our simulations Figure 10 gi es the adapti adjustment algorithm used in our simulator to adjust the request timer parameters and The adapti algorithm is based on comparing the mea- surements ave dup eq and ave eq delay with eDups and eDelay the tar get bounds for the erage number of dupli- cates and the erage delay An identical adjustment algorithm is used to adapt the repair timer parameters and based on the measurements ave dup ep and ave ep delay Figure 11 gi es the initial alues used in our simulations

for the timer parameters. All four timer parameters are constrained to stay within the minimum and maximum alues in Figure 11. The numerical parameters in Figure 10 of 0.05, 0.1, and 0.5 were chosen some what arbitrarily While this might look lik 12
Page 13
Nonadaptive Timer Parameters. Round Number Number of Repairs 20 40 60 80 100 10 15 20 25 Nonadaptive Timer Parameters. Round Number Average Delay (in units of RTT) 20 40 60 80 100 Figure 12: The non-adapti algorithm. Adaptive Timer Parameters: AveDups=1, AveDelay=1 Round Number Number of Repairs 20 40 60 80 100 10 15 20 25

Adaptive Timer Parameters: AveDups=1, AveDelay=1 Round Number Average Delay (in units of RTT) 20 40 60 80 100 Figure 13: The adapti algorithm. Initial values: Fixed parameters: Figure 11: arameters for adapti algorithms multitude of constants, the xact alue of these constants is not important all that matters is that the represent small adjust- ments to the timer parameters and as function of the past observ ed beha vior of the loss reco ery algorithms. The ad- justments of and for are small, as the adjustment of is not the primary mechanism for controlling the number of duplicates. The

adjustments of and for are suf ˛ciently small to minimize oscillations in the setting of the timer parameters. Sample trajectories of the loss reco ery algo- rithms con˛rm that the ariations from the random component of the timer algorithms dominate the beha vior of the algorithms, minimizing the ef fect of oscillations. In our simulations we use multiplicati actor of rather than for the request timer back of described in Section 3.2. ith multiplicati actor of 2, and with an adapti algo- rithm with small minimum alues for single node that x- periences pack et loss could ha its back

ed-of request timer xpire before recei ving the repair pack et, resulting in an unnec- essary duplicate request. ha not attempted to de vise an optimal adapti algo- rithm for reducing some function of both delay and of the num- ber of duplicates; such an optimal algorithm could in olv rather comple decisions about whether to adjust mainly or possibly depending on such actors as that member© relati distance to the source of the lost pack et. or sparse session in tree topology increasing reduces the number of duplicate re- quests; our adapti algorithm relies lar gely on increases of to reduce

duplicates. Our adapti algorithm also decreases for members who ha sent requests, if duplicate requests ha come from members further from the source of the requested pack et. (In our simulations ˚further from the sourceº is de˛ned as ˚at reported distance greater than 1.5 times the distance of the current memberº.) Our adapti algorithm only decreases for members who ha sent requests, or when the erage number of duplicates is already small. Figures 12 and 13 sho simulations comparing adapti and non-adapti algorithms. The simulation set in Figure 12 uses ˛x ed alues for the

timer parameters, and the one in Figure 13 uses the adapti algorithm. From the simulation set in Figure 4, we chose netw ork topology session membership, and drop scenario that resulted in lar ge number of duplicate requests with the non-adapti algorithm. The netw ork topology is bounded-de gree tree of 1000 nodes with de gree for interior nodes, and the multicast session consists of 50 members. Each of the tw ˛gures sho ws ten runs of the simulation, with 100 loss reco ery rounds in each run. The same topology and loss scenario is used for each of the ten runs, ut each run uses ne seed

for the pseudo-random number generator to con- trol the timer choices for the requests and repairs. In each loss eco very ound pack et from the source is dropped on the con- gested link, second pack et from the source is not dropped, and the loss reco ery algorithms are run until all members ha re- cei ed the dropped pack et. The -axis of each graph sho ws the round number or each ˛gure, the top graph sho ws the number of requests in that round, and the bottom graph sho ws the loss reco ery delay Each round of each simulation is mark ed with jittered dot, and solid line sho ws the median

from the ten sim- ulations. The dotted lines sho the upper and lo wer quartiles. 13
Page 14
or the simulations in Figure 12 with ˛x ed timer parameters, one round dif fers from another only in that each round uses dif ferent set of random numbers for choosing the timers. or the simulations with the adapti algorithm in Figure 13, after each round of the simulation each session member uses the adapti algorithms to adjust the timer parameters, based on the results from pre vious rounds. Figure 13 sho ws that for this sce- nario, the adapti algorithms quickly reduce the erage num-

ber of repairs, reaching steady state after about forty iterations. Figure 13 also sho ws small reduction in delay Session Size Number of Requests 20 40 60 80 100 10 15 20 Session Size Number of Repairs 20 40 60 80 100 10 15 20 Session Size Delay (in units of RTT) 20 40 60 80 100 Figure 14: Adapti algorithm on round 40, for bounded- de gree tree of 1000 nodes with de gree and randomly pick ed congested link. xplore the adapti algorithms in range of scenarios, Figure 14 sho ws the results of the adapti algorithm on the same set of scenarios as that in Figure 4. or each scenario (i.e., netw ork

topology session membership, source member and congested link) in Figure 14, the adapti algorithm is run repeatedly for 40 loss reco ery rounds, and Figure 14 sho ws the results from the 40th loss reco ery round. Comparing Figures and 14 sho ws that the adapti algorithm is ef fecti in control- ling the number of duplicates er range of scenarios. Simulations in [13 sho that the adapti algorithm orks well in wide range of conditions. These include scenarios where only one session member xperiences the pack et loss; where the congested link is chosen adjacent to the source of the pack et to be

dropped; and for range of underlying topologies, including 5000-node trees, trees with interior nodes of de gree 10; and connected graphs that are more dense that trees, with 1000 nodes and 1500 edges. In actual multicast sessions, successi pack et losses are not necessarily from the same source or on the same netw ork link. Simulations in [13 sho that in this case, the adapti timer al- gorithms tune themselv es to gi good erage performance for the range of pack et drops encountered. Simulations in [13 sho that, by choosing dif ferent alues for eDelay and eDups, tradeof fs can be made between

the relati importance of lo delay and lo number of duplicates. In the simulations in this section, there is only one congested link, and each pack et that is dropped is dropped on only that one link. More realistic simulations ould include scenarios with multiple locations for drops of single pack et, and ould use an xtended SRM that incorporates local reco ery mechanisms into the loss reco ery algorithms. Similarly in the simulations in this section, none of the re- quests or repairs are themselv es dropped. In more realistic sce- narios where not only data messages ut requests and repairs

can be dropped at congested links as well, members ha to rely on retransmit timer algorithms to retransmit requests and repairs as needed. Ob viously this will increase not only the delay ut also the number of duplicate requests and repairs in dif ferent parts of the netw ork. The use of local reco ery described in the follo wing section, ould help to reduce the unnecessary use of bandwidth in the loss reco ery algorithms. 7.2 Local eco ery ith SRM© global loss reco ery algorithm described abo e, en if pack et is dropped on link to single member both the request and the repair are multicast to

the entire group. In cases where the neighborhood af fected by the loss is small, the bandwidth costs of the loss reco ery algorithm can be reduced if requests and repairs are multicast to limited area. In this section we sugest that local reco ery can be quite ef fecti in reducing the unnecessary use of bandwidth. Scenarios that could bene˛t from local reco ery include ses- sions with persistent losses to small neighborhood of members and isolated late arri als to multicast session asking for back history Studies of pack et loss patterns in the current Mbone [39 suggest that pack et loss

in multicast traf ˛c is most lik ely to occur not in the ˚backboneº ut in the ˚edgesº of the mul- ticast netw ork. In addition, the lar ger the multicast group, the more lik ely it is that pack et will be dropped some where along the multicast tree, en in the absence of particular persistent point of congestion. orw ard Error Correction (FEC) [29 and Explicit Congestion Noti˛cation (ECN) [11 both ha great po- tential for reducing the ne gati impacts of transient or mild congestion for reliable multicast applications. Ho we er links with persistent congestion and persistent

pack et drops are lik ely to remain. In this case, local reco ery is needed to ensure that the fraction of bandwidth used for request and repair messages scales well as the multicast group gro ws. are not at this stage proposing complete set of algo- rithms for implementing local reco ery xplore in this sec- tion set of mechanisms that can be used to limit the scope of request and of an answering repair The question of ho member decides the scope to use for particular request is an area for future research. Local reco ery assumes that the member sending the request 14
Page 15
has

some information about the neighborhood of members shar ing recent losses. de˛ne loss neighborhood as set of members who are all xperiencing the same set of losses. End nodes should not kno about netw ork topology ut end nodes can learn about ˚loss neighborhoodsº from information in ses- sion messages, without learning about the netw ork topology or each member we call loss local loss if the number of members xperiencing the loss is much smaller than the total number of members in the session. help identify loss neigh- borhoods, session messages could report member© loss rate, that

is, the fraction of data for which request timer as set. In addition, session messages could report ˚loss ˛ngerprintº, i.e., the names of the last fe local losses. member should send request with local scope when re- cent losses ha been con˛ned to single loss neighborhood, and when this local request seems lik ely to reach some member capable of answering it. If no repair is recei ed before back ed- of request timer xpires, then the ne xt request can be sent with wider scope. 7.2.1 Administrati scoping One simple and no widely ailable mechanism for local re- co ery is the use of

administrati scope in IP multicast. If member belie es that both the loss neighborhood and poten- tial source of repairs are contained in the local administrati ely- scoped neighborhood, then both the request and the repair can be sent with administrati scoping, so that both messages are restricted to that neighborhood. This is most lik ely to be of use for lar ger administrati ely-scoped neighborhoods. 7.2.2 Separate multicast gr oups Another potential mechanism under in estigation is the use of separate multicast groups for local reco ery [22 ]. In this scheme, the initial requestor creates

separate multicast group for local reco ery and in vites other nearby members to join that multi- cast group. The multicast group must include some member ca- pable of sending repairs. This mechanism is appropriate when there is stable loss neighborhood that results from particu- lar lossy link, or when an isolated member joins group late and asks for past history Kasera, urose, and wsle [20 xplore some what-dif ferent use of multiple multicast groups for reco ery aimed primarily at reducing the costs of processing unw anted pack ets at recei ers. 7.2.3 TTL-based scoping third possible

mechanism for local reco ery is for members to use time-to-live or TTL-based scope to limit the reach of request and repair messages. In the current Mbone, each link (more precisely each interf ace or tunnel) is assigned thr esh- old with def ault threshold of one. The threshold is the min- imum TTL required for an IP multicast pack et to be forw arded on that link, and is used to control the scope of multicast pack- ets. Ev ery multicast router decrements the TTL of forw arded pack et by one. In order to limit the scope of request or repair message, the sender simply sets each pack et© TTL

˛eld to an appropriate alue. By including the initial TTL in separate pack et ˛eld, members recei ving the request (or reply) message xplicitly learn the original TTL as well as the hop count for the path from the source. The simplest ersion of TTL-based local reco ery is one- step repair algorithm. In this approach, request sent with TTL might be answered with repair sent with TTL where is the number of hops to the original requestor In this ay the repair ould be guaranteed to reach all of the members reached by the original request (if we optimistically assume that mul- ticast

routes and thresholds are symmetric). Ho we er simula- tions suggest that one-step repair is not ery ef fecti there is signi˛cant unnecessary use of bandwidth by the repair pack ets. tw o-step repair message is considerably more ef fecti in limiting the unnecessary use of bandwidth. In the ˛rst step of the repair local repair is sent with the same TTL used in the request. This TTL should be suf ˛ciently lar ge to reach the origi- nal requestor gi en suf ˛cient symmetry ut not necessarily suf- ˛ciently lar ge to reach all of the members reached by the orig- inal

request. The local repair includes the name of the member whose request triggered the repair In the second step of the re- pair the requestor upon recei ving the ˛rst local repair naming itself as the original requestor resends the repair using the same TTL as in the original request. In this ay the repair is recei ed by all of the members who sa the original request. use simulations to xplore the optimal beha vior that could be achie ed from tw o-step local reco ery First we xamine net- orks where all links ha link threshold of one, and ne xt we xamine netw orks with range of alues for

the link thresholds. xplore the optimal possible performance, we assume that the loss neighborhood is stable, and that members ha some method for estimating and where is the minimum TTL needed to reach all members in the loss neighborhood, and is the minimum TTL needed to reach some member not in the loss neighborhood. Further we assume that for each loss re- co ery ent, the request/repair algorithms xhibit their optimal beha vior That is, we assume that there is single request and single repair and that both come from the members closest to the point of ailure. restrict attention to scenarios

where the loss neighborhood contains at most 1/10-th of the session members. Figure 15 sho ws the results of such an optimal ecution of the tw o-step local reco ery algorithms in lar ge bounded-de gree netw ork of de gree four with link thresholds of one. The -axis in each graph sho ws the session size. or each session size, twenty simulations are run, each with dif ferent session mem- bership, source, and randomly-chosen congested link for the dropped pack et. The results of each simulation are represented by jittered dot. The three lines indicate the ˛rst, second, and third quartiles.

In the top graph of Figure 15, the -axis sho ws the fraction of session members reached by the repair In the bottom graph of Figure 15, the -axis sho ws the number of session members in the epair neighborhood that is, the number of session members reached by the repair as multiple of the number of members in the loss neighborhood. Additional simulations not reported here sho that local reco ery with tw o-step repairs can ork well in netw orks with range of topologies and link thresholds. Simulations in [13 sho that, in contrast to tw o-step repairs, 15
Page 16
Session Size Fraction

of Nodes Reached 100 200 300 400 500 0.0 0.4 0.8 Session Size Repair/Loss Neighborhoods 100 200 300 400 500 20 40 60 80 Figure 15: Local reco ery with tw o-step repairs in bounded- de gree trees with 1000 nodes, thresholds of one. one-step repairs are airly inef ˛cient in their use of bandwidth, en gi en an optimal setting of the the TTL of the original re- quest. Related esear ch on eliable multicast The literature is rich with architectures for reliable multicast [27 9]. Se eral of the centralized approaches to reliable mul- ticast are discussed brieˇy in [12 13 ]. In this section

we focus on those approaches to reliable multicast that are more closely related to SRM. The Xpress ransport Protocol (XTP) [36 37 is designed for either unicast or one-to-man multicast communication. Re- liable communication is based on ne gati ackno wledgments. The sender may also initiate synchronizing handshak e, to de- termine the status of the recei ers. In this case, recei ers each use ˚slottingº technique to ait random delay before sending their control pack et, to reduce control pack et implosion. The combined slotting and damping techniques proposed in [36 to reduce CK

suppression ha been described earlier in this paper In XTP recei ers or routers can impose maximum data rate and maximum urst size on the sender Se eral proposals for reliable multicast use secondary server (also called Designated Router or Gr oup Contr oller in dif fer ent proposals), to handle retransmissions within subgroup of the multicast group. One such protocol, Log-based Recei er reliable Multicast (LBRM) [15 ], as designed to support Dis- trib uted Interacti Simulation (DIS). The recei er -based reli- ability is pro vided by primary and secondary logging serv ers. Recei ers request

retransmissions from the secondary logging serv ers, which requests retransmissions from the primary log- ging serv er Both the source and the secondary logging serv ers use either deterministic or probabilistic requests to select be- tween unicast and multicast retransmissions. LBRM uses ariable heartbeat scheme that sends heartbeat messages (e.g., session messages) more frequently immediately after data transmission. In an en vironment when the basic transmission rate is lo this ariable heartbeat enables recei ers to detect losses sooner with no penalty in terms of the total num- ber of

heartbeat messages transmitted. While the ariable heart- beat scheme ould not be appropriate for an application such as wb, where the original congestion could itself result from man senders sending data at the same time, the ariable heartbeat scheme could be quite useful for an application with natu- ral limit on the orst-case number of concurrent senders, and ould be easily implementable in SRM. Lik LBRM and SRM, the Reliable Multicast ransport Pro- tocol (RMTP) [21 also includes among its goals scalability and recei er -based reliability RMTP accomplishes this by using Designated Routers

(DRs) in each re gion of the multicast group, where the DRs recei incoming ackno wledgements and per form retransmissions as needed. RMTP uses windo wed ˇo control tuned to the requirements of the orst-case recei er The problem of dynamically choosing DRs for gi en multicast tree is left for continued research. Local Group Concept is proposed in [14 ], where the mul- ticast group is di vided into Local Groups, each represented by Group Controller that handles retransmissions for members in the Local Group. The Group Controller is not router or separate serv er ut simply one of the members

of the mul- ticast group. Hofmann in [14 aims at the dynamic generation of Local Groups and of Group Controllers, ut does not xplore in detail the algorithms for ˛nding the nearby Local Group, re- sponding to the ailure of local Group Controller or choosing ne Group Controller Perhaps the most well-kno wn ork on reliable multicast is the ISIS distrib uted programming system de eloped at Cornell Uni ersity [2 16 ]. ISIS pro vides causal ordering and, if desired, total ordering of messages on top of reliable multicast deli ery protocol. Therefore the ISIS ork is to some xtent orthogonal to

the ork described in this paper and further con˛rms our notion that partial or total ordering, when desired, can al ays be added on top of reliable multicast deli ery system. There is also gro wing literature on the analysis of reliable multicast schemes. As one xample, Bhagw at, Mishra, and ri- pathi [4 consider the performance of one-to-man reliable mul- ticast with block-based CK scheme. The paper in estigates the re gime where transfer sizes are lar ge, recei ers ha limited uf fering, and all retransmissions come from the original sender Pejhan, Schw artz, and Anastassiou [31 compare

se eral re- transmission schemes for multicast protocols for real-time me- dia. The retransmission schemes are intended for real-time me- dia with playback times, so that pack ets recei ed after the play- back time are dropped. The assume that recei ers unicast CKs to the sender and retransmissions are done by the sender Note that these assumptions dif fer from those of SRM, which is in- tended for applications without ˛x ed deadlines by which pack- ets ha to be recei ed, and which allo ws retransmissions from members other than the original source. 16
Page 17
Futur ork 9.1

Futur ork on scalable session messages The SRM frame ork outlined in this paper assumes that mem- bers of the multicast group send session messages and estimate the distance to each of the other group members. or lar ger groups, we are in estigating hierarchical approach for scalable session messages [35 ], where members in local area dynam- ically select one of the local members to be the epr esentative as ar as session messages are concerned. The representati es ould each send global session messages, and maintain an es- timate of their distance in seconds from each of the other rep-

resentati es. All other members ould send local session mes- sages with limited scope suf ˛cient to reach their representati e. 9.2 Futur ork on local eco ery Section 7.2 has sho wn that local reco ery can be ef fecti in lim- iting the unnecessary use of bandwidth in loss reco ery ents, if members can estimate the scope to use in sending local re- quests. While we discuss in [13 some of the issues in imple- menting TTL-based local reco ery there are man open ques- tions about which mechanisms should be used to de˛ne local- reco ery neighborhoods, ho indi vidual members should deter

mine whether to send requests with local or global scope, etc. or local reco ery based on separate multicast groups, there is ongoing research on algorithms for initiating, joining, and lea v- ing such multicast groups, and for soliciting additional members to join such groups. In man topologies, the ef fecti eness of local reco ery could be impro ed by adding members to the multicast group in strate- gic locations. or xample, consider the kno wn stable topolo- gies discussed in [15 ], where losses are xpected to occur mainly on the tail circuits, rather than in the backbone or in the LANs,

and the design priority is to eep unnecessary traf ˛c of of the tail circuits. The addition of session member (i.e., cache) on node near the local end of the tail circuit, coupled with local- reco ery neighborhood de˛ned to include all members on that end of the tail circuit, ould allo local reco ery to continue for losses on the local area without adding an unnecessary traf- ˛c to the tail circuit itself. or losses on the tail circuit itself, lar ger local reco ery area that spanned the tail circuit just into the backbone ould isolate indi vidual local reco ery to inde- pendent

tail circuits. 9.3 Futur ork on congestion contr ol SRM© basic frame ork for congestion control assumes that the members of the multicast session ha an estimate of the avail- able bandwidth for the session, and constrain the data trans- mitted to be within this estimated bandwidth. This frame ork raises se eral some what separate issues, such as ho members determine this ailable bandwidth; ho to detect congestion or oid potential congestion; and gi en ailable bandwidth, which piece of data member should send ˛rst. Multicast congestion control is relati ely ne area for re- search. or

unicast traf ˛c, there is single path from source to recei er with feedback loop pro vided by returning pack- ets sent by the recei er In contrast, in multicast group there could be se eral sources, and the arious communication paths from an acti source to the members of the multicast group can ha range of bandwidth, propagation delay and competing congestion. In this case, ho does one de˛ne and detect con- gestion? ith multicast traf ˛c, there are application-speci˛c polic decisions about whether or not to tune the congestion control procedures to the needs of the

orst-case recei er; these ques- tions do not arise with unicast transmissions. uning the send- ing rate to the orst-case recei er is only viable for multi- cast group with controlled membership; otherwise, the multi- cast group ould be vulnerable to denial-of-service attacks by members joining the group from an xtremely-lo w-bandwidth path. Gi en an uncontrolled membership, and group where the bandwidth along dif ferent paths in the multicast group dif- fers substantially the sender could tune the sending rate to the needs of the majority of recei ers, requiring that recei ers on more

congested paths unsubscribe from the multicast group. recei er -based approach under in estigation for the video tool vic [24 is to di vide the total data transmission into se v- eral substreams, with each being sent to separate multicast group [25 ]. Members that detect congestion unsubscribe from higher -bandwidth groups. When this approach is used for re- liable multicast, reliable deli ery ould be pro vided separately within each group. This implies that unsubscribing recei ers ould either not recei all of the data, or ould recei some of the data later at slo wer rate than that used for

the rest of the multicast group. In either case, we can xploit this tradeof through the use of progressi ely re˛nable or layered data repre- sentations. While considerable research has been done on layering tech- niques for video, layering techniques are application-speci˛c, and layering for wb data remains an area for further research. Possibilities ould be to encode embedded images using Progressi e- JPEG or some other layered scheme, or to tradeof free-hand dra wing resolution for rate (i.e., one could send line dra wings at 50 points/sec for good interacti performance er high

rate channel ut at point/sec er constrained, lo w-rate channel). As another approach to bandwidth adaptation, recei ers could reserv resources where such netw ork services were ailable; an xample of such services are the guaranteed and controlled load services currently being de eloped for the Internet [3 ]. Such resource reserv ation could complement other congestion control mechanisms of the multicast session. 9.4 Futur ork on an SRM ˚toolkitº Although we ha proposed SRM as frame ork that applies to man dif ferent applications, we ha de eloped just one such application, wb Further

because we based the implementation on ALF and deliberately actored man application semantics into the design of the wb transport, it is relati ely dif ˛cult to xtract and re-use wb© netw ork implementation in another ap- plication. Ho we er this limitation resulted from our lack of prior xperience with ALF-based design and we ar gue no that an ALF protocol architecture does not necessarily preclude sub- 17
Page 18
stantial code re-use. Based on our subsequent xperience with another ALF archi- tecture the Real-time ransport Protocol (R TP) [32 that un- derlies the MBone tools

vic and at we kno that the core of an ALF based design can be easily tailored for range of appli- cation types. or xample, we de eloped generic TP toolkit as an object-oriented class hierarchy where the base class im- plements the common TP frame ork and deri ed subclasses implement application-speci˛c semantics. Our TP toolkit sup- ports wide range of applications including layered video, tradi- tional H.261-coded video, LPC-coded audio, generic audio/video recording and playback tools, and TP monitoring and deb ug- ging tools. Each of these tools shares most of its netw ork imple-

mentation with all of the others, yet each still reˇects its indi- vidual semantics through ALF TP is not generic protocol layer In current ork, we are applying these same design princi- ples to both the ne xt generation of the wb protocol as well ne set of SRM-based applications. are de eloping object- oriented SRM toolkit that in base class implements the SRM frame ork described in Section and in deri ed subclass re- ˇects application semantics lik those described in Section 2.3. or xample, the application portion of the SRM class hierar chy determines the pack et generation order

and priority that is, whether to send answer repairs before sending ne data, or oring repairs of one source er another etc. At the same time, the SRM base class handles the more generic SRM func- tionality lik the timer adaptatation algorithms and the basic re- quest/repair ent scheduling. 10 Conclusions This paper has described in detail SRM, frame ork for scal- able reliable multicast. The SRM frame ork meets minimal reliability de˛nition of deli ering all data to all group members, deferring more adv anced functionality when needed, to indi vid- ual applications. SRM is based on the

assumptions of IP mul- ticast deli ery and of unique persistent names for both data and sources. This paper has focused on SRM© request and repair algo- rithms for the reliable deli ery of data. The paper has not pro- posed complete set of algorithms for implementing local re- co ery ut has xplored model for local reco ery with tw o- step repairs. Future ork on scalable session messages, local reco ery congestion control, and an SRM ˚toolkitº ha also been discussed. Ackno wledgments This ork bene˛ted from discussions with Da Clark and with the End-to-End ask orce about general issues

of sender -based vs. recei er -based protocols. thank Peter Danzig for dis- cussions about reliable multicasting and web-caching. also thank Mark Allman, odd Montgomery Kannan aradhan, and the anon ymous referees for useful feedback on the paper Refer ences [1] K. Birman, Response to Cheriton and Sk een© Criticism of Causal and otally Or dered Communicationº, Operating Systems Re- vie 28(1):11-21, January 1994. URL http://cs- tr .cs.cornell.edu/Dienst/UI/2.0/Contents/ncstrl.co rn ell/TR93 1390. [2] K. Birman, A. Schiper and Stephenson, ˚Lightweight Casual and Atomic Group Multicastº, CM

ansactions on Computer Systems ol.9, No. 3, pp. 272-314, Aug. 1991. [3] B. Braden, D. Clark, and S. Shenk er ˚Inte grated Services in the Internet Architecture: an Ov ervie wº, Request for Comments (RFC) 1633 IETF June 1994. [4] Bhagw at, Mishra, and ripathi, ˚Ef fect of opology on Performance of Reliable Multicast Communicationº, In- focom 94, pp. 602-609. [5] D. Cheriton and D. Sk een, ˚Understanding the Limita- tions of Causally and otally Ordered Communicationº, Pr oceedings of the 14th Symposium on Oper ating System Principles CM, December 1993. [6] D. Clark and D.

ennenhouse, D., Architectural Consid- erations for Ne Generation of Protocolsº, Pr oceedings of CM SIGCOMM ©90 Sept. 1990, pp. 201-208. [7] D. Clark, M. Lambert, and L. Zhang, ˚NETBL High Throughput ransport Protocolº, Pr oceedings of CM SIGCOMM ©87 pp. 353-359, Aug. 1987. [8] S. Deering, ˚Multicast Routing in Datagram Internet- orkº, PhD thesis, Stanford Uni ersity alo Alto, Cali- fornia, Dec. 1991. [9] C. Diot, Dabbous and J. Cro wcroft, ˚Multipoint Com- munication: Surv of Protocols, Functions, and Mech- anismsº, IEEE Journal on Selected Area in Communica- tion, Special Issue

on Group Communication, May 1997. [10] A. Erramilli and R.P Singh, Reliable and Ef ˛cient Mul- ticast Protocol for Broadband Broadcast Netw orksº, Pr o- ceedings of CM SIGCOMM ©87 pp. 343-352, August 1987. [11] Flo yd, S., ˚TCP and Explicit Congestion Noti˛cationº, CM Computer Communication Re vie 24 N. 5, Octo- ber 1994, p. 10-23. [12] S. Flo yd, Jacobson, C. Liu, S. McCanne, and L. Zhang, Reliable Multicast Frame ork for Light-weight Sessions and Application Le el Fram- ingº, CM SIGCOMM 95, August 1995, URL ftp://ftp.ee.lbl.go v/papers/srm sigcomm.ps.Z, pp. 342- 356. [13] S.

Flo yd, Jacobson, C. Liu, S. McCanne, and L. Zhang, Reliable Multicast Frame ork for Light- weight Sessions and Application Le el Framing, Extended Reportº, LBNL echnical Report, URL ftp://ftp.ee.lbl.go v/papers/wb .tech .ps.Z Sept. 1995. [14] M. Hofmann, Generic Concept for Lar ge- Scale Multicastº, Proceedings of International Zurich Seminar on Digital Communications (IZS ©96), URL http://www .telematik.informatik.uni- karlsruhe.de/ hofmann/p ap er -izs96 .p s, Feb 1996. 18
Page 19
[15] H. Holbrook, S. Singhal, and D. Cheriton, ˚Log-Based Recei er -Reliable Multicast for

Distrib uted Interacti Simulationº, Pr oceedings of CM SIGCOMM ©95 August 1995. [16] ISIS and Horus WWW page, URL http://www .cs.cornell.edu/Info/Projects/ISI S/ISIS.htm l. [17] Jacobson, Portable, Public Domain Netw ork ÆWhite- board© º, Xerox ARC, vie wgraphs, April 28, 1992. Unpublished document (cited for ackno wledgement pur poses). [18] Jacobson, Pri ac and Security Architecture for Lightweight Sessionsº, Sante Fe, NM, Sept. 94. URL ftp://ftp.ee.lbl.go v/talks/l ws-pri ac .p s.Z. [19] Jacobson, ˚Lightweight Sessions ne architecture for realtime applications and protocolsº, 3rd

Annual Prin- cipal In estigators Meeting, ARP A, Santa Rosa, CA, Sept. 1, 1993. URL ftp://ftp.ee.lbl.go v/talks/vj-nws93 -2 .ps.Z [20] S.K. Kasera, J. urose and D. wsle ˚Scalable Reliable Multicast Using Multiple Multicast Groupsº, Proceedings of 1997 CM Sigmetrics Conference, June 1997. [21] J.C. Lin and S. aul, ˚RMTP: Reliable Multicast rans- port Protocolº, IEEE INFOCOM ©96, pp. 1414-1424. [22] Liu, C.-G., Estrin, D., Shenk er S., and Zhang, L., ˚Lo- cal Error Reco ery in SRM: Comparison of Ap- proachesº, USC echnical Report 97-648, January 1997, URL http://www

.usc.edu/dept/cs/technical reports.html. [23] S. McCanne, Distrib uted Whiteboard for Netw ork Conferencingº, May 1992, UC Berk ele CS 268 Com- puter Netw orks term project. Unpublished report. URL http://www .cs.berk ele .edu/ mccan ne /pap ers/mcca nne wb92.ps.gz. [24] S. McCanne and Jacobson, ˚vic: Fle xible Frame ork for ack et ideoº, CM Multimedia 1995 No 1995, San Francisco, CA, pp. 511-522. [25] S. McCanne, Jacobson, and M. etterli, ˚Recei er dri en Layered Multicastº, CM SIGCOMM 96 August 1996, Stanford, CA, pp. 117-130. [26] D.L Mills, ˚Netw ork ime Protocol (V ersion

3)º, RFC (Request or Comments) 1305, March 1992. [27] Multicast ransport Protocols WWW page, URL http://hill.lut.ac.uk/DS-Archi e/MTP .html. [28] McCanne, S., ˚UCB/LBNL Netw ork Simulator nsº, URL http://www-mash.cs.berk ele .edu/n s/. [29] Nonnenmacher J., Biersack, E., and Don wsle D., ˚P arity-Based Loss Reco ery for Reliable Multicast ransmissionº, CM SIGCOMM 97 [30] E. almer Gr aphical Evolution: An Intr oduction to the Theory of Random Gr aphs John ile Sons, 1985. [31] S. Pejhan, M. Schw artz, and D. Anastassiou, ˚Error Con- trol Using Retransmission Schemes in Multicast

ransport Protocols for Real-T ime Mediaº, IEEE/A CM ransactions on Netw orking, ol. no. 3, pp. 413-427, June 1996. [32] H. Schulzrinne, S. Casner R. Frederick, and Jacobson, ˚R TP: ransport Protocol for Real-T ime Applicationsº, RFC 1889, January 1996. [33] S. Pingali, D. wsle and J. urose, Comparison of Sender -Initiated and Recei er Initiated Reliable Multicast Protocolsº, IEEE JSA olume 15, Issue 3, April 1991. URL ftp://gaia.cs.umass.edu/pub/T ws96 :Comp arison .ps.Z An earlier ersion of this paper appeared in SIGMETRICS ©94 May 1994. [34] Thomas La Porta and Mischa Schw artz,

˚The Multi- Stream Protocol: Highly Fle xible High-speed ransport Protocolº, IEEE Journal on Selected Areas in Communi- cations, ol. 11, pp. 519-530, May 1993. [35] [PEFZ97] Sharma, ., Estrin, D., Flo yd, S., and Zhang, L., ˚Scalable Session Messages in SRMº, unpublished manuscript, August 1997. URL ftp://catarina.usc.edu/pub/pun eetsh/p aper s/info com 98 .ps. [36] .T Strayer B.J. Dempse and A.C. ea er XTP: The Xpress ransfer Protocol, Addison- esle Reading, Mass 1992. URL http://he g- school.a .com/cseng/author s/demp se y/x tp/x tp.n clk. [37] Xpress ransport Protocol

Speci˛cation, XTP Re vision 4.0, XTP orum, Mar 1995. [38] Flo yd, S., SRM eb age, URL http://ftp.ee.lbl.go v/ˇo yd/srm.h tml. [39] M. ajnik, J. urose, and D. wsle ˚P ack et Loss Cor relation in the MBone Multicast Netw orkº, IEEE Global Internet mini-conference at Globecom ©96, No 1996.. 19