Consistent Hashing and Random rees Distrib uted Caching Protocols for Relie ving Hot Spots on the orld ide eb Da vid Kar ger Eric Lehman om Leighton  Matthe Le vine Daniel Le win Rina anigrahy     de

Consistent Hashing and Random rees Distrib uted Caching Protocols for Relie ving Hot Spots on the orld ide eb Da vid Kar ger Eric Lehman om Leighton Matthe Le vine Daniel Le win Rina anigrahy de - Description

Our protocols are particularly designed for use with ery lar ge netw orks such as the Internet where delays caused by hot spots can be se ere and where it is not feasible for ery serv er to ha complete information about the current state of the enti ID: 24705 Download Pdf

204K - views

Consistent Hashing and Random rees Distrib uted Caching Protocols for Relie ving Hot Spots on the orld ide eb Da vid Kar ger Eric Lehman om Leighton Matthe Le vine Daniel Le win Rina anigrahy de

Our protocols are particularly designed for use with ery lar ge netw orks such as the Internet where delays caused by hot spots can be se ere and where it is not feasible for ery serv er to ha complete information about the current state of the enti

Similar presentations

Download Pdf

Consistent Hashing and Random rees Distrib uted Caching Protocols for Relie ving Hot Spots on the orld ide eb Da vid Kar ger Eric Lehman om Leighton Matthe Le vine Daniel Le win Rina anigrahy de

Download Pdf - The PPT/PDF document "Consistent Hashing and Random rees Distr..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "Consistent Hashing and Random rees Distrib uted Caching Protocols for Relie ving Hot Spots on the orld ide eb Da vid Kar ger Eric Lehman om Leighton Matthe Le vine Daniel Le win Rina anigrahy de"— Presentation transcript:

Page 1
Consistent Hashing and Random rees: Distrib uted Caching Protocols for Relie ving Hot Spots on the orld ide eb Da vid Kar ger Eric Lehman om Leighton  Matthe Le vine Daniel Le win Rina anigrahy     describe amily of caching protocols for distrib-uted netw orks that can be used to decrease or eliminate the occurrence of hot spots in the netw ork. Our protocols are particularly designed for use with ery lar ge netw orks such as the Internet, where delays caused by hot spots can be se ere, and where it is not feasible for ery serv er to ha complete information about

the current state of the entire netw ork. The protocols are easy to implement using xisting net- ork protocols such as TCP/IP and require ery little erhead. The protocols ork with local control, mak ef ficient use of xist- ing resources, and scale gracefully as the netw ork gro ws. Our caching protocols are based on special kind of hashing that we call consistent hashing Roughly speaking, consistent hash function is one which changes minimally as the range of the function changes. Through the de elopment of good consistent hash functions, we are able to de elop caching protocols which do

not require users to ha current or en consistent vie of the netw ork. belie that consistent hash functions may entually pro to be useful in other applications such as distrib uted name serv ers and/or quorum systems.    In this paper we describe caching protocols for distrib uted net- orks that can be used to decrease or eliminate the occurrences of “hot spots”. Hot spots occur an time lar ge number of clients wish to simultaneously access data from single serv er If the site is not pro visioned to deal with all of these clients simultaneously service may be de graded or lost.

Man of us ha xperienced the hot spot phenomenon in the conte xt of the eb eb site can suddenly become xtremely popular and recei ar more requests in relati ely short time than This research as supported in part by ARP contracts N00014-95- 1-1246 and ABT63-95-C-0009, Army Contract AAH04-95-1-0607, and NSF contract CCR-9624239 Laboratory for Computer Science, MIT Cambridge, MA 02139. email: kar ger ,e lehman,danl,ftl,msle vine,danl,rina @t he ory du full ersion of this paper is ailble at: http://theory #$! kar ger ,e lehman,ftl,msle vine,danl,rinap Department of

Mathematics, MIT Cambridge, MA 02139 it as originally configured to handle. In act, site may recei so man requests that it becomes “sw amped, which typically renders it unusable. Besides making the one site inaccessible, hea vy traf fic destined to one location can congest the netw ork near it, interfering with traf fic at nearby sites. As use of the eb has increased, so has the occurrence and impact of hot spots. Recent amous xamples of hot spots on the eb include the JPL site after the Shoemak er -Le vy comet struck Jupiter an IBM site during the Deep Blue-Kasparo chess

tour nament, and se eral political sites on the night of the election. In some of these cases, users were denied access to site for hours or en days. Other xamples include sites identified as “W eb-site-of- the-day and sites that pro vide ne ersions of popular softw are. Our ork as originally moti ated by the problem of hot spots on the orld ide eb belie the tools we de elop may be rele ant to man client-serv er models, because centralized serv ers on the Internet such as Domain Name serv ers, Multicast serv ers, and Content Label serv ers are also susceptible to hot spots. '&(*) +

-,.' 0/ Se eral approaches to ercoming the hot spots ha been pro- posed. Most use some kind of replication strate gy to store copies of hot pages throughout the Internet; this spreads the ork of serving hot page across se eral serv ers. In one approach, already in wide use, se eral clients share pr oxy cac he All user requests are for arded through the proxy which tries to eep copies of frequently requested pages. It tries to satisfy requests with cached cop y; ail- ing this, it forw ards the request to the home serv er The dilemma in this scheme is that there is more benefit if more

users share the same cache, ut then the cache itself is liable to get sw amped. Malpani et al. [6] ork around this problem by making group of caches function as one. user' request for page is directed to an arbitrary cache. If the page is stored there, it is returned to the user Otherwise, the cache forw ards the request to all other caches via special protocol called “IP Multicast”. If the page is cached no where, the request is forw arded to the home site of the page. The disadv antage of this technique is that as the number of participating caches gro ws, en with the use of multicast, the

number of messages between caches can become unmanageable. tool that we de elop in this paper consistent hashing gi es ay to implement such distrib uted cache without requiring that the caches communicate all the time. discuss this in Section 4. Chankhunthod et al. [1] de eloped the Harv est Cache, more scalable approach using tr ee of caches. user obtains page by asking nearby leaf cache. If neither this cache nor its siblings ha the page, the request is forw arded to the cache' parent. If page is stored by no cache in the tree, the request entually reaches the root and is forw arded to the

home site of the page. cache
Page 2
retains cop of an page it obtains for some time. The adv antage of cache tree is that cache recei es page requests only from its children (and siblings), ensuring that not too man requests arri simultaneously Thus, man requests for page in short period of time will only cause one request to the home serv er of the page, and on' erload the caches either disadv antage, at least in theory is that the same tree is used for all pages, meaning that the root recei es at least one request for ery distinct page requested of the entire cache tree. This can

sw amp the root if the number of distinct page requests gro ws too lar ge, meaning that this scheme also suf fers from potential scaling problems. Plaxton and Rajaraman [9] sho ho to balance the load among all caches by using randomization and hashing. In partic- ular the use hierarchy of progressi ely lar ger sets of “virtual cache sites for each page and use random hash function to as- sign responsibility for each virtual site to an actual cache in the netw ork. Clients send request to random element in each set in the hierarchy Caches assigned to gi en set cop the page to some members of

the ne xt, lar ger set when the disco er that their load is too hea vy This gi es ast responses en for popular pages, be- cause the lar gest set that has the page is not erloaded. It also gi es good load balancing, because machine in small (thus loaded) set for one page is lik ely to be in lar ge (thus unloaded) set for another Plaxton and Rajaraman' technique is also ault tolerant. The Plaxton/Rajaraman algorithm has dra wbacks, ho we er or xample, since their algorithm sends cop of each page request to random element in ery set, the small sets for popular page are guaranteed to be sw amped.

In act, the algorithm uses sw amp- ing as feature since sw amping is used to trigger replication. This orks well in their model of synchronous parallel system, where sw amped processor is assumed to recei subset of the incom- ing messages, ut otherwise continues to function normally On the Internet, ho we er sw amping has much more serious consequences. Sw amped machines cannot be relied upon to reco er quickly and may en crash. Moreo er the intentional sw amping of lar ge num- bers of random machines could well be vie wed unf orably by the wners of those machines. The Plaxton/Rajaraman

algorithm also requires that all communications be synchronous and/or that mes- sages ha priorities, and that the set of caches ailable be fix ed and kno wn to all users. '&       Here, we describe tw tools for data replication and use them to gi caching algorithm that ercomes the dra wbacks of the pre- ceding approaches and has se eral additional, desirable properties. Our first tool, andom cac he tr ees combines aspects of the structures used by Chankhunthod et al. and Plaxton/Rajaraman. Lik Chankhunthod et al., we use tree of caches to coalesce re- quests. Lik

Plaxton and Rajaraman, we balance load by using dif ferent tree for each page and assigning tree nodes to caches via random hash function. By combining the best features of Chankhunthod et al. and Plaxton/Rajaraman with our wn meth- ods, we pre ent an serv er from becoming sw amped with high probability property not possessed by either Chankhunthod et al. or Plaxton/Rajaraman. In addition, our protocol sho ws ho to minimize memory requirements (without significantly increasing cache miss rates) by only caching pages that ha been requested suf ficient number of times. belie that the

xtra delay introduced by tree of caches should be quite small in practice. The time to request page is multiplied by the tree depth. Ho we er the page request typically tak es so little time that the xtra delay is not great. The return of page can be pipelined; cache need not ait until it recei es whole page before sending data to its child in the tree. Therefore, the return of page also tak es only slightly longer Altogether the added delay seen by user is small. Our second tool is ne hashing scheme we call consistent hashing This hashing scheme dif fers substantially from that used in

Plaxton/Rajaraman and other practical systems. ypical hashing based schemes do good job of spreading load through kno wn, fix ed collection of serv ers. The Internet, ho we er does not ha fix ed collection of machines. Instead, machines come and go as the crash or are brought into the netw ork. Ev en orse, the information about what machines are functional propagates slo wly through the netw ork, so that clients may ha incompatible “vie ws of which machines are ailable to replicate data. This mak es stan- dard hashing useless since it relies on clients agreeing on which caches are

responsible for serving particular page. or xample, Feele et al [3] implement distrib uted global shared memory sys- tem for netw ork of orkstations that uses hash table distrib uted among the machines to resolv references. Each time ne ma- chine joins the netw ork, the require central serv er to redistrib ute completely updated hash table to all the machines. Consistent hashing may help solv such problems. Lik most hashing schemes, consistent hashing assigns set of items to uck- ets so that each bin recei es roughly the same number of items. Unlik standard hashing schemes, small change in the

uck et set does not induce total remapping of items to uck ets. In addi- tion, hashing items into slightly dif ferent sets of uck ets gi es only slightly dif ferent assignments of items to uck ets. apply con- sistent hashing to our tree-of-caches scheme, and sho ho this mak es the scheme ork well en if each client is are of only constant fraction of all the caching machines. In [5] Litwin et al proposes hash function that allo ws uck ets to be added one at time sequentially Ho we er our hash function allo ws the uck ets to be added in an arbitrary order Another scheme that we can impro on is

gi en by De vine [2]. In addition, we belie that consistent hashing will be useful in other applications (such as quorum sys- tems [7] [8] or distrib uted name serv ers) where multiple machines with dif ferent vie ws of the netw ork must agree on common stor age location for an object without communication. '&      In Section we describe our model of the eb and the hot spot problem. Our model is necessarily simplistic, ut is rich enough to de elop and analyze protocols that we belie may be useful in practice. In Section 3, we describe our random tree method and use it in caching

protocol that ef fecti ely eliminates hot spots under simplified model. Independent of Section 3, in Section we present our consistent hashing method and use it to solv hot spots under dif ferent simplified model in olving inconsistent vie ws. In Section we sho ho our tw techniques can be ef fecti ely combined. In Section we propose simple delay model that cap- tures hierarchical clustering of machines on the Internet. sho that our protocol can be easily xtended to ork in this more real- istic delay model. In Sections and we consider aults and the beha vior of the protocol er time,

respecti ely In Section we discuss some xtensions and open problems. '&    '       In se eral places we mak use of hash functions that map objects into range. or clarity we assume that these functions map objects in truly random ashion, i.e. uniformly and independently In practice, hash functions with limited independence are more plau- sible since the economize on space and randomness. ha pro en all theorems of this paper with only limited independence using methods similar to those in [11 ]. Ho we er in this xtended abstract we only state the de gree of independence

required for re- sults to hold. Proofs assuming limited independence will appear in
Page 3
the full ersion of this paper   This section presents our model of the eb and the hotspot prob- lem. classify computers on the eb into three cate gories. All requests for eb pages are initiated by br owser The permanent homes of eb pages are server Cac hes are xtra machines which we use to protect serv ers from the barrage of bro wser requests. Throughout the paper the set of caches is and the number of caches is Each serv er is home to fix ed set of pages. Caches are also able to

store number of pages, ut this set may change er time as dictated by caching protocol. generally assume that the content of each page is unchanging, though Section contains discussion of this issue. The set of all pages is denoted An machine can send message directly to an other with the restriction that machine may not be are of the xistence of all caches; we require only that each machine is are of  fraction of the caches for some constant The tw typical types of messages are requests for pages and the pages themselv es. machine which recei es too man messages too quickly ceases to

function properly and is said to be “sw amped”.  measures the time for message from machine to arri at machine denote this quantity    In practice, of course, delays on the Internet are not so simply char acterized. The alue of should be re garded as “best guess that we optimize on for lack of better information; the correctness of protocol should not depend on alues of (which could actually measure an ything such as throughput, price of connection or con- gestion) being xactly accurate. Note that we do not mak latenc function of message size; this issue is discussed in

Section 3.2.1. All cache and serv er beha vior and some bro wser beha vior is specified in our protocol. In particular the protocol specifies ho caches and serv ers respond to page requests and which pages are stored in cache. The protocol also specifies the cache or serv er to which bro wser sends each page request. All control must be local; the beha vior of machine can depend only on messages it recei es. An adv ersary decides which pages are requested by bro wsers. Ho we er the adv ersary cannot see random alues generated in our protocol and cannot adapt his requests

based on observ ed delays in obtaining pages. consider tw models. First, we consider static model in which single “batch of requests is processed, and require that the number of page requests be at most "!$# where is constant and is the number of caches. then consider temporal model in which the adv ersary may initiate ne requests for pages at rate at most that is, in an time interv al &(') he may initiate at most %*& requests. +   The “hot spot problem is to satisfy all bro wser page requests while ensuring that with high probability no cache or serv er is sw amped. The phrase “with

high probability means “with probability at least .-/0 ”, where is confidence parameter used throughout the paper While our basic requirement is to pre ent sw amping, we also ha tw additional objecti es. The first is to minimize cache mem- ory requirements. protocol should ork well without requiring an cache to store lar ge number of pages. second objecti is, naturally to minimize the delay bro wser xperiences in obtaining page.  21  + In this section we introduce our first tool, random trees. sim- plify the presentation, we gi simple caching protocol that ould ork

well in simpler orld. In particular we mak the follo wing simplifications to the model: 1. All machines kno about all caches. 2. 43 65 !7 for all 8:9 !<; 3. All requests are made at the same time. This restricted model is “static in the sense that there is only one batch of requests; we need not consider the long-term stability of the netw ork. Under these restrictions we sho protocol that has good be- ha vior That is, with high probability no machine is sw amped. achie total delay of =>?A@BC and pro that it is optimal. use total cache space which is fraction of the number of

requests, and enly di vided among the caches. In subsequent sections we will sho ho to xtend the protocol so as to preserv the good beha vior without the simplifying assumptions. The basic idea of our protocol is an xtension of the “tree of caches approach discussed in the introduction. use this tree to ensure that no cache has man “children asking it for particu- lar page. As discussed in the introduction, le els near the root get man requests for page en if the page is relati ely unpopular so being the root for man pages causes sw amping. Our technique, similar to Plaxton/Rajaraman' s, is to

use dif ferent, randomly gen- erated tree for each page. This ensures that no machine is near the root for man pages, thus pro viding good load balancing. Note that we cannot mak use of the analysis gi en by Plaxton/Rajaraman, because our main concern is to pre ent sw amping, whereas the allo machines to be sw amped. In Section 3.1 belo we define our protocol precisely In Sec- tion 3.2, we analyze the protocol, bounding the load on an cache, the storage each cache uses, and the delay bro wser xperiences before getting the page. &(*)    associate rooted -ary tree, called an abstr

act tr ee with each page. use the term nodes only in reference to the nodes of these abstract trees. The number of nodes in each tree is equal to the number of caches, and the tree is as balanced as possible (so all le els ut the bottom are full). refer to nodes of the tree by their rank in breadth-first search order The protocol is described as running on these abstract trees; to support this, all requests for pages tak the form of 4-tuple consisting of the identity of the re- quester the name of the desired page, sequence of nodes through which the request should be directed, and

sequence of caches that should act as those nodes. determine the latter sequence, that is, which cache actually does the ork for gi en node, the nodes are mapped to machines. The root of tree is al ays mapped to the serv er for the page. All the other nodes are mapped to the caches by hash function EGFHJILKMONNNPRQTSU which must be dis- trib uted to all bro wsers and caches. In order not to create copies of pages for which there are fe requests, we ha another parameter for ho man requests cache must see before it bothers to store cop of the page. No gi en hash function and parameters and

our pro- tocol is as follo ws: Br wser When bro wser ants page, it picks random leaf to root path, maps the nodes to machines with and asks the leaf node for the page. The request includes the name of the bro wser the name of the page, the path, and the result of the mapping.
Page 4
Cache When cache recei es request, it first checks to see if it is caching cop of the page or is in the process of getting one to cache. If so, it returns the page to the requester (after it gets its cop if necessary). Otherwise it increments counter for the page and the node it is acting as, and

asks the ne xt machine on the path for the page. If the counter reaches it caches cop of the page. In either case the cache passes the page on to the requester when it is obtained. Ser er When serv er recei es request, it sends the requester cop of the page. &    The analysis is brok en into three parts. be gin by sho wing that the latenc in processing request is lik ely to be small, under the assumption that no serv er is sw amped. then sho that no ma- chine is lik ely to be sw amped. conclude by sho wing that no cache need store too man pages for the protocol to ork properly The

analysis of sw amping runs much the same ay xcept that the “weights on our abstract nodes are no the number of requests arri ving at those nodes. As abo e, the number of requests that hit machine is bounded by the weight of nodes mapped to it. &  +  Under our protocol, the delay bro wser xperiences in obtaining page is determined by the height of the tree. If request is for arded from leaf to the root, the latenc is twice the length of the path, ?A@B  If the request is satisfied with cached cop the latenc is only less. If request stops at cache that is ait- ing for cache cop the

latenc is still less since request has already started up the tree. Note that can probably be made lar ge in practice, so this latenc will be quite small. Note that in practice, the time required to obtain lar ge page is not multiplied by the number of steps in path er which it tra els. The reason is that the page can be transmitted along the path in pipelined ashion. cache in the middle of the path can start sending data to the ne xt cache as soon as it recei es some; it need not ait to recei the whole page. This means that although this protocol will increase the delay in getting small

pages, the erhead for lar ge pages is ne gligible. The xistence of tree schemes, lik the Harv est Cache, suggests that is acceptable in practice. Our bound is optimal (up to constant actors) for an protocol that forbids sw amping. see this, consider making requests for single page. Look at the graph with nodes corresponding to machines and edges corresponding to links er which the page is sent. Small latenc implies that this graph has small diameter which implies that some node must ha high de gree, which im- plies sw amping. &  '  The intuition behind our analysis is the follo wing.

First we analyze the number of requests directed to the abstract tree nodes of arious pages. These gi “weights to the tree nodes. then analyze the outcome when the tree nodes are mapped by hash function onto the actual caching machines: machine gets as man requests as the total weight of nodes mapped to it. bound the projected weight, we first gi bound for the case where each node is as- signed to random machine. This is weighted ersion of the amiliar balls-in-bins type of analysis. Our analysis gi es bound with an xponential tail. can therefore ar gue as in [11 that it ap- plies en when

the balls are assigned to bins only ?A@ ay independently This can be achie ed by using -uni ersal hash function to map the abstract tree nodes to machines. will no analyze our protocol under the simplified model. In this “static analysis we assume for no that caches ha enough space that the ne er ha to vict pages; this means that if cache has already made requests for page it will not mak another request for the same page. In Theorem 3.1 we pro vide high proba- bility bounds on the number of requests cache gets, assuming that all the outputs of the function are independent and random.

The- orem 3.4 xtends our high probability analysis to the case when is -w ay independent function. In particular we sho that it suf- fices to ha logarithmic in the system parameters to achie the same high probability bounds as with full independence.  ' '  Theor em 3.1 If is hosen uniformly and at andom fr om the space of functions KMONNNPRQ  then with pr obability at least 0 wher is par ameter the number of equests given cac he ets is no mor than ?A@  @BC0 ?A@ BH?A@BC0    @BC0 ?A@   ?A@ BC0 "! ?A@ BH0 $# Note that #H?A@B  is the erage number of

requests per cache since each bro wser request could gi rise to ?A@ requests up the trees. The &% ')(+* ')(+% ')(&* term arises because at the leaf nodes of tree' page some cache could occur ')(,* ')(% ')(,* times (balls-in-bins) and the adv ersary could choose to de ote all requests to that page. pro the abo Theorem in the rest of the section. split the analysis into tw parts. First we analyze the re- quests to cache due to its presence in the leaf nodes of the ab- stract trees and then analyze the requests due to its presence at the internal nodes and then add them up. .-  + -  + /

Due to space limitations, we gi proof that only applies when 10 Its xtension to small is straightforw ard ut long. Observ that the requests for each page are being mapped randomly onto the leaf nodes of its abstract tree. And then these leaf nodes are mapped randomly onto the set of caches. Look at collection of all the leaf nodes and the number of requests (weight) associated with each one of them. The ariance among the “weights of the leaf nodes is maximized when all the requests are made for one page. This is also the case which maximizes the number of leaf node requests on cache. Each

page' tree has about > >-  leaf nodes. Since machine has  chance of occurring at particular leaf node, with probability 0 it will occur in ')(+* ')(% ')(&* leaf nodes. In act, since there are at most requests, will occur ')(+* ')(% ')(&* times in all those requested pages' trees with probability 0 Gi en an assignment of machines to leaf nodes so that occurs ')(,* ')(+% ')(,* times in each tree, the xpected number of requests gets is ')(,* ')(+% ')(,* which is &% ')(&* ')(% ')(,* Also, once the as- signment of machine to leaf nodes is fix ed, the number of requests gets

is sum of independent Bernoulli ariables. So by Cher nof bounds gets &% ')(+* ')(+% ')(,* ?A@ BH0 requests with probability :-$ 0 So we conclude that gets ,% ')(&* ')(+% ')(,* L?A@BC0 with probability at least 3 0 Replacing by and assum- ing 'L we can say that the same bound holds with probability  0 It is easy to xtend this proof so that the bound holds en for 54 $
Page 5
  -     + Again we think of the protocol as first running on the abstract trees. No no abstract internal node gets more than requests because each child node gi es out at most

requests for page. Consider an arbitrary arrangement of paths for all the requests up their respecti trees. Since there are only requests in all we can bound the number of abstract nodes that get requests. In act we will bound the number of abstract nodes er all trees which recei between and requests where   @BHD Let denote the number of abstract nodes that recei between and requests. Let  be the number of requests for page Then   Since each of the requests gi es rise to at most ?A@ requests up the trees, the total number of requests is no more than ?A@ So, ')(     $

?A@ (1) Lemma 3.2 The total number of internal nodes whic eceive at least  equests is at most  if '" Pr oof (sk etch): Look at the tree induced by the request paths, con- tract out de gree nodes, and count internal nodes. or !7 there can clearly be no more than @B + requests. The preceding lemma tells us that the number of abstract nodes that recei between and requests, is at most   xcept for  or  will be at most ?A@ No the probabil- ity that machine assumes gi en one of these nodes is  Since assignments of nodes to machines are independent the prob- ability that machine is

recei es more than of these nodes is at most !   $# % In order for the right hand side to be as small as  0 we must ha '& ')(,* ')( ( *) ')(,* Note that the latter term will be present only if ?A@ BC0 So is ')(+* ')(  ,) ')(&* with probability at least -L0 So with probability at least C- @B*D 0 the total number of requests recei ed by due to internal nodes will be of the order of ')(   -. @BC0 @B* ?A@BC0 #O? @B    ?A@ BH0 ?A@ B* ?A@BC0 ?A@BC0 $# By combining the high probability bounds for internal and leaf nodes, we can say that machine gets ?A@B 

?A@ BC0 ?A@ BO? @BC0    ?A@ BH0 ?A@ 3 ?A@ BH0 (? @BC0 $# requests with probability at least  ')( Replacing by ?A@ B*D and ignoring ?A@ BO? @B D in comparision with we get Theorem 3.1. + $  '  '' M  In this section we sho that the high probability bound we ha pro en for the num- ber of requests recei ed by machine is tight. Lemma 3.3 Ther xists distrib ution of equests to pa es so that given mac hine ets #O? @B J# ')(,* ')(+% ')(,* ')(,* ')( 0/21 ')(+* equests with pr obability at least 0 Pr oof: Full paper  ' 04  no xtend our high probability

analysis to functions that are chosen at random from -uni ersal hash amily Theor em 3.4 If is hosen at andom fr om -univer sal hash family then with pr obability at least -6 0 given cac he eceives no mor than #H? @B  3 0/D 65-7 .  ')(  ')( equests. Pr oof: The full proof is deferred to the final ersion of the paper This result does not follo immediately from the results of [11 ], ut in olv es similar ar gument. Setting !$?A@ BH0 we get the follo wing corollary Cor ollary 3.5 The high pr obability bound pr ved in theor em 3.1 for the number of equests cac he ets holds ven if

is selected fr om ?A@ BC0 -univer sal hash family In act, this can be sho wn to be true for all the bounds that we will pro later i.e., it suf fices to be logarithmic in the system size. &  &  '  In this section, we discuss the amount of storage each cache must ha in order to mak our protocol ork. The amount of storage required at cache is simply the number of pages for which it re- cei es more than requests. Lemma 3.6 The total number of cac hed pa es, ver all mac hines, is ?A@ BC ?A@B  with pr obability at least -$ 98 given cac he has @BC cac hed copies with high pr

oba- bility Pr oof (sk etch): The analysis is ery similar to that in proof of The- orem 3.1. again play the protocol on the abstract trees. Since page is cached only if it requested times, we assign each abstract node weight of one if it gets more than requests and zero other wise. These abstract nodes are then mapped randomly onto the set of caches. can bound the total weight recei ed by particular cache, which is xactly the number of pages it caches.     + '  In this section we define ne hashing technique called consis- tent hashing moti ate this technique by reference to

simple scheme for data replication on the Internet. Consider single serv er that has lar ge number of objects that other clients might ant to access. It is natural to introduce layer of caches between the clients and the serv er in order to reduce the load on the serv er In such scheme, the objects should be distrib uted across the caches, so that each is responsible for roughly equal share. In addition, clients need to kno which cache to query for specific object. The ob vious approach is hashing. The serv er can use hash func- tion that enly distrib utes the objects across the caches.

Clients can use the hash function to disco er which cache stores object. Consider no what happens when the set of acti caching ma- chines changes, or when each client is are of dif fer ent set of caches. (Such situations are ery plausible on the Internet.) If the distrib ution as done with classical hash function (for xample, the linear congruential function ;: =< ?> 6@ A@ ), such in- consistencies ould be catastrophic. When the range of the hash function in the xample) changed, almost ery item ould be
Page 6
hashed to ne location. Suddenly all cached data is useless be- cause

clients are looking for it in dif ferent location. Consistent hashing solv es this problem of dif ferent “vie ws. define vie to be the set of caches of which particular client is are. assume that while vie ws can be inconsistent, the are substantial: each machine is are of constant fraction of the cur rently operating caches. client uses consistent hash function to map object to one of the caches in its vie analyze and construct hash functions with the follo wing consistenc properties. First, there is “smoothness property When machine is added to or remo ed from the set of caches, the

xpected fraction of ob- jects that must be mo ed to ne cache is the minimum needed to maintain balanced load across the caches. Second, er all the client vie ws, the total number of dif ferent caches to which object is assigned is small. call this property “spread”. Similarly er all the client vie ws, the number of distinct objects assigned to particular cache is small. call this property “load”. Consistent hashing therefore solv es the problems discussed abo e. The “spread property implies that en in the presence of inconsistent vie ws of the orld, references for gi en object are directed

only to small number of caching machines. Distrib uting object to this small set of caches will insure access for all clients, without using lot of storage. The “load property implies that no one cache is assigned an unreasonable number of objects. The “smoothness property implies that smooth changes in the set of caching machines are matched by smooth olution in the loca- tion of cached objects. Since there are man ays to formalize the notion of consis- tenc as described abo e, we will not commit to precise defini- tion. Rather in Section 4.4 we define “ranged hash function and

then precisely define se eral quantities that capture dif ferent as- pects of “consistenc y”. In Section 4.2 we construct practical hash functions which xhibit all four to some xtent. In Section 4.4, we discuss other aspects of consistent hashing whihc, though not ger mane to this paper indicate some of the richness underlying the theory   In this section, we formalize and relate four notions of consistenc Let be the set of items and be the set of uck ets. Let   be the number of items. vie is an subset of the uck ets ang ed hash function is function of the form 4F   Such

function specifies an assignment of items to uck ets for ery possible vie That is,   is the uck et to which item is assigned in vie (W will use the notation  .8 in place   from no on.) Since items should only be assigned to usable uck ets, we require  O   for ery vie ang ed hash family is amily of ranged hash functions. andom ang ed hash function is function dra wn at random from particular ranged hash amily In the remainder of this section, we state and relate some rea- sonable notions of consistenc re garding ranged hash amilies. Throughout, we use the follo wing

notational con entions: is ranged hash amily is ranged hash function, is vie is an item, and is uck et. Balance: ranged hash amily is balanced if, gi en particu- lar vie set of items, and randomly chosen function selected from the hash amily with high probability the fraction of items mapped to each uck et is     The balance property is what is prized about standard hash functions: the distrib ute items among uck ets in balanced a- sion. Monotonicity: ranged hash function is monotone if for all vie ws  8  implies  8    8 ranged hash amily is monotone if ery ranged hash

function in it is. This property says that if items are initially assigned to set of uck ets and then some ne uck ets are added to form then an item may mo from an old uck et to ne uck et, ut not from one old uck et to another This reflects one intuition about consistenc y: when the set of usable uck ets changes, items should only mo if necessary to preserv an en distrib ution. Spr ead: Let NNN  ! be set of vie ws, altogether containing distinct uck ets and each indi vidually containing at least R uck ets. or ranged hash function and particular item the spr ead .8 is the quantity

8 &%  The spr ead of hash function O ' is the maximum spread of an item. The spr ead of hash amily is if with high probability the spread of random hash function from the amily is The idea behind spread is that there are people, each of whom can see at least constant fraction  of the uck ets that are visible to an yone. Each person tries to assign an item to uck et using consistent hash function. The property says that across the entire group, there are at most .8 dif ferent opinions about which uck et should contain the item. Clearly good consistent hash function should ha lo spread

er all items. Load: Define set of vie ws as before. or ranged hash function and uck et the load  ?< is the quantity ?< The load of hash function is the maximum load of uck et. The load of hash amily is if with high probability randomly cho- sen hash function has load (Note that ?< is the set of items assigned to uck et in vie .) The load property is similar to spread. The same people are back, ut this time we consider particular uck et instead of an item. The property says that there are at most  ?< distinct items that at least one person thinks be- longs in the uck et. good consistent

hash function should also ha lo load. Our main result for consistent hashing is Theorem 4.1 which sho ws the xistence of an ef ficiently computable monotonic ranged hash amily with logarithmic spread and balance.   '  no gi construction of ranged hash amily with good properties. Suppose that we ha tw random functions and +* The function maps uck ets randomly to the unit interv al, and ,* does the same for items. 8 is defined to be the uck et that minimizes ?< +* 8 In other ords, is mapped to the uck et “closest to or reasons that will become apparent, we ac- tually need

to ha more than one point in the unit interv al associ- ated with each uck et. Assuming that the number of uck ets in the range is al ays less than we will need :?A@ B* points for each uck et for some constant The easiest ay to vie this is that each uck et is replicated :?A@ B* times, and then maps each replicated uck et randomly In order to economize on the space to represent function in the amily and on the use of random bits, we only demand that the functions and map points @B* ay independently and uniformly to Q Note that for each point we pick in the unit interv al, we need only

pick enough random bits to distinguish the point from all other points. Thus it is unlik ely that we need more than ?A@ B* number of points bits for each point. Denote the abo described hash amily as Theor em 4.1 The ang ed hash family described abo ve has the following pr operties: 1. is monotone
Page 7
2. Balance: or fixed vie   .8 PQ for and and, conditioned on the hoice of the assignments of items to uc ets ar ?A@ B* -way independent. 3. Spr ead: If the number of vie ws # for some constant and the number of items then for .8 is ?A@ B* with pr obability gr eater

than -<  98 4. Load: If and ar as abo ve then for  ?< is ?A@ B* with pr obability gr eater than -<  98 Pr oof (sk etch): Monotonicity is immediate. When ne uck et is added, the only items that mo are those that are no closest to one of the ne uck et' associated points. No items mo between old uck ets. The spread and load properties follo from the obser ation that with high probability point from very vie alls into an interv al of length  Spread follo ws by observing that the number of uck et points that all in this size interv al around an item point is an upper bound on the spread of

that item, since no other uck et can be closer in an vie Standard Chernof ar gu- ments apply to this case. Load follo ws by similar ar gument where we count the number of item points that all in the re gion “o wned by uck et' associated points. Balance follo ws from the act that when :?A@ B* points are randomly mapped to the unit interv al, each uck et is with highu probability responsible for no more than fraction of the interv al. The here is to count the number of combinatroially distinct ays of assigning this lar ge fraction to the @B* points associated with uck et. This turns out to be

polynomial in then ar gue that with high probability none of these possibilities could actually occur by sho wing that in each one an additional uck et point is lik ely to all. deduce that the ac- tual length must be smaller than   All of the abo proofs can be done with only ?A@ B* -w ay independent mappings. The follo wing corollary is immediate and is useful in the rest of the paper Cor ollary 4.2 ith the same conditions of the pr vious theor em,   .8 < in an vie .  ')(  for and   +  In this section we sho ho the hash amily just de xcrobed can be implemented ef

ficiently Specifically the xpected running time for single hash computation will be The xpectation is er the choice of hash function. The xpected running time for adding or deleting uck et will be ?A@B where is an upper bound on the total number of uck ets in all vie ws. simple implementation uses balanced binary search tree to store the correspondence between se gments of the unit interv al and uck ets. If there are uck ets, then there will be * ?A@ B* interv als, so the search tree will ha depth ? @B Thus, single hash computation tak es ?A@ time. The time for an

addition or remo al of uck et is ?A@ since we insert or delete ?A@ points for each uck et. The follo wing trick reduces the xpected running time of hash computation to The idea is to di vide the interv al into roughly  ?A@ equal length se gments, and to eep separate search tree for each se gment. Thus, the time to compute the hash function is the time to determine which interv al 8 is in, plus the time to lookup the uck et in the corresponding search tree. The first time is al ays Since, the xpected number of points in each se gment is the second time is in xpectation. One ca eat

to the abo is that as the number of uck ets gro ws, the size of the subinterv als needs to shrink. In order to deal with this issue, we will use interv als only of length  .  for some At first we choose the lar gest such that  . "  - * ?A@ B* Then, as points are added, we bisect se gments gradually so that when we reach the ne xt po wer of we ha already di vided all the se gments. In this ay we amortize the ork of di viding search trees er all of the additions and remo als. Another point is that the search trees in adjacent empty interv als may all need to be updated when

uck et is added since the may all no be closest to that uck et. Since the xpected length of run of empty interv als is small, the additional cost is ne gligible. or more complete analysis of the running time we refer to the complete ersion of the paper   +'      + '  In this section, we discuss some additional features of consistent hashing which, though unneccessary for the remainder of the paper demonstrate some of its interesting properties. gi insight into the monotone property we will define ne class of hash functions and then sho that this is equi alent to the

class of monotone ranged hash functions. -hash function is hash function of the amiliar form constructed as follo ws. ith each item associate permutation .8 of all the uck ets Define  8 to be the first uck et in the permutation .8 that is contained in the vie Note that the permutations need not be chosen uniformly or independently Theor em 4.3 Every monotone ang ed hash function is -hash function and vice ver sa. Pr oof (sk etch): or ranged hash function associate item with the permutation 8  NNN   8  NNN Suppose is the first element of an arbitrary

vie in this permutation. Then NNN < Since  8 < monotonicity implies 8 < The equi alence stated in Theorem 4.3 allo ws us to reason about monotonic ranged hash functions in terms of permutations associated with items. Uni ersality: ranged hash amily is univer sal if restricting ery function in the amily to single vie creates uni ersal hash amily This property is one ay of requiring that ranged hash func- tion be well-beha ed in ery vie The abo condition is rather stringent; it says that if vie is fix ed, items are assigned ran- domly to the bins in that vie This implies that in

an vie the xpected fraction of items assigned to of the uck ets is ;  Using only monotonicity and this act about the uniformity of the assignment, we can determine the xpected number of items reas- signed when the set of usable uck ets changes. This relates to the informal notion of “smoothness”. Theor em 4.4 Let be monotonic, univer sal ang ed hash func- tion. Let and be vie ws. The xpected fr action of items for whic  8  8 is     Pr oof (sk etch): Count the number of items that mo as we add uck ets from until the vie is  and then delete uck ets do wn to Note that

monotonicity is used only to sho an upper bound on the number of items reassigned to ne uck et; this implies that one can not obtain “more consistent uni ersal hash function by relaxing the monotone condition. ha sho wn that ery monotone ranged hash function can be obtained by associating each item with random permutation
Page 8
of uck ets. The most natural monotone consistent hash function is obtained by choosing these permutations independently and uni- formly at random. denote this function by Theor em 4.5 The function is monotonic and univer sal. or item and uc et eac of the

following hold with pr obability at least -< 0 O8 ?A@ B* and  ?< .?A@ B* 0 2. Pr oof: Monotonicity and uni ersality are immediate; this lea es spread and load. Define: ?A@ B* . .? @B 0 use  8 to denote list of the uck ets in NNN which are ordered as in O8 First, consider spread. Recall that in particular vie item is assigned to the first uck et in O8 which is also in the vie Therefore, if ery vie contains one of the first uck ets in 8 then in ery vie item will be assigned to one of the first uck ets in 8 This implies that item is assigned to

at most distinct uck ets er all the vie ws. ha to sho that with high probability ery vie contains one of the first uck ets in 8 do this by sho wing that the complement has lo probability; that is, the probability that some vie contains none of the first uck ets is at most 0 The probability that particular vie does not contain the first uck et in 8 is at most O-   since each vie contains at least   fraction of all uck ets. The act that the first uck et is not in vie only reduces the probability that subsequent uck ets are not in the vie Therefore, the

probability that particular vie contains none of the first uck ets is at most :-$   !) :-   ')(  2  By the union bound, the probability that en one of the vie ws contains none of the first uck ets is at most 0 No consider load. By similar reasoning, ery item in ery vie is assigned to one of the first ?A@ B* uck ets in 8 with probability at least :-   sho belo that fix ed uck et appears among the first @B* 0 uck ets in 8 for at most items with probability at least ->  By the union bound, both ents occur with high probability This

implies that at most items are assigned to uck et er all the vie ws. All that remains is to pro the second statement. The x- pected number of items for which the uck et appears among the first ?A@ B* uck ets in 8 is .?A@ 0  Us- ing Chernof bounds, we find that uck et appears among the first ?A@ B* uck ets in  8 for at most items with probability at least -<  -<  simple approach to constructing consistent hash function is to assign random scores to uck ets, independently for each item. Sorting the scores defines random permutation, and therefore has the

good properties pro ed in the this section. Ho we er finding the uck et an item belongs in requires computing all the scores. This could be restricti vly slo for lar ge uck et sets.       -,.' In this section we apply the techniques de eloped in the last sec- tion to the simple hot spot protocol de eloped in section 3. no relax the assumption that clients kno about all of the caches. assume only that each machine kno ws about   fraction of the caches chosen by an adv ersary There is no dif ference in the proto- col, xcept that the mapping is consistent hash function. This

change will not af fect latenc Therefore, we only analyze the ef- fects on sw amping and storage. The basic properties of consistent hashing are crucial in sho wing that the protocol still orks well. In particular the blo wup in the number of requests and storage is proportional to the maximum and of the hash function. &(  Theor em 5.1 If is implemented using the @B* -way indepen- dent consistent hash function of Theor em 4.1 and if eac vie con- sists of  *! R cac hes then with pr obability at least -6  an arbitr ary cac he ets no mor than # ?A@ @B D @B  # ?A@B equests.

Pr oof (sk etch): look at the dif ferent trees of caches for dif- ferent vie ws for one page, Let   denote the number of caches in each tree. erlay these dif ferent trees on one an- other to get ne tree where in each node, there is set of caches. Due to the spr ead property of the consistent hash function at most  ?A@ BH caches appear at an node in this combined tree with high probability In act since there are only requests, this will be true for the nodes of all the trees for the requested pages. If  denotes the ent that appears in the  node of the combined tree for page then we kno

from Corrollary 4.2 that the probability of this ent is '( * where is the load which is ?A@B with high probability condition on the ent that and are ?A@ which happens with high probability Since cache in node sends out at most requests, each node in the combined tree sends out at most requests. no adapt the proof of Theorem 3.1 to this case. In Theorem 3.1 where ery machine as are of all the caches, an abstract node as as- signed to an gi en machine with probability  no assign and abstract node to gi en machine with probability => '( * So we ha scenario with !"R caches where each

abstract node sends out up to requests to its parent and occurs at each ab- stract node independently and with probability => '( * The rest of the proof is ery similar to that of Theorem 3.1.    Using techniques similar to those in proof of Theorem 5.1 we get the follo wing lemma. The proof is deferred to the final ersion of the paper Lemma 5.2 The total number of cac hed pa es, ver all mac hines is "!( ? @BC ?A@   with pr obability of R-$  given cac he has "!(  # L?A@BC cac hed copies with high pr obability  +    So ar we assumed that ery pair of

machines can communicate with equal ease. In this section we xtend our protocol to tak the latenc between machines, into account. The latenc of the whole request will be the sum of the latencies of the machine- machine links crossed by the request. or simplicity we assume in this section that all clients are are of all caches. xtend our protocol to restricted class of functions In particular we assume that is an ultr ametric ormally an ultra- metric is metric which obe ys more strict form of the triangle inequality:  *    *
Page 9
The ultrametric is natural model of

Internet distances, since it essentially captures the hierarchical nature of the Internet topology under which, for xample, all machines in gi en uni ersity are equidistant, ut all of them are arther ay from another uni er sity and still arther from another continent. The logical point-to- point connecti vity is established atop physical netw ork, and it is generally the case that the latenc between tw sites is determined by the “highest le el physical communication link that must be tra ersed on the path between them. Indeed, another definition of an ultrametric is as hierarchical

clustering of points. The distance in the ultrametric between tw points is completely determined by the smallest cluster containing both of the points. & '   The only modification we mak to the protocol is that when bro wser maps the tree nodes to caches, it only uses caches that are as close to it as the serv er of the desired page. By doing this, we insure that our path to the serv er does not contain an caches that are unnecessarily ar ay in the metric. The mapping is done using consistent hash function, which is the vital element of the solution. Clearly requiring that bro

wsers use “nearby caches can cause sw amping if there is only one cache and serv er near man bro wsers. Thus, in order to oid cases of de generate ultrametrics where there are bro wsers that are not close to an cache, and where there are clusters in the ultrametric without an caches in them, we restrict the set of ultrametrics that may be presented to the protocol. The re- striction is that in an cluster the ratio of the number of caches to the number of bro wsers may not all belo  # (recall that 7! # ). This restriction mak es sense in the real orld where caches are lik ely to be enly

spread out er the Internet. It is also neces- sary as we can pro that lar ge number of bro wsers clustered around one cache can be forced to sw amp that cache in some cir cumstances. &    It is clear from the protocol and the definition of an ultrametric that the latenc will be no more than the depth of the tree, @B times the latenc between the bro wser and the serv er So once again we need only look at sw amping and storage. The intuition is that inside each cluster the bounds we pro ed for the unit distance model ap- ply The monotone property on consistent hashing will allo us to

restrict our analysis to ?A@ B* clusters. Thus, summing er these clusters we ha only ?A@ B* blo wup in the bound. &  '  Theor em 6.1 Let be an ultr ametric. Suppose that eac br owser mak es at most one equest. Then in the pr otocol abo ve an arbi- tr ary cac he ets no mor than @B > ?A@ ')(,* ')(+% ')(,*  ')(&* ')( ( /1 ')(,* ?A@BC0 equests with pr obability at least 0 wher is par ameter Pr oof (sk etch): The intuition behind are proof is the follo wing. bound the load on machine Consider the ranking of machines  NNN according to their distance from Suppose asks for

page from machine closer to itself than Then according to our modified protocol, it will ne er in olv in the request. So we need only consider machine if it asks for page at least as ar ay from itself as It follo ws from the definition of ultrametrics that ery <8 is also used in the re vised protocol by Intuiti ely our original protocol spread load among the ma- chines so that the probability machine got on the path for par ticular page requests as ? @B  In our ultrametric pro- tocol, plays the protocol on set of at least machines. So is on the path of the request from with

probability ?A@  8 Summing er the xpected load on is ?A@ BC Stating things slightly more formally we consider set of ?A@ BH nested “virtual clusters NNN  Note that an bro wser in will use all machines in in the protocol. modify the protocol so that such machine uses only the machines in 3 This only reduces the number of machines it uses. According to the monotonicity property of our consistent hash functions, this only increases the load on machine No we can consider each separately and apply the static analysis. The total number of requests arri ving in one of the clusters under

the modified protocol is proportional to the number of caches in the cluster so our static analysis applies to the cluster This gi es us bound of ?A@ on the load induced on by Sumnming er the ?A@B clusters pro es the theorem. &  &  '  Using techniques similar to those in proof of Theorem 6.1 we get the follo wing lemma. Lemma 6.2 The total number of cac hed pa es, ver all mac hines is ?A@ BC ?A@ ?A@ with pr obability of R-<  98 given cac he has ?A@ > # ?A@BH cac hed copies with high pr obability +'    Basically as in Plaxton/Rajaraman, the act that our

protocol uses random short paths to the serv er mak es it ault tolerant. con- sider model in which an adv ersary designates that some of the caching machines may be down that is, ignore all attempts at com- munication. Remember that our adv ersary does not get to see our random bits, and thus cannot simply designate all machines at the top of tree to be do wn. The only restriction is that specified fraction of the machines in ery vie must be up. Under our protocol, no preempti caching of pages is done. Thus, if serv er goes do wn, all pages that it has not distrib uted become

inaccessible to an algorithm. This problem can be eliminated using standard techniques, such as Rabin' Information Dispersal Algorithm [10 ]. So we ignore serv er aults. Observ that request is satisfied if and only if all the caches serving for the nodes of the tree path are not do wn. Since each node is mapped to machine -wise) independently it is tri vial (using standard Chernof bounds) to lo wer bound the number of abstract nodes that ha orking paths to the root. This leads to the follo wing lemma: Lemma 7.1 Suppose that D/! ?A@ BH0 ith high pr obability the fr action of abstr

act-tr ee leaves that have working path to the oot is  ')( In particular if T-   ?A@  this fr action is constant. The modification to the protocol is therefore quite simple. Choose parameter and simultaneously send requests for the page. logarithmic number if requests is suf ficient to gi high probability of one of the requests goes through. This change in the protocol will of course ha an impact on the system. This impact is described in the full paper Note that since communication is chanc thing on the Inter net, ailure to get quick response from machine is not par

ticularly good indication that it is do wn. Thus, we focused on the tolerance of aults, and not on their detection. Ho we er gi en some ay to decide that machine is do wn, our consistent hash functions mak it tri vial to reassign the ork to other machines. If you de- cide machine is do wn, remo it from your vie
Page 10
     So ar we ha omitted an real mention of time from our analy- ses. ha instead considered and analyzed single “batch of requests, and ar gued that this batch causes limited amount of caching (storage usage) at ery machine, while simultaneously ar guing

that no machine gets sw amped by the batch. In this section, we sho ho this static analysis carries implications for temporal model in which requests arri er time. Recall that our temporal model says that bro wsers issues requests at certain rate ime is problematic issue when modeling the Internet, be- cause the communication protocols for it ha no guarantees re- garding time of deli ery Thus an one request could tak arbi- trarily long. Ho we er we can consider the rate at which serv ers recei requests. This seems lik an erly simplistic measure, ut the rate at which machine can recei requests

is in act the statis- tic that hardw are manuf acturers adv ertise. consider an interv al of time and apply our “requests all come at once analysis to the requests that come in this interv al. can write the bounds from the static analysis on requests as follo ws: cache size :     cache load :    Suppose machines ha cache size Consider time interv al small enough to mak "! %*& small enough so that J'    In other ords, the number of requests that arri in this interv al is insuf ficient, according to our static analysis, to use storage xceed- ing per machine. Thus once

machine caches page during this interv al, it eeps it for the remainder of the interv al. Thus our static analysis will apply er this interv al. This gi es us bound on ho man requests can arri in the interv al. Di viding by the interv al length, we get the rate at which caches see requests: ?:   Plugging in the bounds from Section 3, we get the follo wing: Theor em 8.1 If our mac hines have ?A@BC0 stor for some constant then with pr obability 0 the bound on the ate of ne equests per cac he when we have mac hines of size is ')(   !/! Observ the tradeof fs implicit in this

theorem. Increasing causes the load to decrease proportionately ut ne er belo %R?A@ BC Increasing increases the load linearly (b ut re- duces the number of hops on request path). Increasing seems only to hurt, suggesting that we should al ays tak The abo analysis used the rate at which requests were issued to measure the rate at which connections are established to ma- chines. If we also assume that each connection lasts for finite duration, this immediately translates into bound on the number of connections open at machine at an gi en time.    This paper has focused on one

particular caching problem—that of handling read requests on the eb belie the ideas ha broader applicability In particular consistent hashing may be useful tool for distrib uting information from name serv ers such as DNS and label serv ers such as PICS in load-balanced and ault- tolerant ashion. Our tw schemes may together pro vide an inter esting method for constructing multicast trees [4]. Another important ay in which our ideas could be xtended is in handling pages whose information changes er time, due to either serv er or client acti vity If we augment our protocol to let the serv er kno

which machines are currently caching its page, then the serv er can notify such caches whene er the data on its pages changes. This might ork particularly well in conjunction with the currently under de elopment multicast protocols [4] that broadcast information from serv er to all the client members of multicast “group. Our protocol can be mapped into this model if we assume that ery machine “caching page joins multicast group for that page. Ev en without multicast, if each cache eeps track, for each page it caches, of the at most other caches it has gi en the page to, then notification

of changes can be sent do wn the tree to only the caches that ha copies. It remains open ho to deal with time when modeling the In- ternet, because the communication protocols ha no guarantees re garding time of deli ery Indeed, at the pack et le el, there are not en guarantees re garding entual deli ery This suggests model- ing the Internet as some kind of distrib uted system. Clearly in model in which there are no guarantees re garding deli ery times, the best one can hope to pro is some of the classical liveness and safety properties underlying distrib uted algorithms. It is not clear what

one can pro about caching and sw amping in such model. think that there is significant research to be done on the proper ay to model this aspect of the Internet. also belie that interesting open questions remain re gard- ing the method of consistent hashing that we present in this paper Among them are the follo wing. Is there -uni ersal consistent hash function that can be aluated ef ficiently?? What tradeof fs can be achie ed between spread and load? Are there some kind of “perfect consistent hash functions that can be constructed deter ministically with the same spread and load

bounds we gi e? On what other theoretical problems can consistent hashing gi us handle? [1] Ana at Chankhunthod, Peter Danzig, Chuck Neerdaels, Michael Schw artz and urt orrell. Hierarchical Internet Object Cache. In USENIX Pr oceedings 1996. [2] Robert De vine. Design and Implementation of DDH: Distrib uted Dynamic Hashing Algorithm. In Pr oceedings of 4th International Con- fer ence on oundations of Data Or ganizations and Algorithms 1993. [3] M. J. Feele E. Mor gan, Pighin, A. R. Karlin, H. M. Le vy and C. A. Thekkath. Implementing Global Memory Management in orkstation Cluster In Pr

oceedings of the 15th CM Symposium on Oper ating Systems Principles 1995. [4] Sally Flo yd, an Jacobson, Steen McCanne, Ching-Gung Liu and Lixia Zhang. Reliable Multicast Frame ork for Light-weight Ses- sions and Application Le el Framing, SIGCOMM' 95 [5] itold Litwin, Marie-Anne Neimat and Dono an A. Schneider  -A Scalable, Distrib uted Data Structure. CM ransactions on Database Systems, Dec. 1996 [6] Radhika Malpani, Jacob Lorch and Da vid Ber ger Making orld ide eb Caching Serv ers Cooperate. In Pr oceedings of orld ide eb Confer ence 1996. [7] M. Naor and A. ool. The load, capacity and

ailability of quorum systems. In Pr oceedings of the 35th IEEE Symposium on oundations of Computer Science pages 214-225, No ember 1994. [8] D. Pele and A. ool. The ailability of quorum systems. Information and Computation 123(2):210-233, 1995. [9] Gre Plaxton and Rajmohan Rajaraman. ast ault-T olerant Concur rent Access to Shared Objects. In Pr oceedings of 37th IEEE Sympo- sium on oundations of Computer Science 1996. [10] M. O. Rabin. Ef ficient dispersal of Information for Security Load Bal- ancing, and ault olerance. ournal of the CM 36:335–348, 1989. [11] Jeanette Schmidt, Alan Sie

gel and Ara vind Srini asan. Chernof f- Hoef fding Bounds for Applications with Limited Independence. In Pr oc. 4th CS-SIAM Symposium on Discr ete Algorithms 1993.