COS 518 Advanced Computer Systems Lecture 16 Michael Freedman Credit Slides Adapted from Kyle Jamieson and Daniel Suo PeertoPeer Systems Napster Gnutella BitTorrent challenges Distributed Hash Tables ID: 760167
Download Presentation The PPT/PDF document "Peer-to-Peer Systems and Distributed Ha..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Peer-to-Peer Systems and Distributed Hash Tables
COS 518: Advanced Computer SystemsLecture 16Michael Freedman
[Credit: Slides Adapted from Kyle Jamieson and Daniel
Suo
]
Slide2Peer-to-Peer SystemsNapster, Gnutella, BitTorrent, challengesDistributed Hash TablesThe Chord Lookup ServiceConcluding thoughts on DHTs, P2P
2
Today
Slide3A
distributed system architecture:No centralized controlNodes are roughly symmetric in functionLarge number of unreliable nodes
3
What is a Peer-to-Peer (P2P) system?
Node
Node
Node
Node
Node
Internet
Slide4High capacity for services through parallelism:Many disksMany network connectionsMany CPUsAbsence of a centralized server may mean:Less chance of service overload as load increasesEasier deploymentA single failure won’t wreck the whole systemSystem as a whole is harder to attack
4
Why might P2P be a win?
Slide5Successful adoption in some niche areasClient-to-client (legal, illegal) file sharingDigital currency: no natural single owner (Bitcoin)Voice/video telephony: user to user anywayIssues: Privacy and control
5
P2P adoption
Slide6User clicks on download linkGets torrent file with content hash, IP addr of trackerUser’s BitTorrent (BT) client talks to trackerTracker tells it list of peers who have fileUser’s BT client downloads file from peersUser’s BT client tells tracker it has a copy now, tooUser’s BT client serves the file to others for a while
6
Example: Classic BitTorrent
Provides huge download bandwidth,
without
expensive server or network links
Slide77
The lookup problem
N1
N2
N3
N6
N5
Publisher (N4)
Client
?
Internet
put(“Pacific Rim.mp4”, [content])
get(
“
Pacific Rim.mp4
”
)
Slide88
Centralized lookup (Napster)
N1
N2
N3
N6
N5
Publisher (N4)
Client
SetLoc
(“Pacific Rim.mp4”, IP address of N
4
)
Lookup(
“
Pacific Rim.mp4
”
)
DB
key=“Pacific Rim.mp4”, value=[content]
Simple,
but O(
N
) state and a
single point of failure
Slide99
Flooded queries (original Gnutella)
N1
N2
N3
N6
N5
Publisher (N4)
Client
Lookup(
“
Pacific Rim.mp4
”
)
key=“Star Wars.mov”, value=[content]
Robust,
but
O(
N = number of peers
)
messages per lookup
Slide1010
Routed DHT queries (Chord)
N1
N2
N3
N6
N5
Publisher (N4)
Client
Lookup(H(data
)
)
key=“H(audio data)”, value=[content]
Can we make it
robust
,
reasonable state
, reasonable number of
hops
?
Slide11Peer-to-Peer SystemsDistributed Hash TablesThe Chord Lookup ServiceConcluding thoughts on DHTs, P2P
11
Today
Slide1212
What is a DHT (and why)?
How can I do (roughly) this across millions of hosts on the Internet?Distributed Hash Table (DHT)
Local
hash table:
key = Hash(name)
put(key, value)
get(key)
value
Service:
Constant-time insertion and lookup
Slide1313
What is a DHT (and why)?
Distributed Hash Table:
key = hash(data)
lookup(key)
IP
addr
(Chord lookup service)
send-RPC(IP address,
put
, key, data)
send-RPC(IP address,
get
, key)
data
Partitioning data
in
large-scale distributed systems
Tuples in a global database engine
Data blocks in a global file system
Files in a P2P file-sharing system
Slide14App may be distributed over many nodesDHT distributes data storage over many nodes
14
Cooperative storage with a DHT
Distributed hash table
Distributed application
get (key)
data
node
node
node
….
put(key, data)
Lookup service
lookup(key)
node IP address
(DHash)
(Chord)
Slide15BitTorrent can use DHT instead of (or with) a trackerBT clients use DHT:Key = file content hash (“infohash”)Value = IP address of peer willing to serve fileCan store multiple values (i.e. IP addresses) for a keyClient does:get(infohash) to find other clients willing to serveput(infohash, my-ipaddr) to identify itself as willing
15
BitTorrent over DHT
Slide16The DHT comprises a single giant tracker, less fragmented than many trackersSo peers more likely to find each otherClassic tracker too exposed to legal & © attacks
16
Why might DHT be a win for BitTorrent?
Slide17API supports a wide range of applicationsDHT imposes no structure/meaning on keysKey/value pairs are persistent and globalCan store keys in other DHT valuesAnd thus build complex data structures
17
Why the put/get DHT interface?
Slide18Decentralized: no central authorityScalable: low network traffic overhead Efficient: find items quickly (latency)Dynamic: nodes fail, new nodes join
18
Why might DHT design be hard?
Slide19Peer-to-Peer SystemsDistributed Hash TablesThe Chord Lookup ServiceBasic designIntegration with DHash DHT, performance
19
Today
Slide20Interface: lookup(key) IP addressEfficient: O(log N) messages per lookupN is the total number of serversScalable: O(log N) state per nodeRobust: survives massive failuresSimple to analyze
20
Chord lookup algorithm properties
Slide21Key identifier = SHA-1(key)Node identifier = SHA-1(IP address)SHA-1 distributes both uniformlyHow does Chord partition data?i.e., map key IDs to node IDs
21
Chord identifiers
Slide2222
Consistent hashing [Karger ‘97]
Key is stored at its successor: node with next-higher ID
K80
N32
N90
N105
K20
K5
Circular 7-bit
ID space
Key 5
Node 105
Slide2323
Chord: Successor pointers
K80
N32
N90
N105
N10
N60
N120
Slide2424
Basic lookup
K80
N32
N90
N105
N10
N60
N120
“N90 has K80”
“Where is K80?”
Slide2525
Simple lookup algorithm
Lookup
(key-id)
succ
my successor
if
my-id
<
succ
<
key-id
// next hop
call Lookup(key-id) on
succ
else
// done
return
succ
Correctness
depends only on
successors
Problem: Forwarding through successor is slowData structure is a linked list: O(n)Idea: Can we make it more like a binary search? Need to be able to halve distance at each step
26
Improving performance
Slide2727
“Finger table” allows log N-time lookups
N80
½
¼
1/8
1/16
1/32
1/64
Slide2828
Finger i Points to Successor of n+2i
N80
½
¼
1/8
1/16
1/32
1/64
K112
N120
Slide29A binary lookup tree rooted at every node Threaded through other nodes' finger tablesBetter than arranging nodes in a single treeEvery node acts as a rootSo there's no root hotspotNo single point of failureBut a lot more state in total
29
Implication of finger tables
Slide3030
Lookup with finger table
Lookup
(key-id)
look in local finger table for
highest n: my-id
<
n
<
key-id
if
n exists
call Lookup(key-id) on node n
// next hop
else
return
my successor
// done
31
Lookups Take O(log N) Hops
N32
N10
N5
N20
N110
N99
N80
N60
Lookup(K19
)
K19
Slide32For a million nodes, it’s 20 hopsIf each hop takes 50ms, lookups take a secondIf each hop has 10% chance of failure, it’s a couple of timeoutsSo in practice log(n) is better than O(n) but not great
32
An aside: Is log(n) fast or slow?
Slide3333
Joining: Linked list insert
N36
N40
N25
1. Lookup(36)
K30
K38
Slide3434
Join (2)
N36
N40
N25
2. N36 sets its own
successor pointer
K30
K38
Slide3535
Join (3)
N36
N40
N25
3. Copy keys 26..36
from N40 to N36
K30
K38
K30
Slide3636
Notify maintains predecessors
N36
N40
N25
notify
N36
notify
N25
Slide3737
Stabilize message fixes successor
N36
N40
N25
stabilize
“
My predecessor
is N36.”
✔
✘
Slide38Predecessor pointer allows link to new nodeUpdate finger pointers in the backgroundCorrect successors produce correct lookups
38
Joining: Summary
N36
N40
N25
K30
K38
K30
Slide3939
Failures may cause incorrect lookup
N120
N113
N102
N80
N85
N80
does not know correct successor, so
incorrect lookup
N10
Lookup(K90)
Slide4040
Successor lists
Each node stores a
list
of its
r
immediate successors
After failure, will know first live successor
Correct successors
guarantee
correct lookups
Guarantee is with some probability
Slide4141
Choosing successor list length
Assume
one half
of the nodes
fail
P(successor list all dead) =
(½)
r
i.e.
, P(this node breaks the Chord ring)
Depends on independent failure
Successor list of
size
r
= O(log
N
)
makes this probability 1/
N
: low for large
N
Slide4242
Lookup with fault tolerance
Lookup
(key-id)
look in local finger table
and successor-list
for highest n: my-id
<
n
<
key-id
if
n exists
call Lookup(key-id) on node n
// next hop
if call failed,
remove n from finger table and/or
successor list
return Lookup(key-id)
else
return
my successor
// done
Slide43Peer-to-Peer SystemsDistributed Hash TablesThe Chord Lookup ServiceBasic designIntegration with DHash DHT, performance
43
Today
Slide4444
The DHash DHT
Builds key/value storage on Chord
Replicates
blocks for availability
Stores
k
replicas
at the
k
successors
after the block on the Chord ring
Caches
blocks for load balancing
Client
sends
copy of block
to each of the servers it contacted along the
lookup path
Authenticates
block contents
Slide4545
DHash data authentication
Two types of DHash blocks:Content-hash: key = SHA-1(data)Public-key: Data signed by corresponding private keyChord File System example:
Slide46Replicas are easy to find if successor failsHashed node IDs ensure independent failure
46
DHash replicates blocks at r successors
N40
N10
N5
N20
N110
N99
N80
N60
N50
Block 17
N68
Slide4747
Experimental overview
Quick lookup in large systemsLow variation in lookup costsRobust despite massive failure
Goal: Experimentally confirm theoretical results
Slide4848
Chord lookup cost is O(log N)
Number of Nodes
Average Messages per Lookup
Constant is 1/2
Slide4949
Failure experiment setup
Start
1,000 Chord servers
Each server’s
successor list
has 20 entries
Wait until they
stabilize
Insert 1,000 key/value pairs
Five
replicas
of each
Stop X%
of the servers, immediately make 1,000 lookups
Slide5050
Massive failures have little impact
Failed Lookups (Percent)
Failed Nodes (Percent)
(1/2)
6 is 1.6%
Slide51Peer-to-Peer SystemsDistributed Hash TablesThe Chord Lookup ServiceBasic designIntegration with DHash DHT, performanceConcluding thoughts on DHT, P2P
51
Today
Slide52Original DHTs (CAN, Chord, Kademlia, Pastry, Tapestry) proposed in 2001-02Next 5-6 years saw proliferation of DHT-based apps:Filesystems (e.g., CFS, Ivy, OceanStore, Pond, PAST)Naming systems (e.g., SFR, Beehive)DB query processing [PIER, Wisc]Content distribution systems (e.g., CoralCDN)distributed databases (e.g., PIER)
52
DHTs: Impact
Slide53Why don’t all services use P2P?
High latency and limited bandwidth between peers (vs. intra/inter-datacenter)User computers are less reliable than managed serversLack of trust in peers’ correct behaviorSecuring DHT routing hard, unsolved in practice
53
Slide54Seem promising for finding data in large P2P systemsDecentralization seems good for load, fault tolerance But: the security problems are difficultBut: churn is a problem, particularly if log(n) is bigDHTs have not had the hoped-for impact
54
DHTs in retrospective
Slide55Consistent hashingElegant way to divide a workload across machinesVery useful in clusters: actively used today in Amazon Dynamo and other systemsReplication for high availability, efficient recoveryIncremental scalabilitySelf-management: minimal configurationUnique trait: no single server to shut down/monitor
55
What DHTs got right