Chord and Dynamo Costin Raiciu Advanced Topics in Distributed Systems 18122012 Motivation file sharing Many users want to share files online If a files location is known downloading is easy ID: 185463
Download Presentation The PPT/PDF document "Distributed Hash Tables" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Distributed Hash TablesChord and Dynamo
Costin Raiciu,
Advanced Topics in Distributed Systems
18/12/2012Slide2
Motivation: file sharing
Many users want to share files online
If a file’s location is known, downloading is easy
The challenge is to find who stores the file we want
Early attempts
Napster (centralized),
Kazaa
Gnutella (March 2000)
Completely decentralizedSlide3
How should we fix Gnutella’s problems?
Decouple storage from lookup
Gnutella: node only answers queries for nodes it has locally
Requirements
Extreme scalability
: millions of nodes
Load balance
: spread load across nodes evenly
Availability: must cope with
node churn
(nodes joining/leaving/failing)Slide4
Chord [Stoica et al,
Sigcomm
2001]
Opens a new body of research on “Distributed Hash Tables”
Together with Content Addressable Networks (also
Sigcomm
2001)
Most popular application: a Distributed Hash Table (DHT)Slide5
Chord basics
A single fundamental operation:
lookup(key
)
Given a key, find the node responsible for that key
How do we do this?Slide6
Consistent hashing
Assign unique
m
-bit identifiers to both nodes and objects (e.g. files)
E.g.
m
=160, use SHA1
Node identifier: hash of IP address
Object identifier: hash of name.
Split key space across all servers
Not necessary to store keys for the files you have!
Who is responsible for storing metadata
relating to a given key?Slide7
Key assignment
Identifiers
are ordered in an identifier circle modulo
2
m
Key
k
is assigned to the first node whose identifier is equal to or follows (the identifier of)
k
in the identifier space.
This
node is called the successor node of
k
(
successor(k
)
)
If
identifiers are represented as a circle of numbers from 0 to 2
m
−1 then successor (
k
) is the first node clockwise from
kSlide8
Consistent hashing exampleSlide9
Lookup
Each
node
n
maintains a routing table with (at most)
m
entries called
the finger
table
The
ith
entry in the table at node
n
contains the identity of the first
node(s
) that succeeds
n
by at least 2
i
−1 on the
circle
n.finger[i
]=
successor (
n
+ 2
i-1
),
1<
i
<
mSlide10
Lookup (2)
Each node stores information about only a small number of other
nodes (log
n
)
Nodes know
more about nodes closely following
them
on the circle than about nodes farther
away
Is there enough information in the finger table to find the successor of an arbitrary key?Slide11
How should we use finger pointers to guide the lookup?Slide12
Lookup algorithmSlide13
How many hops are required to find a key?Slide14
Node joins
To maintain correctness, Chord maintains two
invariants
:
Each node’s successor is correctly maintained
For every key
k
,
successor(
k
) is responsible for
k
Slide15
Node joins: detail
Chord uses a predecessor pointer to walk counterclockwise
Maintains Chord ID and IP address of previous node
Why?
When
a node joins the network Chord
:
Initializes
the predecessor and fingers of node
n
;
Updates
the fingers and predecessors of existing nodes to reflect the addition of
n
Notifies
the higher layer software so that it can transfer state associated with keys that
n
is now responsible forSlide16
Stabilization: Dealing with Concurrent Joins and Failures
In
practice Chord needs to deal with nodes joining the system concurrently and with nodes that fail or leave
voluntarily
Solution
: Every node runs a stabilize process
periodically
When
n
runs
stabilize,it
asks
n’s
successor for the successor’s predecessor
p
, and decides whether
p
should be
n
’
s
successor
instead
stabilize
also notifies
n’s
successor of
n’s
existence, giving the successor the chance to change its predecessor to
nSlide17
Implementing a Distributed Hash Table over Chord
put(k
,
v
)
– lookup
n
, the node responsible for
k
and store
v
on
n
get(k
)
– lookup node responsible for
k
, return value
How long does it take to join/leave Chord?
Fix: store on
n
and a few of its successors
Locally broadcast querySlide18
Other aspects of Distributed Hash Tables
How do we deal with security?
Nodes that return wrong answers
Nodes that do not forward messages
…Slide19
Applications of Distributed Hash Tables?
A whole body of research
Distributed Filesystems (Past,
Oceanstore
)
Distributed Search
None deployed. Why?
Today:
Kademlia
is used for “tracker-less” torrentsSlide20
Amazon
Dynamo
[
DeCandia
et al, SOSP 2007]
(slides adapted from
DeCandia
et al)Slide21
Context
Want a
distributed storage
system to use as support some of Amazon’s tasks:
best
seller
lists
shopping carts
customer preferences
session management
sales rank
product catalog
Traditional databases scale poorly and have poor availabilitySlide22
Amazon Dynamo
Requirements
Scale
Simple: key-value
Highly available
Guarantee Service Level Agreements (SLA)
Uses key-value store as abstractionSlide23
System Assumptions and Requirements
Query
Model
Read
and write operations to a data item that is uniquely identified by a
key
No schema needed
Small Objects (<1MB) stored as blobs
ACID
Properties?
Atomicity and weaker consistency, durability
Efficiency
Commodity hardware
Mind the SLA!
Other Assumptions
Environment is friendly (no security issues)Slide24
Amazon Request Handling99.9%
SLAsSlide25
Design Considerations
Sacrifice strong consistency for availability
Why are consistency and availability at odds?
Optimistic replication increases availability
Allow disconnected operations
This may lead to concurrent updates to the same object:
conflict
When to perform conflict resolution?
Delaying writes unacceptable (e.g. shopping cart update)
Solve conflicts during
read
instead of
write
, i.e. “always writeable”
.
Who resolves conflict?
App – e.g. merge shopping cart contents
Datastore
– last write wins.Slide26
Other design considerations
Incremental scalability
Symmetry
Decentralization
HeterogeneitySlide27
Partitioning Algorithm
Dynamo uses
consistent hashing
Consistent hashing issues:
Load imbalance
Dealing with heterogeneity
”
Virtual Nodes”: Each node can be responsible for more than one virtual node.Slide28
Advantages of using virtual nodes
If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes.
When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.
The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.Slide29
Replication
Each data item is replicated at N
hosts
N is specified per instance
“
preference list
”:
the N-1 successors of the key that store it.Slide30
Data Versioning
A put() call may return to its caller before the update has been applied at all the replicas
A get() call may return many versions of the same
object
Challenge:
an object having distinct version sub-histories, which the system will need to reconcile in the future.
Solution:
uses vector clocks in order to capture causality between different versions of the same object.Slide31
Vector Clock
A vector clock is a list of (node, counter) pairs.
Every version of every object is associated with one vector clock.
If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.Slide32
Vector clock exampleSlide33
Execution of get () and put () operations
Route its request through a generic load balancer that will select a node based on load information.
Use a partition-aware client library that routes requests directly to the appropriate coordinator nodes.Slide34
Quorum systems
We are balancing writes and reads over N nodes
How do we make sure a read sees the latest write?
Write on all nodes, wait for reply from all; read from any node
Or write to one, read from all
Quorum systems: write to W, read from R such that W+R>NSlide35
Dynamo uses Sloppy Quorum
Send write to all nodes
Return when W reply
Send read to all nodes
Return
result(s
) when R reply
What did we lose?
.Slide36
Hinted handoff
Assume N = 3. When
B
is temporarily down or unreachable during a write, send replica to
E.
E’s metadata hints
that the replica
belongs
to A and it will deliver
it to
A when A is recovered.
Write will succeed as long as where are W nodes (any) available in the systemSlide37
Dynamo membership
Membership changes are manually configured
Gossip based protocol propagates membership information
Everyone node knows about every other node’s range
Failures are detected by each node via timeouts
Enable hinted handoffs, etc. Slide38
Implementation
Java
Local persistence component allows for different storage engines to be plugged in:
Berkeley Database (BDB) Transactional Data Store:
object of tens of kilobytes
MySQL:
object of > tens of kilobytes
BDB Java Edition, etc.Slide39
EvaluationSlide40
Evaluation