/
Distributed Hash Tables Distributed Hash Tables

Distributed Hash Tables - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
424 views
Uploaded On 2015-11-07

Distributed Hash Tables - PPT Presentation

Chord and Dynamo Costin Raiciu Advanced Topics in Distributed Systems 18122012 Motivation file sharing Many users want to share files online If a files location is known downloading is easy ID: 185463

nodes node successor key node nodes key successor write distributed chord lookup hash responsible object load read table identifier store clock dynamo

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Distributed Hash Tables" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Distributed Hash TablesChord and Dynamo

Costin Raiciu,

Advanced Topics in Distributed Systems

18/12/2012Slide2

Motivation: file sharing

Many users want to share files online

If a file’s location is known, downloading is easy

The challenge is to find who stores the file we want

Early attempts

Napster (centralized),

Kazaa

Gnutella (March 2000)

Completely decentralizedSlide3

How should we fix Gnutella’s problems?

Decouple storage from lookup

Gnutella: node only answers queries for nodes it has locally

Requirements

Extreme scalability

: millions of nodes

Load balance

: spread load across nodes evenly

Availability: must cope with

node churn

(nodes joining/leaving/failing)Slide4

Chord [Stoica et al,

Sigcomm

2001]

Opens a new body of research on “Distributed Hash Tables”

Together with Content Addressable Networks (also

Sigcomm

2001)

Most popular application: a Distributed Hash Table (DHT)Slide5

Chord basics

A single fundamental operation:

lookup(key

)

Given a key, find the node responsible for that key

How do we do this?Slide6

Consistent hashing

Assign unique

m

-bit identifiers to both nodes and objects (e.g. files)

E.g.

m

=160, use SHA1

Node identifier: hash of IP address

Object identifier: hash of name.

Split key space across all servers

Not necessary to store keys for the files you have!

Who is responsible for storing metadata

relating to a given key?Slide7

Key assignment

Identifiers

are ordered in an identifier circle modulo

2

m

Key

k

is assigned to the first node whose identifier is equal to or follows (the identifier of)

k

in the identifier space.

This

node is called the successor node of

k

(

successor(k

)

)

If

identifiers are represented as a circle of numbers from 0 to 2

m

−1 then successor (

k

) is the first node clockwise from

kSlide8

Consistent hashing exampleSlide9

Lookup

Each

node

n

maintains a routing table with (at most)

m

entries called

the finger

table

The

i­th

entry in the table at node

n

contains the identity of the first

node(s

) that succeeds

n

by at least 2

i

−1 on the

circle

n.finger[i

]=

successor (

n

+ 2

i-1

),

1<

i

<

mSlide10

Lookup (2)

Each node stores information about only a small number of other

nodes (log

n

)

Nodes know

more about nodes closely following

them

on the circle than about nodes farther

away

Is there enough information in the finger table to find the successor of an arbitrary key?Slide11

How should we use finger pointers to guide the lookup?Slide12

Lookup algorithmSlide13

How many hops are required to find a key?Slide14

Node joins

To maintain correctness, Chord maintains two

invariants

:

Each node’s successor is correctly maintained

For every key

k

,

successor(

k

) is responsible for

k

Slide15

Node joins: detail

Chord uses a predecessor pointer to walk counterclockwise

Maintains Chord ID and IP address of previous node

Why?

When

a node joins the network Chord

:

Initializes

the predecessor and fingers of node

n

;

Updates

the fingers and predecessors of existing nodes to reflect the addition of

n

Notifies

the higher layer software so that it can transfer state associated with keys that

n

is now responsible forSlide16

Stabilization: Dealing with Concurrent Joins and Failures

In

practice Chord needs to deal with nodes joining the system concurrently and with nodes that fail or leave

voluntarily

Solution

: Every node runs a stabilize process

periodically

When

n

runs

stabilize,it

asks

n’s

successor for the successor’s predecessor

p

, and decides whether

p

should be

n

s

successor

instead

stabilize

also notifies

n’s

successor of

n’s

existence, giving the successor the chance to change its predecessor to

nSlide17

Implementing a Distributed Hash Table over Chord

put(k

,

v

)

– lookup

n

, the node responsible for

k

and store

v

on

n

get(k

)

– lookup node responsible for

k

, return value

How long does it take to join/leave Chord?

Fix: store on

n

and a few of its successors

Locally broadcast querySlide18

Other aspects of Distributed Hash Tables

How do we deal with security?

Nodes that return wrong answers

Nodes that do not forward messages

…Slide19

Applications of Distributed Hash Tables?

A whole body of research

Distributed Filesystems (Past,

Oceanstore

)

Distributed Search

None deployed. Why?

Today:

Kademlia

is used for “tracker-less” torrentsSlide20

Amazon

Dynamo

[

DeCandia

et al, SOSP 2007]

(slides adapted from

DeCandia

et al)Slide21

Context

Want a

distributed storage

system to use as support some of Amazon’s tasks:

best

seller

lists

shopping carts

customer preferences

session management

sales rank

product catalog

Traditional databases scale poorly and have poor availabilitySlide22

Amazon Dynamo

Requirements

Scale

Simple: key-value

Highly available

Guarantee Service Level Agreements (SLA)

Uses key-value store as abstractionSlide23

System Assumptions and Requirements

Query

Model

Read

and write operations to a data item that is uniquely identified by a

key

No schema needed

Small Objects (<1MB) stored as blobs

ACID

Properties?

Atomicity and weaker consistency, durability

Efficiency

Commodity hardware

Mind the SLA!

Other Assumptions

Environment is friendly (no security issues)Slide24

Amazon Request Handling99.9%

SLAsSlide25

Design Considerations

Sacrifice strong consistency for availability

Why are consistency and availability at odds?

Optimistic replication increases availability

Allow disconnected operations

This may lead to concurrent updates to the same object:

conflict

When to perform conflict resolution?

Delaying writes unacceptable (e.g. shopping cart update)

Solve conflicts during

read

instead of

write

, i.e. “always writeable”

.

Who resolves conflict?

App – e.g. merge shopping cart contents

Datastore

– last write wins.Slide26

Other design considerations

Incremental scalability

Symmetry

Decentralization

HeterogeneitySlide27

Partitioning Algorithm

Dynamo uses

consistent hashing

Consistent hashing issues:

Load imbalance

Dealing with heterogeneity

Virtual Nodes”: Each node can be responsible for more than one virtual node.Slide28

Advantages of using virtual nodes

If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes.

When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.

The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.Slide29

Replication

Each data item is replicated at N

hosts

N is specified per instance

preference list

”:

the N-1 successors of the key that store it.Slide30

Data Versioning

A put() call may return to its caller before the update has been applied at all the replicas

A get() call may return many versions of the same

object

Challenge:

an object having distinct version sub-histories, which the system will need to reconcile in the future.

Solution:

uses vector clocks in order to capture causality between different versions of the same object.Slide31

Vector Clock

A vector clock is a list of (node, counter) pairs.

Every version of every object is associated with one vector clock.

If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.Slide32

Vector clock exampleSlide33

Execution of get () and put () operations

Route its request through a generic load balancer that will select a node based on load information.

Use a partition-aware client library that routes requests directly to the appropriate coordinator nodes.Slide34

Quorum systems

We are balancing writes and reads over N nodes

How do we make sure a read sees the latest write?

Write on all nodes, wait for reply from all; read from any node

Or write to one, read from all

Quorum systems: write to W, read from R such that W+R>NSlide35

Dynamo uses Sloppy Quorum

Send write to all nodes

Return when W reply

Send read to all nodes

Return

result(s

) when R reply

What did we lose?

.Slide36

Hinted handoff

Assume N = 3. When

B

is temporarily down or unreachable during a write, send replica to

E.

E’s metadata hints

that the replica

belongs

to A and it will deliver

it to

A when A is recovered.

Write will succeed as long as where are W nodes (any) available in the systemSlide37

Dynamo membership

Membership changes are manually configured

Gossip based protocol propagates membership information

Everyone node knows about every other node’s range

Failures are detected by each node via timeouts

Enable hinted handoffs, etc. Slide38

Implementation

Java

Local persistence component allows for different storage engines to be plugged in:

Berkeley Database (BDB) Transactional Data Store:

object of tens of kilobytes

MySQL:

object of > tens of kilobytes

BDB Java Edition, etc.Slide39

EvaluationSlide40

Evaluation