Kurchi S ubhra H azra Agenda Basic Algorithms such as Leader Election Consensus in Distributed Systems Replication and Fault Tolerance in Distributed Systems GFS as an example of a Distributed System ID: 712996
Download Presentation The PPT/PDF document "Network algorithms Presenter-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Network algorithms
Presenter-
Kurchi
S
ubhra
H
azraSlide2
Agenda
Basic Algorithms such as Leader Election
Consensus in Distributed Systems
Replication and Fault Tolerance in Distributed Systems
GFS as an example of a Distributed SystemSlide3
Network Algorithms
Distributed System is a
collection of entities
where
Each of them is autonomous, asynchronous and
failure-prone
Communicating
through
unreliable
channels
To
perform some common
function
Network algorithms enable such distributed systems to effectively perform these “common functions”Slide4
Gobal State in Distributed Systems
We want to estimate a “consistent” state of a distributed system
Required for determining if the system is deadlocked, terminated
and for debugging
Two approaches:
1. Centralized- All processes and channels report to a central process
2. Distributed –
Chandy
Lamport
AlgorithmSlide5
Chandy
Lamport
Algorithm
Based on Marker Messages M
On receiving M over channel c:
If state is not recorded:
a) Record own state
b) Start recording state of incoming channels
c) Send Marker Messages to all outgoing channels
Else
a) Record state of cSlide6
Chandy
Lamport
Algorithm
P1
P2
P3
e
1
0
e
2
0
e
2
3
e
3
0
e
1
3
a
b
M
e
1
1,2
M
1- P1 initiates snapshot: records its state (S1); sends Markers to P2 & P3; turns on recording for channels Ch
21
and Ch
31
e
2
1,2,3
M
M
2- P2 receives Marker over Ch
12
, records its state (S2), sets state(Ch
12
) = {} sends Marker to P1 & P3; turns on recording for channel Ch
32
e
1
4
3- P1 receives Marker over Ch
21
, sets state(Ch
21
) = {a}
e
3
2,3,4
M
M
4- P3 receives Marker over Ch
13
, records its state (S3), sets state(Ch
13
) = {} sends Marker to P1 & P2; turns on recording for channel Ch
23
e
2
4
5- P2 receives Marker over Ch
32
, sets state(Ch32) = {b}
e
3
1
6- P3 receives Marker over Ch23, sets state(Ch23) = {}
e
1
3
7- P1 receives Marker over Ch31, sets state(Ch31) = {}
Taken from CS 425/UIUC/Fall 2009Slide7
Leader Election
Suppose you want to
-elect a master server out of n servers
-elect a co-
ordinator
among different mobile systems
Common Leader Election Algorithms
-Ring Election
-Bully Election
Two requirements
Safety (Process with best attribute is elected)
Liveness (Election terminates)Slide8
Ring Election
Processes organized in a ring
Send message clockwise to next process in a ring with its id and own attribute value
Next process checks the election message
if its attribute value is greater, it replaces its own process id with that in the message.
If the attribute value is less, it simply passes on the message
If the attribute value is equal it declares itself as the leader and passes on an “elected” message.
What happens when a node fails?Slide9
Ring Election - Example
Taken from CS 425/UIUC/Fall 2009Slide10
Ring Election - Example
Taken from CS 425/UIUC/Fall 2009Slide11
Bully Algorithm
Best case and worst case scenarios
Taken from CS 425/UIUC/Fall 2009Slide12
Consensus
A set of n processes/systems attempt to “agree” on some information
P
i
begins
in undecided state and proposes value
v
i
є
D
P
i‘s communicate by exchanging valuesPi sets its decision value di and enters decided stateRequirements:1.Termination: Eventually all correct processes decide, i.e., each correct process sets its decision variable2. Agreement : Decision value of all correct processes is the same3. Integrity: If all correct processes proposed v, then any correct decided process has di= vSlide13
2 Phase Commit Protocol
Useful in distributed transactions to perform atomic commit
Atomic Commit: Set of distinct changes applied in a single operation
Suppose A transfers 300 $ from A’s account to B’s bank account.
A= A-300
B=B+300
These operations should be guaranteed for consistency.Slide14
2 Phase Commit Protocol
What happens if the co-
ordinator
and
a participant fails
after
doCommit
?Slide15
Issue with 2PC
Co-
ordinator
B
A
CanCommit
?Slide16
Issue with 2PC
Co-
ordinator
B
A
YesSlide17
Issue with 2PC
Co-
ordinator
B
A
doCommit
A crashes
Co-
ordinator
Crashes
B commits
A new co-
ordinator
cannot know whether A had committed.Slide18
3 Phase Commit Protocol (3PC)
Use an additional stageSlide19
3PC Cont
…
Co-
ordinator
Cohort 1
Cohort 2
Cohort 3
canCommit
ack
preCommit
ack
commit
commit
commit
commitSlide20
3PC Cont
…
Why
is this better
?
2PC: execute transaction when everyone is willing to COMMIT
it
3PC: execute transaction when everyone
knows
it will COMMIT(http://www.coralcdn.org/07wi-cs244b/notes/l4d.txt)But 3PC is expensiveTimeouts triggered by slow machinesSlide21
Paxos Protocol
A consensus algorithm
Important Safety Conditions:
Only one value is chosen
Only a proposed value is chosen
Important
Liveness
Conditions:
Some proposed value is eventually chosen
Given a value is chosen, a process can learn the value eventually
Nodes behave as
Proposer, Acceptor and LearnersSlide22
Paxos Protocol – Phase 1
22
Proposer
Acceptor
Acceptor
Acceptor
Acceptor
Select a number n
for proposal
of value v
Prepare message
What about
this acceptor?Majority of acceptors is enough
Acceptors respond back with the highest n it has seen
AcknowledgementSlide23
Paxos Protocol – Phase 2
23
Proposer
Acceptor
Acceptor
Acceptor
Acceptor
n
n
n
Majority of acceptors agree on proposal n with value vSlide24
Paxos Protocol – Phase 2
24
Proposer
Acceptor
Acceptor
Acceptor
Acceptor
Majority of acceptors agree on proposal n with value v
Accept
Acceptors accept
What if v
is null?Slide25
Paxos Protocol
Cont
…
What if arbitrary number of proposers are allowed?
P
Q
Acceptor
n1
Round 1
Round 2
n2Slide26
Paxos Protocol
Cont
…
What if arbitrary number of proposers are allowed?
To ensure progress, use distinguished proposer
P
Q
Acceptor
Round 1
Round 2
n3
Round 3
n4
Round 4Slide27
Paxos Protocol
Contd
…
Some issues:
How to choose proposer?
How do we ensure unique n ?
Expensive protocol
No primary if distinguished proposer used
Originally used by
Paxons
to run their part-time parliamentSlide28
Replication
Replication is important for
Fault Tolerance
Load Balancing
Increased Availability
Requirements:
Transparency
ConsistencySlide29
Failure in Distributed Systems
An important consideration in every design decision
Fault detectors should be :
Complete – should be able to detect a fault when it occurs
Accurate – Does not raise false positives Slide30
Byzantine Faults
Arbitrary
messages and
transitions
Cause
: e.g., software bugs, malicious
attacks
Byzantine Agreement Problem: “Can
a set of concurrent processes achieve coordination in spite of the faulty
behavior
of some of them
?”Concurrent processes could be replicas in distributed systemsSlide31
Practical Byzantine Fault Tolerance(PBFT)
Replication Algorithm that is able to tolerate faults.
Useful for software faults
Why “Practical”?
-> since can be used in an asynchronous environment like the internet
Important Assumptions:
At most
nodes can be faulty
All
replicas start in the same
state
Failures are independent – Practical? Slide32
PBFT Cont..
C
R1
R2
R3
R4
request
p
re-prepare
prepare
commit
reply
C : Client
R1: Primary replica
Client blocks and waits for f+1 replies
After accepting
2f prepares
Execution after 2f+1 commitsSlide33
PBFT Cont
…
The algorithm provides
-> Safety
By guaranteeing
linearizability
. Pre-prepare and prepare ensures total order on messages
->
Liveness
By providing for view change, when the primary replica fails
. Here, synchrony is assumed.
How do we know apriori the value of f?Slide34
Google File System
Revisited traditional file system design
1. Component failures are a norm
2. Multi-GB Files are common
3. Files mutated by appending new data
4. Relaxed consistency modelSlide35
GFS Architecture
Leader Election/ Replication
Maintains metadata, namespace, chunk metadata
etcSlide36
GFS – Relaxed ConsistencySlide37
GFS – Design Issues
Single Master
Rational: Keep things simple
Problems:
Increasing volume of underlying storage -> Increase in metadata
Clients not as fast as master server -> Master server became bottleneck
Current: Multiple Masters per data center
Ref: http
://queue.acm.org/detail.cfm?id=1594206Slide38
GFS Design Isuues
Replication of chunks
Replication across racks – default number is 3
Allowing concurrent changes to the same file
.
->
In retrospect, they would rather have a single writer
Primary replica serializes mutation to
chunks
-
They do not use any of the consensus protocols before applying mutations to the chunks. Ref: http://queue.acm.org/detail.cfm?id=1594206Slide39
Thank You