Distributed Snapshot Sukumar Ghosh Department of Computer Science University of Iowa Contents Part 1 The evolution of network topologies Part 2 Distributed snapshot Part 3 Tolerating failures ID: 547835
Download Presentation The PPT/PDF document "Networks and" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Networks and Distributed Snapshot
Sukumar Ghosh
Department of Computer Science
University of IowaSlide2
Contents
Part 1. The evolution of network topologies
Part 2. Distributed snapshot
Part 3. Tolerating failuresSlide3
Random Graphs
How a connected topology evolves in the real world
Erdös-Rényi
graphs (ER graphs)
Power-law graphs
Small-world graphsSlide4
Random graphs: Erdös-Rényi
model
ER model
is one of several models of random graphs
Presents a theory of how social webs are formed.
Start with a set of isolated nodes
Connect each pair of nodes with a probability
The resulting graph is known asSlide5
Erdös-Rényi model
ER model is different from the model
The model randomly selects one from the entire
family of graphs with nodes and edges
.
Slide6
Properties of ER graphs
Property 1
. The expected
number of edges
is
Property 2
. The expected
degree per node is
Property 3
.
The expected
diameter of is
[
deg = expected degree of a node
]Slide7
Diameter of a network
Let denote the distance of the
shortest path
between a pair of nodes and . For all such pairs of nodes, the
largest value
of is known as the
diameter
of the network. Slide8
Degree distribution in random graphs
Probability that a node connects with a given set of nodes (and
not
to the remaining remaining nodes) is
One can choose out of the remaining
nodes in ways.
So the probability distribution is
(
binomial distribution
) Slide9
Degree distribution in random graphs
N(k) = Number of nodes with degree kSlide10
Properties of ER graphs
--
When ,
an ER graph is a
collection of disjoint trees.
-- When suddenly
one giant (connected) component
emerges. Other components have a much smaller size [
Phase change]Slide11
Properties of ER graphs
When the graph is
almost always connected
These give “ideas” about how a network can evolve.
But not all random topologies are ER graphs!
For example, social networks are often “clustered”, but ER
graphs have poor (i.e. very low)
clustering coefficient
(
what is clustering coefficient
?)Slide12
Clustering coefficient
For a given node,
its
local clustering
coefficient
(CC)
measures what fraction of its various pairs of neighbors are neighbors of each other.
CC(B) = 3/6 = ½ CC(D) = 2/3 = CC(E)
B’s neighbors are{A,C,D,E}. Only (A,D), (D,E), (E,C) are connected
CC of a graph is themean of the CC of its various nodesSlide13
The connectors
Malcom
Gladwell
, a staff writer at the
New Yorker
magazine
describes in his book The Tipping Point, a simple
experiment to measure how social a person is.
He started with a list of 248 last names A person scores a point if he or she knows someone with a last name from this list. If he/she knows
three persons with the same last name, then he/she scores 3 pointsSlide14
The connectors
(Outcome of the
Tipping Point
experiment) Altogether
400 persons from different groups were tested. It was found that
(min) 9, (max) 118 {from a random sample}
(min) 16, (max) 108 {from a highly homogeneous group}
(min) 2, (max) 95 {from a college class}
[Conclusion: Some people are very social, even in small or homogeneous samples. They are connectors
]Slide15
Connectors
Barabási
observed that connectors are not unique to human society only, but
true for many complex networks
ranging from
biology to
computer science
, where there are some nodes with an anomalously large number of links. This was not quite expected
in ER graphs.The world wide web,
the ultimate forum of democracy, is not a
random network, as Barabási’s web-mapping project revealed.
.Slide16
Anatomy of the World Wide Web
Barabási
experimented with the Univ. of Notre Dame’s web.
325,000
pages
270,000
pages
(i.e. 82%) had three or fewer links
42 had 1000+ incoming links each.
The entire WWW exhibited even more disparity. 90% had ≤ 10 links,
whereas a few (4-5) like Yahoo were referenced by close to a million pages!
These are the hubs of the web. They help create short paths between nodes (
mean distance = 19 for WWW obtained via extrapolation). (Some dispute this figure now)Slide17
Power law graph
The degree distribution in of the web pages in the World Wide Web follow a
power-law
. In a
power-law graph
, the number of nodes with degree satisfies the condition .
Also
known as scale-free graph. Other examples are
--
Income and number of people with that income
-- Magnitude and number of earthquakes of that magnitude
-- Population and number of cities with that populationSlide18
Random vs. Power-law Graphs
The degree distribution in of the web pages in the
World Wide Web follows a power-law Slide19
Random vs. Power-Law networksSlide20
Example: Airline Routes
Think of how new routes are added to an existing networkSlide21
Preferential attachment
New node
Existing
network
A new node connects with an
existing node with a
probability
proportional to
its degree. Thesum of the node degrees = 8
Also known as “Rich gets richer” policySlide22
Preferential attachmentcontinued
Barabási
and Albert showed that when large networks
are formed via
preferential attachment
, the resulting graph
exhibits a
power-law distribution of the node degrees. Slide23
Other properties of power law graphs
Graphs following a power-law distribution
have a small diameter
(
n
= number of nodes). The clustering coefficient decreases as the node degree increases (power law)
Graphs following a power-law distribution tend to be highly resilient to random edge removal, but
quite vulnerable to targeted attacks on the hubs.Slide24
The small-world model
Due to
Watts and
Strogatz
(1998)
They followed up on
Milgram’s work (on six degrees of separation) and reason about
why there is a small degree of separation between individuals in a social network.
Research originally inspired by Watt’s efforts to understand the synchronization
of cricket chirps, which show a high degree of coordination over long ranges, as though the insects are being guided by an invisible conductor
. Disease spreads faster
over a small-world network
.Slide25
Questions not answered by Milgram
Milgram’s
experiment
tried to validate the theory of
six degrees of separation
between any two individuals on the planet.
Why
six
degrees of separation? Any scientific reason? What properties do these social graphs have? Are there other situations in which this model is applicable?
Time to reverse engineer this. Slide26
What are small-world graphs
Completely regular
Small-world graphs
(n
>> k
>
ln (n) >
1)
Completely random
n
= number of nodes,
k
= number of neighbors of each nodeSlide27
Completely regular
A ring lattice
If then
Diameter is too large!Slide28
Completely random
Diameter is small, but the
Clustering coefficient is small too!Slide29
Small-world graphs
Start with the regular graph, and with probability
p
rewire
each link
to a randomly selected
node. It results in
a graph that has high clustering coefficient but low diameter …Slide30
Small-world graphs
Small-world
properties holdSlide31
Limitation of Watts-Strogatz
model
Jon Kleinberg argued
Watts-
Strogatz
small-world model illustrates
the existence
of short paths between pairs of nodes. But it does not give any clue about how those short paths will be discovered
. A greedy search for the destination
will not lead to the discovery of these short paths.Slide32
Kleinberg’s Small-World Model
Consider
an
grid
. Each node has
a link to every node at lattice distance
(short range neighbors) &
long range links. Choose long-range links at
lattice distance with a probability proportional to
r = 2
p
= 1,
q
= 2
n
nSlide33
Results
Theorem
1.
There is a
constant
(
depending on and
but independent of
), such that
when , the expected delivery
time of any decentralized algorithm is at least
Slide34
More results
Theorem 2
. There is a decentralized algorithm A and a
constant
dependent
on and but independent of
so that when and , the expected
delivery time of A is at
most Slide35
Variation of search time with r
E
x
ponent
r
Log TSlide36
Distributed SnapshotSlide37
Think about these
How
many messages are in transit
on
the internet?
What
is the
total cash reserve in the Bank of America?
How many cars are on the streets of Kolkata now? How much pollutants are there in the air (or water now)?
What are most people in the US thinking about the election?
How do we compute these?Slide38
UAV surveillance of trafficSlide39
Importance of snapshots
Major uses in
-
data collection
- surveillance
- deadlock detection - termination detection
- rollback recovery - global predicate computationSlide40
Importance of snapshots
A
snapshot may consist of the
internal states
of the
recording processes
, or it may consist of the
state of external shared objects updated by an updater process.Slide41
Distributed Snapshot:
First Case
Assume that the snapshot consists of the
internal states
of the
recording processes.
The main issue is
synchronization. An ad hoc combination of the local snapshots will not lead to a meaningful distributed snapshot.Slide42
One-dollar
bank
Let a
$1 coin
circulate in a network of a million banks. How can someone count the total $ in circulation? If not counted “properly,” then one may think the total $ in circulation to be one million.Slide43
Review Causal Ordering
Causality
helps identify
sequential
and
concurrent
events
in distributed systems, since clocks are notalways reliable.
Local ordering: a b
c (based on the local clock)
2. Message sent message received [Thus joke
Re: joke]
3. If
a
b and b
c
then
a
c
(
implies
causally ordered before
or
happened
before relation
)Slide44
Consistent cut
If this is not true, then the cut C is inconsistent
A
cut
is a
set of events
. If a cut C is consistent
then
timeSlide45
Consistent snapshot
The set of states immediately following the events (actions) in a
consistent cut
forms a
consistent snapshot
of a distributed system.
A snapshot that is of practical interest is the
most recent one. Let C1 and C2 be two consistent cuts and . Then C2 is more recent than C1.
Analyze why certain cuts in the one-dollar bank are inconsistent.Slide46
Consistent snapshot
How to record a
consistent snapshot
? Note that
1. The recording must be
non-invasive.
2. Recording must be done on-the-fly.
You cannot stop the system.
Slide47
Chandy-Lamport Algorithm
Works on a
(1) strongly connected graph
(2) each channel is FIFO.
An
initiator
initiates the algorithm by sending out a
marker
( )Slide48
White and red processes
Initially every process is
white
.
When
a
process
receives a marker,
it turns red and remain red
Every action by a process, and every
message sent by a process gets the color of that process.
So, white action = action by a white process
red action = action by a red process white message
= message sent by a white process
red message
= message sent by a red processSlide49
Two steps
Step 1.
In one atomic action, the initiator (a)
Turns red
(b)
Records its own state
(c)
sends a marker along all outgoing channels
Step 2. Every other process, upon receiving a marker for the first time (and before doing anything else) (a) Turns red (b) Records its own state (c) sends markers along all outgoing channels
The algorithm terminates when (1) every process
turns red, and (2) Every process has received a marker through each incoming channel.Slide50
Why does it work?
Lemma 1
.
No red message is received in a white action
.Slide51
Why does it work?
Theorem
.
The global state recorded by
Chandy-
Lamport
algorithm is
equivalent to
the ideal snapshotstate SSS.
Hint.
A pair of actions (a, b) can be scheduled in
anyorder, if there is no causal order between them, so
(a; b) is equivalent to (b; a)
SSS
Easy conceptualization of the snapshot state
All white
All redSlide52
Why does it work?
Let an observer observe the following actions
:
w[
i
] w[k] r[k] w[j] r[
i
] w[l] r[j] r[l] … ≡ w[i] w[k] w[j] r[k] r[i
] w[l] r[j] r[l] …[Lemma 1]≡ w[i] w[k] w[j] r[k] w[l] r[i
] r[j] r[l] …[Lemma 1]≡ w[i] w[k] w[j] w[l] r[k] r[i] r[j] r[l] …[done!]
Recorded stateSlide53
Example 1: Count the tokens
Let us verify that
Chandy-Lamport
snapshot algorithm correctly counts
the tokens circulating in the system
A
B
C
How to account for the channel states?
Compute this using the
sent and received variables for each process
.
token
no token
token
token
no token
no token
A
B
C
no token
no token
token
Are these consistent cuts?
1
2Slide54
Example 2: Communicating State MachinesSlide55
Something unusual
Let
machine
i
start
Chandy-Lamport
snapshot before it has sent M
along ch1. Also, let machine j
receive the marker after it sends out M’ along ch2
. Observe that the snapshot state is SSS =
down ∅ up
M’
Doesn’t this appear strange? This state was never reached during the computation!Slide56
Understanding snapshotSlide57
Understanding snapshot
The
observed state
is a
feasible state
that is reachable
from the
initial configuration
. It may not actually be visitedduring a specific execution.
The final state of the original computation is always reachable
from the observed state.Slide58
Discussions
What good is a snapshot if that state has never been visited by the system?
- It is relevant for the detection of
stable predicates
.
- Useful for
checkpointing
.Slide59
Discussions
What if the channels are not FIFO?
Study how
Lai-Yang algorithm
works. It does not use any marker
LY1
. The initiator records its own state. When it needs to send a message
m
to another process, it sends a message (m,
red).LY2.
When a process receives a message (m, red), it records its state if it has not already done so, and then accepts the message
m.Question 1. Why will it work?
Question 1 Are there any limitations of this approach?Slide60
Food for thought
Distributed snapshot = distributed read.
Distributed reset = distributed write
How difficult is distributed reset?Slide61
Distributed debugging
(
Marzullo
and
Neiger
, 1991)
observer
Distributed system
e, VC(e)Slide62
Distributed debugging
Uses
vector clocks
.
S
ij
is a global state after the ith
action by process 0 and the jth action by process 1Slide63
Distributed debugging
Possibly
ϕ
:
At least one consistent global state S is reachable
from the initial global state, such that
φ
(S) = true.
Definitely
ϕ
: All computations pass through some consistent global state S such that
φ(S) = true.
Never ϕ: No computation passes through some consistent global state S such that
φ
(S) = true
.
Definitely
ϕ
⇒ Possibly
ϕ
Slide64
Examples
ϕ
=
x+y
=12 (true at S
21
)
Possibly ϕϕ = x+y
> 15 (true at S31) Definitely ϕ
ϕ = x=y=5 (true at S40 and S22)
Never ϕ *Neither S40
nor S22 are consistent states*Slide65
Distributed Snapshot:
Second case
The
snapshot consists of the
external observations
of the
recording
processes --
distributed snapshots of shared external objects.
How
many cars are on the streets now?How many trees have been downed by the storm?Slide66
Distributed snapshot of shared objectsSlide67
The first algorithm
0
1
2
iSlide68
Algorithm double collect
function
read
while
true
X[0..n-1] := collect; Y[
0..n-1] := collect; if ∀i∈{0,..,
n-1} location i was not changed between two collects then return Y; end
function update (i,v) M[i] := v; end Slide69
Limitations of double collect
Read may never terminate! Why?
We need a better algorithm that guarantees termination.Slide70
Coordinated snapshot
Engage multiple readers and ask them to record snapshots at the
same time
. It will work if the writer is
sluggish
and the clocks are
accurately synchronized
.Slide71
Faulty recorder
Assume that there are
n
recorders. Each
records a snapshot
and
shares with the others
, so that each can form a complete snapshot.
Easy when all recorders record correctly and transmit the information reliably.
But what if one or more recorders are faulty
or the communication is error prone?Slide72
Distributed Consensus
Consensus is very important to take coordinated action.
How can the recorders reach consensus in presence of communication failure?
It reduces to the classic
Byzantine Generals ProblemSlide73
Byzantine Generals Problem
Describes
and solves the consensus problem on the
synchronous model of communication
.
The network topology is a
completely
connected graph.
Processes undergo byzantine failures, the worst possible kind of
failure. Shows the power of the adversary. Slide74
Byzantine Generals Problem
n
generals
{0, 1, 2, ..., n-1}
decide about whether to
"attack"
or to
"retreat" during a particular phase of a war. The goal is to
agree upon the same plan of action.
Some generals may be "traitors" and therefore send either no input, or send conflicting inputs to prevent the
"loyal" generals from reaching an agreement.
Devise a strategy, by which every loyal general eventually agrees upon the same plan, regardless of the action of the traitors. Slide75
Byzantine Generals
0
3
2
1
Attack = 1
Attack=1
Retreat = 0
Retreat = 0
{1, 1, 0,
0
}
{1, 1, 0,
0
}
Every general will broadcast
his/her
judgment to everyone else.
These are inputs to the consensus protocol.
{1, 1, 0,
1
}
{1, 1, 0,
0
}
traitor
The
traitor may
send
conflicting input valuesSlide76
Byzantine Generals
We need to devise a protocol so that
all peers (
call it a
lieutenant
)
receives the same value from
any given general (call it a
commander). Clearly, the
lieutenants will have to use secondary information.
Note that the roles of the commander and the
lieutenants will rotate among the generals.Slide77
Interactive consistency specifications
IC1.
Every
loyal lieutenant
receives
the
same order
from the commander.
IC2.
If the commander is loyal, then every loyal lieutenant receives the order that the commander
sends.
commander
lieutenantsSlide78
The Communication
Model
Oral Messages
1. Messages
are not corrupted in transit
. (
why? if the message gets altered then blame the sender
)
2. Messages can be lost, but the absence of message can be detected
.
3. When a message is received (or its absence is detected), the receiver knows the identity of the sender (or the defaulter).
OM(m) represents an interactive consistency protocol
in presence of at most m traitors.Slide79
An Impossibility Result
Using
oral messages
,
no solution
to the
Byzantine Generals
problem exists with
three or fewer
generals and one traitor. Consider the two cases:
In (a), to satisfy IC2, lieutenant 1 must trust the commander, but in IC2, the same idea leads to the violation of IC1.Slide80
Impossibility result
(continued
)
Using
oral messages
,
no solution to the Byzantine Generals problem exists
with 3m or fewer
generals and m
traitors (m > 0).
The proof is by contradiction
. Assume that such a solution exists. Now, divide the 3m generals into three groups of m
generals
each, such that all the traitors belong to one group. Let one general simulate each of these three groups. This scenario is equivalent to the case of
three generals and one traitor
. We already know that such a solution does not exist.Slide81
The OM(m
) algorithm
Recursive algorithm
OM(m
)
OM(m-1)
OM(m-2)
OM(0)
OM(0)
OM(m
) = Consensus Algorithm with oral messages in presence of up to
m
traitors
OM(0) = Direct broadcastSlide82
The OM(m
) algorithm
1
.
Commander
i
sends out a value
v (0 or 1)
2. If m
> 0, then every lieutenant j
≠ i, afterreceiving
v, acts as a commander and initiates OM(m-1)
with everyone except i .
3. Every lieutenant, collects (
n-1)
values:
(n-2) values
received from the lieutenants
using
OM
(m-1)
, and one direct value
from
the commander
. Then he picks
the majority
of
these
values as the
order
from
i
Slide83
Example of OM(1)Slide84
Example of OM(2)
OM(2)
OM(1)
OM(0)Slide85
Proof of OM(m
)
Lemma
.
Let the
commander be
loyal
, and
n > 2m + k,
where m = maximum
number of traitors. Then
OM(k) satisfies IC2Slide86
Proof of OM(m
)
Proof
If
k
=0, then the result trivially holds.
Let it hold for
k
= r (r
> 0) i.e. OM(r)
satisfies IC2. We have to show thatit holds for
k = r + 1 too.
By definition n > 2m+r
+1, so
n-1 > 2m
+r
So
OM(r
) holds
for the lieutenants in
the bottom row.
Each loyal lieutenant
collects
n-m-1
identical good values and
m
bad values. So bad values are voted
out (
n-m-1 >
m+r
implies
n-m-1 >
m
)
“O
M(r) holds
” means each loyal
lieutenant receives identical values
from every loyal commanderSlide87
The final theorem
Theorem
. If
n
> 3m
where
m
is the maximum number of
traitors, then OM (m
) satisfies both
IC1 and IC2.
Proof. Consider two cases:Case 1
. Commander is loyal. The theorem follows from the previous lemma (substitute k
=
m
).
Case 2
. Commander is a traitor. We prove it by induction
.
Base case
.
m
=0 trivial.
(
Induction hypothesis
) Let the theorem hold for
m
=
r
.
(
Inductive step
) We
have to show that it holds for
m
=
r+1 too.Slide88
Proof (continued)
There are
n
> 3(r + 1)
generals
and
r + 1 traitors. Excluding the commander, there are
> 3r+2 generals of which there are r traitors
. So > 2r+2 lieutenants are loyal. Since 3r+ 2 > 3.r, OM(r) satisfies IC1 and IC2
> 2r+2
r traitorsSlide89
Proof (continued)
In OM(r+1), a loyal lieutenant chooses the
majority from
(1)
> 2r+1 values
obtained from the loyal lieutenants via
OM(r
),
(2) the r values from the traitors, and
(3) the value directly from the commander.
> 2r+2
r traitors
The set of values collected in part (1) & (3) are the same for all loyal lieutenants –
it is the same set of values that these lieutenants received from the commander.
Also,
by the induction hypothesis
, in part (2) each loyal lieutenant receives
identical values from each traitor.
So every loyal lieutenant eventually
collects the same set of values.Slide90
Conclusion
Distributed snapshot of shared objects can be tricky when the writer does not cooperate
Approximate snapshots
is useful for a rough view.
Failures add new twist
to the recording of snapshots.
Much work remains to be done for the upper layers of
snapshot integration (
What can you make out from a trail of Twitter data with not much correlation?)