Alex Shraer Technion Joint work with JP Martin D Malkhi M K Aguilera MSR I Keidar Technion Preview 2 The setting datacentric replicated storage Simple networkattached storagenodes ID: 382436
Download Presentation The PPT/PDF document "Data-Centric Reconfiguration with Networ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Data-Centric Reconfiguration with Network-Attached Disks
Alex Shraer (Technion)
Joint work with:
J.P. Martin, D. Malkhi, M. K. Aguilera (MSR)
I.
Keidar
(
Technion
)Slide2
Preview
2The setting: data-centric replicated storageSimple network-attached storage-nodes
Our contributions:
First distributed reconfigurable R/W storage
Asynch. VS. consensus-based reconfiguration
Allows to add/remove
storage-nodes dynamicallySlide3
Enterprise Storage Systems
Highly reliable customized hardwareControllers, I/O ports may become a bottleneckExpensiveUsually not extensibleDifferent solutions for different scaleExample(HP): High end - XP (
1152
disks), Mid range – EVA (324 disks)
3Slide4
Alternative – Distributed Storage
Made up of many storage nodesUnreliable, cheap hardwareFailures are the norm, not an exception
Challenges:
Achieving
reliability
and consistencySupporting reconfigurations
4Slide5
Distributed Storage Architecture
Unpredictable network delays (asynchrony)
Cloud Storage
LAN/
WAN
read
write
5
Storage Clients
Dynamic,
Fault-prone
Fault-prone
Storage NodesSlide6
A Case for
Data-Centric Replication
Client-side code runs replication logic
Communicates with multiple storage nodes
Simple storage nodes (servers)
Can be network-attached disks Not necessarily PCs with disks Do not run application-specific code Less fault-prone componentsSimply respond to client requests High throughput
Do not communicate with each other
If storage-nodes communicate,
their failure is
likely to
be correlated!
Oblivious to where other replicas of each object are stored
Scalable, same storage node can be used for many replication sets
not-so-thin
client
thin
storage
nodeSlide7
Real Systems
Are Dynamic
7
The challenge: maintain
consistency
, reliability, availability
LAN/
WAN
reconfig
{–A, –B}
A
B
C
D
E
reconfig
{–C, +F,…,
+I}
F
G
I
HSlide8
Pitfall of Naïve Reconfiguration
8
A
B
C
D
{A, B, C, D}
{A, B, C, D}
{A, B, C, D}
{A, B, C, D}
{A, B, C, D}
{A, B, C, D}
{A, B, C, D, E}
{A, B, C}
{A, B, C, D, E}
{A, B, C, D, E}
{A, B, C}
{A, B, C}
E
delayed
delayed
delayed
delayed
reconfig
{+E
}
reconfig
{-D
}
{A, B, C, D, E}Slide9
Returns
“Italy”!
Pitfall of Naïve Reconfiguration
9
A
B
C
D
{A, B, C, D, E}
{A, B, C}
{A, B, C, D, E}
{A, B, C, D, E}
{A, B, C}
{A, B, C}
E
write x “Spain”
read x
{A, B, C, D, E}
X = “Italy”, 1
X = “Italy”, 1
X = “Spain”, 2
X = “Spain”, 2
X = “Spain”, 2
X = “Italy”, 1
X = “Italy”, 1
X = “Italy”, 1
Split Brain!Slide10
Reconfiguration Option 1: Centralized
Can be automatic E.g., Ursa Minor [Abd-El-Malek et al., FAST 05]
Downtime
Most solutions stop R/W while reconfiguring
Single point of failure
What if manager crashes while changing the system? 10
Tomorrow Technion servers will be down
for maintenance from 5:30am to 6:45am
Virtually Yours,
Moshe BarakSlide11
Reconfiguration Option 2:
Distributed AgreementServers agree
on next configuration
Previous solutions
not data-centric
No downtimeIn theory, might never terminate [FLP85]In practice, we have partial synchrony so it usually works
11Slide12
Reconfiguration Option 3: DynaStore
[Aguilera, Keidar,
Malkhi
, S., PODC09]
12
Distributed & completely asynchronous
No downtime
Always terminates
Not data-centricSlide13
In this work: Dyna
Disk dynamic data-centric R/W storage
13
First distributed data-centric solution
No downtime
Tunable reconfiguration methodModular design, coordination is separate from dataAllows easily setting/comparing the coordination methodConsensus-based VS. asynchronous reconfiguration
Many shared objects
Running a protocol instance per object too costly
Transferring all state at once might be infeasible
Our solution: incremental state transfer
Built with an external (weak) location service
We formally state the requirements from such a serviceSlide14
Location Service
Used in practice, ignored in theory
We formalize the weak external service as an
oracle
:
Not enough to solve reconfiguration
14
oracle.query
( )
returns some “legal” configuration
If reconfigurations stop and
oracle. query
() invoked infinitely many times, it eventually returns last system configurationSlide15
The Coordination Module in
DynaDisk Storage devices in a configuration conf
= {+A, +B, +C}
z
x
y
next
config
:
z
x
y
next
config
:
z
x
y
next
config
:
A
B
C
Distributed R/W objects
Updated similarly to ABD
Distributed
“weak snapshot”
object
API:
update(set of changes)→
OK
scan()
→
set
of updates
15Slide16
Coordination with Consensus
z
x
y
next
config
:
z
x
y
next
config
:
z
x
y
next
config
:
A
B
C
reconfig
({–C
})
reconfig
({+D
})
Consensus
+D
–C
+D
+D
+D
+D
+D
+D
update
:
scan: read & write-back next
config
from majority
every scan returns +D or
16Slide17
Weak Snapshot – Weaker than consensus
No need to agree on the next configuration, as long as each process has a set of possible next configurations, and all such sets intersectIntersection allows to converge and again use a single
config
Non-empty intersection
property of weak snapshot:
Every two non-empty sets returned by scan( ) intersect
Example: Client 1’s scan Client 2’s scan
{+D} {+D}
{–C} {+D, –C}
{+D} {–C}
Consensus
17Slide18
Coordination without consensus
z
x
y
next
config
:
z
y
next
config
:
z
y
next
config
:
A
B
C
reconfig
({–C
})
reconfig
({+D
})
update
:
scan:
read & write-back proposals from majority (twice)
CAS({–C},
,
0
)
+D
CAS({–C},
,
1
)
+D
–C
WRITE ({–C}, 0
)
OK
OK
2
2
2
1
1
1
0
0
0
–CSlide19
Tracking Evolving Config’s
With consensus: agree on next configuration
Without consensus – usually a chain, sometimes a DAG:
19
A, B, C
A,B,C,D
+D
C
A,B
A, B, D
A, B, C
+D
+D
C
C
A,B,C,D
A, B, D
Inconsistent
updates found
and merged
weak snapshot
scan() returns {+D, -C}
scan() returns {+D}
All non-empty scans intersectSlide20
Consensus-based VS.
Asynch. Coordination
Two implementations of weak snapshots
Asynchronous
Partially synchronous (consensus-based)
Active Disk
Paxos
[
Chockler
,
Malkhi
, 2005]
Exponential
backoff
for leader-election
Unlike asynchronous coordination, consensus-based might not terminate
[FLP85]
Storage overhead
Asynchronous: vector of updates
vector size ≤ min(#
reconfigs
, #members in config)Consensus-based: 4 integers and the chosen updatePer storage device and configuration20Slide21
Strong progress guarantees are not for free
Consensus-based
Asynchronous (no consensus)
Significant negative effect on R/W latency
Slightly
better,much
more predictable
reconfig
latency when many
reconfig
execute simultaneously
The same when no reconfigurations
21Slide22
Future & Ongoing Work
Combine asynch. and partially-synch. coordinationConsider other weak snapshot implementations
E.g., using randomized consensus
Use weak snapshots to reconfigure other services
Not just for R/W
22Slide23
Summary
DynaDisk – dynamic data-centric R/W storageFirst decentralized solutionNo downtimeSupports many objects, provides incremental
reconfig
Uses one coordination object per
config
. (not per object)Tunable reconfiguration method We implemented asynchronous and consensus-basedMany other implementations of weak-snapshots possibleAsynchronous coordination in practice:Works in more circumstances → more robust
But, at a cost – significantly affects ongoing R/W ops
23