Consistent MainMemory Database ederations under Deferr ed Disk Writes Rodrigo Schmidt  Fernando Pedone cole Polytechnique Fdrale de Lausanne EPFL CH Lausanne Switzerland Uni ersit della Svizzera Ital
189K - views

Consistent MainMemory Database ederations under Deferr ed Disk Writes Rodrigo Schmidt Fernando Pedone cole Polytechnique Fdrale de Lausanne EPFL CH Lausanne Switzerland Uni ersit della Svizzera Ital

schmidtepflch fernandopedoneunisich Abstract Curr ent cluster ar hitectur es pr vide the ideal en vir on ment to run feder ations of mainmemory database sys tems FMMDBs In FMMDBs data esides in the main memory of the feder ation server s signi64257ca

Download Pdf

Consistent MainMemory Database ederations under Deferr ed Disk Writes Rodrigo Schmidt Fernando Pedone cole Polytechnique Fdrale de Lausanne EPFL CH Lausanne Switzerland Uni ersit della Svizzera Ital

Download Pdf - The PPT/PDF document "Consistent MainMemory Database ederation..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "Consistent MainMemory Database ederations under Deferr ed Disk Writes Rodrigo Schmidt Fernando Pedone cole Polytechnique Fdrale de Lausanne EPFL CH Lausanne Switzerland Uni ersit della Svizzera Ital"— Presentation transcript:

Page 1
Consistent Main-Memory Database ederations under Deferr ed Disk Writes Rodrigo Schmidt ?; Fernando Pedone École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland Uni ersità della Svizzera Italiana (USI), CH-6904 Lugano, Switzerland E-mails:, Abstract Curr ent cluster ar hitectur es pr vide the ideal en vir on- ment to run feder ations of main-memory database sys- tems (FMMDBs). In FMMDBs, data esides in the main memory of the feder ation server s, significantly impr v- ing performance by avoiding I/O

during the xecution of ead oper ations. maximize the performance of update tr ansactions as well, some applications ecur to deferr ed disk writes. This means that update tr ansactions com- mit befor their modifications ar written on stable stor and dur ability must be ensur ed outside the database While deferr ed disk writes in centr alized MMDBs elax the dur ability pr operty of tr ansactions only in FMMDBs tr ansaction atomicity may be also violated in case of failur es. addr ess this issue fr om the per spective of lo g-based ollbac k-r eco very in distrib uted systems and pr vide an

ef ficient solution to the pr oblem. eyw ords dependency tr ac king consistency ollbac k- eco very distrib uted tr ansactions, MMDBs. 1. Intr oduction Continuous technology impro ements ha reduced the cost and boosted the performance and memory ca- pacity of commodity computers. As consequence, po werful computer clusters are becoming increasingly af fordable and common. These architectures pro vide the ideal en vironment for mechanisms tar geting high- performance computing such as main-memory database systems (MMDBs [11 ]). Although originally designed for specific classes of

applications (e.g., telecommuni- cation) running in single serv ers, recent ork has sug- gested that MMDBs can be also used in broader con- te xts (e.g., web serv ers [18 ]) and en vironments (e.g., clustered architectures [24 ]). Shortly MMDBs er come the latenc limitations of traditional disk-based databases by storing the data items in the main mem- ory of the serv ers [12 ]. By oiding disk I/O, both The ork presented in this paper has been partially supported by the Hasler oundation, Switzerland (project #1899). transaction throughput and response time can be im- pro ed. Moreo er as

transactions do not ha to ait for data to be fetched from disk, concurrenc becomes less important for performance and some approaches ha considered lo wering the erhead of transaction synchronization by reducing concurrenc (e.g., locking tables instead of ro ws, ecuting transactions sequen- tially [11 15 ]). or reco ery reasons, MMDBs also eep cop of the database in disk. Queries ecute entirely using data in main memory ut update transactions ha to mod- ify the state in disk. In act, accessing the disk is the main erhead incurred by update transactions ecut- ing in an MMDB. maximize the

performance in such cases, some applications recur to deferr ed disk writes This means that update transactions commit before their modifications are written on stable storage. Since disk access is deferred until after transactions commit, ari- ous transaction logs can be grouped and asynchronously written at once on disk. This approach alone harms the durability property of transactions, ut some ap- plications may prefer to ensure durability outside the database for performance reasons. As an xample of such applications, database replication schemes based on atomic broadcast primiti es

(e.g., the database state machine approach [19 ]) in the crash-reco ery model will ha durability ensured by the group communication primiti (see the ork in [22 ]) and, therefore, it is re- dundant to also ha it in each database replica. This paper considers federation of main-memory database systems (FMMDB) where data is partitioned among dif ferent serv ers running local MMDBs. Global transaction termination is implemented by atomically grouping the commit decision of arious local sub- transactions. As in centralized database, applications can choose to use deferred disk writes in order to

im- pro system performance. Deferred disk writes, ho w- er introduce additional comple xities in an FMMDB. In single-serv er system, only the durability property
Page 2
may be violated in case of database crash—this happens as long as log writes respect the commit order of their respecti transactions. By contrast, in federation crash may render serv er inconsistent with respect to the others, compromising atomicity as well. Consider simple federation composed of tw database serv ers. If transaction updates data in both serv ers, commits, and one of the serv ers crashes before making

the updates locally persistent, when the ailed serv er reco ers from the ailure, it will ha for gotten local ecution. In this case, atomicity is violated by the act that only part of persists: the one in the serv er that did not crash. address the problem of deferred disk writes in federation of MMDBs using no el approach that bor ro ws from the theory of rollback-reco ery in distrib uted systems [9 ]. The basis of this theory is the identification of dependencies between process states. This allo ws the recognition of consistent global states (i.e., those com- posed of local states such

that no one depends on the other) to which the application should be rolled back in case of ailure. Ef ficiently applying these results in the conte xt of transaction processing systems, ho we er is not straightforw ard and requires re visiting the original theory ransaction processing systems create depen- dencies between database states dif ferently from usual message-passing distrib uted systems. In the latter de- pendencies are based on causality in the former de- pendencies are created by ead and write operations on database objects during the ecution of transactions. Consider for

xample, simple distrib uted transac- tion ecution composed of tw serv ers and one client. transactions ecute sequentially: and Fig- ure depicts the ecution where read requests are de- noted by R, write requests by and commit requests by C; and represent the database states at the serv ers. Database serv er changes its state after an update transaction commits at the state remains the same if the transaction only reads the local state or aborts. In usual message-passing system, state ould precede since there is causal path between the tw states (depicted in bold in Figure 1). Ho we er since

only reads it turns out that and are in act con- current. This xample sho ws that causality is actually too strong to capture database state dependencies, and more appropriate formalism is needed. re visit the original dependenc definitions, de el- oped for message-passing systems, and propose ne one based on database states, minimal for distrib uted Ev ent causally precedes if (i) the ecute in the same pro- cess, before or (ii) refers to the sending of message and refers to its receipt, or (iii) and are related by the transiti closure of the tw pre vious conditions [17 ]. PSfrag

replacements Figure 1. alse (causal) dependenc transaction en vironments and allo wing ef ficient tracking implementation. Moreo er this paper illustrates the ap- plicability of our approach in the conte xt of an FMMDB with deferred disk writes. Our solution is optimistic in the sense that we do not force serv ers to synchronize their accesses to disk (e.g., using tw o-phase commit- lik protocol), ut track dependencies between database states during normal ecution and, in case of ailure, bring the system to consistent state during reco ery This paper is structured as follo ws. Section

intro- duces our computational and ecution models. Sec- tion xplores consistenc and dependencies in trans- actional system. Section presents our algorithms to ensure correctness of ecution in federation of main- memory databases with deferred disk writes. com- pare our approach with xistent orks in the field in Sec- tion and conclude the paper in Section 6. Due to space limitations, theorems and correctness proofs are presented in the full paper [25 ]. 2. System model assume system composed of tw disjoint sets of processes: the set of serv ers and the set of clients Serv ers are

stateful—their state is gi en by the data alues stored on them, and clients are stateless—their state can be recreated by the serv ers state in case of crash. as- sume that clients interact only with serv ers by submit- ting transaction requests and aiting for their response. All communication between clients and serv ers is done through message xchanging. The system is asynchronous: we mak no assump- tions about the time needed for processes to ecute and messages to be transmitted. Communication links may lose messages ut if both sender and recei er remain up “long enough, lost messages can

be retransmitted and are entually recei ed. process can ail by crashing, stopping its ecution and losing its olatile state, ut The implementation of distrib uted transactional en vironment may require stronger assumptions (e.g., ailure suspicion). The ideas described in this paper ho we er are obli vious to such assumptions.
Page 3
it entually reco ers. Serv ers are equipped with sta- ble storage whose contents survi crashes. The system ecution alternates between normal ecution periods and reco ery sessions. reco ery session starts when ailure is noticed and ends after the serv ers

are ensured to be in globally-consistent state. 2.1 Database ser ers and transactions Serv ers store disjoint subsets of the entire database accessible to the clients and run local main-memory databases. call the complete set of serv ers main- memory database feder ation Each serv er ecutes local transactions, where transaction is (most lik ely short) sequence of read and write operations on data items, fol- lo wed by commit or an abort operation, ut not both. transaction is called ead-only if it does not contain an write operations, and update otherwise. ransactions are abstracted by the

follo wing traditional properties [13 ]: Atomicity: transaction changes to the database state are atomic: either all happen or none happen. Consistency: transaction is correct transformation of the database state. Isolation: An ecution of set of transactions is equivalent to serial ecution of the same trans- actions. Durability is relax ed as result of deferred disk writes. If there is ailure before transaction is made durable, ut after its commit, such transaction is lost In that case, after reco ery the ecution has to proceed as if the transaction had ne er ecuted. Lost transactions dif fer

from aborted ones because the commit and their results may ha been seen by other transactions. transaction that is not lost throughout the ecution is called per sis- tent redefine transaction durability under deferred disk writes through the tw properties belo w: eak Durability: If an update transaction commits and the system does not crash for “long enough, the transaction is persistent. Consistent ersistence: persistent transaction is pr e- ceded only by other persistent transactions. In order to mak the pre vious definitions sound, tw things still ha to be defined: equi

alence between x- ecutions of sets of transactions and precedence between transactions. Let tr ansaction history be partial order on all the operations ecuted by set of trans- actions, necessarily defined for all conflicting oper a- tions —tw operations are said to conflict if the both operate on the same data item and one of them is write [4 ]. represents real ecution (not necessar ily serial) of the transactions in the system. histo- ries er the same set of transactions are equi alent if the order conflicting oper ations of non-aborted persis- tent transactions in the

same ay say that trans- action dir ectly pr ecedes transaction in if there is pair of conflicting operations, such that precedes in The precedence relation between transactions is gi en by the transiti closure of the direct precedence relation. Ha ving clarified our defi- nitions, we ould lik to reinforce that our concern is to xtend eak Durability and Consistent Persistence from the local database serv ers to the federation, and ensure that none of the other transaction properties are violated in the presence of ailures. assume the concurrenc control in each serv er is based

on shared read locks and xclusi write locks in the whole local database, characterizing the multiple- read single-write beha vior found in some MMDBs (e.g., [15 ]). This allo ws us to abstract client operations as Reads and Writes performed er an entire database state. sho ho our approach can be xtended to more complicated concurrenc control mechanisms such as tw o-phase-locking in Section 4.5. serv er updates its state to ne one after com- mitting transaction that wrote some alue on the serv er This creates sequence of states where represents state after committing the -th local up- date

transaction. 2.2 Clients execution model Clients ecute sequence of steps. In each step, client (a) performs some local computation, (b) submits request to database in the federation, and (c) aits for its response. abstract the set of possible database re- quests by the follo wing primiti es, where op represents an operation to be submitted to the database. Details about their implementation are gi en in Section 4.2. Read( tid op ): Operation op reads some data item stored in on behalf of transaction tid Write( tid op ): Operation op updates some data item stored in or creates it on behalf of

tid Commit( tid ): Requests the global commit of transac- tion tid in the federation. Abort( tid ): Requests the global abort of transaction tid in the federation. start ne transaction, client generates ne unique transaction identification number tid ), to be used in all serv ers. When serv er recei es the first oper ation on behalf of tid (either Read or Write), it cre- ates ne transaction abstraction in the local database and relates it to tid in order to submit future operations to the database in the same local transaction abstraction. When all the operations in all serv ers

referent to certain
Page 4
transaction ha been ecuted, the client ecutes the Commit request to ensure global commit. After Com- mit or Abort request, no more requests with the same tid are ecuted by the client. At an point during transaction ecution, serv er that is participating in it can unilaterally abort its local sub-transaction. This is done, for xample, if the local sub-transaction is in olv ed in deadlock or the serv er suspects that the client responsible for this transaction has crashed. ensure transactions atomic commit in the absence of ailures we use simple blocking pro-

tocol: the client sends message to all in olv ed serv ers asking them to prepare to commit. Ev ery in olv ed serv er sends its committing/aborting ote to the client and the other serv ers. serv er commits the transaction if it recei es “commit ote from ery in olv ed serv er Moreo er if the client recei es the “commit ote from ery serv er it kno ws the transaction has been commit- ted. abort transaction, client simply sends an “abort message to all in olv ed serv ers. If the client ails and some serv er does not recei such message, en- tually this serv er will unilaterally abort the

transaction. It is clear that this algorithm (deri ed from tw o-phase- commit [4, 13 ]) orks in the absence of ailures. Sec- tion 4.2 sho ws ho Atomicity is preserv ed in the pres- ence of ailures albeit no disk write is ecuted during transaction commit. 3. Consistent global database states When ailure occurs, we must mak sure that the system will restart from pre vious consistent global state. In this section we precisely define the notion of consistenc analyze the conditions that mak global database state consistent, and sho what must be done by our algorithm to ha it reco erable. 3.1

Database-state dependencies When it comes to the creation of database-state de- pendencies, we are only interested in committed trans- actions. Therefore, we consider only committed trans- actions in definitions and theorems presented in this section and, for simplicity omit this condition in their statement. Additionally some xtra notation is nec- essary use to represent the set of serv er states accessed by transaction throughout its ecution. is the set of serv er states updated by This means that if and commits, ne database state +1 is cr eated by at serv er Fur thermore, we

define to be the set of serv er states read by State dependencies in the transactional model are due to the three well-kno wn types of transaction dependen- cies: write-read, write-write and read-write [4 13 ]. Def- inition belo captures the notion of transaction de- pendenc using our terminology in simplified manner where write-read and write-write dependencies are rep- resented by condition (a), read-write dependencies by condition (b), and transiti dependencies by condition (c). In this conte xt, database state precedes another one if the former is erwritten by transaction that

either creates the latter or precedes the transaction that does it. This means that the first state will ha already been erwritten by the time the second one is created and, therefore, no transaction (or xternal vie wer) can see both of them together in the same global database state. Definition presents this idea more formally Definition ansaction pr ecedes if (a) or (b) or (c) 00 00 00 Definition State pr ecedes if (a) and (b) and (c) 3.2 Consistent and eco erable database states global state of the federation is set composed of local state for each database serv er

in the system. base our consistenc criterion on the notion of serializ- abilty [4 and formalize it in Definition 3. Definition global database state in given history is consistent if it epr esents the database state after the serial xecution of an or der ed set of tr ansactions ::; suc that: (a) all tr ansactions in ar non-aborted per sistent tr ansactions in (b) in and (c) in From Definition 3, global state is consistent if it is created by the ecution, in correct order of subset of the ecuted transactions left-closed under the trans- action dependenc relation. Theorem sho

ws sim- pler characterization of consistent global state based on the database-state dependenc relation we introduced in Definition 2. Theor em global state is con- sistent if 6!
Page 5
As an xample, consider Figure 2(a), where we sho possible ecution scenario in which transactions are applied to federation of tw database serv ers. omit message xchanges between clients and serv ers and depict only the operations performed against the databases grouped by transaction, where means database write and means database read. Figure 2(b) sho ws the dependencies between the database

states cre- ated by the ecuted transactions. depict only the direct dependencies and omit the transiti ones. Based on these dependencies, it is possible to identify total of se en consistent global states according to Theorem 1, all of them depicted in Figure 2(b). Global state number is reached after the serial ecution of and global state number is achie ed by By the eak Durability property described in Sec- tion 2.1, if one serv er crashes, it might not reco er in the same state it as just before the crash. According to Consistent Persistence, locally ensured by the MMDB running in the serv

er an entire suf fix of the local ecu- tion may be lost after ailure. As this ne local state may be inconsistent with the state of the other serv ers, to ensure Consistent Persistence globally the entire sys- tem may ha to roll back to pre vious consistent global state. Clearly we ant this state to be as recent as pos- sible to roll back the least number of committed transac- tions. In order to satisfy this condition we ha to dis- tinguish between stable database states, already written on the serv er disk, and olatile database states, whose local durability has not been ensured yet.

consistent global state is reco erable if it is composed of stable database states. When some database serv ers crash, the reco ery algorithm must mak the system roll back to its most recent reco erable consistent global state, or e- co very line non-f aulty serv er that ants to mak its olatile states part of the reco ery line should mak them stable before ecuting the reco ery algorithm.                                      

                                        PSfrag replacements (a) PSfrag replacements (b) Figure 2. Consistent global states PSfrag replacements last last Figure 3. Reco ver y-line determination The main determiner of the reco ery line in some his- tory is the last stable state of each serv er which we denote by last As Theorem sho ws, the reco ery line for gi en ecution scenario is composed of the last persistent state not

preceded by an state last Theor em The eco very line for given history is determined by =1 max last 6! Figure depicts an xample of reco ery line determina- tion based on the scenario presented in Figure (v olatile states are depicted between square brack ets, e.g., ). The figure sho ws dependenc graph with all the states dependent on some state last as empty circles. There- fore, the reco ery line is formed by the state represented by the last filled circle in each database serv er 4. Database-oriented ollback-r eco ery 4.1 Thrifty dependency tracking Definition relates

database-state dependencies with transaction dependencies. Theorem belo sho ws that it is also possible to eep track of database-state de- pendencies without ha ving to gather information about transaction dependencies. Theor em Server state pr ecedes if (a) ); or (b) t; ); or (c) t; Theorem comes from the act that transaction ac- cesses consistent partial state of the federation and generates, after its ecution, another consistent partial state. These states ork lik partial snapshots of the x- ecution and, therefore, incur constraints in the ordering of ents. As in the real orld, if an ent is

captured in snapshot and another one is not (i.e., it took place after the snapshot as tak en), then the snapshot is “proof that the first ent happened before the second.
Page 6
emplify conditions (a), (b) and (c) of Theo- rem in Figure 4, where Befor refers to the (partial) federation state accessed by transaction either read- only or update transaction, and After refers to the fed- eration state generated after ecution. In the figure, scenarios (a and (a correspond to condition (a) of Theorem 3, and scenarios (b) and (c) correspond to con- ditions (b) and (c), respecti

ely Figure 4(a depicts the situation where and When commits, the ne state it creates contains +1 and Therefore, as necessarily precedes this state and succeeds it, it is clear that Fig- ure 4(a represents the case where As is created by it did not xist before commit; whilst xisted only until before commits, since it is updated by This means that, as no other transaction can see state between Befor and After In Figure 4(b), and This means that and belong to the federation state accessed by Similarly to the situation depicted in Fig- ure 4(a ), must precede Lastly let us consider the case where

and sho wn in Figure 4(c). The state generated after com- mit contains and Since precedes has been already updated before is created. As and are created together surely The scenario of condition (c) where and resembles the situation depicted in Fig- ure 4(b), just xchanging Befor for After 4.2 Dependency tracking algorithm Theorem leads to simple ay to gather database- state dependencies on-the-fly during the system e- PSfrag replacements +1 After Befor PSfrag replacements +1 After Befor (a (a PSfrag replacements +1 After Befor PSfrag replacements +1 After Befor (b) (c) Figure 4.

Dependencies based on the ser ver states accessed transaction cution. Assume each state has associated with it data structure representing the set of states it de- pends on (we sho later ho this structure can be im- plemented ef ficiently). update upon commit- ting, ery transaction ecutes the steps described in Algorithm 1, where is an auxiliary data structure local to initially empty represents the depen- dencies that must be attrib uted to the ne xt state to be created at serv er Lines 1–3 are directly associated with the three possible database-state precedences pre- sented in Theorem

3. Line associates dependenc data structure with ery ne database state created by the transaction. Algorithm Dependenc tracking During commit of transaction at 1: 2: 3: 4: +1 no xplain ho Algorithm can be imple- mented in practice. start analyzing ho MMDBs write database state changes on stable storage. In MMDBs, data changes are stored on disk only after an update transaction has issued commit request. This means that no action must be undone in case of ailures and the transaction log is typically redo-only and can be implemented by simply storing the set of operations performed by each

transaction [8 ]. Re gardless its par ticular implementation details, each entry in redo-only log represents the ne state created by the respecti up- date transaction ecuted. can therefore associate the database state with the th entry in the log of Serv er eep track of dependencies, the only thing we ha to do is to write the structure with its respecti transaction entry on transaction log. or practical implementation, we must pro vide ay to implement the data structure ef ficiently with respect to space comple xity As dependencies are transiti and continuous in the sequence of states of

database serv er it is not dif ficult to see that to eep track of the complete set of dependencies of gi en state we need to store only the last state of each serv er on which depends. If depends on 0) clearly it also depends on Therefore, complete set of state dependencies can be represented by dependenc ector with entries, in which stores the inde of the most recent state depen- denc from serv er This idea and nomenclature is
Page 7
inspired by dependenc tracking for rollback-reco ery in the message-passing model [28 ]. di vide our dependenc tracking algorithm into tw parts:

the client stub and the serv er wrapper both sho wn in Algorithm 2. Only one when clause ecutes at time, and only after its condition holds. If more than one when -clause condition hold at the same time, an one is chosen to ecute. assume ho we er that the ecution is air that is, unless the serv er crashes, ery when clause with condition that holds will be ecuted. submit transaction operations to the local MMDB, serv er mak es use of the submit interf ace. Moreo er to mak it clear that our approach does not introduce an xtra disk operations, all log operations are dealt by our algorithm, that

is, all submit calls access only data in the serv er main memory At the client side it is only necessary to eep track of the set of serv ers accessed during the ecution of transaction (line 2). Basically all operations performed by the client stub are straightforw ard and ha little to do with dependenc tracking. Dependenc tracking tak es place at commit making use of the synchroniza- tion messages xchanged by the serv ers to ensure trans- actions atomicity While analyzing the algorithm, re- member that we assume Isolation is ensured by sim- ple database-locking mechanism and global Atomicity

during normal ecution is gi en by ariation of tw o- phase-commit, described in Sections 2.1 and 2.2, respec- ti ely Although we mak no xplicit use of these tw properties, the ensure the dependencies captured by our algorithm are consistent with the dependencies indeed created in the distrib uted database. Briefly each serv er eeps tw dependenc ectors during ecution, and last implements (the dependencies to be attrib uted to the ne xt state created) and last stores the dependencies of the cur rent database state. serv er sends, together with the an- swer to the request issued by the

client, de- pendenc ector containing the dependencies the trans- action should forw ard to all accessed serv ers based on the operations performed in the local database (lines 35- 41). This information is sent not only to the client ut also to the other in olv ed serv ers. Finally when serv er recei es the messages from all serv ers in olv ed in the transaction, it updates its (line 45-46) and, if the transaction wrote some data in the database, the serv er performs local state transition (lines 48-49). correct implementation of Algorithm is ensured by the dependencies propagated by the serv

ers in the or code simplicity let us assume single client does not ecute tw transactions concurrently messages. Dependencies referent to line of Al- gorithm are gathered in line 38 of Algorithm 2. De- pendencies gi en by line of Algorithm are gathered in line 39 of Algorithm if the serv er as only read by the transaction, or in line 37 if the serv er as updated. Line 37 also captures dependencies referent to line of Algorithm 1. Correctness proofs of Algorithms and appear in [25 ]. As mentioned before, the atomic commit mechanism we assumed can block processes in case of ailure, forc- ing them

to ait for message from process that has crashed. block ed process is unblock ed when the crashed serv er upon which it depends reco ers and starts the global reco ery procedure xplained in the ne xt sec- tion. During the reco ery phase, all running transactions are aborted and global state consistenc is ensured by the rollback-reco ery mechanism. When ecution re- sumes, no serv er is block ed an more. block ed client has to ait for reco ery notification to unblock and check with the database serv ers whether some transac- tion as lost. Unblock ed clients may also start some reco ery

procedure after recei ving such notification if the rely on something outside the database to ensure transaction durability 4.3 Rollback-r eco ery Once we ha managed to perform dependenc tracking ef ficiently during the ecution, we can mak use of one of the numerous xistent approaches to orchestrate rollback-reco ery in the message-passing model [14 26 28 ]. illustrate the idea by xtend- ing the algorithm presented in [26 ], adapted to our e- cution model. The system runs as sequence of incar nations, started after reco ery from some ailure. Each serv er eeps track of the current

incarnation. In order to start ne one, an agreement among serv ers must be reached to determine the reco ery line used for the fed- eration restart. Therefore, processes xchange messages containing information about their last stable database state. When all information is recei ed by serv er it computes its local state that tak es part in the reco ery line based on Theorem and rolls back to it by erasing inconsistent log entries. Due to the possibility of ail- ures, information about the current incarnation and the last reco ery line used for reco ery must be ept in the stable storage of each

serv er detailed description of this algorithm is presented in [25 ]. 4.4 Algorithm analysis Algorithm incurs no xtra cost during transaction ecution with respect to the number of messages and
Page 8
Algorithm Complete algorithm for dependenc tracking 1: Data Structures 2: set of serv ers 3: Be gin_T ransaction () 4: 5: return unique tid 6: Read /Write tid; op 7: 8: send tid op to 9: ait for esul from 10: return esul 11: Commit tid 12: send tid to all 13: ait for tid from all 14: return YES 15: Abort tid 16: send tid to all 17: Data Structures 18: opSet tid ordered set of operations

19: last array [1 ::n of inte ger 20: tid set of serv ers 21: Initialization 22: tid opSet tid tid 23: 24: last 25: The serv er continuously aits for an ent: 26: when recei tid op from 27: esul submit( tid op 28: send esul to 29: when recei tid op from 30: esul submit( tid op 31: append op to opSet tid 32: send esul to 33: when recei tid from 34: tid 35: if willing to commit then 36: if opSet tid then 37: aux 38: aux last 39: else aux last 40: send tid YES, aux to tid 41: else send tid NO, ?i to tid 42: when tid such that tid recei ed tid tid tid from 43: if tid tid YES then 44: submit( tid

45: or all tid do 46: max tid ]) 47: if opSet tid then 48: last 49: assynchronously write entry opSet tid in the transaction log 50: when recei tid from 51: submit( tid communication steps. The algorithm just piggybacks ector timestamp in messages related to the transac- tion commit and updates local ariables according to the timestamps recei ed. Our approach ensures the min- imum possible “windo of vulnerability for transac- tions, since it depends only on the time each serv er tak es to physically write on stable storage the transaction log entry Ev ery serv er does that at its wn pace

without synchronizing with the others; as soon as all of them complete their writes the transaction is durable. It is possible to come up with alternati solutions to the problem of ensuring consistenc in federation of main-memory databases under deferred disk writes. or instance, non-blocking synchronous checkpointing approaches for the message-passing model, lik [6] and [16 can be adapted to the transactional model considering database-state dependencies in the ay we ha defined. These algorithms, ho we er incur control messages during disk-write synchronization and may force the

propagation of timestamps in the appli- cation messages to ercome the absence of FIFO com- munication channels [9 or tw disk writes per synchro- nization to record the act that the current instance has finished and ne ones are allo wed [16 ]. Although some dif ficulties can be oided by stronger system assump- tions as in [24 ], the problem of increasing the windo of vulnerability and making it as lar ge as the one of the slo west serv er for all serv ers will al ays be present in synchronous algorithms. able summarizes the comparison between the approaches we ha mentioned. aggre

gate syn- chronous checkpointing protocols (e.g., [6 and [16 ]) since the present similar beha vior with respect to the ariables analyzed in the table. Moreo er “MySQL Cluster refers to the synchronous approach adopted in [24 ]. represent the disk latenc (i.e., the time it tak es for disk write request to be completed) of serv er by lat and use AX to refer to max lat The netw ork latenc used to quantify communication step, is represented by Besides requiring FIFO channels, synchronous checkpointing protocols include the clients in their syn- chronization, since the are in olv ed in the

creation of database-state dependencies. MySQL Cluster assumes
Page 9
Communication Client windo of Extra messages Algorithm channels synchronization vulnerability per ecution Sync. Checkpointing FIFO Clients participate AX ( MySQL Cluster artially Sync. Clients coordinate AX ( Our approach An None lat ab le 1. Comparison of the diff erent appr oac hes partially synchronous channels (i.e., with bounded mes- sage deli ery) and ha clients coordinate the task in or der to simplify the algorithm. Dif ferently our approach mak es no assumptions about communication channels

and only propagates timestamps on some of the mes- sages already xchanged by the system. As the role of the client in participating of synchronous approaches is not ery clear possibly forcing more messages to be x- changed, for such approaches we only sho the lo wer bound on xtra messages required for serv ers synchro- nization. 4.5 Dealing with complex concurr ency contr ol So ar we ha assumed ery simple concurrenc control mechanism inside ery single database serv er with concurrent access for read-only transactions and xclusi access for update transactions. Ho we er our results can be easily

xtended to more general cases. or xample, the well-kno wn tw o-phase-locking (2PL) algorithm can be seen as an xtension of our simple con- currenc control where each piece of data plays the role of "virtual database": multiple transactions can read the data concurrently ut only one can update it at time. As consequence, though, ector timestamps will ha as man entries as the number of virtual databases. Clearly the implementation of such system can be sim- plified since all virtual databases inside the same phys- ical one will be al ays synchronized with each other Reducing the size of

the timestamps will in olv either the use of direct instead of transiti dependenc track- ing (and more comple reco ery algorithm [9, 26 ]), or the identification of alse dependencies as it happens when logical clocks are used instead of ector clocks to gather causal dependencies between ents [17 ]. Study- ing such alternati es is out of the scope of this paper and subject to further ork. Related ork Although MMDBs do not represent ne concept in database design, only recently the ha been applied to Although in practice this might not incur in lar ge erheads since in most MMDBs concurrenc

control is usually performed at coarse granularity [11 ]. more general scenarios. Specifically to our kno wledge, the only ork that mak es use of MMDBs in cluster of serv ers is [24 (deri ed from [23 ]), where performance and ailability are enhanced by replicating and frag- menting the database among the database serv ers in the system. ensure good performance for update trans- actions as well, the approach mak es use of deferred disk writes, en for transactions that access multiple serv ers. In this case, consistenc is ensured by synchronizing the serv ers disk writes as mentioned in

the pre vious section. Rollback-reco ery has been xtensi ely studied in the message-passing model [1 14 16 26 28 ]. Ne v- ertheless, ery fe of these orks ha been xploited in dif ferent en vironments. The ork in [2 presents frame ork to analyze consistenc in dif ferent shared- memory and message-passing systems. In [3 ], their results are xtended to the transactional model, moti- ated by the problem of uilding consistent snapshot of centralized database without stopping the ecu- tion of transactions. Actually the problem of uild- ing consistent database snapshot has triggered lot of research on

the analysis of database-state dependen- cies [3 10 20 21 27 ]. Dif ferent approaches ha con- sidered dependencies created between transactions due to concurrenc control [5 or between data accessed within single process which should be consistently transfered to stable storage [7 ]. Some of the ideas pre- sented in these orks, specially in [3 and [10 ], resemble our transaction and state dependencies definitions. Ho w- er none of them present practical characterization of database-state dependencies (e.g., Theorem 3). Our ap- proach dif fers from these orks by (a) assuming dis- trib uted

scenario where synchronization between dif fer ent processes must be minimized, and (b) aiming at ap- plying rollback-reco ery techniques to bring the appli- cation back to consistent state in case of ailure. [7 6. Concluding emarks In this paper we tackled the problem of deferred disk writes in federations of main-memory database systems. Our approach as moti ated by pre vious research on rollback-reco ery for message-passing distrib uted sys- tems. described ho database-state dependencies are created in the transactional model and ho the can
Page 10
be track ed ef ficiently

during ecution. possible x- tension to our algorithms is to use direct instead of tran- siti dependenc tracking [9 26 ], as this can possibly lead to smaller timestamps if transactions do not tend to access man serv ers. Moreo er our algorithms borro from optimistic message logging. It is also possible to xploit other rollback-reco ery techniques, lik causal message logging and quasi-synchronous checkpointing, and compare their performance and adv antages under dif ferent transaction scenarios. Research domains that may tak adv antage of this theory include optimistic concurrenc control

mechanisms and management of nested transactions. In estigating such issues is the sub- ject of future ork. Ackno wledgments thank the anon ymous re vie wers for their com- ments that helped us impro the paper Refer ences [1] L. Alvisi and K. Marzullo. Message Logging: Pes- simistic, Optimistic, Causal and Optimal. IEEE ans. on Softwar Engineering 24(2):149–159, Feb 1998. [2] R. Baldoni, J.-M. Helari, and M. Raynal. Consistent records in asynchronous computations. Acta Informat- ica 35(6):441–455, June 1998. [3] R. Baldoni, Quaglia, and M. Raynal. Consistent check- pointing for transaction

systems. The Computer ournal 44(2):92–100, 2001. [4] Bernstein, Hadzilacos, and N. Goodman. Con- curr ency Contr ol and Reco very in Databases Systems Addison-W esle 1987. [5] B. Bhar ga a. Concurrenc control in database systems. IEEE ansactions on Knowledg and Data Engineer ing 11(1):3–16, 1999. [6] M. Chandy and L. Lamport. Distrib uted Snapshots: De- termining Global States of Distrib uted Systems. CM ans. on Computer Systems 3(1):63–75, Feb 1985. [7] Cristian, S. Mishra, and S. Hyun. Implementation and performance of stable-storage service in Unix. In Pr oceedings of the 15th IEEE

Symposium on Reliable Distrib uted Systems 1996. [8] D. DeW itt, R. H. Katz, Olk en, L. D. Shapiro, M. Stonebrak er and D. A. ood. Implementation tech- niques for main memory database systems. In SIG- MOD’84, Pr oceedings of Annual Meeting Boston, Mas- sac husetts, une 18-21 pages 1–8. CM Press, 1984. [9] E. N. Elnozahy L. Alvisi, M. ang, and D. B. Johnson. Surv of Rollback-Reco ery Protocols in Message-Passing Systems. CM Computing Surve ys 34(3):375–408, Sept. 2002. [10] I. C. Garcia and L. E. Buzato. Asynchronous construc- tion of consistent global snapshots in the object and ac- tion

model. In Pr oc. of the 4th IEEE Int. Confer ence on Con˛gur able Distrib uted Systems 1998. [11] H. Garcia-Molina and K. Salem. Main memory database systems: An ervie IEEE ansactions on Knowledg and Data Engineering 4(6):509–516, Dec. 1992. [12] J. Gray The re olution in database architecture. echni- cal Report MSR-TR-2004-31, Microsoft Research, 2004. [13] J. N. Gray and A. Reuter ansaction Pr ocessing: Con- cepts and ec hniques Mor gan Kaufmann, 1993. [14] D. B. Johnson and Zw aenepoel. Reco ery in dis- trib uted systems using optimistic message logging and checkpointing. ournal of

Algorithms 11(3):462–491, 1990. [15] K. Knizhnik. astdb: Main-memory relational database management system. http://www knizh- nik/f astdb .html. [16] R. oo and S. oue g. Checkpointing and rollback- reco ery for distrib uted systems. IEEE ans. on Soft- war Engineering 13:23–31, Jan. 1987. [17] L. Lamport. ime, clocks, and the ordering of ents in distrib uted system. Commun. CM 21(7):558–565, July 1978. [18] D. Morse. In-memory database web serv er Dedicated Systems Ma gazine 4:12–14, 2000. [19] Pedone, R. Guerraoui, and A. Schiper The database state machine approach. ournal of

Distrib uted and ar allel Databases and ec hnolo gy 14(1):71–98, 2003. [20] S. Pilarski and Kameda. Checkpointing for distrib uted databases: Starting from the basics. IEEE ans. on ar allel and Distrib uted Systems 3(5):602–610, 1992. [21] C. Pu. On-the-ˇy incremental, consistent reading of en- tire databases. Algorithmica 1(3):271–287, 1986. [22] L. Rodrigues and M. Raynal. Atomic broadcast in asynchronous crash-reco ery distrib uted systems and its use in quorum-based replication. IEEE ansactions on Knowledg and Data Engineering 15(5):1206–1217, 2003. [23] M. Ronström. The NDB cluster

parallel data serv er for telecommunications applications. Ericsson Re vie no. 4, 1997. [24] M. Ronström and L. Thalmann. Mysql cluster architec- ture ervie MySQL echnical White aper 2004. [25] R. Schmidt and Pedone. Consistent main-memory database federations under deferred disk writes. ech- nical Report IC/2005/17, School of Computer and Com- municaiton Sciences, EPFL, 2005. [26] A. Sistla and J. L. elch. Ef ˛cient distrib uted reco ery using message logging. In Pr oceedings of the 8th CM Symposium on the Principles of Distrib uted Computing pages 233–238, 1989. [27] S. H. Son and A. K.

Agra ala. Distrib uted checkpointing for globally consistent states of databases. IEEE ans. on Softwar Engineering 15(19):1157–1166, 1989. [28] R. Strom and S. emini. Optimistic Reco ery in Dis- trib uted Systems. CM ans. on Computing Systems 3(3):204–226, Aug. 1985. 10