The LAMMPI CheckpointRestart Framew ork SystemInitiated Checkpointing Sriram Sankaran Jef fre M
113K - views

The LAMMPI CheckpointRestart Framew ork SystemInitiated Checkpointing Sriram Sankaran Jef fre M

Squyres Brian Barrett Andre Lumsdaine Open Systems Laboratory Indiana Uni ersity ssankarajsquyresbrbarretlums lammpior Jason Duell aul Har gro e Eric Roman La wrence Berk ele National Laboratory jcduellphhar gro eeroman lblgo Abstract As highperform

Download Pdf

The LAMMPI CheckpointRestart Framew ork SystemInitiated Checkpointing Sriram Sankaran Jef fre M

Download Pdf - The PPT/PDF document "The LAMMPI CheckpointRestart Framew ork ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "The LAMMPI CheckpointRestart Framew ork SystemInitiated Checkpointing Sriram Sankaran Jef fre M"— Presentation transcript:

Page 1
The LAM/MPI Checkpoint/Restart Framew ork: System-Initiated Checkpointing Sriram Sankaran, Jef fre M. Squyres, Brian Barrett, Andre Lumsdaine Open Systems Laboratory Indiana Uni ersity ssankara,jsquyres,brbarret,lums @lam-mpi.or Jason Duell, aul Har gro e, Eric Roman La wrence Berk ele National Laboratory jcduell,phhar gro e,eroman @lbl.go Abstract As high-performance cluster continue to gr ow in size and popularity issues of fault toler ance and eliability ar becoming limiting factor on application scalability ad- dr ess these issues, we pr esent the design and implementa-

tion of system for pr viding coor dinated hec kpointing and ollbac eco very for MPI-based par allel applications. Our appr oac inte gr ates the Berk ele Lab BLCR ernel- le vel pr ocess hec kpoint system with the LAM implementa- tion of MPI thr ough defined hec kpoint/r estart interface Chec kpointing is tr anspar ent to the application, allowing the system to be used for cluster maintenance and sc hedul- ing easons as well as for fault toler ance Experimental e- sults show ne gligible communication performance impact due to the incorpor ation of the hec kpoint support capabil- ities into

LAM/MPI. Intr oduction In recent years, the supercomputing community has seen significant increase in the CPU count of lar ge-scale com- putational resources. Se en of the top ten machines in the No ember 2002 op500 [1 list utilize at least 2000 pro- cessors. ith machines such as ASCI White, Q, and Red Storm, the processor count for the lar gest systems is no on the order of 10,000 processorsand this increasing trend will only continue. While the gro wth in CPU count has pro- vided great increases in computing po wer it also presents significant reliability challenges to

applications. In particu- lar since the indi vidual nodes of these lar ge-scale systems are comprised of commodity hardw are, the reliability of the indi vidual nodes is tar geted for the commodity mark et. As the node count increases, the reliability of the parallel sys- tem decreases (roughly proportional to the node count). In- deed, anecdotal vidence suggests that ailures in the com- puting en vironment are making it more dif ficult to complete long-running jobs and that reliability is becoming limiting actor on scalability The Message assing Interf ace (MPI) is de facto stan- dard

for message passing parallel programming for lar ge- scale distrib uted systems [12 14 16 17 24 30 ]. Imple- mentations of MPI comprise the middle are layer for man lar ge-scale high-performance applications [3 15 18 37 ]. Ho we er the MPI standard itself does not specify an par ticular kind of ault tolerant beha vior In addition, the most widely used MPI implementations ha not been designed to be ault-tolerant. address these issues, we present the design and im- plementation of system for pro viding coordinated check- pointing and rollback reco ery for MPI-based parallel ap- plications. Se

eral actors were considered for our design. Generality Our design is an xtension of the compo- nent frame ork comprising the most recent ersion of LAM/MPI [32 33 ]. In general, the frame ork itself can be used to support wide ariety of ault tolerance mecha- nisms; we report on one such mechanism here. In particu- lar our approach inte grates the Berk ele Lab BLCR ernel- le el process checkpoint system with the LAM implemen- tation of MPI through defined checkpoint/restart interf ace. ranspar ency The particular implementation of coordi- nated checkpointing and rollback reco ery that we

report here as designed with transparenc in mind. That is, our system can be used to checkpoint parallel MPI applications without making an changes to the application code. In ol- untary checkpointing is consequently supported.
Page 2
erf ormance. As sho wn by our xperimental results, the addition of checkpointing support capabilities to LAM/MPI has insignificant impact on its message passing perfor mance. And, since checkpoint support is run-time se- lectable, it can be bypassed altogether for applications that do not wish to use it. ortability Our implementation has been

incorporated into the most recent release of LAM/MPI, widely used and industrial strength open-source implementation of MPI. Although the BLCR checkpointer is currently ailable for Linux, LAM/MPI will operate on almost all POSIX sys- tems. The general approach tak en in this ork will allo it to be easily xtended to other single process checkpoint systems and to other operating systems. The remainder of the paper is or ganized as follo ws. Sec- tion discusses background information and related ork. The design of our system is gi en in Section and details of its implementation in Section 4.

Performance results are pro vided in Section 5. Future ork and our conclusions are gi en in Sections and 7. Backgr ound 2.1 Checkpoint-Based Rollback Reco ery In the conte xt of message-passing parallel applications, global state is collection of the indi vidual states of all participating processes and of the states of the communica- tion channels. consistent global state is one that may oc- cur during ailure-free, correct ecution of distrib uted computation. ithin consistent global state, if gi en process has local state that indicates particular message has been recei ed, then the state of

the corresponding sender must indicate that the message has been sent [4 ]. Figure sho ws tw xamples of global states, one of which is con- sistent, and the other of which is inconsistent. consistent global hec kpoint is set of local checkpoints, one for each process, forming consistent global state. An consistent global checkpoint can be used to restart process ecution upon ailure. Checkpoint/restart techniques for parallel jobs can be broadly classified into three cate gories: uncoordinated, co- ordinated, and communication-induced. (These approaches are analyzed in detail in [10 ].)

2.1.1 Uncoordinated Checkpointing In the uncoordinated approach, the processes determine their local checkpoints independently During restart, these processes search the set of sa ed checkpoints for consis- tent state from which ecution can resume. The main ad- antage of this autonomy is that each process can tak 1 2 1 (a) (b) Time Time 2 Figure 1: message-passing system consisting of pro- cesses. (a) sho ws an xample of consistent global state where message  is recorded as ha ving been sent by pro- cess  ut not yet recei ed by process and (b) sho ws an xample of an inconsistent global

state in which mes- sage  is recorded as ha ving been recei ed by ut not yet sent by checkpoint when it is most con enient. or ef ficienc process may tak checkpoints when the amount of state in- formation to be sa ed is small [39 ]. Ho we er this approach has se eral disadv antages. First, there is the possibility of the domino ef fect [26 which causes the system to rollback to the be ginning of computation, resulting in the loss of lar ge amount of useful ork. Second, process may tak checkpoints that will ne er be part of global consistent state. Third, uncoordinated checkpointing

forces each pro- cess to maintain multiple checkpoints, thereby incurring lar ge storage erhead. 2.1.2 Coordinated Checkpointing ith the coordinated approach, the determination of local checkpoints by indi vidual processes is orchestrated in such ay that the resulting global checkpoint is guaranteed to be consistent [4 19 35 38 ]. Coordinated checkpoint- ing simplifies reco ery from ailure and is not susceptible to the domino ef fect, since ery process al ays restarts from its most recent checkpoint. Also, coordinated check- pointing minimizes storage erhead since only one perma- nent

checkpoint needs to be maintained on stable storage. The main disadv antage of coordinated checkpointing, ho w- er is the lar ge latenc in olv ed in sa ving the checkpoints, since global checkpoint needs to be determined before the checkpoints can be written to stable storage. 2.1.3 Communication-Induced Checkpointing The communication-induced checkpointing approach forces each process to tak checkpoints based on protocol- related information piggyback ed on the application
Page 3
messages it recei es from other processes [27 ]. Check- points are tak en such that system-wide

consistent state al ays xists on stable storage, thereby oiding the domino ef fect [2 ]. Processes are allo wed to tak some of their checkpoints independently Ho we er in order to determine consistent global state, processes may be forced to tak additional checkpoints. The checkpoints that process tak es independently are called local checkpoints, while those that process is forced to tak are called forced checkpoints. The recei er of each application message uses the piggyback ed information to determine if it has to tak forced checkpoint. The forced checkpoint must be tak en before the

application may process the contents of the message, possibly incurring high latenc and erhead. In contrast with coordinated checkpointing, no special coordination messages are xchanged in this approach. 2.2 Other Uses of Checkpoint/Restart The ability to checkpoint and restore applications has number of uses in parallel en vironment besides ault tol- erance. Gang schedulingcheckpointing and restarting all the processes that are part of single parallel application allo ws for more fle xible scheduling. or xample, jobs with lar ge resource requirements can be intermittently scheduled at

of f-peak times using the checkpoint/restart capability ithout intermittent scheduling such lar ge jobs may use all ailable resources for long periodslocking out other jobs during that time. Hence, the ability to stop and resume lar ge jobs allo ws scheduling of other ailable jobs in such ay that the erall system throughput is maximized. Process migration is another feature that is made pos- sible by the ability to sa process image. If process needs to be mo ed from one node to another (because im- minent ailure of node is predicted or for scheduling rea- sons) it is possible to transfer the

state of the processes run- ning on that node to another node by writing the process image directly to remote node. The process can then re- sume ecution on this ne node, without ha ving to kill the entire application and start it all er again. Process migration has also pro ed xtremely aluable for systems whose netw ork topology constrains the placement of pro- cesses in order to achie optimal performance. The Cray T3E interconnect, for instance, uses three-dimensional torus that requires processes that are part of the same par allel application to be placed in contiguous locations on the

torus. This results in fragmentation as jobs of dif fer ent sizes enter and xit the system. ith process migration, jobs can be pack ed together to eliminate fragmentation, re- sulting in significantly higher utilization [40 ]. Netw orks with such constraining topologies ha become less com- mon recently ho we er IBM Blue Gene/L project plans to constrain communication among processors [36 ], and more cluster projects may use them in the future. 2.3 Related ork Checkpoint/restart for sequential programs has been some what well studied. Libckpt [25 is an open source li- brary for

transparent checkpointing of Unix processes. It contains support for incremental checkpoints, in which only pages that ha been modified since the last checkpoint are sa ed. Condor [22 23 is another system that pro vides checkpointing services for single process jobs on num- ber of Unix platforms. The CRAK (Checkpoint/Restart As ernel module) project [41 pro vides ernel im- plementation of checkpoint/restart for Linux. CRAK also supports migration of netw ork ed processes by adopting no el approach to sock et migration. BLCR (Berk ele Lab Checkpoint/Restart) [8 is ernel implementation of

checkpoint/restart for multi-threaded applications on Linux. Libtckpt [7 is user -le el checkpoint/restart library that can also checkpoint POSIX threads applications. In the conte xt of parallel programs, there are en- dor implementations of checkpoint/restart for MPI applica- tions running on some commercial parallel computers [6 ]. Some implementations are also ailable for checkpoint- ing MPI applications running on commodity hardw are. CoCheck [34 is one such tool for PVM and MPI appli- cations. It is uilt into nati MPI library called tuMPI and layered on top of portable single-process

checkpoint- ing mechanism [13 21]. CoCheck uses special process to coordinate checkpoints, that sends checkpoint request notification to all the processes belonging to the MPI job On recei ving this trigger each process sends ready mes- sage (RM) to all other processes, and stores all incoming messages from each process until all the RMs ha been re- cei ed, in specially reserv ed uf fers. The underlying check- pointer then sa es the ecution conte xt of each process to stable storage. At restart, recei operation first checks the uf fers for matching message. If there is such mes-

sage, it is retrie ed from the uf fer Otherwise, real re- cei operation fetches the ne xt matching message from the netw ork. One dra wback to CoCheck is that checkpoint request cannot be processed when send operation is in progress. Consequently if matching recei has not been posted by the peer there is no finite bound on the time tak en for the checkpoint request to complete. Also, checkpointing could change the semantics of MPI synchronous sends in CoCheck: an anticipated recei could cause the return of the send instead of the actual recei by the application. checkpoint/restart

implementation for MPI at NCCU aiw an uses combination of coordinated and uncoordi- nated strate gies for checkpointing MPI applications [20 ]. It is uilt on top of the NCCU MPI implementation [5 ], and
Page 4
uses Libckpt as the back-end checkpointer Checkpointing of processes running on the same node is coordinated by local daemon process, while processes on dif ferent nodes are checkpointed in an uncoordinated manner using mes- sage logging. limitation of the xisting systems for checkpointing MPI applications on commodity clusters is that the are im- plemented using MPI libraries

that primarily serv as re- search platforms and are not widely used. Another dra w- back of some of these checkpoint/restart systems is that the are tightly coupled to specific single-process checkpointer Since single-process checkpointers usually support lim- ited number of platforms, this limits the range of systems on which MPI applications can be checkpointed to those that are supported by the underlying checkpointer Design This section presents an ervie of the design of the checkpoint/restart system in LAM/MPI. This implementa- tion does not alter the semantics of an of the MPI

func- tions, and fully supports all of MPI-1. The checkpoint/- restart system has been designed in such ay that there is clear separation between the checkpoint/restart func- tionality and MPI-specific functionality in LAM. Also, the checkpoint/restart system can plug-in multiple back-end checkpointers with minimal changes to the main LAM/MPI code base, as result of which there is wide range of plat- forms that can potentially be supported by our system. The current implementation in LAM/MPI uses the BLCR [8 checkpointer that is ailable for Linux. 3.1 Checkpointing ppr oach in LAM/MPI

checkpoint of an MPI job is initiated by user or batch scheduler by deli ering checkpoint request to mpirun The precise mechanism for deli ering this re- quest is implementation-dependent. On recei ving this re- quest, mpirun propagates this request to all the processes in the MPI job LAM/MPI uses coordinated approach to checkpoint- ing MPI jobs. The current implementation in LAM sup- ports TCP-based communication sub-system (see Sec- tions 3.2.1 and 4). Upon recei ving the checkpoint request from mpirun all the MPI processes interact with each other to guarantee that their local checkpoints

will result in consistent global checkpoint. In [4 ], consistent global state is described as the set of process states and the states of their communication channels. The approach adopted in LAM ensures that all the MPI communication channels be- tween the processes are empty when checkpoint is tak en. During restart, all the processes resume ecution from their oid bookmark xchange () int i; struct bookmark bookmarks arr or (i (num procs myidx 1), 0; num procs (i 1) num procs ++j) if myidx i) send our bookmark status, then eceive into appr opriate location in bookmarks arr ay send bookmarks

(i); recv bookmarks bookmarks arr ); else if (myidx i) eceive emote bookmark status into appr opriate location in bookmarks arr ay then send recv bookmarks bookmarks arr ); send bookmarks (i); Figure 2: Staggered all-to-all algorithm used for communi- cating netw ork status. sa ed states, with the communication channels restored to to their kno wn (empty) states. The interaction between the processes to clear the data in the MPI communication channels uses staggered all- to-all algorithm er out-of-band communication channels that are ailable in LAM, as sho wn in Figure 2. This al- gorithm

starts with each process choosing unique peer to xchange information about ho much data it has sent to and recei ed from that peer This xchange then continues with other peers in increasing order of ranks in circular ashion until each process has xchanged this information with its immediate lo wer -rank ed peer Then, based on this information, each process recei es the remaining data from the MPI communication channels and all the in-flight data are drained. The LAM checkpoint algorithm is summarized belo mpirun acts as coordination point between all processes of an MPI application, and

is the process signaled by the run-time system or user when checkpoint is to be initiated. 1. mpirun: recei es checkpoint request from user or batch scheduler 2. mpirun: propagates the checkpoint request to each MPI process. 3. mpirun: indicates that it is ready to be checkpointed. 4. each MPI pr ocess: coordinates with the others to reach consistent global state in which the MPI job can be checkpointed. or xample, processes using
Page 5
TCP for MPI message passing drain in-flight messages from the netw ork to achie consistent global state. 5. each MPI pr ocess: indicates that

it is ready to be in- di vidually checkpointed. 6. underlying checkpointer: sa es the ecution con- te xt of each process to stable storage. 7. each MPI pr ocess: continues ecution after the checkpoint is tak en. The follo wing sequence of ents occurs at restart: 1. mpirun: restarts all the processes from the sa ed pro- cess images. 2. each MPI pr ocess sends its ne process information to mpirun 3. mpirun: updates the global list containing informa- tion about each process in the MPI job and broadcasts it to all processes. 4. each MPI pr ocess: recei es information about all the other processes

from mpirun 5. each MPI pr ocess: re-b uilds its communication chan- nels with the other processes. 6. each MPI pr ocess: resumes ecution from the sa ed state. This algorithm has been successfully implemented using the BLCR [8 checkpointer The details of the implementa- tion are gi en in Section 4. 3.2 LAM/MPI Ar chitectur LAM/MPI is designed with tw major layers: the LAM layer and the MPI layer as sho wn in Figure 3. The LAM layer pro vides frame ork and run-time en vironment upon which the MPI layer ecutes. The LAM layer pro vides services such as message passing, process control, remote

file access, and I/O forw arding. The MPI layer pro vides the MPI interf ace and an infrastructure for direct, process-to- process communication er high-speed netw orks. LAM pro vides daemon-based run-time en vironment (R TE). user -le el daemon (the lamd is used to pro vide man of the services needed for the MPI TE. The lam- boot command is used to start lamd on ery node at the be ginning of an ecution. At the end of an ecution session, these lamd are killed using the lamhalt com- mand. The lamd pro vide process control for all MPI jobs e- cuted under LAM/MPI. mpirun launches an MPI

applica- tion by sending request to the appropriate daemons, which in turn fork() exec() the application. When an ap- plication terminates, the daemons are notified through the standard Unix SIGCHLD mechanisms, and the relay this User Application MPI Layer Operating System LAM Layer Figure 3: The layered design of LAM/MPI. mpirun MPI app lamd Node 0 MPI app Node 1 lamd out−of−band communication channel MPI point−to−point communication channel TCP socket Figure 4: tw o-w ay MPI job on tw nodes. information back to mpirun The LAM daemons also pro- vide

message-passing services er UDP channels. The MPI library consists of tw layers. The upper layer is portable and independent of the communication sub- system (i.e., MPI function calls and accounting utility func- tions). The lo wer layer consists of modular frame ork for components called SSI (see Section 3.2.1). One such com- ponent type is the MPI Request Progression Interf ace (RPI), which pro vides de vice-dependent point-to-point message- passing between the MPI peer processes. LAM/MPI in- cludes RPIs that implement message-passing using TCP shared memory gm (the lo w-le el

message-passing sys- tem for Myrinet netw orks), and the message-passing service pro vided by the lamd s. Figure sho ws the LAM/MPI TE for tw o-w ay MPI job running on tw nodes and using the TCP RPI. 3.2.1 System Ser vices Interface LAM/MPI has recently been redesigned to pro vide com- ponent frame ork for arious services pro vided by the LAM infrastructure. This frame ork the System Ser vices Interf ace (SSI) is composed of number of com- ponent types, each of which pro vides single service to the LAM TE or MPI implementation [32 ]. Each SSI type can ha one or more run-time selectable

instances ailable. Component instances are implemented as plug-in modules, and are chosen at run-time, either automatically by the SSI infrastructure or manually by the user allo wing particu- lar ersion of LAM/MPI to support multiple underlying in-
Page 6
LAM/MPI SSI Framework MPI API CR RPI Figure 5: The LAM SSI component architecture has mul- tiple dif ferent component types. At run-time, module in- stances will be chosen from each component type. frastructures. Currently there are SSI interf aces for launch- ing the LAM TE, MPI de vice-dependent point-to-point communication layer

MPI collecti communication algo- rithms, and checkpoint/restart of MPI applications. Figure sho ws the SSI frame ork, and ho an MPI application can choose between modules of each component type at run- time. The tw component types sho wn in Figure are the Request Progression Interf ace (RPI), and checkpoint/restart (CR). The RPI component type is responsible for all MPI point-to-point communications. The CR component type is the sole interf ace to the back-end checkpointing system to actually perform checkpoint and restart functionality Although LAM has multiple RPI modules ailable for

selection at run-time, there is currently only one CR mod- ule ailable: blcr which utilizes the BLCR single-node checkpointer (see Section 3.3). The design and implemen- tation of the CR SSI and the blcr module were the main focuses of this ork. or an MPI job to be checkpointable, it must ha alid CR module and each of the other SSI modules that it has chosen at run-time must support some abstract checkpoint/- restart functionality The internal SSI checkpoint/restart in- terf aces were carefully designed to preserv strict abstrac- tion barriers between the CR SSI and the other SSI modules.

Hence, the strict separation of back-end checkpointing ser vices and communication allo ws ne back-end checkpoint- ing systems to be plugged-in simply by pro viding ne CR SSI module; the xisting RPI modules (and other SSI component types) will be able to utilize its services with no modifications. 3.2.2 The CR SSI At the start of ecution of an MPI job, the SSI frame ork chooses the set of modules from each SSI component type that will be used. In the case of the CR SSI, it determines whether checkpoint/restart support as requested, and if so, CR module is selected to run (in this case,

it is blcr since it is the only module ailable). All modules in the CR SSI pro vide common set of APIs to be used by the MPI layer and another set of APIs that can be used by mpirun The detailed design of the CR SSI component type is described in [28 ]. Broadly these APIs pro vide the follo wing functionality: initialize: used by the MPI layer to attach to the under lying checkpointer and re gister callback(s) that will be in ok ed at checkpoint. suspend: used by the MPI application thread to sus- pend ecution when it is interrupted by the callback thread (see Section 4.1). disable checkpoint:

used by mpirun to enter crit- ical section during which it cannot be interrupted by checkpoint request. enable checkpoint: used by mpirun to xit critical section and allo incoming checkpoint requests. finalize: used by the MPI layer to perform cleanup actions and detach from the underlying checkpointer Most of the ork in the CR SSI is done in separate thread to allo preparation for checkpoint to happen asyn- chronously without blocking the ecution of the main ap- plication thread. In the blcr module, this thread is created by the BLCR checkpointer when callback is re gistered during the

module initialize action. Ho we er the design of the CR SSI type does not require the underlying check- pointer to pro vide this thread. If checkpointer does not im- plicitly pro vide separate thread for callbacks, the module itself can create this xtra thread during initialize and block its ecution until checkpoint request arri es. This design strate gy serv es to reduce the requirements imposed on the underlying checkpointing systems, thereby potentially in- creasing the range of checkpointers that can be supported. 3.2.3 The RPI SSI support checkpointing, an RPI module must ha the ability

to generically prepare for checkpoint, continue after checkpoint, and restore from checkpoint. check- pointable RPI module must therefore pro vide API functions to perform this functionality The follo wing functions will be in ok ed from the thread-based callback in the CR SSI: checkpoint: in ok ed when checkpoint request comes in, usually to consume an in-flight messages. continue: in ok ed to perform an operations that might be required when process continues ecution after checkpoint is tak en.
Page 7
estart: in ok ed to re-establish connections and an other operations

that might be required when process restarts ecution from sa ed state. Note that these functions are independent of which back- end checkpointing system is used; for xample, the actions required for the TCP RPI to checkpoint, continue and restart are the same re gardless of which CR SSI module is selected. The detailed design of the RPI SSI is described in [31 ]. 3.3 The BLCR Checkpointer The Berk ele Lab Linux Checkpoint/Restart project (BLCR) [8 is rob ust, ernel-le el checkpoint/restart im- plementation. It can be used either as stand-alone sys- tem for checkpointing applications on single

node, or by scheduling system or parallel communication library for checkpointing and restarting parallel jobs running on mul- tiple nodes. BLCR is implemented as Linux ernel mod- ule (for recent 2.4 ersions of the ernel, such as 2.4.18) and user -le el library ernel module implementation has the benefit that it allo ws BLCR to be easily deplo yed by ne users without requiring them to patch, recompile, and reboot their ernel. While the current implementation of BLCR only supports checkpointing of single processes (including multi-threaded processes), checkpointing of pro- cess groups,

sessions, and full range of Unix tools will be supported in the future. BLCR pro vides simple user -le el interf ace to li- braries/applications that need to interact with checkpoint/- restart. It pro vides mechanism to re gister user -le el call- back functions that are triggered whene er checkpoint oc- curs, and that continue when the process restarts (or peri- odic checkpoint for backup purposes completes). kinds of callbacks can be re gistered: signal-based callbacks that ecute in signal-handler conte xt, and thread-based call- backs that ecute in separate thread. These callbacks allo the

application to shutdo wn its netw ork acti vity (and perform analogous actions on some other uncheckpointable resource) before checkpoint is tak en, and restore them later Callbacks are designed to be written as sho wn in Fig- ure 6. BLCR also pro vides user -le el code with critical sec- tions to allo groups of instructions to be performed atom- ically with respect to checkpoints. This allo ws the applica- tions to ensure that special cases such as netw ork initial- ization are not interrupted by checkpoint. In some cases, such atomicity is not merely matter of con enience ut is vital for

correct program operation. Implementation Details The checkpoint/restart implementation in LAM/MPI re- lies on the ailability of message-passing service pro- oid callback oid data ptr struct my data pdata struct my data data ptr int did restart; Do hec kpoint time shutdown lo gic ell system to tak the hec kpoint did restart cr checkpoint (); if (did restart Actions to estart fr om hec kpoint else Actions to continue after hec kpoint Figure 6: emplate for signal-based and thread-based call- back functions. The state of the entire process (including the callback ecution) is sa ed in the cr

checkpoint call, and restored at restart or after checkpoint is complete. vided by the LAM layer This service is used for out-of- band signaling and communication between the processes during checkpoint and restart. Although the play an im- portant role during checkpoint, the lamd are not logical part of an MPI application, and are themselv es not check- pointed. The design of this system also presupposes the ailability of threads package on the tar get platform. Cur rently support for checkpoint/restart has been implemented only for modified ersion of the TCP RPI. Ho we er this

functionality will soon be xtended to include all the RPIs. This section describes the details of the implementation in the conte xt of the sequence of steps that occur in the system during checkpoint, upon continuing from checkpoint, and when restarting from sa ed conte xt. 4.1 Checkpoint Since mpirun is the startup coordination point for MPI processes, it as the natural choice to serv as the entry point for checkpoint request to be sent to LAM/MPI job At the start of ecution, mpirun in ok es the initialization function of the blcr checkpoint/restart SSI module to re g- ister both

thread-based and signal-based callback functions with BLCR. The thread-based callback is required to prop- agate the checkpoint requests to the MPI processes. This cannot be done in signal conte xt because the propagation of the checkpoint request uses some non-reentrant library calls, and the use of non-reentrant functions from signal conte xt can cause deadlocks. When checkpoint request is sent by user or batch scheduler (by in oking the BLCR utility cr checkpoint
Page 8
with the process ID of mpirun ), it triggers the callbacks to start ecuting. The thread-based callback computes

the names under which the images of each MPI process will be stored to disk and sa es the process topology of the MPI job (called application schema in LAM) in mpirun address space, that will be used for restoring the applica- tions at restart. It then signals all the MPI processes about the pending checkpoint request by instructing the rele ant lamd to in ok cr checkpoint for ery process that is part of this MPI job Once this is done, the callback thread indicates that mpirun is ready to be checkpointed. In the MPI library MPI INIT has been modified to in- ok the initialization

function of the blcr checkpoint/- restart SSI module; this function re gisters thread-based and signal-based callbacks with BLCR, that will be ecuted when checkpoint request arri es. oid race condi- tions, the current implementation defines that it is not pos- sible to checkpoint an MPI job in which one of the processes has already completed ecuting MPI FINALIZE In order to pre ent this situation from occurring, barrier synchro- nization has been introduced in MPI FINALIZE When checkpoint request is recei ed by an MPI pro- cess from mpirun the threaded callback in the blcr mod- ule

starts ecuting. The use of threaded callback here allo ws the application to continue en when the thread- based callback starts ecuting. Another reason for using threaded callback is the non-reentranc issue mentioned abo e. Consequently we ha to xplicitly synchronize these threads so that the application thread does not e- cute an MPI call when the callback thread is quiescing the netw ork. Synchronization of threads is already done in LAM/MPI when the thread le el is MPI THREAD SERIALIZED ef- fecti ely pre enting multiple threads from making MPI calls simultaneously This is accomplished by

placing mu- te at the entry and xit points of all MPI library calls. This same mechanism is reused in the checkpoint/restart imple- mentation to pre ent the application thread from calling into the MPI library when the callback thread is perform- ing checkpoint or restart functions, and vice ersa. Hence, all MPI applications that request checkpoint/restart sup- port are assigned thread le el of at least MPI THREAD SERIALIZED At checkpoint time, the callback thread in each process aits for the application thread to xit its current MPI call (if an y), and then instructs the RPI to prepare itself

for checkpoint. It is possible, ho we er that the application thread could be blocking on an MPI operation whose corre- sponding peer operation has not been posted. handle this case, the callback thread of that process signals the applica- tion thread to interrupt its blocking beha vior At this point, the application thread realizes that it has been interrupted by the callback thread, and yields control to it by releas- ime CR Thr ead pp Thr ead  sleep ecute outside MPI library  ak eup  acquire mute  prepare RPI for checkpoint call MPI function, block on mute  checkpoint

 RPI continue/restart  release lock  sleep acquire lock, ecute MPI call Figure 7: Sequence of ents when the application thread is ecuting outside the MPI library when checkpoint re- quest arri es. ing the mute x. The callback thread can then trigger the RPI to quiesce the netw ork, and perform an other operations that are required to prepare the process to be checkpointed. At restart time, the interrupted MPI call is automatically resumed without the user being are of the interruption. Figures and depict the synchronization that is enforced between the application callback

threads. In order to drain the in-flight data on the netw ork, each process needs to kno ho much data has been sent across TCP sock et by its peer This is accomplished by ha ving each MPI process eep bookmark for each of its peers. bookmark is pair of inte gers containing the number of bytes it has sent to and recei ed from each peer At checkpoint time, the callback threads in each process xchange the sent bookmarks with each of their peers us- ing LAM out-of-band channel (see Figure 2). If the sent bookmark recei ed from peer does not match the re- cei ed bookmark that the process

has for that peer then there must be some messages on the netw ork that ha not yet been recei ed. If this is the case, the callback threads call the RPI modules to progress the recei es in their inter nal message-passing state machines and consume data from TCP sock ets until each recei ed bookmark matches its corresponding sent bookmark. The RPI state machine ecutes the normal progression of MPI recei requests by matching the posted recei es with incoming messages, and creating une xpected message uf fers for unmatched in- coming messages. or xample, if process had posted an MPI recei

before checkpoint and the message arri es af- ter the quiesce process be gins, it will be recei ed into the actual destination uf fer when the RPI drains the netw ork. Hence, no secondary uf fers or rollback mechanisms need to be utilized [10 ]. At this time, MPI send requests are pre- ented from making progress so that no more messages are
Page 9
ime CR Thr ead pp Thr ead  sleep call MPI function, ac- quire mute  ak eup ecute blocking sys- tem call in MPI library  try to acquire mu- te x, ail   signal app thread   system call interrupted, release mute  acquire mute

block on mute   prepare for check- point  checkpoint  continue/restart  release lock  sleep acquire lock, resume MPI function Figure 8: Sequence of ents when the application thread is ecuting blocking system call inside the MPI library when checkpoint request arri es. sent. When all the bookmarks match, the RPI has drained all the in-flight data on the netw ork, and the callback thread in each process indicates that the process is ready to be checkpointed. The underlying checkpointer then writes the process image to stable storage. Figure depicts the x- change of bookmarks

and the draining of in-flight data for tw o-process MPI job 4.2 Continue After checkpoints are tak en, the MPI processes are al- lo wed to continue ecution. At checkpoint time, the TCP sock ets are not closed so the MPI processes need not per form additional ork to re-establish connections or re- ne gotiate per -job parameters when the continue from checkpoint. The MPI library is unlock ed and control is sim- ply returned to the application thread and processing contin- ues as if nothing happened. 4.3 Restart When checkpointed MPI job is restarted by in ok- ing the BLCR utility cr

restart with the name of mpirun sa ed process conte xt, the signal-based callback function exec() ne mpirun mpirun restarts all the MPI processes from the application schema that as sa ed at checkpoint-time, with the same process-topology as before checkpointing. signal-based callback is re- quired here because in oking exec() from another thread app (B) lamd Node 0 app (A) lamd Node 1 "sent" bookmarks app (B) app (A) pending MPI messages Node 0 Node 1 (1) (2) Figure 9: Clearing the communication channels before checkpoint. (1) processes and xchange the sent bookmarks that the ha for each

other using the out-of- band channel. (2) processes and recei data from the in-band channel until their recei ed bookmarks match the sent bookmarks sent by the peer in (1). ould result in changed process ID on current Linux er nels (v ersion 2.4 or earlier). When the MPI processes resume ecution, the thread- based callbacks still ha the MPI library lock ed, with the application threads either block ed at the entry point to an MPI function, safely interrupted in their MPI func- tion calls, or running entirely outside the MPI library The checkpoint/restart implementation in LAM/MPI does not

rely on the xistence of support for transparent migration of sock ets in the back-end checkpointer for performance rea- sons and to minimize the requirements on the underlying system. Hence, the threaded callback re-establishes ne TCP sock ets with each of its MPI peers. Once these connec- tions ha been re-established, the MPI library is unlock ed, the callback thread completes ecution, and the application thread continues. Communication erf ormance Experiments were conducted to measure the com- munication and computation performance of the check- point/restart system in LAM/MPI 7.0 using

NetPIPE (A Netw ork Protocol Independent Performance Ev aluator) [29 and the AS arallel Benchmarks [11 on Linux clus- ter consisting of 208 2.4 GHz Xeon processors with ast Ethernet interconnect. NetPIPE is program that performs ping-pong tests, bouncing messages of increasing size be- tween tw processes across netw ork in order to measure communication performance. The AS arallel Bench- marks is suite of application ernels that test se eral dif-
Page 10
10 10 10 10 10 10 10 10 10 10 10 Block size (bytes) Bandwidth (Mbps) raw TCP TCP RPI (THREAD SINGLE) TCP RPI (THREAD SERIALIZED)

CR−enabled TCP RPI (THREAD SERIALIZED) Figure 10: Performance comparison of ra TCP plain TCP RPI MPI THREAD SINGLE and MPI THREAD SERIALIZED and TCP RPI with checkpoint/restart MPI THREAD SERIALIZED using NetPIPE. ferent computational and communication patterns in paral- lel en vironments. Experiments were conducted to measure the communi- cations erhead of adding checkpoint/restart capability to LAM/MPI. First, the drop in performance caused by the addition of checkpoint/restart support to the TCP RPI as measured. NetPIPE as used to compare the throughput of plain TCP RPI with that of

the TCP RPI with checkpoint/- restart. The graph of throughput ersus block-size is sho wn in Figure 10. The percentage of bandwidth loss in the checkpoint/restart-enabled TCP RPI as compared to plain TCP RPI is sho wn in Figure 11. There are three reasons for the drop in performance of the TCP RPI with the addition of checkpoint/restart sup- port. First, there is f ast mode of communication in the RPI layer such that in certain cases when the MPI re- quest queues are empty LAM bypasses the entire RPI state machine and directly uses sends and recei es for perfor mance reasons. The current

implementation of the check- point/restart enabled TCP RPI does not support this f ast mode of communication, and based on running tests with the f ast mode disabled in the TCP RPI, it has been de- termined that this accounts for part of the deterioration in performance that is seen in the graphs (see Figure 11. Second, when an MPI job requests checkpoint/restart sup- port, the thread le el is automatically upgraded to MPI THREAD SERIALIZED In this situation, LAM uses mu- te es to synchronize the threads, and this leads to additional erhead due to the lock/unlock operations that need to be

performed ery time an MPI call is made. third reason for the de gradation in performance is the additional book eeping that is done in the RPI layer to support checkpoint/- restart. Since checkpoint/restart functionality adds con- 10 10 10 10 10 10 10 10 0.5 1.5 2.5 3.5 4.5 Block size (bytes) Percentage Drop in Bandwidth TCP/serialized/fast vs. CRTCP/serialized/no−fast TCP/serialized/fast vs. TCP/serialized/no−fast Figure 11: Performance de gradation of checkpoint/restart- enabled TCP RPI MPI THREAD SERIALIZED and plain TCP RPI MPI THREAD SERIALIZED without f ast mode, both relati

to plain TCP RPI MPI THREAD SERIALIZED with f ast mode. stant erhead to the MPI layer performance drop is maxi- mum for small sized messages (about percent). or mes- sages lar ger than 1KB, the performance de gradation is less than 0.5 percent. assess the impact of LAM checkpoint/restart infras- tructure on the computational performance of parallel appli- cations, the entire suite of AS arallel Benchmarks (prob- lem size class A) were run on four nodes both with and without checkpoint/restart support. There as no discern- able dif ference in the all-clock ecution time of an of the benchmark

applications. Futur ork Future ork on this project is planned in se eral direc- tions. Our first priority for future ork is to implement the f ast mode of communication in the modified TCP RPI and xtend checkpoint/restart support to all the the remaining RPIs so that it will be possible to checkpoint/restart all MPI- jobs running in LAM/MPI. The ne xt step will be to xtend the implementation to include MPI-2 functionality Later we plan to look into the possibility of uilding checkpoint/- restart SSI modules on top of other back-end checkpoint- ing systems, possibly including

Condor [23 ], Libckpt [25 and CRAK [41 ], to xtend our implementation to multiple platforms. All of these ef forts will be complemented with xtensi performance testing and tuning to understand and identify run-time bottlenecks. Another possibility for future ork in this project is full support for process migration. Our current implementation lets us restore an entire check- pointed job on dif ferent set of nodes in some cases, ut 10
Page 11
it does not permit us to migrate subset of the processes while the others are still running. While support for real- time migration ould be

contingent upon the underlying system ability to do this, additional ork also needs to be done in the MPI library itself to mak this possible. Fi- nally long term goal is to in estigate the implementation of an uncoordinated approach to checkpointing MPI jobs in LAM/MPI. Conclusions This paper presented checkpoint/restart implementa- tion for MPI jobs that has been implemented in LAM/MPI using BLCR [8 as the underlying checkpointer This imple- mentation adopts coordinated approach to checkpointing jobs. The performance of this system as tested to measure the erhead of adding checkpoint/restart

functionality and the time to checkpoint MPI jobs. Experiments ha sho wn that the drop in performance caused by the introduction of additional functionality in the MPI layer and the communi- cation sub-system is ne gligible, and the time to checkpoint jobs increases linearly with the number of processes. The checkpoint/restart system and all other modifica- tions to the LAM infrastructure that gre out of this project are currently ailable in LAM CVS tree. Anon ymous read-only access is ailable to users who wish to utilize the latest features in LAM/MPI. The checkpoint/restart func-

tionality is also scheduled to be included in the upcoming LAM/MPI 7.0 release. More information on the project can be found on the web: Ackno wledgments This ork as supported by grant from the Lilly En- do wment, by National Science oundation grant 0116050, and by the U.S. Department of Ener gy under Contract No. DE-A C03-76SF00098. Brian Barrett as supported by Department of Ener gy High Performance Computer Sci- ence fello wship. Refer ences [1] op500 supercomputer list, No ember 2002. http://www .top500.or g/. [2] D. Briatico, A. Ciuf foletti, and L. Simoncini.

distrib uted domino-ef fect free reco ery algorithm. In Pr oceedings of the ourth International Symposium on Reliability in Dis- trib uted Softwar and Databases pages 207215, 1984. [3] G. Burns, R. Daoud, and J. aigl. LAM: An Open Cluster En vironment for MPI. In J. Ross, editor Pr oceedings of Super computing Symposium 94 pages 379386. Uni ersity of oronto, 1994. [4] K. M. Chandy and L. Lamport. Distrib uted Snapshots: Determining Global States of Distrib uted Systems. CM ansactions on Computing Systems 3(1):6375, 1985. [5] Z. Chang, K. S. Ding, and J. J. Tsay Ef ficient Imple-

mentation of Message assing Interf ace on Local Area Net- orks, 1996. [6] Chen, K. Li, and J. S. Plank. CLIP: checkpointing tool for message-passing parallel programs. In CM, ed- itor SC97: High erformance Networking and Comput- ing: Pr oceedings of the 1997 CM/IEEE SC97 Confer ence: No vember 1521, 1997, San ose California, USA. Ne ork, NY 10036, USA and 1109 Spring Street, Suite 300, Silv er Spring, MD 20910, USA, 1997. CM Press and IEEE Computer Society Press. [7] R. Dieter and J. E. L. Jr user -le el checkpointing li- brary for POSIX threads programs. In Symposium on ault- oler ant

Computing pages 224227, 1999. [8] J. Duell, Har gro e, and E. Roman. The Design and Im- plementation of Berk ele Lab Linux Checkpoint/Restart, 2002. [9] E. N. Elnozahy D. B. Johnson, and Zw aenepoel. The performance of consistent checkpointing. In Pr oceedings of the 11th Symposium on Reliable Distrib uted Systems pages 3947, Oct. 1992. [10] M. Elnozahy L. Alvisi, M. ang, and D. B. Johnson. surv of rollback-reco ery protocols in message pass- ing systems. echnical Report CMU-CS-96-181, School of Computer Science, Carne gie Mellon Uni ersity Pittsb ur gh, A, USA, 1996. [11] D. H. B. et al.

The AS arallel Benchmarks. echnical Report RNR-94-007, ASA Ames Research Center Mof- fett Field, CA, 1994. [12] A. Geist, Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, Saphir Skjellum, and M. Snir MPI-2: Extending the Message-Passing Interf ace. In L. Bouge, Fraigniaud, A. Mignotte, and Robert, editors, Eur o-P ar 96 ar allel Pr ocessing number 1123 in Lecture Notes in Computer Science, pages 128135. Springer erlag, 1996. [13] Genias Softw are GmbH. CODINE User Guide 1993. [14] Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzber g, Saphir and M. Snir MPI The Complete Refer ence:

olume 2, the MPI-2 Extensions MIT Press, 1998. [15] Gropp, E. Lusk, N. Doss, and A. Skjellum. high- performance, portable implementation of the MPI message passing interf ace standard. ar allel Computing 22(6):789 828, Sept. 1996. [16] Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Par allel Pr gr amming with the Messa Passing Interface MIT Press, 1994. [17] Gropp, E. Lusk, and R. Thakur Using MPI-2: Advanced eatur es of the Messa assing Interface MIT Press, 1999. 11
Page 12
[18] D. Gropp and E. Lusk. User Guide for mpich ortable Implementation of MPI Mathematics and Com- puter

Science Di vision, Ar gonne National Laboratory 1996. ANL-96/6. [19] R. oo and S. oue g. Checkpointing and rollback-reco ery for distrib uted systems. echnical Report TR85-706, Cor nell Uni ersity Computer Science Department, 1985. [20] .-J. Li and J.-J. Tsay Checkpointing Message-Passing Interf ace (MPI) arallel Programs. In Pr oceedings of the acific Rim International Symposium on ault-T oler ant Sys- tems 1997. [21] M. Litzk M. Li vn and M. Mutka. Condor Hunter of Idle Workstations. In Pr oceedings of the 8th International Confer ence of Distrib uted Computing Systems pages 104 111,

1988. [22] M. Litzk and M. Solomon. The Ev olution of Condor Checkpointing, 1998. [23] M. Litzk annenbaum, J. Basne and M. Li vn Checkpoint and Migration of UNIX Processes in the Condor Distrib uted Processing System. echnical Report CS-TR- 1997-1346, Uni ersity of isconsin, Madison, Apr 1997. [24] Message assing Interf ace orum. MPI: Message Passing Interf ace. In Pr oc. of Super computing 93 pages 878883. IEEE Computer Society Press, No ember 1993. [25] J. S. Plank, M. Beck, G. Kingsle and K. Li. Libckpt: ransparent Checkpointing under Unix. In Pr oceedings of the 1995 inter USENIX ec

hnical Confer ence 1995. [26] B. Randell. Systems structure for softw are ault tolerance. IEEE ansactions on Softwar Engineering 1(2):220232, 1975. [27] D. L. Russell. State restoration in systems of communicat- ing processes. IEEE ansactions on Softwar Engineering 6(2):183194, Mar 1980. [28] S. Sankaran, J. M. Squyres, B. Barrett, and A. Lumsdaine. Checkpoint-restart support system services interf ace (SSI) modules for LAM/MPI. echnical Report TR578, Indiana Uni ersity Computer Science Department, 2003. [29] Q. O. Snell, A. R. Mikler and J. L. Gustafson. NetPIPE: Netw ork Protocol

Independent Performace Ev aluator 1996. [30] M. Snir S. Otto, S. Huss-Lederman, D. alk er and J. Dongarra. MPI: The Complete Refer ence MIT Press, Cambridge, MA, 1996. [31] J. M. Squyres, B. Barrett, and A. Lumsdaine. Request progression interf ace (RPI) system services interf ace (SSI) modules for LAM/MPI. echnical Report TR579, Indiana Uni ersity Computer Science Department, 2003. [32] J. M. Squyres, B. Barrett, and A. Lumsdaine. The sys- tem services interf ace (SSI) to LAM/MPI. echnical Report TR575, Indiana Uni ersity Computer Science Department, 2003. [33] J. M. Squyres and A. Lumsdaine.

Component Architec- ture for LAM/MPI. In Pr oceedings, Eur PVM/MPI Octo- ber 2003. [34] G. Stellner CoCheck: Checkpointing and Process Migration for MPI. In Pr oceedings of the 10th International ar allel Pr ocessing Symposium Honolulu, HI, 1996. [35] amir and C. H. Sequin. Error reco ery in multicomput- ers using global checkpoints. In Pr oceedings of the 1984 International Confer ence on ar allel Pr ocessing pages 32 41, Bellaire, Michigan, Aug. 1984. IEEE. [36] The BlueGene/L eam. An Ov ervie of the BlueGene/L Supercomputer 2002. [37] The LAM eam. Getting Started with LAM/MPI Uni- ersity of

Notre Dame, Department of Computer Science, 1998. [38] Z. ong, R. Kain, and Tsai. Rollback reco ery in distrib uted systems using loosely synchronized clocks. IEEE ansactions on ar allel and Distrib uted Systems 3(2):246251, 1992. [39] .-M. ang, .-Y Chung, I.-J. Lin, and K. Fuchs. Check- point space reclamation for uncoordinated checkpointing in message-passing systems. IEEE ansactions on ar allel and Distrib uted Systems 6(5):546554, 1995. [40] A. ong, L. Olik er Kramer Kaltz, and D. Baile System Utilization Benchmark on the Cray T3E and IBM SP, April 2000. [41] H.

Zhong and J. Nieh. CRAK: Linux checkpoint restart as ernel module. echnical Report CUCS-014-01, Depart- ment of Computer Science, Columbia Uni ersity 2001. 12