[en] Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. We present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. The application is described using a dataflow graph, which is an abstract representation of the execution. Thanks to this representation, the fault recovery in our protocol only requires a partial restart of other processes. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.
Disciplines :
Computer science
Author, co-author :
BESSERON, Xavier ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Engineering Research Unit ; Laboratoire d'Informatique de Grenoble > MOAIS project
Gautier, Thierry; Laboratoire d'Informatique de Grenoble > MOAIS project
Language :
English
Title :
Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model
Publication date :
22 January 2009
Event name :
Workshop APRETAF : Algorithmes Parallèles, Répartis Et Tolérance Aux Fautes