[en] Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. We present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. The application is described using a dataflow graph, which is an abstract representation of the execution. Thanks to this representation, the fault recovery in our protocol only requires a partial restart of other processes. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
BESSERON, Xavier ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Engineering Research Unit ; Laboratoire d'Informatique de Grenoble > MOAIS project
Gautier, Thierry; Laboratoire d'Informatique de Grenoble > MOAIS project
Langue du document :
Anglais
Titre :
Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model
Date de publication/diffusion :
22 janvier 2009
Nom de la manifestation :
Workshop APRETAF : Algorithmes Parallèles, Répartis Et Tolérance Aux Fautes