Reference : Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model
Scientific Presentations in Universities or Research Centers : Scientific presentation in universities or research centers
Engineering, computing & technology : Computer science
Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model
Besseron, Xavier mailto [University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Engineering Research Unit > ; Laboratoire d'Informatique de Grenoble > MOAIS project]
Gautier, Thierry [Laboratoire d'Informatique de Grenoble > MOAIS project]
Workshop APRETAF : Algorithmes Parallèles, Répartis Et Tolérance Aux Fautes
from 22-01-2009 to 23-01-2009
[en] Grid ; Distributed Computing ; Fault Tolerance ; Dataflow graph
[en] Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. We present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. The application is described using a dataflow graph, which is an abstract representation of the execution. Thanks to this representation, the fault recovery in our protocol only requires a partial restart of other processes. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.

File(s) associated to this reference

Fulltext file(s):

Open access
talk_2009_apretaf.pdfAuthor postprint984.11 kBView/Open

Bookmark and Share SFX Query

All documents in ORBilu are protected by a user license.