BESSERON, Xavier ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Engineering Research Unit ; The Ohio State University > Department of Computer Science and Engineering
Gautier, Thierry; INRIA, France > MOAIS Project
External co-authors :
yes
Language :
English
Title :
Impact of over-decomposition on coordinated checkpoint/rollback protocol
Publication date :
August 2011
Event name :
Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids (Resilience'11), held in conjunction with EuroPar'11
Badia, R. M., Herrero, J. R., Labarta, J., Pérez, J. M., Quintana-Ort́?, E. S., Quintana-Ort́?, G. : Parallelizing dense and banded linear algebra libraries using smpss. Concurr. Comput. : Pract. Exper. (2009)
Besseron, X., Gautier, T. : Optimised recovery with a coordinated checkpoint/ rollback protocol for domain decomposition applications. In: MCO 2008 (2008)
Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., Zhou, Y. : Cilk: An efficient multithreaded runtime system. Parallel and Distributed Computing (1996)
Bongo, L. A., Vinter, B., Anshus, O. J., Larsen, T., Bjorndalen, J. M. : Using overdecomposition to overlap communication latencies with computation and take advantage of smt processors. In: ICPP Workshops (2006)
Bouteiller, A., Hérault, T., Krawezik, G., Lemarinier, P., Cappello, F. : MPICH-V project: A multiprotocol automatic fault tolerant MPI. High Performance Computing Applications (2006)
Chakravorty, S., Kale, L. V. : A fault tolerant protocol for massively parallel systems. In: IPDPS (2004)
Chandy, K. M., Lamport, L. : Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems (1985)
Elnozahy, E. N., Alvisi, L., Wang, Y. M., Johnson, D. B. : A survey of rollbackrecovery protocols in message-passing systems. ACM Computing Surveys (2002)
Galilée, F., Roch, J. L., Cavalheiro, G., Doreille, M. : Athapascan-1: On-line building data flow graph in a parallel language. In: PACT 1998 (1998)
Gao, Q., Yu, W., Huang, W., Panda, D. K. : Application-transparent checkpoint/ restart for mpi programs over infiniband. In: ICPP 2006 (2006)
Gautier, T., Besseron, X., Pigeon, L. : Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007 (2007)
Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F. : Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In: IPDPS (2011)
Hursey, J., Squyres, J. M., Mattox, T. I., Lumsdaine, A. : The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS (2007)
Jafar, S., Krings, A. W., Gautier, T. : Flexible rollback recovery in dynamic heterogeneous grid computing. IEEE Transactions on Dependable and Secure Computing (2008)
Jafar, S., Pigeon, L., Gautier, T., Roch, J. L. : Self-adaptation of parallel applications in heterogeneous and dynamic architectures. In: ICTTA 2006 (2006)
Jose, J., Luo, M., Sur, S., Panda, D. K. : Unifying UPC and MPI Runtimes: Experience with MVAPICH. In: PGAS 2010 (2010)
Kale, L. V., Mendes, C., Meneses, E. : Adaptive runtime support for fault tolerance. Talk at Los Alamos Computer Science Symposium 2009 (2009)
Kale, L. V., Zheng, G. : Charm++ and AMPI: Adaptive runtime strategies via migratable objects. In: Advanced Computational Infrastructures for Parallel and Distributed Applications. Wiley-Interscience (2009)
Naik, V. K., Setia, S. K., Squillante, M. S. : Processor allocation in multiprogrammed distributed-memory parallel computer systems. Parallel Distributed Computing (1997)
Rabenseifner, R., Hager, G., Jost, G. : Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In: PDP 2009 (2009)
Song, F., YarKhan, A., Dongarra, J. : Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: SC 2009 (2009)
Tamir, Y., Séquin, C. H. : Error recovery in multicomputers using global checkpoints. In: ICPP 1984 (1984)
Zheng, G., Shi, L., Kale, L. V. : FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. Cluster Computing (2004)