Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollbackrecovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375-408 (2002)
Zheng, G., Shi, L., Kalé, L.V.: Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: 2004 IEEE International Conference on Cluster Computing, San Dieago, CA (September 2004)
Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2), 97-108 (2004)
Jafar, S., Krings, A.W., Gautier, T., Roch, J.L.: Theft-induced checkpointing for reconfigurable dataflow applications. In: IEEE, (ed.): IEEE Electro/Information Technology Conference (EIT, Lincoln, Nebraska (May 2005) This paper received the EIT 2005 Best Paper Award
Bouteiller, A., Lemarinier, P., Krawezik, G., Cappello, F.: Coordinated checkpoint versus message log for fault tolerant MPI. In: Proceedings of The 2003 IEEE International Conference on Cluster Computing, Honk Hong,China (2003)
Jafar, S., Krings, A., Gautier, T.: Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing. IEEE Transactions on Dependable and Secure Computing (TDSC) (in print, 2008)
Xie, M., Dai, Y.S., Poh, K.L.: Reliability of Grid Computing Systems. In: Computing System Reliability, pp. 179-205. Springer, US (2004)
Neokleous, K., Dikaiakos, M., Fragopoulou, P., Markatos, E.: Grid reliability: A study of failures on the egee infrastructure. In: Gorlatch, S., Bubak, M., Priol, T. (eds.) Proceedings of the CoreGRID Integration Workshop 2006, pp. 165-176 (Octobert 2006)
Anstreicher, K.M., Brixius, N.W., Goux, J.P., Linderoth, J.: Solving large quadratic assignment problems on computational grids. Technical report, Iowa City, Iowa 52242 (2000)
Wang, Y.M., Huang, Y., Vo, K.P., Chung, P.Y., Kintala, C.: Checkpointing and its applications. In: Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers, Twenty-Fifth International Symposium on (27-30 Jun 1995), pp. 22-31 (1995)
Jafar, S., Pigeon, L., Gautier, T., Roch, J.L.: Self-adaptation of parallel applications in heterogeneous and dynamic architectures. In: IEEE, (ed.): ICTTA 2006, IEEE Conference on Information and Communication Technologies: from Theory to Applications, Damascus, Syria, pp. 3347-3352 (April 2006)
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fédak, G., Germain, C., Hérault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Néri, V., Selikhov, A.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: Super- Computing, Baltimore, USA (2002)
Jafar, S., Gautier, T., Krings, A.W., Roch, J.-L.: A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 675-684. Springer, Heidelberg (2005)
Baude, F., Caromel, D., Delb́e, C., Henrio, L.: A hybrid message logging-cic protocol for constrained checkpointability. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro- Par 2005. LNCS, vol. 3648, pp. 644-653. Springer, Heidelberg (2005)
Gautier, T., Besseron, X., Pigeon, L.: Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007: Proceedings of the 2007 international workshop on Parallel symbolic computation, pp. 15-23 (2007)
Kal, L., Skeel, R., Bhandarkar, M., Brunner, R., Gursoy, A., Krawetz, N., Phillips, J., Shinozaki, A., Varadarajan, K., Schulten, K.: Namd2: greater scalability for parallel molecular dynamics. J. Comput. Phys. 151(1), 283-312 (1999)
Revire, R., Zara, F., Gautier, T.: Efficient and easy parallel implementation of large numerical simulation. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 663-666. Springer, Heidelberg (2003)
Wiesmann, M., Pedone, F., Schiper, A.: A systematic classification of replicated database protocols based on atomic broadcast. In: Proceedings of the 3rd Europeean Research Seminar on Advances in Distributed Systems (ERSADS 1999), Madeira Island, Portugal (1999)
Alvisi, L., Marzullo, K.: Message logging: Pessimistic, optimistic, causal, and optimal. IEEE Transactions on Software Engineering 24(2), 149-159 (1998)
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63-75 (1985)
Randell, B.: System structure for software fault tolerance. In: Proceedings of the international conference on Reliable software, pp. 437-449 (1975)
Baldoni, R.: A communication-induced checkpointing protocol that ensures rollback-dependency trackability. In: Proc. of the 27th International Symposium on Fault-Tolerant Computing (FTCS 1997), p.68. IEEE Computer Society, Los Alamitos (1997)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent Checkpointing under Unix. In: Proceedings of USENIX Winter 1995 Technical Conference, New Orleans, Louisiana,USA, pp. 213-224 (January 1995)
Galilée, F., Roch, J.L., Cavalheiro, G., Doreille, M.: Athapascan-1: On-line building data flow graph in a parallel language. In: IEEE, (ed.): Pact 1998, Paris, France, pp. 88-95 (October 1998)
Roch, J.L., Gautier, T., Revire, R.: Athapascan: Api for asynchronous parallel programming. Technical Report RT-0276, Projet APACHE, INRIA (February 2003)
Pellegrini, F., Roman, J.: Experimental analysis of the dual recursive bipartitioning algorithm for static mapping. Technical Report 1038-96, LaBRI, Universit́e Bordeaux I (1996)
Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S.: Multilevel hypergraph partitioning: Application in VLSI domain. In: Proceedings of the 34th annual conference on Design automation, pp. 526-529. ACM Press, New York (1997)
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 13(1), 23-31 (1987)
Besseron, X., Jafar, S., Gautier, T., Roch, J.L.: Cck: An improved coordinated checkpoint/rollback protocol for dataflow applications in kaapi. In: IEEE, (ed.): ICTTA 2006, IEEE Conference on Information and Communication Technologies: from Theory to Applications, Damascus, Syria, pp. 3353-3358 (April 2006)
Besseron, X., Pigeon, L., Gautier, T., Jafar, S.: Un protocole de sauvegarde/reprise coordonńe pour les applications 'a flot de donńees reconfigurables. Technique et Science Informatiques nuḿero sṕecial RenPar 17 27 (2008)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, Cambridge (2001)