Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F. : Datastager: Scalable data staging services for petascale applications. In: HPDC (2009)
Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., Wingate, M. : PLFS: A checkpoint filesystem for parallel applications. In: SC (2009)
Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F. : Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi protocols. Future Generation Computer Systems (2008)
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M. : Toward exascale resilience. IJHPCA (2009)
Gao, Q., Yu, W., Huang, W., Panda, D. K. : Application-transparent checkpoint/ restart for mpi programs over infiniband. In: ICPP (2006)
Gupta, R., Beckman, P., Park, B. H., Lusk, E., Hargrove, P., Geist, A., Panda, D. K., Lumsdaine, A., Dongarra, J. : Cifts: A coordinated infrastructure for fault-tolerant systems. In: ICPP (2009)
Hargrove, P. H., Duell, J. C. : Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters. In: SciDAC (2006)
Hursey, J., Lumsdaine, A. : A composable runtime recovery policy framework supporting resilient hpc applications. Tech. rep., University of Tennessee (2010)
Hursey, J., Squyres, J. M., Mattox, T. I., Lumsdaine, A. : The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS (2007)
InfiniBand Trade Association: The InfiniBand Architecture, http://www. infinibandta. org
Isaila, F., Garcia Blas, J., Carretero, J., Latham, R., Ross, R. : Design and evaluation of multiple-level data staging for blue gene systems. TPDS (2011)
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B. R. : Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: SC (2010)
Ouyang, X., Gopalakrishnan, K., Gangadharappa, T., Panda, D. K. : Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture. HiPC (2009)
Ouyang, X., Rajachandrasekhar, R., Besseron, X., Wang, H., Huang, J., Panda, D. K. : CRFS: A lightweight user-level filesystem for generic checkpoint/restart. In: ICPP (2011) (to appear)
Plank, J. S., Chen, Y., Li, K., Beck, M., Kingsley, G. : Memory exclusion: Optimizing the performance of checkpointing systems. In: Software: Practice and Experience (1999)
Schroeder, B., Gibson, G. A. : Understanding failures in petascale computers. Journal of Physics: Conference Series (2007)