"MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE," http://mvapich.cse.ohio-state.edu/.
"Open MPI: Open Source High Performance Computing," http://mvapich.cse.ohio-state.edu/.
"MPICH2: High-Performance and Widely Portable MPI," http://www.mcs.anl.gov/research/projects/mpich2/.
X. Ouyang, K. Gopalakrishnan, and D. K. Panda, "Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems," ICPP 2009, September 2009.
J. Hursey, J. Squyres, T. Mattox, and A. Lumsdaine, "The design and implementation of checkpoint/restart process fault tolerance for open mpi," in 12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, March 2007.
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, no. 3, pp. 375-408, 2002.
P. H. Hargrove and J. C. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters," in SciDAC, 6 2006.
Q. Gao, W. Yu, W. Huang and D. K. Panda, "Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand," in International Conference on Parallel Processing (ICPP), August 2006.
K. M. Chandy and L. Lamport, "Distributed snapshots: determining global states of distributed systems," ACM Transactions on Computer Systems, vol. 3, no. 1, pp. 63-75, 1985.
J. S. Plank, Y. Chen, K. Li, M. Beck, and G. Kingsley, "Memory exclusion: Optimizing the performance of checkpointing systems," in Software: Practice and Experience, 1999.
J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, "PLFS: a checkpoint filesystem for parallel applications," in Proc. of SC '09, 2009.
X. Ouyang, K. Gopalakrishnan, T. Gangadharappa, and D. K. Panda, "Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture," HiPC 2009, December 2009.
"Filesystem in Userspace," http://fuse.sourceforge.net/.
"A flow-chart diagram which shows how FUSE works," http://en.wikipedia.org/wiki/Filesystem-in-Userspace.
F. C. Wong and R. P. M. etc., "Architectural requirements and scalability of the NAS parallel benchmarks," in Proc. of Supercomputing '99, 1999, p. 41.
G. Stellner, "CoCheck: Checkpointing and Process Migration for MPI," in Proc. of the 10th International Parallel Processing Symposium (IPPS '96), 1996.
A. Agbaria and R. Friedman, "Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations," High-Performance Distributed Computing, International Symposium on, vol. 0, p. 31, 1999.
S. Sankaran and J. M. Squyres and B. Barrett etc, "The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing," LACSI, Oct. 2003.
I.R. Philp, "Software failures and the road to a petaflop machine," in First Workshop on High Performance Computing Reliability Issues (HPCRI), February 2005.
Milo Polte and Jiri Simsa etc., "Fast log-based concurrent writing of checkpoints ," in PDSI 2008 workshop in conjunction with SC08, Nov. 2008.
"PVFS2," http://www.pvfs.org/.
S. Al-Kiswany, M. Ripeanu, S. Vazhkudai, and A. Gharaibeh, "stdchk: A Checkpoint Storage System for Desktop Grid Computing," in ICDCS 2008., June 2008.
K. Li, J. F. Naughton, and J. S. Plank, "Low-latency, concurrent checkpointing for parallel programs," IEEE Trans. Parallel Distrib. Syst., 1994.