References of "Besseron, Xavier 50000761"
     in
Bookmark and Share    
Full Text
Peer Reviewed
See detailMonitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI
Rajachandrasekar, Raghunath; Besseron, Xavier UL; Panda, Dhabaleswar K.

in Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (2012)

Fault-detection and prediction in HPC clusters and Cloud-computing systems are increasingly challenging issues. Several system middleware such as job schedulers and MPI implementations provide support for ... [more ▼]

Fault-detection and prediction in HPC clusters and Cloud-computing systems are increasingly challenging issues. Several system middleware such as job schedulers and MPI implementations provide support for both reactive and proactive mechanisms to tolerate faults. These techniques rely on external components such as system logs and infrastructure monitors to provide information about hardware/software failure either through detection, or as a prediction. However, these middleware work in isolation, without disseminating the knowledge of faults encountered. In this context, we propose a light-weight multi-threaded service, namely FTB-IPMI, which provides distributed fault-monitoring using the Intelligent Platform Management Interface (IPMI) and coordinated propagation of fault information using the Fault-Tolerance Backplane (FTB). In essence, it serves as a middleman between system hardware and the software stack by translating raw hardware events to structured software events and delivering it to any interested component using a publish-subscribe framework. Fault-predictors and other decision-making engines that rely on distributed failure information can benefit from FTB-IPMI to facilitate proactive fault-tolerance mechanisms such as preemptive job migration. We have developed a fault-prediction engine within MVAPICH2, an RDMA-based MPI implementation, to demonstrate this capability. Failure predictions made by this engine are used to trigger migration of processes from failing nodes to healthy spare nodes, thereby providing resilience to the MPI application. Experimental evaluation clearly indicates that a single instance of FTB-IPMI can scale to several hundreds of nodes with a remarkably low resource-utilization footprint. A deployment of FTB-IPMI that services a cluster with 128 compute-nodes, sweeps the entire cluster and collects IPMI sensor information on CPU temperature, system voltages and fan speeds in about 0.75 seconds. The average CPU utilization of this service running on a single node is 0.35%. [less ▲]

Detailed reference viewed: 133 (9 UL)
Full Text
Peer Reviewed
See detailCRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart
Ouyang, Xiangyong; Rajachandrasekar, Raghunath; Besseron, Xavier UL et al

in 2011 International Conference on Parallel Processing (2011, September)

Detailed reference viewed: 128 (0 UL)
Full Text
Peer Reviewed
See detailCan a Decentralized Metadata Service Layer benefit Parallel Filesystems?
Meshram, Vilobh; Besseron, Xavier UL; Ouyang, Xiangyong et al

in 2011 IEEE International Conference on Cluster Computing (2011, September)

Detailed reference viewed: 96 (2 UL)
Full Text
Peer Reviewed
See detailImpact of over-decomposition on coordinated checkpoint/rollback protocol
Besseron, Xavier UL; Gautier, Thierry

in Euro-Par 2011: Parallel Processing Workshops (2011, August)

Detailed reference viewed: 103 (0 UL)
Full Text
Peer Reviewed
See detailCan Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?
Rajachandrasekar, Raghunath; Ouyang, Xiangyong; Besseron, Xavier UL et al

in Euro-Par 2011: Parallel Processing Workshops (2011, August)

Detailed reference viewed: 120 (2 UL)
Full Text
Peer Reviewed
See detailHigh Performance Pipelined Process Migration with RDMA
Ouyang, Xiangyong; Rajachandrasekar, Raghunath; Besseron, Xavier UL et al

in 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2011, May)

Detailed reference viewed: 145 (1 UL)
Full Text
See detailProactive Fault-Resilience with Process Migration in MVAPICH2: A demonstration with Tachyon
Ouyang, Xiangyong; Rajachandrasekar, Raghunath; Besseron, Xavier UL et al

Presentation (2010, November)

Detailed reference viewed: 28 (1 UL)
Full Text
See detailKaapi / Charm++ preliminary comparison
Besseron, Xavier UL; Gautier, Thierry; Zheng, Gengbin et al

Presentation (2010, June)

Detailed reference viewed: 39 (4 UL)
Full Text
See detailTolérance aux fautes et reconfiguration dynamique pour les applications distribuées à grande échelle
Besseron, Xavier UL

Doctoral thesis (2010)

This work deals with high performance computing on large scale platforms like computing grids. Computing grids are characterized by (1) frequent changes in execution context and, especially, by (2) a high ... [more ▼]

This work deals with high performance computing on large scale platforms like computing grids. Computing grids are characterized by (1) frequent changes in execution context and, especially, by (2) a high failure probability caused by the large number of components. Running an application efficiently in such an environment requires to consider these parameters. Our research work is based on the abstract representation of the application as a data flow graph from the parallel and distributed programming model Athapascan/Kaapi. This abstract representation is used to provide solutions for (1) dynamic reconfiguration and (2) fault tolerance issues. - First, we propose a dynamic reconfiguration mechanism that manages, transparently for the reconfiguration programmer, concurrent operations on the application state and mutual consistency of states for distributed reconfiguration. - Secondly, we present an original fault tolerance protocol that allows partial rollback of the application in case of failure. For this purpose, the set of strictly required computation tasks to recover is computed. These contributions are evaluated through the Kaapi and X-Kaapi software on the Grid'5000 computing platform. [less ▲]

Detailed reference viewed: 30 (4 UL)
Full Text
See detailFault tolerance for a data flow model
Besseron, Xavier UL

Presentation (2010, March)

Detailed reference viewed: 26 (1 UL)
Full Text
Peer Reviewed
See detailFault tolerance and availability awareness in computational grids
Besseron, Xavier UL; Bouguerra, Slim; Gautier, Thierry et al

in Fundamentals of Grid Computing: Theory, Algorithms and Technologies (2009)

Detailed reference viewed: 89 (0 UL)
Full Text
Peer Reviewed
See detailX-Kaapi : Une nouvelle implémentation eXtrême du vol de travail
Besseron, Xavier UL; Laferriere, Christophe; Traore, Daouda et al

in Rencontres Francophones du Parallélisme (RenPar'19) (2009, September)

Detailed reference viewed: 40 (0 UL)
Full Text
See detailOptimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model
Besseron, Xavier UL; Gautier, Thierry

Presentation (2009, January 22)

Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel applications. The probability of a failure may be important due to the number of unreliable components ... [more ▼]

Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. We present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. The application is described using a dataflow graph, which is an abstract representation of the execution. Thanks to this representation, the fault recovery in our protocol only requires a partial restart of other processes. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol. [less ▲]

Detailed reference viewed: 29 (1 UL)
Full Text
Peer Reviewed
See detailOptimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications
Besseron, Xavier UL; Gautier, Thierry

in Modelling, Computation and Optimization in Information Systems and Management Sciences. MCO 2008 (2008, September)

Detailed reference viewed: 103 (1 UL)
Full Text
See detailIV Grid Plugtests: composing dedicated tools to run an application efficiently on Grid'5000
Besseron, Xavier UL; Danjean, Vincent; Gautier, Thierry et al

Presentation (2008, February 12)

Exploiting efficiently the resources of whole Grid'5000 with the same application requires to solve several issues: 1) resources reservation; 2) application's processes deployment; 3) application's tasks ... [more ▼]

Exploiting efficiently the resources of whole Grid'5000 with the same application requires to solve several issues: 1) resources reservation; 2) application's processes deployment; 3) application's tasks scheduling. For the IV Grid Plugtests, we used a dedicated tool for each issue to solve. The N-Queens contest rules imposed ProActive for the resources reservations (issue 1). Issue 2 was solved using TakTuk which allows to deploy a large set of remote nodes. Deployed nodes take part in the deployment using an adaptive algorithm that makes it very efficient. For the 3rd issue, we wrote our application with Athapascan API whose model is based on the concepts of tasks and shared data. The application is described as a data-flow graph using the Shared and Fork keywords. This high level abstraction of hardware gives us an efficient execution with the Kaapi runtime engine using a work-stealing scheduling algorithm to balance the workload between all the distributed processes. [less ▲]

Detailed reference viewed: 26 (1 UL)
Full Text
Peer Reviewed
See detailUn protocole de sauvegarde / reprise coordonné pour les applications à flot de données reconfigurables
Besseron, Xavier UL; Pigeon, Laurent; Gautier, Thierry et al

in Technique et Science Informatiques (2007)

Detailed reference viewed: 89 (2 UL)
Full Text
Peer Reviewed
See detailKaapi: A Thread Scheduling Runtime System for Data Flow Computations on Cluster of Multi-Processors
Gautier, Thierry; Besseron, Xavier UL; Pigeon, Laurent

in PASCO '07 Proceedings of the 2007 international workshop on Parallel symbolic computation (2007, July)

Detailed reference viewed: 78 (2 UL)
Full Text
Peer Reviewed
See detailCCK: An Improved Coordinated Checkpoint/Rollback Protocol for Dataflow Applications in Kaapi
Besseron, Xavier UL; Jafar, Samir; Gautier, Thierry et al

in 2006 2nd International Conference on Information & Communication Technologies (2006, April)

Detailed reference viewed: 116 (0 UL)