Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.
Chakraborty, Sourav, et al. "EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications." Concurrency and Computation. Practice and Experience, vol. 32, no. 3, Aug. 2018. https://doi.org/10.1002/cpe.4863
Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, & Subramoni, Hari (2018). EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications. Concurrency and Computation. Practice and Experience, 32(3). https://doi.org/10.1002/cpe.4863
Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, et al., "EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications," Concurrency and Computation. Practice and Experience 32, no. 3 (2018), https://doi.org/10.1002/cpe.4863
@article{osti_1708993,
author = {Chakraborty, Sourav and Laguna, Ignacio and Emani, Murali and Mohror, Kathryn and Panda, Dhabaleswar K. and Schulz, Martin and Subramoni, Hari},
title = {EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications},
annote = {Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.},
doi = {10.1002/cpe.4863},
url = {https://www.osti.gov/biblio/1708993},
journal = {Concurrency and Computation. Practice and Experience},
issn = {ISSN 1532-0626},
number = {3},
volume = {32},
place = {United States},
publisher = {Wiley},
year = {2018},
month = {08}}
Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15https://doi.org/10.1145/2807591.2807672
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11https://doi.org/10.1145/2063384.2063443
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13https://doi.org/10.1145/2503210.2503226
2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processinghttps://doi.org/10.1109/IPDPS.2013.115
2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1109/SC.2010.18
Journal Article
·
2018
· EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018
·OSTI ID:1544207