Covering Resilience: A Recent Development for Binomial Checkpointing
In terms of computing time, adjoint methods offer a very attractive alternative to compute gradient information, required, e.g., for optimization purposes. However, together with this very favorable temporal complexity result comes a memory requirement that is in essence proportional with the operation count of the underlying function, e.g., if algorithmic differentiation is used to provide the adjoints. For this reason, checkpointing approaches in many variants have become popular. This paper analyzes an extension of the so-called binomial approach to cover also possible failures of the computing systems. Such a measure of precaution is of special interest for massive parallel simulations and adjoint calculations where the mean time between failure of the large scale computing system is smaller than the time needed to complete the calculation of the adjoint information. We describe the extensions of standard checkpointing approaches required for such resilience, provide a corresponding implementation and discuss first numerical results.
- Research Organization:
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- DOE Contract Number:
- AC02-06CH11357
- OSTI ID:
- 1366299
- Resource Relation:
- Conference: 7th International Conference on Algorithmic Differentiation, 09/12/16 - 09/15/16, Oxford, GB
- Country of Publication:
- United States
- Language:
- English
Similar Records
Resiliency in numerical algorithm design for extreme scale simulations
Node failure resiliency for Uintah without checkpointing