# Asynchronous Two-Level Checkpointing Scheme for Large-Scale Adjoints in the Spectral-Element Solver Nek5000

## Abstract

Adjoints are an important computational tool for large-scale sensitivity evaluation, uncertainty quantification, and derivative-based optimization. An essential component of their performance is the storage/recomputation balance in which efficient checkpointing methods play a key role. We introduce a novel asynchronous two-level adjoint checkpointing scheme for multistep numerical time discretizations targeted at large-scale numerical simulations. The checkpointing scheme combines bandwidth-limited disk checkpointing and binomial memory checkpointing. Based on assumptions about the target petascale systems, which we later demonstrate to be realistic on the IBM Blue Gene/Q system Mira, we create a model of the expected performance of our checkpointing approach and validate it using the highly scalable Navier-Stokes spectralelement solver Nek5000 on small to moderate subsystems of the Mira supercomputer. In turn, this allows us to predict optimal algorithmic choices when using all of Mira. We also demonstrate that two-level checkpointing is significantly superior to single-level checkpointing when adjoining a large number of time integration steps. To our knowledge, this is the first time two-level checkpointing had been designed, implemented, tuned, and demonstrated on fluid dynamics codes at large scale of 50k+ cores.

- Authors:

- Publication Date:

- Research Org.:
- Argonne National Lab. (ANL), Argonne, IL (United States)

- Sponsoring Org.:
- USDOE Office of Science (SC)

- OSTI Identifier:
- 1394784

- DOE Contract Number:
- AC02-06CH11357

- Resource Type:
- Conference

- Resource Relation:
- Conference: 2016 International Conference on Computation Science, 06/06/16 - 06/08/16, San Diego, CA, US

- Country of Publication:
- United States

- Language:
- English

- Subject:
- Adjoints; CFD; Gradient; Large Scale; Nek5000; PETSc; Two-Level Checkpointing

### Citation Formats

```
Schanen, Michel, Marin, Oana, Zhang, Hong, and Anitescu, Mihai.
```*Asynchronous Two-Level Checkpointing Scheme for Large-Scale Adjoints in the Spectral-Element Solver Nek5000*. United States: N. p., 2016.
Web. doi:10.1016/j.procs.2016.05.444.

```
Schanen, Michel, Marin, Oana, Zhang, Hong, & Anitescu, Mihai.
```*Asynchronous Two-Level Checkpointing Scheme for Large-Scale Adjoints in the Spectral-Element Solver Nek5000*. United States. doi:10.1016/j.procs.2016.05.444.

```
Schanen, Michel, Marin, Oana, Zhang, Hong, and Anitescu, Mihai. Fri .
"Asynchronous Two-Level Checkpointing Scheme for Large-Scale Adjoints in the Spectral-Element Solver Nek5000". United States. doi:10.1016/j.procs.2016.05.444. https://www.osti.gov/servlets/purl/1394784.
```

```
@article{osti_1394784,
```

title = {Asynchronous Two-Level Checkpointing Scheme for Large-Scale Adjoints in the Spectral-Element Solver Nek5000},

author = {Schanen, Michel and Marin, Oana and Zhang, Hong and Anitescu, Mihai},

abstractNote = {Adjoints are an important computational tool for large-scale sensitivity evaluation, uncertainty quantification, and derivative-based optimization. An essential component of their performance is the storage/recomputation balance in which efficient checkpointing methods play a key role. We introduce a novel asynchronous two-level adjoint checkpointing scheme for multistep numerical time discretizations targeted at large-scale numerical simulations. The checkpointing scheme combines bandwidth-limited disk checkpointing and binomial memory checkpointing. Based on assumptions about the target petascale systems, which we later demonstrate to be realistic on the IBM Blue Gene/Q system Mira, we create a model of the expected performance of our checkpointing approach and validate it using the highly scalable Navier-Stokes spectralelement solver Nek5000 on small to moderate subsystems of the Mira supercomputer. In turn, this allows us to predict optimal algorithmic choices when using all of Mira. We also demonstrate that two-level checkpointing is significantly superior to single-level checkpointing when adjoining a large number of time integration steps. To our knowledge, this is the first time two-level checkpointing had been designed, implemented, tuned, and demonstrated on fluid dynamics codes at large scale of 50k+ cores.},

doi = {10.1016/j.procs.2016.05.444},

journal = {},

number = ,

volume = ,

place = {United States},

year = {2016},

month = {1}

}