CosmoFlow: Using Deep Learning to Learn the Universe at Scale
Abstract
Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational cost of this problem, we present CosmoFlow: a highly scalable deep learning application built on top of the TensorFlow framework. CosmoFlow uses efficient implementations of 3D convolution and pooling primitives, together with improvements in threading for many element-wise operations, to improve training performance on Intel® Xeon Phi™ processors. We also utilize the Cray PE Machine Learning Plugin for efficient scaling to multiple nodes. We demonstrate fully synchronous data-parallel training on 8192 nodes of Cori with 77% parallel efficiency, achieving 3.5 Pflop/s sustained performance. To our knowledge, this is the first large-scale science application of the TensorFlow framework at supercomputer scale with fully-synchronous training. Here, these enhancements enable us to process large 3D dark matter distribution and predict the cosmological parameters ΩM, σ8 and ns with unprecedented accuracy.
- Authors:
-
- Intel Corp., Hillsboro, OR (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Cray Inc., Seattle, WA (United States)
- Univ. of California, Berkeley, CA (United States)
- Intel Corp., Santa Clara, CA (United States)
- Flatiron Inst., New York, NY (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Carnegie Mellon Univ., Pittsburgh, PA (United States)
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1510756
- Grant/Contract Number:
- AC02-05CH11231
- Resource Type:
- Accepted Manuscript
- Journal Name:
- International Conference for High Performance Computing, Networking, Storage and Analysis
- Additional Journal Information:
- Journal Volume: 2018; Conference: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX (United States), 11-16 Nov 2018; Journal ID: ISSN 2167-4329
- Publisher:
- IEEE
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 98 NUCLEAR DISARMAMENT, SAFEGUARDS, AND PHYSICAL PROTECTION; Cosmology; Deep Learning; Machine Learning; TensorFlow; High Performance Computing
Citation Formats
Mathuriya, Amrita, Bard, Deborah, Mendygral, Peter, Meadows, Lawrence, Arnemann, James, Shao, Lei, He, Siyu, Karna, Tuomas, Moise, Diana, Pennycook, Simon J., Maschhoff, Kristyn, Sewall, Jason, Kumar, Nalini, Ho, Shirley, Ringenburg, Michael F., Prabhat, Prabhat, and Lee, Victor. CosmoFlow: Using Deep Learning to Learn the Universe at Scale. United States: N. p., 2019.
Web. doi:10.1109/sc.2018.00068.
Mathuriya, Amrita, Bard, Deborah, Mendygral, Peter, Meadows, Lawrence, Arnemann, James, Shao, Lei, He, Siyu, Karna, Tuomas, Moise, Diana, Pennycook, Simon J., Maschhoff, Kristyn, Sewall, Jason, Kumar, Nalini, Ho, Shirley, Ringenburg, Michael F., Prabhat, Prabhat, & Lee, Victor. CosmoFlow: Using Deep Learning to Learn the Universe at Scale. United States. https://doi.org/10.1109/sc.2018.00068
Mathuriya, Amrita, Bard, Deborah, Mendygral, Peter, Meadows, Lawrence, Arnemann, James, Shao, Lei, He, Siyu, Karna, Tuomas, Moise, Diana, Pennycook, Simon J., Maschhoff, Kristyn, Sewall, Jason, Kumar, Nalini, Ho, Shirley, Ringenburg, Michael F., Prabhat, Prabhat, and Lee, Victor. Thu .
"CosmoFlow: Using Deep Learning to Learn the Universe at Scale". United States. https://doi.org/10.1109/sc.2018.00068. https://www.osti.gov/servlets/purl/1510756.
@article{osti_1510756,
title = {CosmoFlow: Using Deep Learning to Learn the Universe at Scale},
author = {Mathuriya, Amrita and Bard, Deborah and Mendygral, Peter and Meadows, Lawrence and Arnemann, James and Shao, Lei and He, Siyu and Karna, Tuomas and Moise, Diana and Pennycook, Simon J. and Maschhoff, Kristyn and Sewall, Jason and Kumar, Nalini and Ho, Shirley and Ringenburg, Michael F. and Prabhat, Prabhat and Lee, Victor},
abstractNote = {Deep learning is a promising tool to determine the physical model that describes our universe. To handle the considerable computational cost of this problem, we present CosmoFlow: a highly scalable deep learning application built on top of the TensorFlow framework. CosmoFlow uses efficient implementations of 3D convolution and pooling primitives, together with improvements in threading for many element-wise operations, to improve training performance on Intel® Xeon Phi™ processors. We also utilize the Cray PE Machine Learning Plugin for efficient scaling to multiple nodes. We demonstrate fully synchronous data-parallel training on 8192 nodes of Cori with 77% parallel efficiency, achieving 3.5 Pflop/s sustained performance. To our knowledge, this is the first large-scale science application of the TensorFlow framework at supercomputer scale with fully-synchronous training. Here, these enhancements enable us to process large 3D dark matter distribution and predict the cosmological parameters ΩM, σ8 and ns with unprecedented accuracy.},
doi = {10.1109/sc.2018.00068},
journal = {International Conference for High Performance Computing, Networking, Storage and Analysis},
number = ,
volume = 2018,
place = {United States},
year = {Thu Mar 14 00:00:00 EDT 2019},
month = {Thu Mar 14 00:00:00 EDT 2019}
}
Works referenced in this record:
Distributed Deep Learning Using Synchronous Stochastic Gradient Descent
preprint, January 2016
- Das, Dipankar; Avancha, Sasikanth; Mudigere, Dheevatsa
- arXiv
Solving large scale structure in ten easy steps with COLA
journal, June 2013
- Tassev, Svetlin; Zaldarriaga, Matias; Eisenstein, Daniel J.
- Journal of Cosmology and Astroparticle Physics, Vol. 2013, Issue 06
sCOLA: The N-body COLA Method Extended to the Spatial Domain
preprint, January 2015
- Tassev, Svetlin; Eisenstein, Daniel J.; Wandelt, Benjamin D.
- arXiv
Cosmological model discrimination with Deep Learning
preprint, January 2017
- Schmelzle, Jorit; Lucchi, Aurelien; Kacprzak, Tomasz
- arXiv
cuDNN: Efficient Primitives for Deep Learning
preprint, January 2014
- Chetlur, Sharan; Woolley, Cliff; Vandermersch, Philippe
- arXiv
FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters
conference, June 2016
- Iandola, Forrest N.; Moskewicz, Matthew W.; Ashraf, Khalid
- 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Enabling Dark Energy Science with Deep Generative Models of Galaxy Images
preprint, January 2016
- Ravanbakhsh, Siamak; Lanusse, Francois; Mandelbaum, Rachel
- arXiv
Hsp90 is important for fecundity, longevity, and buffering of cryptic deleterious variation in wild fly populations
journal, January 2012
- Chen, Bing; Wagner, Andreas
- BMC Evolutionary Biology, Vol. 12, Issue 1
WOMBAT: A Scalable and High Performance Astrophysical MHD Code
text, January 2017
- Mendygral, Peter; Radcliffe, Nick; Kandalla, Krishna
- arXiv
Planck 2015 results : XIII. Cosmological parameters
journal, September 2016
- Ade, P. A. R.; Aghanim, N.; Arnaud, M.
- Astronomy & Astrophysics, Vol. 594
Evaluating the networking characteristics of the Cray XC-40 Intel Knights Landing-based Cori supercomputer at NERSC: Evaluating the Networking Characteristics of the Cray XC-40 Intel Knights Landing Based Cori Supercomputer at NERSC
journal, September 2017
- Doerfler, Douglas; Austin, Brian; Cook, Brandon
- Concurrency and Computation: Practice and Experience, Vol. 30, Issue 1
Solving large scale structure in ten easy steps with COLA
journal, June 2013
- Tassev, Svetlin; Zaldarriaga, Matias; Eisenstein, Daniel J.
- Journal of Cosmology and Astroparticle Physics, Vol. 2013, Issue 06
Multi-scale initial conditions for cosmological simulations: Multi-scale initial conditions
journal, July 2011
- Hahn, Oliver; Abel, Tom
- Monthly Notices of the Royal Astronomical Society, Vol. 415, Issue 3
WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code
journal, February 2017
- Mendygral, P. J.; Radcliffe, N.; Kandalla, K.
- The Astrophysical Journal Supplement Series, Vol. 228, Issue 2
In-Datacenter Performance Analysis of a Tensor Processing Unit
conference, January 2017
- Jouppi, Norman P.; Borchers, Al; Boyle, Rick
- Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA '17
Deep Residual Learning for Image Recognition
conference, June 2016
- He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing
- 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Evolving Deep Networks Using HPC
conference, January 2017
- Young, Steven R.; Rose, Derek C.; Johnston, Travis
- Proceedings of the Machine Learning on HPC Environments - MLHPC'17
Deep learning at 15PF: supervised and semi-supervised classification for scientific data
conference, January 2017
- Kurth, Thorsten; Smorkalov, Mikhail; Deslippe, Jack
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
Rotation-invariant convolutional neural networks for galaxy morphology prediction
journal, April 2015
- Dieleman, Sander; Willett, Kyle W.; Dambre, Joni
- Monthly Notices of the Royal Astronomical Society, Vol. 450, Issue 2
Distributed asynchronous deterministic and stochastic gradient optimization algorithms
journal, September 1986
- Tsitsiklis, J.; Bertsekas, D.; Athans, M.
- IEEE Transactions on Automatic Control, Vol. 31, Issue 9
Works referencing / citing this record:
Response to NITRD, NCO, NSF Request for Information on "Update to the 2016 National Artificial Intelligence Research and Development Strategic Plan"
preprint, January 2019
- Amundson, J.; Annis, J.; Avestruz, C.
- arXiv
Parallelizing Training of Deep Generative Models on Massive Scientific Datasets
preprint, January 2019
- Jacobs, Sam Ade; Van Essen, Brian; Hysom, David
- arXiv
Derivation and Analysis of Fast Bilinear Algorithms for Convolution
preprint, January 2019
- Ju, Caleb; Solomonik, Edgar
- arXiv
Quasar Detection using Linear Support Vector Machine with Learning From Mistakes Methodology
text, January 2020
- Herle, Aniruddh; Channegowda, Janamejaya; Prabhu, Dinakar
- arXiv
Exascale Deep Learning for Climate Analytics
conference, November 2018
- Kurth, Thorsten; Treichler, Sean; Romero, Joshua
- SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
DisCo: Physics-Based Unsupervised Discovery of Coherent Structures in Spatiotemporal Systems
conference, November 2019
- Rupe, Adam; Prabhat, Mr; Crutchfield, James P.
- 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)
The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism
preprint, January 2020
- Oyama, Yosuke; Maruyama, Naoya; Dryden, Nikoli
- arXiv
Learning to Predict the Cosmological Structure Formation
text, January 2018
- He, Siyu; Li, Yin; Feng, Yu
- arXiv
HPC AI500: A Benchmark Suite for HPC AI Systems
preprint, January 2019
- Jiang, Zihan; Gao, Wanling; Wang, Lei
- arXiv
Learning to predict the cosmological structure formation
journal, June 2019
- He, Siyu; Li, Yin; Feng, Yu
- Proceedings of the National Academy of Sciences, Vol. 116, Issue 28
A computational-graph partitioning method for training memory-constrained DNNs
journal, July 2021
- Qararyah, Fareed; Wahib, Mohamed; Dikbayır, Doğa
- Parallel Computing, Vol. 104-105
Clairvoyant prefetching for distributed machine learning I/O
conference, November 2021
- Dryden, Nikoli; Böhringer, Roman; Ben-Nun, Tal
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis