Train Like a (Var)Pro: Efficient Training of Neural Networks with Variable Projection

Newman, Elizabeth; Ruthotto, Lars; Hart, Joseph; van Bloemen Waanders, Bart

doi:10.1137/20m1359511

Title: Train Like a (Var)Pro: Efficient Training of Neural Networks with Variable Projection

Journal Article · Tue Oct 05 00:00:00 EDT 2021 · SIAM Journal on Mathematics of Data Science

DOI:https://doi.org/10.1137/20m1359511· OSTI ID:1834344

Newman, Elizabeth ^[1];

^[1]; Hart, Joseph ^[2]; van Bloemen Waanders, Bart ^[2]

Emory Univ., Atlanta, GA (United States)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Deep neural networks (DNNs) have achieved state-of-the-art performance across a variety of traditional machine learning tasks, e.g., speech recognition, image classification, and segmentation. The ability of DNNs to efficiently approximate high-dimensional functions has also motivated their use in scientific applications, e.g., to solve partial differential equations and to generate surrogate models. In this paper, we consider the supervised training of DNNs, which arises in many of the above applications. We focus on the central problem of optimizing the weights of the given DNN such that it accurately approximates the relation between observed input and target data. Devising effective solvers for this optimization problem is notoriously challenging due to the large number of weights, nonconvexity, data sparsity, and nontrivial choice of hyperparameters. To solve the optimization problem more efficiently, we propose the use of variable projection (VarPro), a method originally designed for separable nonlinear least-squares problems. Our main contribution is the Gauss--Newton VarPro method (GNvpro) that extends the reach of the VarPro idea to nonquadratic objective functions, most notably cross-entropy loss functions arising in classification. These extensions make GNvpro applicable to all training problems that involve a DNN whose last layer is an affine mapping, which is common in many state-of-the-art architectures. In our four numerical experiments from surrogate modeling, segmentation, and classification, GNvpro solves the optimization problem more efficiently than commonly used stochastic gradient descent (SGD) schemes. Finally, GNvpro finds solutions that generalize well, and in all but one example better than well-tuned SGD methods, to unseen data points.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: NA0003525; 2003941; DMS 1751636

OSTI ID:: 1834344

Report Number(s):: SAND-2020-8481J; 689974

Journal Information:: SIAM Journal on Mathematics of Data Science, Vol. 3, Issue 4; ISSN 2577-0187

Publisher:: Society for Industrial and Applied Mathematics (SIAM)Copyright Statement

Country of Publication:: United States

Language:: English

References (34)

TensorFlow.jl: An Idiomatic Julia Front End for TensorFlow Malmaud, Jonathan; White, Lyndon Journal of Open Source Software, Vol. 3, Issue 31 https://doi.org/10.21105/joss.01002	journal	November 2018
Reaction-diffusion model for the growth of avascular tumor Ferreira, S. C.; Martins, M. L.; Vilela, M. J. Physical Review E, Vol. 65, Issue 2 https://doi.org/10.1103/PhysRevE.65.021907	journal	January 2002
Deep UQ: Learning deep neural network surrogate models for high dimensional uncertainty quantification Tripathy, Rohit K.; Bilionis, Ilias Journal of Computational Physics, Vol. 375 https://doi.org/10.1016/j.jcp.2018.08.036	journal	December 2018
Variable projections neural network training Pereyra, V.; Scherer, G.; Wong, F. Mathematics and Computers in Simulation, Vol. 73, Issue 1-4 https://doi.org/10.1016/j.matcom.2006.06.017	journal	November 2006
Optimal Control for a Groundwater Pollution Ruled by a Convection–Diffusion–Reaction Problem Augeraud-Véron, Emmanuelle; Choquet, Catherine; Comte, Éloïse Journal of Optimization Theory and Applications, Vol. 173, Issue 3 https://doi.org/10.1007/s10957-016-1017-8	journal	October 2016
Stable architectures for deep neural networks Haber, Eldad; Ruthotto, Lars Inverse Problems, Vol. 34, Issue 1 https://doi.org/10.1088/1361-6420/aa9a90	journal	December 2017
Multiscale Modeling of Chemical Vapor Deposition (CVD) Apparatus: Simulations and Approximations Geiser, Juergen Polymers, Vol. 5, Issue 1 https://doi.org/10.3390/polym5010142	journal	February 2013
Instabilities in spatially extended predator–prey systems: Spatio-temporal patterns in the neighborhood of Turing–Hopf bifurcations Baurmann, Martin; Gross, Thilo; Feudel, Ulrike Journal of Theoretical Biology, Vol. 245, Issue 2 https://doi.org/10.1016/j.jtbi.2006.09.036	journal	March 2007
A variable projection method for solving separable nonlinear least squares problems Kaufman, Linda BIT, Vol. 15, Issue 1 https://doi.org/10.1007/BF01932995	journal	March 1975
Optimization Methods for Large-Scale Machine Learning Bottou, Léon; Curtis, Frank E.; Nocedal, Jorge SIAM Review, Vol. 60, Issue 2 https://doi.org/10.1137/16M1080173	journal	January 2018
Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations Raissi, M.; Perdikaris, P.; Karniadakis, G. E. Journal of Computational Physics, Vol. 378 https://doi.org/10.1016/j.jcp.2018.10.045	journal	February 2019
Bifurcation and spatiotemporal patterns in a homogeneous diffusive predator–prey system Yi, Fengqi; Wei, Junjie; Shi, Junping Journal of Differential Equations, Vol. 246, Issue 5 https://doi.org/10.1016/j.jde.2008.10.024	journal	March 2009
Robust Stochastic Approximation Approach to Stochastic Programming Nemirovski, A.; Juditsky, A.; Lan, G. SIAM Journal on Optimization, Vol. 19, Issue 4 https://doi.org/10.1137/070704277	journal	January 2009
Separable nonlinear least squares: the variable projection method and its applications Golub, Gene; Pereyra, Victor Inverse Problems, Vol. 19, Issue 2 https://doi.org/10.1088/0266-5611/19/2/201	journal	February 2003
Solving high-dimensional partial differential equations using deep learning Han, Jiequn; Jentzen, Arnulf; E., Weinan Proceedings of the National Academy of Sciences, Vol. 115, Issue 34 https://doi.org/10.1073/pnas.1718942115	journal	August 2018
The Differentiation of Pseudo-Inverses and Nonlinear Least Squares Problems Whose Variables Separate Golub, G. H.; Pereyra, V. SIAM Journal on Numerical Analysis, Vol. 10, Issue 2 https://doi.org/10.1137/0710036	journal	April 1973
Numerical Simulation of Groundwater Pollution Problems Based on Convection Diffusion Equation Li, Lingyu; Yin, Zhe American Journal of Computational Mathematics, Vol. 07, Issue 03 https://doi.org/10.4236/ajcm.2017.73025	journal	January 2017
A hybrid, non-split, stiff/RKC, solver for advection–diffusion–reaction equations and its application to low-Mach number combustion Lucchesi, M.; Alzahrani, H. H.; Safta, C. Combustion Theory and Modelling, Vol. 23, Issue 5 https://doi.org/10.1080/13647830.2019.1600723	journal	April 2019
An implicit shift bidiagonalization algorithm for ill-posed systems Björck, Åke; Grimme, Eric; Van Dooren, Paul BIT Numerical Mathematics, Vol. 34, Issue 4 https://doi.org/10.1007/BF01934265	journal	December 1994
A Numerical Approach to the Study of Spatial Pattern Formation in the Ligaments of Arcoid Bivalves Madzvamuse, A. Bulletin of Mathematical Biology, Vol. 64, Issue 3 https://doi.org/10.1006/bulm.2002.0283	journal	May 2002
Variable projection for nonlinear least squares problems O’Leary, Dianne P.; Rust, Bert W. Computational Optimization and Applications, Vol. 54, Issue 3 https://doi.org/10.1007/s10589-012-9492-9	journal	August 2012
Deep Neural Networks Motivated by Partial Differential Equations Ruthotto, Lars; Haber, Eldad Journal of Mathematical Imaging and Vision, Vol. 62, Issue 3 https://doi.org/10.1007/s10851-019-00903-1	journal	September 2019
Numerical methods for coupled super-resolution Chung, Julianne; Haber, Eldad; Nagy, James Inverse Problems, Vol. 22, Issue 4 https://doi.org/10.1088/0266-5611/22/4/009	journal	June 2006
Exact and inexact subsampled Newton methods for optimization Bollapragada, Raghu; Byrd, Richard H.; Nocedal, Jorge IMA Journal of Numerical Analysis, Vol. 39, Issue 2 https://doi.org/10.1093/imanum/dry009	journal	April 2018
Resistivity modeling for arbitrarily shaped three‐dimensional structures Dey, A.; Morrison, H. F. GEOPHYSICS, Vol. 44, Issue 4 https://doi.org/10.1190/1.1440975	journal	April 1979
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups Hinton, Geoffrey; Deng, Li; Yu, Dong IEEE Signal Processing Magazine, Vol. 29, Issue 6 https://doi.org/10.1109/MSP.2012.2205597	journal	November 2012
A Stochastic Approximation Method Robbins, Herbert; Monro, Sutton The Annals of Mathematical Statistics, Vol. 22, Issue 3 https://doi.org/10.1214/aoms/1177729586	journal	September 1951
220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3 Baumgardner, Marion; Biehl, Larry; Landgrebe, David Purdue University Research Repository https://doi.org/10.4231/r7rx991c	dataset	January 2015
Approximation by superpositions of a sigmoidal function Cybenko, G. Mathematics of Control, Signals, and Systems, Vol. 2, Issue 4 https://doi.org/10.1007/bf02551274	journal	December 1989
The Sample Average Approximation Method for Stochastic Discrete Optimization Kleywegt, Anton J.; Shapiro, Alexander; Homem-de-Mello, Tito SIAM Journal on Optimization, Vol. 12, Issue 2 https://doi.org/10.1137/s1052623499363220	journal	January 2002
ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs Gholaminejad, Amir; Keutzer, Kurt; Biros, George Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence https://doi.org/10.24963/ijcai.2019/103	conference	August 2019
Exact and Inexact Subsampled Newton Methods for Optimization Bollapragada, Raghu; Byrd, Richard; Nocedal, Jorge arXiv https://doi.org/10.48550/arxiv.1609.08502	preprint	January 2016
Inexact Newton Methods for Stochastic Nonconvex Optimization with Applications to Neural Network Training O'Leary-Roseberry, Thomas; Alger, Nick; Ghattas, Omar arXiv https://doi.org/10.48550/arxiv.1905.06738	preprint	January 2019
Discretize-Optimize vs. Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows Onken, Derek; Ruthotto, Lars arXiv https://doi.org/10.48550/arxiv.2005.13420	preprint	January 2020

Similar Records

Improving Deep Neural Networks’ Training for Image Classification With Nonlinear Conjugate Gradient-Style Adaptive Momentum

Journal Article · Fri Mar 24 00:00:00 EDT 2023 · IEEE Transactions on Neural Networks and Learning Systems · OSTI ID:1834344

Wang, Bao; Ye, Qiang

Bayesian sparse learning with preconditioned stochastic gradient MCMC and its applications

Journal Article · Wed Feb 03 00:00:00 EST 2021 · Journal of Computational Physics · OSTI ID:1834344

Wang, Yating; Deng, Wei; Lin, Guang

Layer-Parallel Training of Deep Residual Neural Networks

Journal Article · Thu Feb 06 00:00:00 EST 2020 · SIAM Journal on Mathematics of Data Science · OSTI ID:1834344

Günther, Stefanie; Ruthotto, Lars; Schroder, Jacob B.; +2 more

Related Subjects

97 MATHEMATICS AND COMPUTING
numerical optimization
deep learning
neural networks
variable projection
hyperspectral segmentation
PDE surrogate modeling

Title: Train Like a (Var)Pro: Efficient Training of Neural Networks with Variable Projection

Citation Formats

References (34)

Similar Records

Related Subjects