skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems

Abstract

Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. Additionally, the reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. In this work, we propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input datasetmore » into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).« less

Authors:
 [1];  [1];  [2];  [3]
  1. Univ. of California, Berkeley, CA (United States)
  2. Univ. of California, Davis, CA (United States)
  3. Georgia Inst. of Technology, Atlanta, GA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1544213
DOE Contract Number:  
AC02-05CH11231; SC0008700
Resource Type:
Conference
Resource Relation:
Conference: 2018 International Conference on Supercomputing, Beijing (China), 12-15 Jun 2018
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

You, Yang, Demmel, James, Hsieh, Cho-Jui, and Vuduc, Richard. Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems. United States: N. p., 2018. Web. doi:10.1145/3205289.3205290.
You, Yang, Demmel, James, Hsieh, Cho-Jui, & Vuduc, Richard. Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems. United States. doi:10.1145/3205289.3205290.
You, Yang, Demmel, James, Hsieh, Cho-Jui, and Vuduc, Richard. Tue . "Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems". United States. doi:10.1145/3205289.3205290. https://www.osti.gov/servlets/purl/1544213.
@article{osti_1544213,
title = {Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems},
author = {You, Yang and Demmel, James and Hsieh, Cho-Jui and Vuduc, Richard},
abstractNote = {Kernel Ridge Regression (KRR) is a fundamental method in machine learning. Given an n-by-d data matrix as input, a traditional implementation requires Θ(n2) memory to form an n-by-n kernel matrix and Θ(n3) flops to compute the final model. These time and storage costs prohibit KRR from scaling up to large datasets. For example, even on a relatively small dataset (a 520k-by-90 input requiring 357 MB), KRR requires 2 TB memory just to store the kernel matrix. Additionally, the reason is that n usually is much larger than d for real-world applications. On the other hand, weak scaling becomes a problem: if we keep d and n/p fixed as p grows (p is # machines), the memory needed grows as Θ(p) per processor and the flops as Θ(p2) per processor. In the perfect weak scaling situation, both the memory needed and the flops grow as Θ(1) per processor (i.e. memory and flops are constant). The traditional Distributed KRR implementation (DKRR) only achieved 0.32% weak scaling efficiency from 96 to 1536 processors. In this work, we propose two new methods to address these problems: the Balanced KRR (BKRR) and K-means KRR (KKRR). These methods consider alternative ways to partition the input dataset into p different parts, generating p different models, and then selecting the best model among them. Compared to a conventional implementation, KKRR2 (optimized version of KKRR) improves the weak scaling efficiency from 0.32% to 38% and achieves a 591x speedup for getting the same accuracy by using the same data and the same hardware (1536 processors). BKRR2 (optimized version of BKRR) achieves a higher accuracy than the current fastest method using less training time for a variety of datasets. For the applications requiring only approximate solutions, BKRR2 improves the weak scaling efficiency to 92% and achieves 3505x speedup (theoretical speedup: 4096x).},
doi = {10.1145/3205289.3205290},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Kernel methods in machine learning
journal, June 2008

  • Hofmann, Thomas; Schölkopf, Bernhard; Smola, Alexander J.
  • The Annals of Statistics, Vol. 36, Issue 3
  • DOI: 10.1214/009053607000000677

Solving Eigenvalue and Singular Value Problems on an Undersized Systolic Array
journal, April 1986

  • Schreiber, Robert
  • SIAM Journal on Scientific and Statistical Computing, Vol. 7, Issue 2
  • DOI: 10.1137/0907029

Nonlinear Component Analysis as a Kernel Eigenvalue Problem
journal, July 1998

  • Schölkopf, Bernhard; Smola, Alexander; Müller, Klaus-Robert
  • Neural Computation, Vol. 10, Issue 5
  • DOI: 10.1162/089976698300017467

On Early Stopping in Gradient Descent Learning
journal, April 2007

  • Yao, Yuan; Rosasco, Lorenzo; Caponnetto, Andrea
  • Constructive Approximation, Vol. 26, Issue 2
  • DOI: 10.1007/s00365-006-0663-2