skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Adaptive Learning for Concept Drift in Application Performance Modeling

Abstract

Supervised learning is a promising approach for modeling the performance of applications running on large HPC systems. A key assumption in supervised learning is that the training and testing data are obtained under the same conditions. However, in production HPC systems these conditions might not hold because the conditions of the platform can change over time as a result of hardware degradation, hardware replacement, software upgrade, and configuration updates. These changes could alter the data distribution in a way that affects the accuracy of the predictive performance models and render them less useful; this phenomenon is referred to as concept drift. Ignoring concept drift can lead to suboptimal resource usage and decreased efficiency when those performance models are deployed for tuning and job scheduling in production systems. To address this issue, we propose a concept-drift-aware predictive modeling approach that comprises two components: (1) an online Bayesian changepoint detection method that can automatically identify the location of events that lead to concept drift in near-real time and (2) a moment-matching transformation inspired by transfer learning that converts the training data collected before the drift to be useful for retraining. We use application input/output performance data collected on Cori, a production supercomputingmore » system at the National Energy Research Scientific Computing Center, to demonstrate the effectiveness of our approach. The results show that concept-drift-aware models obtain significant improvement in accuracy; the median absolute error of the best-performing Gaussian process regression improved by 58.8% when the proposed approaches were used.« less

Authors:
; ; ; ; ; ; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR) - Scientific Discovery through Advanced Computing (SciDAC)
OSTI Identifier:
1574301
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 48th International Conference on Parallel Processing, 08/05/19 - 08/08/19, Kyoto, JP
Country of Publication:
United States
Language:
English
Subject:
HPC performance modeling; I/O performance models; adaptive learning; concept drift; online change point detection; temporal learning

Citation Formats

Madireddy, Sandeep, Balaprakash, Prasanna, Carns, Philip, Latham, Robert, Lockwood, Glenn K., Ross, Robert, Snyder, Shane, and Wild, Stefan M. Adaptive Learning for Concept Drift in Application Performance Modeling. United States: N. p., 2019. Web. doi:10.1145/3337821.3337922.
Madireddy, Sandeep, Balaprakash, Prasanna, Carns, Philip, Latham, Robert, Lockwood, Glenn K., Ross, Robert, Snyder, Shane, & Wild, Stefan M. Adaptive Learning for Concept Drift in Application Performance Modeling. United States. https://doi.org/10.1145/3337821.3337922
Madireddy, Sandeep, Balaprakash, Prasanna, Carns, Philip, Latham, Robert, Lockwood, Glenn K., Ross, Robert, Snyder, Shane, and Wild, Stefan M. 2019. "Adaptive Learning for Concept Drift in Application Performance Modeling". United States. https://doi.org/10.1145/3337821.3337922. https://www.osti.gov/servlets/purl/1574301.
@article{osti_1574301,
title = {Adaptive Learning for Concept Drift in Application Performance Modeling},
author = {Madireddy, Sandeep and Balaprakash, Prasanna and Carns, Philip and Latham, Robert and Lockwood, Glenn K. and Ross, Robert and Snyder, Shane and Wild, Stefan M.},
abstractNote = {Supervised learning is a promising approach for modeling the performance of applications running on large HPC systems. A key assumption in supervised learning is that the training and testing data are obtained under the same conditions. However, in production HPC systems these conditions might not hold because the conditions of the platform can change over time as a result of hardware degradation, hardware replacement, software upgrade, and configuration updates. These changes could alter the data distribution in a way that affects the accuracy of the predictive performance models and render them less useful; this phenomenon is referred to as concept drift. Ignoring concept drift can lead to suboptimal resource usage and decreased efficiency when those performance models are deployed for tuning and job scheduling in production systems. To address this issue, we propose a concept-drift-aware predictive modeling approach that comprises two components: (1) an online Bayesian changepoint detection method that can automatically identify the location of events that lead to concept drift in near-real time and (2) a moment-matching transformation inspired by transfer learning that converts the training data collected before the drift to be useful for retraining. We use application input/output performance data collected on Cori, a production supercomputing system at the National Energy Research Scientific Computing Center, to demonstrate the effectiveness of our approach. The results show that concept-drift-aware models obtain significant improvement in accuracy; the median absolute error of the best-performing Gaussian process regression improved by 58.8% when the proposed approaches were used.},
doi = {10.1145/3337821.3337922},
url = {https://www.osti.gov/biblio/1574301}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Collective I/O Tuning Using Analytical and Machine Learning Models
conference, September 2015


24/7 Characterization of petascale I/O workloads
conference, August 2009


Detection of Recovery Patterns in Cluster Systems Using Resource Usage Data
conference, January 2017


Robust Online Time Series Prediction with Recurrent Neural Networks
conference, October 2016


Problem Determination in Enterprise Middleware Systems using Change Point Correlation of Time Series Data
conference, January 2006


A survey of methods for time series change point detection
journal, September 2016


Extremely randomized trees
journal, March 2006


Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems
journal, November 2018


The Changepoint Model for Statistical Process Control
journal, October 2003


PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing
conference, January 2005


Pilot: A Framework that Understands How to Do Performance Benchmarks the Right Way
conference, September 2016

  • Li, Yan; Gupta, Yash; Miller, Ethan L.
  • 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)
  • https://doi.org/10.1109/MASCOTS.2016.31

A Year in the Life of a Parallel File System
conference, November 2018


Performance modeling under resource constraints using deep transfer learning
conference, January 2017

  • Marathe, Aniruddha; Anirudh, Rushil; Jain, Nikhil
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • https://doi.org/10.1145/3126908.3126969

A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data
journal, January 2014


Two Nonparametric Control Charts for Detecting Arbitrary Distribution Changes
journal, April 2012


Bayesian Online Learning of the Hazard Rate in Change-Point Problems
journal, September 2010


Comparisons of various types of normality tests
journal, December 2011