skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Failure prediction using machine learning in a virtualised HPC system and application

Abstract

Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that themore » average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.« less

Authors:
ORCiD logo [1];  [1];  [1];  [2]
  1. Univ. of Bradford (United Kingdom). School of Electrical Engineering and Computer Science
  2. Oxford Brookes Univ. (United Kingdom). Dept. of Computing & Communication Technologies
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory-National Energy Research Scientific Computing Center (NERSC)
Sponsoring Org.:
USDOE
OSTI Identifier:
1526981
Resource Type:
Journal Article
Journal Name:
Cluster Computing
Additional Journal Information:
Journal Volume: 22; Journal Issue: 2; Journal ID: ISSN 1386-7857
Country of Publication:
United States
Language:
English

Citation Formats

Mohammed, Bashir, Awan, Irfan, Ugail, Hassan, and Younas, Muhammad. Failure prediction using machine learning in a virtualised HPC system and application. United States: N. p., 2019. Web. doi:10.1007/s10586-019-02917-1.
Mohammed, Bashir, Awan, Irfan, Ugail, Hassan, & Younas, Muhammad. Failure prediction using machine learning in a virtualised HPC system and application. United States. doi:10.1007/s10586-019-02917-1.
Mohammed, Bashir, Awan, Irfan, Ugail, Hassan, and Younas, Muhammad. Thu . "Failure prediction using machine learning in a virtualised HPC system and application". United States. doi:10.1007/s10586-019-02917-1.
@article{osti_1526981,
title = {Failure prediction using machine learning in a virtualised HPC system and application},
author = {Mohammed, Bashir and Awan, Irfan and Ugail, Hassan and Younas, Muhammad},
abstractNote = {Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.},
doi = {10.1007/s10586-019-02917-1},
journal = {Cluster Computing},
issn = {1386-7857},
number = 2,
volume = 22,
place = {United States},
year = {2019},
month = {3}
}

Works referenced in this record:

Analyzing real cluster data for formulating allocation algorithms in cloud platforms
journal, May 2016

  • Beaumont, Olivier; Eyraud-Dubois, Lionel; Lorenzo-del-Castillo, Juan-Angel
  • Parallel Computing, Vol. 54
  • DOI: 10.1016/j.parco.2015.07.001

Failover strategy for fault tolerance in cloud computing environment: Failover Strategy for Fault Tolerance in Cloud Computing Environment
journal, April 2017

  • Mohammed, Bashir; Kiran, Mariam; Maiyama, Kabiru M.
  • Software: Practice and Experience, Vol. 47, Issue 9
  • DOI: 10.1002/spe.2491

A Large-Scale Study of Failures in High-Performance Computing Systems
journal, October 2010

  • Schroeder, Bianca; Gibson, Garth A.
  • IEEE Transactions on Dependable and Secure Computing, Vol. 7, Issue 4
  • DOI: 10.1109/TDSC.2009.4

Software defect prediction techniques using metrics based on neural network classifier
journal, February 2018


Software reliability modeling using increased failure interval with ANN
journal, March 2018


Risk Prediction Model Based on Improved AdaBoost Method for Cloud Usersse
journal, February 2015


Automatic classification of data-warehouse-data for information lifecycle management using machine learning techniques
journal, July 2016

  • Büsch, Sebastian; Nissen, Volker; Wünscher, Arndt
  • Information Systems Frontiers, Vol. 19, Issue 5
  • DOI: 10.1007/s10796-016-9680-8

Performance prediction of parallel computing models to analyze cloud-based big data applications
journal, November 2017


A survey of deep learning-based network anomaly detection
journal, September 2017


Recent advancements in resource allocation techniques for cloud computing environment: a systematic review
journal, December 2016

  • Madni, Syed Hamid Hussain; Latiff, Muhammad Shafie Abd; Coulibaly, Yahaya
  • Cluster Computing, Vol. 20, Issue 3
  • DOI: 10.1007/s10586-016-0684-4

An overview of statistical learning theory
journal, January 1999

  • Vapnik, V. N.
  • IEEE Transactions on Neural Networks, Vol. 10, Issue 5
  • DOI: 10.1109/72.788640

A Combinatorial Approach to Piecewise Linear Time Series Analysis
journal, March 2002

  • Medeiros, Marcelo C.; Veiga, Alvaro; Resende, Mauricio G. C.
  • Journal of Computational and Graphical Statistics, Vol. 11, Issue 1
  • DOI: 10.1198/106186002317375712

A comparative study of neural network and Box-Jenkins ARIMA modeling in time series prediction
journal, April 2002


A study on performance measures for auto-scaling CPU-intensive containerized applications
journal, January 2019