skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information
  1. Network Bandwidth Utilization Forecast Model on High Bandwidth Network

    With the increasing number of geographically distributed scientific collaborations and the scale of the data size growth, it has become more challenging for users to achieve the best possible network performance on a shared network. We have developed a forecast model to predict expected bandwidth utilization for high-bandwidth wide area network. The forecast model can improve the efficiency of resource utilization and scheduling data movements on high-bandwidth network to accommodate ever increasing data volume for large-scale scientific data applications. Univariate model is developed with STL and ARIMA on SNMP path utilization data. Compared with traditional approach such as Box-Jenkins methodology,more » our forecast model reduces computation time by 83.2percent. It also shows resilience against abrupt network usage change. The accuracy of the forecast model is within the standard deviation of the monitored measurements.« less
  2. Time-Series Forecast Modeling on High-Bandwidth Network Measurements

    With the increasing number of geographically distributed scientific collaborations and the growing sizes of scientific data, it has become challenging for users to achieve the best possible network performance on a shared network. In this paper, we have developed a model to forecast expected bandwidth utilization on high-bandwidth wide area networks. The forecast model can improve the efficiency of the resource utilization and scheduling of data movements on high-bandwidth networks to accommodate ever increasing data volume for large-scale scientific data applications. A univariate time-series forecast model is developed with the Seasonal decomposition of Time series by Loess (STL) and themore » AutoRegressive Integrated Moving Average (ARIMA) on Simple Network Management Protocol (SNMP) path utilization measurement data. Compared with the traditional approach such as Box-Jenkins methodology to train the ARIMA model, our forecast model reduces computation time up to 92.6 %. It also shows resilience against abrupt network usage changes. Finally, our forecast model conducts the large number of multi-step forecast, and the forecast errors are within the mean absolute deviation (MAD) of the monitored measurements.« less
  3. UMAMI: A Recipe for Generating Meaningful Metrics through Holistic I/O Performance Analysis

    I/O efficiency is essential to productivity in scientific computing, especially as many scientific domains become more data-intensive. Many characterization tools have been used to elucidate specific aspects of parallel I/O performance, but analyzing components of complex I/O subsystems in isolation fails to provide insight into critical questions: how do the I/O components interact, what are reasonable expectations for application performance, and what are the underlying causes of I/O performance problems? To address these questions while capitalizing on existing component-level characterization tools, we propose an approach that combines on-demand, modular synthesis of I/O characterization data into a unified monitoring and metricsmore » interface (UMAMI) to provide a normalized, holistic view of I/O behavior. We evaluate the feasibility of this approach by applying it to a month-long benchmarking study on two distinct largescale computing platforms. We present three case studies that highlight the importance of analyzing application I/O performance in context with both contemporaneous and historical component metrics, and we provide new insights into the factors affecting I/O performance. By demonstrating the generality of our approach, we lay the groundwork for a production-grade framework for holistic I/O analysis.« less
  4. A lightweight network anomaly detection technique

    While the network anomaly detection is essential in network operations and management, it becomes further challenging to perform the first line of detection against the exponentially increasing volume of network traffic. In this paper, we develop a technique for the first line of online anomaly detection with two important considerations: (i) availability of traffic attributes during the monitoring time, and (ii) computational scalability for streaming data. The presented learning technique is lightweight and highly scalable with the beauty of approximation based on the grid partitioning of the given dimensional space. With the public traffic traces of KDD Cup 1999 andmore » NSL-KDD, we show that our technique yields 98.5% and 83% of detection accuracy, respectively, only with a couple of readily available traffic attributes that can be obtained without the help of post-processing. Finally, the results are at least comparable with the classical learning methods including decision tree and random forest, with approximately two orders of magnitude faster learning performance.« less
  5. Machine learning based job status prediction in scientific clusters

    Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. Here, we set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forestsmore » algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. Lastly, this prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.« less
  6. Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

    Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe-more » art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster. Our study processed many terabytes of system logs and application performance measurements collected on the HPC systems at NERSC. The implementation of our tool is generic enough to be used for analyzing the performance of other HPC systems and Big Data workows.« less
  7. PATHA: Performance Analysis Tool for HPC Applications

    Large science projects rely on complex workflows to analyze terabytes or petabytes of data. These jobs are often running over thousands of CPU cores and simultaneously performing data accesses, data movements, and computation. It is difficult to identify bottlenecks or to debug the performance issues in these large workflows. In order to address these challenges, we have developed Performance Analysis Tool for HPC Applications (PATHA) using the state-of-art open source big data processing tools. Our framework can ingest system logs to extract key performance measures, and apply the most sophisticated statistical tools and data mining methods on the performance data.more » Furthermore, it utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of PATHA, we conduct a case study on the workflows from an astronomy project known as the Palomar Transient Factory (PTF). This study processed 1.6 TB of system logs collected on the NERSC supercomputer Edison. Using PATHA, we were able to identify performance bottlenecks, which reside in three tasks of PTF workflow with the dependency on the density of celestial objects.« less

Search for:
All Records
Creator / Author
"Yoo, Wucherl"

Refine by:
Resource Type
Availability
Creator / Author
Research Organization