skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable Fault Detection and Localization of Network Issues

Technical Report ·
OSTI ID:1432840
 [1];  [1];  [1];  [1];  [1];  [2];  [3]
  1. Intelligent Automation, Inc., Rockville, MD (United States)
  2. Univ. of South Florida, Tampa, FL (United States)
  3. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

In this report, we describe the progress we have made towards developing the NetFault-SONAR: a modular, cloud-based, scalable and extensible advanced network and cyber analysis tool to address the above needs. Towards this goal, our Phase I effort has been mainly focused on the feasibility study and preliminary system implementation, and an initial evaluation to demonstrate the feasibility of a cloud-based network and cyber anomaly detection and analysis system. Our accomplishments in Phase I are summarized as follows; Investigated the network faults and symptoms and the relationship between faults and symptoms. We identified a preliminary list of faults and symptoms across different protocol layers. We also identified a list of typical cyber threats to Enterprise Networks; Identified suitable network measurement tools that provide (raw) data input for the anomaly detection and analysis in our Phase I implementation, including Traceroute, NetFlow, Syslog, PingER, BWCTL, and OWAMP. We also investigated the list of metrics for measuring the status of monitored network and how to interpret the network status based on these metrics; Developed anomaly detection algorithms for different types of measurement data collected from identified tools, including multiple perfSONAR data and non-perfSONAR data (such as NetFlow and Syslog). A wide range of existing detection algorithms have been studied and evaluated. We also studied how to correlate the information across different types of data traces to localize problems or narrow down their potential location(s), and perform resolution; Designed and developed the root-cause analysis scheme. We developed a root-cause graphical model to assist fault diagnosis. Specifically, we conducted studies on the inter-dependency among various faults and symptoms, and provided a comprehensive global root-cause graphical model that contains all available information including service dependency, fault-symptom dependency and network topology. Our proposed scheme then compresses the global root-cause graph into a bipartite graph in order to simplify the graph model and achieve better scalability and computation efficiency. Using this bipartite graph, our NetFaultSONAR can assist users (i.e., network operators) to better troubleshoot the network issues, based on the observed symptoms; Implemented the proposed scheme for feasibility validation and performance evaluation, using the real data traces collected from identified measurement tools and emulated data as well when the real data are not available. Specifically, we developed anomaly detection algorithms for each type of measurement data, and investigated/developed event (anomaly) correlation algorithms for different types of perfSONAR measurement data. We also developed our root-cause analysis approach on a simulation platform in the same scenario used in for tests and validation of its feasibility, scalability and complexity. In brief, we have performed feasibility studies, preliminary system implementation and initial performance evaluation of our proposed NetFaultSONAR approach for network and cyber analysis, with the final goal to provide a system that can automatically collect the network data/status/configuration, perform anomaly detection and analysis to pinpoint the root cause, and assist the network operator(s) in determining the most efficient action to correct existing issues. With the above general introduction, the rest of the document presents our Phase I research results and findings.

Research Organization:
Intelligent Automation, Inc., Rockville, MD (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
SC0011380
OSTI ID:
1432840
Type / Phase:
SBIR (Phase I)
Country of Publication:
United States
Language:
English