skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Data Intensive Architecture for Scalable Cyber Analytics

Conference ·

Cyber analysts are tasked with the identification and mitigation of network exploits and threats. These compromises are difficult to identify due to the characteristics of cyber communication, the volume of traffic, and the duration of possible attack. It is necessary to have analytical tools to help analysts identify anomalies that span seconds, days, and weeks. Unfortunately, providing analytical tools effective access to the volumes of underlying data requires novel architectures, which is often overlooked in operational deployments. Our work is focused on a summary record of communication, called a flow. Flow records are intended to summarize a communication session between a source and a destination, providing a level of aggregation from the base data. Despite this aggregation, many enterprise network perimeter sensors store millions of network flow records per day. The volume of data makes analytics difficult, requiring the development of new techniques to efficiently identify temporal patterns and potential threats. The massive volume makes analytics difficult, but there are other characteristics in the data which compound the problem. Within the billions of records of communication that transact, there are millions of distinct IP addresses involved. Characterizing patterns of entity behavior is very difficult with the vast number of entities that exist in the data. Research has struggled to validate a model for typical network behavior with hopes it will enable the identification of atypical behavior. Complicating matters more, typically analysts are only able to visualize and interact with fractions of data and have the potential to miss long term trends and behaviors. Our analysis approach focuses on aggregate views and visualization techniques to enable flexible and efficient data exploration as well as the capability to view trends over long periods of time. Realizing that interactively exploring summary data allowed analysts to effectively identify events, we utilized multidimensional OLAP data cubes. The data cube structure supports interactive analysis of summary data across multiple dimensions, such as location, time, and protocol. Cube technology also allows the analyst to drill-down into the underlying data set, when events of interest are identified and detailed analysis is required. Unfortunately, when creating these cubes, we ran into significant performance issues with our initial architecture, caused by a combination of the data volume and attribute characteristics. Overcoming, these issues required us to develop a novel, data intensive computing infrastructure. In particular, we ended up combining a Netezza Twin Fin data warehouse appliance, a solid state Fusion IO ioDrive, and the Tableau Desktop business intelligence analytic software. Using this architecture, we were able to analyze a month's worth of flow records comprising 4.9B records, totaling approximately 600GB of data. This paper describes our architecture, the challenges that we encountered, and the work that remains to deploy a fully generalized cyber analytical infrastructure.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1038400
Report Number(s):
PNNL-SA-79393; TRN: US201208%%490
Resource Relation:
Conference: Proceedings of the 2011 IEEE International Conference on Technologies of Homeland Security, November 15-17, 2011, Waltham, Massachusetts, 390-395
Country of Publication:
United States
Language:
English