A Data Processing Pipeline To Extract A Knowledge Graph From Heterogeneous Data For Socio-technical Analysis Of Critical Infrastructure Influence
- Idaho National Laboratory (INL), Idaho Falls, ID (United States)
The code is written in Python and consists of the following pipeline that is implemented in Apache Airflow. This pipeline intends to understand the companies that are directly or indirectly involved with a type of critical infrastructure system at some point in that system's lifecycle. The pipeline takes a configuration file that specifies a list of initial companies to consider, a geographic region of interest, and a set of SEC form types as well as other data sources (e.g. CrunchBase) from which to extract entities and relations. There are four main components to this pipeline as currently implemented: Entity Extraction, Network Construction, Analysis, and Visualization. First, Entity Extraction, is implemented as the `topear-extract_organizations` Apache Airflow workflow. Given an initial query that specifies a geographic region of interest and a time interval, the software will extract CI facilities of interest and organizations that have a direct influence relationship to those facilities (e.g. ownership). During the course of the LDRD, we focused on Electric Vehicle charging stations and this information is available via the Department of Energy (DOE) database on fueling stations maintained by NREL. Within the context of the DOE CESER project, we have focused on Battery Energy Storage Systems (BESS). Second, the Network Extraction component will iteratively construct a social network graph given the set of organizations and people extracted in the previous step. Organizations (and eventually People if desired) are then fed as a query to the `topgear-construct_social_network` Apache Airflow workflow which given a set of initial companies and data sets (e.g. SEC EDGAR form types, OpenCorporates, Crunchbase). This Airflow workflow will iteratively query such data sources to discover relationships with new organizations and people. For example, this module can iteratively query SEC EDGAR for metadata that documents the number of each type of form for the given set of companies and their location. This forms metadata represents a catalog of data sources from SEC EDGAR for the extracted social network knowledge graph. The pipeline then downloads these forms from the website and saves them in a build directory for further processing. These documents are then parsed for entities and relations. Again, we note that in additional to SEC data sources, this step can also pull in information on organizations via API services such as CrunchBase and OpenCorporates or bulk data sources. At the end of this step, the resultant social network, the Critical Infrastructure network, and the edges that encode relationships between organizations and CI facilities, form the Adversarial Socio-Technical Network (ASTN) that informs the analysis. Third, the Analysis component processes these generated ASTN. Previously, that has included the ability to compare prevalence of different vendors for a given infrastructure component type across different regions as well as identify common public and private investors across those vendors. This was demonstrated for EV Charging Stations across several different metropolitan areas within an IEEE PES GridEdge publication. More recently, we have looked at ways to identify infrastructure owners and operators of BESS with the most nameplate capacity across different states as well as other indictors of risk resulting from changes in ownership over time. Finally, the Visualization component consists of an HTML/CSS/JS framework by which users can interact geospatial, operational, and organizational relationships across a given portfolio of Critical Infrastructure facilities. The objective is to provide a library of UI/UX modules that can be repurposed for stakeholder-specific dashboards. All of the modules are related via a common event model that enables UI actions in one view to percolate across the other views.
- Short Name / Acronym:
- TOPGEAR: Technology, Organization, or Person of i
- Project Type:
- Closed Source
- Software Type:
- Scientific
- Research Organization:
- Idaho National Laboratory (INL), Idaho Falls, ID (United States)
- Sponsoring Organization:
- USDOE Office of Nuclear Energy (NE)Primary Award/Contract Number:AC07-05ID14517
- DOE Contract Number:
- AC07-05ID14517
- Code ID:
- 178006
- OSTI ID:
- code-178006
- Country of Origin:
- United States
Similar Records
A Data Processing Pipeline To Extract A Knowledge Graph From Sec Documents For Socio-technical Analysis Of Critical Infrastructure Influence
A Data Processing Pipeline for Socio-Technical Network Analysis [Slides]
A Data Processing Pipeline for Adversarial Socio-Technical Network Analysis
Software
·
Mon May 06 20:00:00 EDT 2024
·
OSTI ID:code-145418
A Data Processing Pipeline for Socio-Technical Network Analysis [Slides]
Technical Report
·
Sun Apr 30 20:00:00 EDT 2023
·
OSTI ID:2007807
A Data Processing Pipeline for Adversarial Socio-Technical Network Analysis
Conference
·
Mon Jun 12 20:00:00 EDT 2023
·
OSTI ID:2006801