Abstract
The code is written in Python and consists of the following pipeline that is implemented in Apache Airflow. This pipeline intends to understand the companies that are directly or indirectly involved with a type of critical infrastructure system at some point in that system's lifecycle. The pipeline takes a configuration file that specifies a list of initial companies to consider, a geographic region of interest (disk) expressed as a latitude/longitude point and distance, and a set of SEC form types from which to extract entities and relations. There are three main components to this pipeline as currently implemented: Social Network Extraction, Critical Infrastructure Network Extraction, and Inference and Fusion.
First, Social Network Extraction, implemented as the `organizations_sec` component of the workflow graph queries the SEC EDGAR webservice using the list of initial companies from the configuration file. Given this, it extracts metadata that documents the number of each type of form for the given set of companies and their location. This forms metadata represents a catalog of data sources for the extracted social network knowledge graph. The pipeline then downloads these forms from the website and saves them in a build directory for further processing. These documents are then parsed
More>>
- Developers:
-
Weaver, Gabriel [1]
- Idaho National Laboratory (INL), Idaho Falls, ID (United States)
- Release Date:
- 2024-05-07
- Project Type:
- Closed Source
- Software Type:
- Scientific
- Programming Languages:
-
Python
- Sponsoring Org.:
-
USDOE Office of Nuclear Energy (NE)Primary Award/Contract Number:AC07-05ID14517
- Code ID:
- 145418
- Research Org.:
- Idaho National Laboratory (INL), Idaho Falls, ID (United States)
- Country of Origin:
- United States
- Keywords:
- socio-technical network analysis (STNA); Electric Vehicles; multilayer networks
Citation Formats
Weaver, Gabriel A.
A Data Processing Pipeline To Extract A Knowledge Graph From Sec Documents For Socio-technical Analysis Of Critical Infrastructure Influence.
Computer Software.
USDOE Office of Nuclear Energy (NE).
07 May. 2024.
Web.
doi:10.11578/dc.20241010.1.
Weaver, Gabriel A.
(2024, May 07).
A Data Processing Pipeline To Extract A Knowledge Graph From Sec Documents For Socio-technical Analysis Of Critical Infrastructure Influence.
[Computer software].
https://doi.org/10.11578/dc.20241010.1.
Weaver, Gabriel A.
"A Data Processing Pipeline To Extract A Knowledge Graph From Sec Documents For Socio-technical Analysis Of Critical Infrastructure Influence." Computer software.
May 07, 2024.
https://doi.org/10.11578/dc.20241010.1.
@misc{
doecode_145418,
title = {A Data Processing Pipeline To Extract A Knowledge Graph From Sec Documents For Socio-technical Analysis Of Critical Infrastructure Influence},
author = {Weaver, Gabriel A.},
abstractNote = {The code is written in Python and consists of the following pipeline that is implemented in Apache Airflow. This pipeline intends to understand the companies that are directly or indirectly involved with a type of critical infrastructure system at some point in that system's lifecycle. The pipeline takes a configuration file that specifies a list of initial companies to consider, a geographic region of interest (disk) expressed as a latitude/longitude point and distance, and a set of SEC form types from which to extract entities and relations. There are three main components to this pipeline as currently implemented: Social Network Extraction, Critical Infrastructure Network Extraction, and Inference and Fusion.
First, Social Network Extraction, implemented as the `organizations_sec` component of the workflow graph queries the SEC EDGAR webservice using the list of initial companies from the configuration file. Given this, it extracts metadata that documents the number of each type of form for the given set of companies and their location. This forms metadata represents a catalog of data sources for the extracted social network knowledge graph. The pipeline then downloads these forms from the website and saves them in a build directory for further processing. These documents are then parsed for entities and relations.
Second, the Critical Network Extraction component extracts entities and relations for a critical infrastructure sector. Currently, we focus on Electric Vehicle charging stations and this information is available via the Department of Energy (DOE) database on fueling stations maintained by NREL.
Third, the Inference and Fusion component relates the social network graph to the critical infrastructure graph in order to understand the impact of a company within a geographic region. Relations include ownership of the EV Charging Station asset as well as maintenance/ownership of the EV payment networks. The fused network can be represented in many ways and currently we emit a knowledge graph.},
doi = {10.11578/dc.20241010.1},
url = {https://doi.org/10.11578/dc.20241010.1},
howpublished = {[Computer Software] \url{https://doi.org/10.11578/dc.20241010.1}},
year = {2024},
month = {may}
}