TY - COMP TI - A Data Processing Pipeline To Extract A Knowledge Graph From Sec Documents For Socio-technical Analysis Of Critical Infrastructure Influence AB - The code is written in Python and consists of the following pipeline that is implemented in Apache Airflow. This pipeline intends to understand the companies that are directly or indirectly involved with a type of critical infrastructure system at some point in that system's lifecycle. The pipeline takes a configuration file that specifies a list of initial companies to consider, a geographic region of interest (disk) expressed as a latitude/longitude point and distance, and a set of SEC form types from which to extract entities and relations. There are three main components to this pipeline as currently implemented: Social Network Extraction, Critical Infrastructure Network Extraction, and Inference and Fusion. First, Social Network Extraction, implemented as the `organizations_sec` component of the workflow graph queries the SEC EDGAR webservice using the list of initial companies from the configuration file. Given this, it extracts metadata that documents the number of each type of form for the given set of companies and their location. This forms metadata represents a catalog of data sources for the extracted social network knowledge graph. The pipeline then downloads these forms from the website and saves them in a build directory for further processing. These documents are then parsed for entities and relations. Second, the Critical Network Extraction component extracts entities and relations for a critical infrastructure sector. Currently, we focus on Electric Vehicle charging stations and this information is available via the Department of Energy (DOE) database on fueling stations maintained by NREL. Third, the Inference and Fusion component relates the social network graph to the critical infrastructure graph in order to understand the impact of a company within a geographic region. Relations include ownership of the EV Charging Station asset as well as maintenance/ownership of the EV payment networks. The fused network can be represented in many ways and currently we emit a knowledge graph. AU - Weaver, Gabriel DO - https://doi.org/10.11578/dc.20241010.1 UR - https://www.osti.gov/doecode/biblio/145418 CY - United States PY - 2024 DA - 2024-05-07 LA - English C1 - Research Org.: Idaho National Laboratory (INL), Idaho Falls, ID (United States) C2 - Sponsor Org.: USDOE Office of Nuclear Energy (NE) C4 - Contract Number: AC07-05ID14517 ER -