skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Cybersecurity Automated Information Extraction Techniques: Drawbacks of Current Methods, and Enhanced Extractors

Abstract

We address a crucial element of applied information extraction—accurate identification of basic security entities in text-—by evaluating previous methods and presenting new labelers. Our survey reveals that the previous efforts have not been tested on documents similar to the targeted sources (news articles, blogs, tweets, etc.) and that no sufficiently large publicly available annotated corpus of these documents exists. By assembling a representative test corpus, we perform a quantitative evaluation of previous methods in a realistic setting, revealing an overall lack of recall, and giving insight to the models' beneficial and inhibiting elements. In particular, our results show that many previous efforts overfit to the non-representative test corpora in this domain. Informed by this evaluation, we present three novel cyber entity extractors, which seek to leverage the available labeled data but remain worthwhile on the more diverse documents encountered in the wild. Each new model increases the state of the art in recall, with maximal or near maximal F1 score. Our results establish that the state of the art in cyber entity tagging is characterized by F1 = 0.61.

Authors:
ORCiD logo [1]; ORCiD logo [1];  [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1424492
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Conference on Machine Learning and Applications (ICMLA) - Cancun, , Mexico - 12/20/2017 10:00:00 AM-12/24/2017 10:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Bridges, Robert A., Huffer, Kelly M., Jones, Corinne L., Iannacone, Michael D., and Goodall, John R. Cybersecurity Automated Information Extraction Techniques: Drawbacks of Current Methods, and Enhanced Extractors. United States: N. p., 2018. Web. doi:10.1109/ICMLA.2017.0-122.
Bridges, Robert A., Huffer, Kelly M., Jones, Corinne L., Iannacone, Michael D., & Goodall, John R. Cybersecurity Automated Information Extraction Techniques: Drawbacks of Current Methods, and Enhanced Extractors. United States. doi:10.1109/ICMLA.2017.0-122.
Bridges, Robert A., Huffer, Kelly M., Jones, Corinne L., Iannacone, Michael D., and Goodall, John R. Mon . "Cybersecurity Automated Information Extraction Techniques: Drawbacks of Current Methods, and Enhanced Extractors". United States. doi:10.1109/ICMLA.2017.0-122. https://www.osti.gov/servlets/purl/1424492.
@article{osti_1424492,
title = {Cybersecurity Automated Information Extraction Techniques: Drawbacks of Current Methods, and Enhanced Extractors},
author = {Bridges, Robert A. and Huffer, Kelly M. and Jones, Corinne L. and Iannacone, Michael D. and Goodall, John R.},
abstractNote = {We address a crucial element of applied information extraction—accurate identification of basic security entities in text-—by evaluating previous methods and presenting new labelers. Our survey reveals that the previous efforts have not been tested on documents similar to the targeted sources (news articles, blogs, tweets, etc.) and that no sufficiently large publicly available annotated corpus of these documents exists. By assembling a representative test corpus, we perform a quantitative evaluation of previous methods in a realistic setting, revealing an overall lack of recall, and giving insight to the models' beneficial and inhibiting elements. In particular, our results show that many previous efforts overfit to the non-representative test corpora in this domain. Informed by this evaluation, we present three novel cyber entity extractors, which seek to leverage the available labeled data but remain worthwhile on the more diverse documents encountered in the wild. Each new model increases the state of the art in recall, with maximal or near maximal F1 score. Our results establish that the state of the art in cyber entity tagging is characterized by F1 = 0.61.},
doi = {10.1109/ICMLA.2017.0-122},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: