PDF Entity Annotation Tool (PEAT)

Stahl, Christopher G.; Markey, Kristan J.; Jewell, Brian C.; Shams, Dahnish; Taylor, Michele M.; Wilkins, A. Amina; Watford, Sean; Shapiro, Andy; Angrish, Michelle

doi:10.21105/joss.05336

PDF Entity Annotation Tool (PEAT)

Journal Article · Tue Apr 08 00:00:00 EDT 2025 · Journal of Open Source Software

DOI:https://doi.org/10.21105/joss.05336· OSTI ID:2573694

^[1]; ^[2]; ^[1]; ^[2]; ^[2]; ^[2]; ^[2]; ^[2]; ^[2]

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States)

While different text mining approaches – including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

Grant/Contract Number:: AC05-00OR22725

OSTI ID:: 2573694

Journal Information:: Journal of Open Source Software, Journal Name: Journal of Open Source Software Journal Issue: 108 Vol. 10; ISSN 2475-9066

Publisher:: Open Source Initiative - NumFOCUSCopyright Statement

Country of Publication:: United States

Language:: English

References (4)

Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr Walker, Vickie R.; Schmitt, Charles P.; Wolfe, Mary S. Environment International, Vol. 159 https://doi.org/10.1016/j.envint.2021.107025	journal	January 2022
TeamTat: a collaborative text annotation tool Islamaj, Rezarta; Kwon, Dongseop; Kim, Sun Nucleic Acids Research, Vol. 48, Issue W1 https://doi.org/10.1093/nar/gkaa333	journal	May 2020
Graph-based layout analysis for PDF documents Xu, Canhui; Tang, Zhi; Tao, Xin SPIE Proceedings https://doi.org/10.1117/12.2005608	conference	March 2013
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation Ansel, Jason; Yang, Edward; He, Horace Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 https://doi.org/10.1145/3620665.3640366	conference	April 2024

Similar Records

DeepPDF: A Deep Learning Approach to Extracting Text from PDFs

Conference · Tue May 01 00:00:00 EDT 2018 · OSTI ID:1460210

Teaching AI when to care about gender

Journal Article · Sun Aug 28 20:00:00 EDT 2022 · Code4Lib Journal · OSTI ID:1885750

Multiperspective Automotive Labeling

Conference · Tue Dec 31 23:00:00 EST 2019 · OSTI ID:1804073

Related Subjects

97 MATHEMATICS AND COMPUTING
Python
annotation
pdf
text extraction

PDF Entity Annotation Tool (PEAT)

Citation Formats

References (4)

Similar Records

Related Subjects