DeepPDF: A Deep Learning Approach to Extracting Text from PDFs

Stahl, Christopher; Young, Steven; Herrmannova, Dasha; Patton, Robert; Wells, Jack

DeepPDF: A Deep Learning Approach to Extracting Text from PDFs

Conference · Tue May 01 04:00:00 EDT 2018

OSTI ID:1460210

^[1]; ^[1]; ^[1]; ^[1]; ^[1]

ORNL

Scientific publications contain a plethora of important information, not only for researchers but also for their managers and institutions. Many researchers try to collect and extract this information in large enough quantities that it requires machine automation, but because publications were historically intended for print and not machine consumption, the digital document formats used today (primarily PDF) have created many hurdles for text extraction. Primarily, tools have relied on trying to convert PDF's to plain text documents for machine processing by reverse engineering the PDF standard. A complex process because once a PDF is created it is more closely related to an image file than a document markup language. In this paper we explore the feasibility of treating these PDF documents as images as opposed to a proprietary markup language. We believe that by using deep learning and image analysis we can create more accurate PDF to text extraction tools than those that currently exist. \\ \newline \Keywords{deep learning, text extraction, information extraction, PDF extraction, scholarly publications.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1460210

Country of Publication:: United States

Language:: English

Similar Records

PDF Entity Annotation Tool (PEAT)

Journal Article · Mon Apr 07 20:00:00 EDT 2025 · Journal of Open Source Software · OSTI ID:2573694

PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

Journal Article · Mon Mar 28 20:00:00 EDT 2022 · Journal of Chemical Information and Modeling · OSTI ID:1981870

DeepPDF: A Deep Learning Approach to Extracting Text from PDFs

Citation Formats

Similar Records

Related Subjects