skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: DeepPDF: A Deep Learning Approach to Extracting Text from PDFs

Abstract

Scientific publications contain a plethora of important information, not only for researchers but also for their managers and institutions. Many researchers try to collect and extract this information in large enough quantities that it requires machine automation, but because publications were historically intended for print and not machine consumption, the digital document formats used today (primarily PDF) have created many hurdles for text extraction. Primarily, tools have relied on trying to convert PDF's to plain text documents for machine processing by reverse engineering the PDF standard. A complex process because once a PDF is created it is more closely related to an image file than a document markup language. In this paper we explore the feasibility of treating these PDF documents as images as opposed to a proprietary markup language. We believe that by using deep learning and image analysis we can create more accurate PDF to text extraction tools than those that currently exist. \\ \newline \Keywords{deep learning, text extraction, information extraction, PDF extraction, scholarly publications.


Citation Formats

Stahl, Christopher G., Young, Steven R., Herrmannova, Drahomira, Patton, Robert M., and Wells, Jack C. DeepPDF: A Deep Learning Approach to Extracting Text from PDFs. United States: N. p., 2018. Web.
Stahl, Christopher G., Young, Steven R., Herrmannova, Drahomira, Patton, Robert M., & Wells, Jack C. DeepPDF: A Deep Learning Approach to Extracting Text from PDFs. United States.
Stahl, Christopher G., Young, Steven R., Herrmannova, Drahomira, Patton, Robert M., and Wells, Jack C. Tue . "DeepPDF: A Deep Learning Approach to Extracting Text from PDFs". United States. https://www.osti.gov/servlets/purl/1460210.
@article{osti_1460210,
title = {DeepPDF: A Deep Learning Approach to Extracting Text from PDFs},
author = {Stahl, Christopher G. and Young, Steven R. and Herrmannova, Drahomira and Patton, Robert M. and Wells, Jack C.},
abstractNote = {Scientific publications contain a plethora of important information, not only for researchers but also for their managers and institutions. Many researchers try to collect and extract this information in large enough quantities that it requires machine automation, but because publications were historically intended for print and not machine consumption, the digital document formats used today (primarily PDF) have created many hurdles for text extraction. Primarily, tools have relied on trying to convert PDF's to plain text documents for machine processing by reverse engineering the PDF standard. A complex process because once a PDF is created it is more closely related to an image file than a document markup language. In this paper we explore the feasibility of treating these PDF documents as images as opposed to a proprietary markup language. We believe that by using deep learning and image analysis we can create more accurate PDF to text extraction tools than those that currently exist. \\ \newline \Keywords{deep learning, text extraction, information extraction, PDF extraction, scholarly publications.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: