PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

Zhu, Miao; Cole, Jacqueline M.

doi:10.1021/acs.jcim.1c01198

Title: PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

Journal Article · Tue Mar 29 00:00:00 EDT 2022 · Journal of Chemical Information and Modeling

DOI:https://doi.org/10.1021/acs.jcim.1c01198· OSTI ID:1981870

Zhu, Miao ^[1];

^[2]

University of Cambridge (United Kingdom). Cavendish Laboratory
University of Cambridge (United Kingdom). Cavendish Laboratory; Harwell Science and Innovation Campus, Didcot (United Kingdom); University of Cambridge (United Kingdom)

The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with opensource data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entityrecognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Argonne National Laboratory (ANL), Argonne, IL (United States); Univ. of Cambridge (United Kingdom)

Sponsoring Organization:: Science & Technology Facilities Council (STFC); USDOE Office of Science (SC)

Grant/Contract Number:: AC02-06CH11357

OSTI ID:: 1981870

Journal Information:: Journal of Chemical Information and Modeling, Vol. 62, Issue 7; ISSN 1549-9596

Publisher:: American Chemical SocietyCopyright Statement

Country of Publication:: United States

Language:: English

References (7)

Information Retrieval and Text Mining Technologies for Chemistry Krallinger, Martin; Rabal, Obdulia; Lourenço, Anália Chemical Reviews, Vol. 117, Issue 12 https://doi.org/10.1021/acs.chemrev.6b00851	journal	May 2017
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature Swain, Matthew C.; Cole, Jacqueline M. Journal of Chemical Information and Modeling, Vol. 56, Issue 10 https://doi.org/10.1021/acs.jcim.6b00207	journal	October 2016
PDFX: fully-automated PDF-to-XML conversion of scientific literature Constantin, Alexandru; Pettifer, Steve; Voronkov, Andrei DocEng '13: ACM Symposium on Document Engineering 2013, Proceedings of the 2013 ACM symposium on Document engineering https://doi.org/10.1145/2494266.2494271	conference	September 2013
The Oligopoly of Academic Publishers in the Digital Era Larivière, Vincent; Haustein, Stefanie; Mongeon, Philippe PLOS ONE, Vol. 10, Issue 6 https://doi.org/10.1371/journal.pone.0127502	journal	June 2015
CERMINE: automatic extraction of structured metadata from scientific literature Tkaczyk, Dominika; Szostek, Paweł; Fedoryszak, Mateusz International Journal on Document Analysis and Recognition (IJDAR), Vol. 18, Issue 4 https://doi.org/10.1007/s10032-015-0249-8	journal	July 2015
Layout-aware text extraction from full-text PDF of scientific articles Ramakrishnan, Cartic; Patnia, Abhishek; Hovy, Eduard Source Code for Biology and Medicine, Vol. 7, Issue 1 https://doi.org/10.1186/1751-0473-7-7	journal	May 2012
The Stanford CoreNLP Natural Language Processing Toolkit Manning, Christopher; Surdeanu, Mihai; Bauer, John Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations https://doi.org/10.3115/v1/P14-5010	conference	January 2014

Similar Records

THE NEW ONLINE METADATA EDITOR FOR GENERATING STRUCTURED METADATA

Conference · Mon Dec 01 00:00:00 EST 2014 · OSTI ID:1981870

Devarakonda, Ranjeet; Shrestha, Biva; Prakash, Giri; +9 more

Procedure Parsing: A Method for Parsing Handwritten Documents into Computer-Based Procedures

Conference · Fri Jul 22 00:00:00 EDT 2022 · OSTI ID:1981870

Whitmore, Stacey Ray

Geospatial Data from the Alpine Treeline Warming Experiment (ATWE) on Niwot Ridge, Colorado, USA

Dataset · Fri Jan 01 00:00:00 EST 2021 · OSTI ID:1981870

Zuest, Fabian; Castanha, Cristina; Lau, Nicole; +1 more

Related Subjects

37 INORGANIC, ORGANIC, PHYSICAL, AND ANALYTICAL CHEMISTRY
97 MATHEMATICS AND COMPUTING

Title: PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

Citation Formats

References (7)

Similar Records

Related Subjects