DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Unsupervised word embeddings capture latent knowledge from materials science literature

Abstract

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure-property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embeddedmore » in past publications. Lastly, our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.« less

Authors:
 [1];  [2];  [3];  [2];  [3];  [4];  [2];  [2];  [3]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Google LLC, Mountain View, CA (United States)
  2. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
  3. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  4. Univ. of California, Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Basic Energy Sciences (BES)
OSTI Identifier:
1608271
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
Nature (London)
Additional Journal Information:
Journal Name: Nature (London); Journal Volume: 571; Journal Issue: 7763; Journal ID: ISSN 0028-0836
Publisher:
Nature Publishing Group
Country of Publication:
United States
Language:
English
Subject:
36 MATERIALS SCIENCE

Citation Formats

Tshitoyan, Vahe, Dagdelen, John, Weston, Leigh, Dunn, Alexander, Rong, Ziqin, Kononova, Olga, Persson, Kristin A., Ceder, Gerbrand, and Jain, Anubhav. Unsupervised word embeddings capture latent knowledge from materials science literature. United States: N. p., 2019. Web. doi:10.1038/s41586-019-1335-8.
Tshitoyan, Vahe, Dagdelen, John, Weston, Leigh, Dunn, Alexander, Rong, Ziqin, Kononova, Olga, Persson, Kristin A., Ceder, Gerbrand, & Jain, Anubhav. Unsupervised word embeddings capture latent knowledge from materials science literature. United States. https://doi.org/10.1038/s41586-019-1335-8
Tshitoyan, Vahe, Dagdelen, John, Weston, Leigh, Dunn, Alexander, Rong, Ziqin, Kononova, Olga, Persson, Kristin A., Ceder, Gerbrand, and Jain, Anubhav. Wed . "Unsupervised word embeddings capture latent knowledge from materials science literature". United States. https://doi.org/10.1038/s41586-019-1335-8. https://www.osti.gov/servlets/purl/1608271.
@article{osti_1608271,
title = {Unsupervised word embeddings capture latent knowledge from materials science literature},
author = {Tshitoyan, Vahe and Dagdelen, John and Weston, Leigh and Dunn, Alexander and Rong, Ziqin and Kononova, Olga and Persson, Kristin A. and Ceder, Gerbrand and Jain, Anubhav},
abstractNote = {The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure-property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Lastly, our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.},
doi = {10.1038/s41586-019-1335-8},
journal = {Nature (London)},
number = 7763,
volume = 571,
place = {United States},
year = {Wed Jul 03 00:00:00 EDT 2019},
month = {Wed Jul 03 00:00:00 EDT 2019}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Figures / Tables:

Fig. 1 Fig. 1: Word2vec skip-gram and analogies. a, Target words ‘LiCoO2’ and ‘LiMn2O4’ are represented as vectors with ones at their corresponding vocabulary indices (for example, 5 and 8 in the schematic) and zeros everywhere else (one-hot encoding). These one-hot encoded vectors are used as inputs for a neural network withmore » a single linear hidden layer (for example, 200 neurons), which is trained to predict all words mentioned within a certain distance (context words) from the given target word. For similar battery cathode materials such as LiCoO2 and LiMn2O4, the context words that occur in the text are mostly the same (for example, ‘cathodes’, ‘electrochemical’, and so on), which leads to similar hidden layer weights after the training is complete. These hidden layer weights are the actual word embeddings. The softmax function is used at the output layer to normalize the probabilities. b, Word embeddings for Zr, Cr and Ni, their principal oxides and crystal symmetries (at standard conditions) projected onto two dimensions using principal component analysis and represented as points in space. The relative positioning of the words encodes materials science relationships, such that there exist consistent vector operations between words that represent concepts such as ‘oxide of’ and ‘structure of’.« less

Save / Share:

Works referenced in this record:

Information Retrieval and Text Mining Technologies for Chemistry
journal, May 2017


Generalized Gradient Approximation Made Simple
journal, October 1996

  • Perdew, John P.; Burke, Kieron; Ernzerhof, Matthias
  • Physical Review Letters, Vol. 77, Issue 18, p. 3865-3868
  • DOI: 10.1103/PhysRevLett.77.3865

Learning atoms for materials discovery
journal, June 2018

  • Zhou, Quan; Tang, Peizhe; Liu, Shenxiu
  • Proceedings of the National Academy of Sciences, Vol. 115, Issue 28
  • DOI: 10.1073/pnas.1801181115

An ab initio electronic transport database for inorganic materials
journal, July 2017


Ultralow lattice thermal conductivity and electronic properties of monolayer 1T phase semimetal SiTe2 and SnTe2
journal, April 2019


Atomate: A high-level interface to generate, execute, and analyze computational materials science workflows
journal, November 2017


FireWorks: a dynamic workflow system designed for high-throughput applications: FireWorks: A Dynamic Workflow System Designed for High-Throughput Applications
journal, May 2015

  • Jain, Anubhav; Ong, Shyue Ping; Chen, Wei
  • Concurrency and Computation: Practice and Experience, Vol. 27, Issue 17
  • DOI: 10.1002/cpe.3505

From ultrasoft pseudopotentials to the projector augmented-wave method
journal, January 1999


Glove: Global Vectors for Word Representation
conference, January 2014

  • Pennington, Jeffrey; Socher, Richard; Manning, Christopher
  • Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
  • DOI: 10.3115/v1/D14-1162

Low-Symmetry Two-Dimensional Materials for Electronic and Photonic Applications
text, January 2017


Low-symmetry two-dimensional materials for electronic and photonic applications
journal, December 2016


High-resolution X-ray luminescence extension imaging
journal, February 2021


Materials science with large-scale data and informatics: Unlocking new opportunities
journal, May 2016

  • Hill, Joanne; Mulholland, Gregory; Persson, Kristin
  • MRS Bulletin, Vol. 41, Issue 5
  • DOI: 10.1557/mrs.2016.93

New trends, strategies and opportunities in thermoelectric materials: A perspective
journal, June 2017


Chalcopyrite CuGaTe2: A High-Efficiency Bulk Thermoelectric Material
journal, June 2012

  • Plirdpring, Theerayuth; Kurosaki, Ken; Kosuga, Atsuko
  • Advanced Materials, Vol. 24, Issue 27
  • DOI: 10.1002/adma.201200732

BoltzTraP. A code for calculating band-structure dependent quantities
journal, July 2006


tmChem: a high performance approach for chemical named entity recognition and normalization
journal, January 2015


Chemical named entities recognition: a review on approaches and applications
journal, April 2014


Ultralow thermal conductivity and high thermoelectric figure of merit in SnSe crystals
journal, April 2014

  • Zhao, Li-Dong; Lo, Shih-Han; Zhang, Yongsheng
  • Nature, Vol. 508, Issue 7496, p. 373-377
  • DOI: 10.1038/nature13184

Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning
journal, October 2017


Advances in thermoelectric materials research: Looking back and moving forward
journal, September 2017


Self-Consistent Equations Including Exchange and Correlation Effects
journal, November 1965


Thermoelectric properties of defect chalcopyrites
conference, January 2017

  • Pandey, Chhama; Sharma, Ramesh; Sharma, Yamini
  • DAE SOLID STATE PHYSICS SYMPOSIUM 2016, AIP Conference Proceedings
  • DOI: 10.1063/1.4980633

ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.
text, January 2016

  • Swain, Matthew C.; Cole, Jacqui
  • Apollo - University of Cambridge Repository
  • DOI: 10.17863/cam.10935

ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature
journal, October 2016

  • Swain, Matthew C.; Cole, Jacqueline M.
  • Journal of Chemical Information and Modeling, Vol. 56, Issue 10
  • DOI: 10.1021/acs.jcim.6b00207

High Thermoelectric Figure of Merit via Tunable Valley Convergence Coupled Low Thermal Conductivity in A II B IV C 2 V Chalcopyrites
journal, November 2018

  • Mukherjee, Madhubanti; Yumnam, George; Singh, Abhishek K.
  • The Journal of Physical Chemistry C, Vol. 122, Issue 51
  • DOI: 10.1021/acs.jpcc.8b10564

Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set
journal, July 1996


Data-Driven Review of Thermoelectric Materials: Performance and Resource Considerations
journal, May 2013

  • Gaultois, Michael W.; Sparks, Taylor D.; Borg, Christopher K. H.
  • Chemistry of Materials, Vol. 25, Issue 15
  • DOI: 10.1021/cm400893e

Low lattice thermal conductivity and excellent thermoelectric behavior in Li 3 Sb and Li 3 Bi
journal, October 2018

  • Yang, Xiuxian; Dai, Zhenhong; Zhao, Yinchang
  • Journal of Physics: Condensed Matter, Vol. 30, Issue 42
  • DOI: 10.1088/1361-648X/aade17

Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set
journal, October 1996


The Proof and Measurement of Association between Two Things
journal, October 1987

  • Spearman, C.
  • The American Journal of Psychology, Vol. 100, Issue 3/4
  • DOI: 10.2307/1422689

Commentary: The Materials Project: A materials genome approach to accelerating materials innovation
journal, July 2013

  • Jain, Anubhav; Ong, Shyue Ping; Hautier, Geoffroy
  • APL Materials, Vol. 1, Issue 1
  • DOI: 10.1063/1.4812323

Inhomogeneous Electron Gas
journal, November 1964


Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
journal, September 2004


Inhomogeneous Electron Gas
journal, March 1973


Reducing Dzyaloshinskii-Moriya interaction and field-free spin-orbit torque switching in synthetic antiferromagnets
journal, May 2021


The Proof and Measurement of Association between Two Things
journal, January 1904

  • Spearman, C.
  • The American Journal of Psychology, Vol. 15, Issue 1
  • DOI: 10.2307/1412159

Machine Learning Energies of 2 Million Elpasolite ( A B C 2 D 6 ) Crystals
journal, September 2016


Machine-learned and codified synthesis parameters of oxide materials
journal, September 2017


Machine learning for molecular and materials science
journal, July 2018


The proof and measurement of association between two things
journal, October 2010

  • Spearman, C.
  • International Journal of Epidemiology, Vol. 39, Issue 5
  • DOI: 10.1093/ije/dyq191

Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis
journal, February 2013


Automated hypothesis generation based on mining scientific literature
conference, August 2014

  • Spangler, Scott; Wilkins, Angela D.; Bachman, Benjamin J.
  • Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
  • DOI: 10.1145/2623330.2623667

Works referencing / citing this record:

Revealing ferroelectric switching character using deep recurrent neural networks
journal, October 2019


Nanoinformatics, and the big challenges for the science of small things
journal, January 2019

  • Barnard, A. S.; Motevalli, B.; Parker, A. J.
  • Nanoscale, Vol. 11, Issue 41
  • DOI: 10.1039/c9nr05912a

A Critical Review of Machine Learning of Energy Materials
journal, January 2020


Key genes and co-expression modules involved in asthma pathogenesis
journal, February 2020


Revealing ferroelectric switching character using deep recurrent neural networks
journal, October 2019


Text mining facilitates materials discovery
journal, July 2019


Ultra-low lattice thermal conductivity of monolayer penta-silicene and penta-germanene
journal, January 2019

  • Gao, Zhibin; Zhang, Zhaofu; Liu, Gang
  • Physical Chemistry Chemical Physics, Vol. 21, Issue 47
  • DOI: 10.1039/c9cp05246a

Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding
journal, October 2019


Assessment of text coherence using an ontology‐based relatedness measurement method
journal, December 2019

  • Giray, Görkem; Ünalır, Murat Osman
  • Expert Systems, Vol. 37, Issue 3
  • DOI: 10.1111/exsy.12505

Opportunities for Artificial Intelligence in Advancing Precision Medicine
journal, December 2019