Unsupervised word embeddings capture latent knowledge from materials science literature
Abstract
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure-property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embeddedmore »
- Authors:
-
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Google LLC, Mountain View, CA (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Univ. of California, Berkeley, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Basic Energy Sciences (BES)
- OSTI Identifier:
- 1608271
- Grant/Contract Number:
- AC02-05CH11231
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Nature (London)
- Additional Journal Information:
- Journal Name: Nature (London); Journal Volume: 571; Journal Issue: 7763; Journal ID: ISSN 0028-0836
- Publisher:
- Nature Publishing Group
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 36 MATERIALS SCIENCE
Citation Formats
Tshitoyan, Vahe, Dagdelen, John, Weston, Leigh, Dunn, Alexander, Rong, Ziqin, Kononova, Olga, Persson, Kristin A., Ceder, Gerbrand, and Jain, Anubhav. Unsupervised word embeddings capture latent knowledge from materials science literature. United States: N. p., 2019.
Web. doi:10.1038/s41586-019-1335-8.
Tshitoyan, Vahe, Dagdelen, John, Weston, Leigh, Dunn, Alexander, Rong, Ziqin, Kononova, Olga, Persson, Kristin A., Ceder, Gerbrand, & Jain, Anubhav. Unsupervised word embeddings capture latent knowledge from materials science literature. United States. https://doi.org/10.1038/s41586-019-1335-8
Tshitoyan, Vahe, Dagdelen, John, Weston, Leigh, Dunn, Alexander, Rong, Ziqin, Kononova, Olga, Persson, Kristin A., Ceder, Gerbrand, and Jain, Anubhav. Wed .
"Unsupervised word embeddings capture latent knowledge from materials science literature". United States. https://doi.org/10.1038/s41586-019-1335-8. https://www.osti.gov/servlets/purl/1608271.
@article{osti_1608271,
title = {Unsupervised word embeddings capture latent knowledge from materials science literature},
author = {Tshitoyan, Vahe and Dagdelen, John and Weston, Leigh and Dunn, Alexander and Rong, Ziqin and Kononova, Olga and Persson, Kristin A. and Ceder, Gerbrand and Jain, Anubhav},
abstractNote = {The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure-property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Lastly, our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.},
doi = {10.1038/s41586-019-1335-8},
journal = {Nature (London)},
number = 7763,
volume = 571,
place = {United States},
year = {Wed Jul 03 00:00:00 EDT 2019},
month = {Wed Jul 03 00:00:00 EDT 2019}
}
Figures / Tables:
Works referenced in this record:
Information Retrieval and Text Mining Technologies for Chemistry
journal, May 2017
- Krallinger, Martin; Rabal, Obdulia; Lourenço, Anália
- Chemical Reviews, Vol. 117, Issue 12
Generalized Gradient Approximation Made Simple
journal, October 1996
- Perdew, John P.; Burke, Kieron; Ernzerhof, Matthias
- Physical Review Letters, Vol. 77, Issue 18, p. 3865-3868
Learning atoms for materials discovery
journal, June 2018
- Zhou, Quan; Tang, Peizhe; Liu, Shenxiu
- Proceedings of the National Academy of Sciences, Vol. 115, Issue 28
An ab initio electronic transport database for inorganic materials
journal, July 2017
- Ricci, Francesco; Chen, Wei; Aydemir, Umut
- Scientific Data, Vol. 4, Issue 1
Ultralow lattice thermal conductivity and electronic properties of monolayer 1T phase semimetal SiTe2 and SnTe2
journal, April 2019
- Wang, Yi; Gao, Zhibin; Zhou, Jun
- Physica E: Low-dimensional Systems and Nanostructures, Vol. 108
Atomate: A high-level interface to generate, execute, and analyze computational materials science workflows
journal, November 2017
- Mathew, Kiran; Montoya, Joseph H.; Faghaninia, Alireza
- Computational Materials Science, Vol. 139
FireWorks: a dynamic workflow system designed for high-throughput applications: FireWorks: A Dynamic Workflow System Designed for High-Throughput Applications
journal, May 2015
- Jain, Anubhav; Ong, Shyue Ping; Chen, Wei
- Concurrency and Computation: Practice and Experience, Vol. 27, Issue 17
From ultrasoft pseudopotentials to the projector augmented-wave method
journal, January 1999
- Kresse, G.; Joubert, D.
- Physical Review B, Vol. 59, Issue 3, p. 1758-1775
Glove: Global Vectors for Word Representation
conference, January 2014
- Pennington, Jeffrey; Socher, Richard; Manning, Christopher
- Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Low-Symmetry Two-Dimensional Materials for Electronic and Photonic Applications
text, January 2017
- Tian, He; Tice, Jesse; Fei, Ruixiang
- arXiv
Low-symmetry two-dimensional materials for electronic and photonic applications
journal, December 2016
- Tian, He; Tice, Jesse; Fei, Ruixiang
- Nano Today, Vol. 11, Issue 6
High-resolution X-ray luminescence extension imaging
journal, February 2021
- Ou, Xiangyu; Qin, Xian; Huang, Bolong
- Nature, Vol. 590, Issue 7846
Materials science with large-scale data and informatics: Unlocking new opportunities
journal, May 2016
- Hill, Joanne; Mulholland, Gregory; Persson, Kristin
- MRS Bulletin, Vol. 41, Issue 5
New trends, strategies and opportunities in thermoelectric materials: A perspective
journal, June 2017
- Liu, Weishu; Hu, Jizhen; Zhang, Shuangmeng
- Materials Today Physics, Vol. 1
Chalcopyrite CuGaTe2: A High-Efficiency Bulk Thermoelectric Material
journal, June 2012
- Plirdpring, Theerayuth; Kurosaki, Ken; Kosuga, Atsuko
- Advanced Materials, Vol. 24, Issue 27
BoltzTraP. A code for calculating band-structure dependent quantities
journal, July 2006
- Madsen, Georg K. H.; Singh, David J.
- Computer Physics Communications, Vol. 175, Issue 1
tmChem: a high performance approach for chemical named entity recognition and normalization
journal, January 2015
- Leaman, Robert; Wei, Chih-Hsuan; Lu, Zhiyong
- Journal of Cheminformatics, Vol. 7, Issue S1
Chemical named entities recognition: a review on approaches and applications
journal, April 2014
- Eltyeb, Safaa; Salim, Naomie
- Journal of Cheminformatics, Vol. 6, Issue 1
Ultralow thermal conductivity and high thermoelectric figure of merit in SnSe crystals
journal, April 2014
- Zhao, Li-Dong; Lo, Shih-Han; Zhang, Yongsheng
- Nature, Vol. 508, Issue 7496, p. 373-377
Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning
journal, October 2017
- Kim, Edward; Huang, Kevin; Saunders, Adam
- Chemistry of Materials, Vol. 29, Issue 21
Advances in thermoelectric materials research: Looking back and moving forward
journal, September 2017
- He, Jian; Tritt, Terry M.
- Science, Vol. 357, Issue 6358
Self-Consistent Equations Including Exchange and Correlation Effects
journal, November 1965
- Kohn, W.; Sham, L. J.
- Physical Review, Vol. 140, Issue 4A, p. A1133-A1138
Thermoelectric properties of defect chalcopyrites
conference, January 2017
- Pandey, Chhama; Sharma, Ramesh; Sharma, Yamini
- DAE SOLID STATE PHYSICS SYMPOSIUM 2016, AIP Conference Proceedings
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.
text, January 2016
- Swain, Matthew C.; Cole, Jacqui
- Apollo - University of Cambridge Repository
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature
journal, October 2016
- Swain, Matthew C.; Cole, Jacqueline M.
- Journal of Chemical Information and Modeling, Vol. 56, Issue 10
High Thermoelectric Figure of Merit via Tunable Valley Convergence Coupled Low Thermal Conductivity in A II B IV C 2 V Chalcopyrites
journal, November 2018
- Mukherjee, Madhubanti; Yumnam, George; Singh, Abhishek K.
- The Journal of Physical Chemistry C, Vol. 122, Issue 51
Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set
journal, July 1996
- Kresse, G.; Furthmüller, J.
- Computational Materials Science, Vol. 6, Issue 1, p. 15-50
Data-Driven Review of Thermoelectric Materials: Performance and Resource Considerations
journal, May 2013
- Gaultois, Michael W.; Sparks, Taylor D.; Borg, Christopher K. H.
- Chemistry of Materials, Vol. 25, Issue 15
Low lattice thermal conductivity and excellent thermoelectric behavior in Li 3 Sb and Li 3 Bi
journal, October 2018
- Yang, Xiuxian; Dai, Zhenhong; Zhao, Yinchang
- Journal of Physics: Condensed Matter, Vol. 30, Issue 42
Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set
journal, October 1996
- Kresse, G.; Furthmüller, J.
- Physical Review B, Vol. 54, Issue 16, p. 11169-11186
The Proof and Measurement of Association between Two Things
journal, October 1987
- Spearman, C.
- The American Journal of Psychology, Vol. 100, Issue 3/4
Commentary: The Materials Project: A materials genome approach to accelerating materials innovation
journal, July 2013
- Jain, Anubhav; Ong, Shyue Ping; Hautier, Geoffroy
- APL Materials, Vol. 1, Issue 1
Inhomogeneous Electron Gas
journal, November 1964
- Hohenberg, P.; Kohn, W.
- Physical Review, Vol. 136, Issue 3B, p. B864-B871
Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
journal, September 2004
- Müller, Hans-Michael; Kenny, Eimear E.; Sternberg, Paul W.
- PLoS Biology, Vol. 2, Issue 11
Inhomogeneous Electron Gas
journal, March 1973
- Rajagopal, A. K.; Callaway, J.
- Physical Review B, Vol. 7, Issue 5
Reducing Dzyaloshinskii-Moriya interaction and field-free spin-orbit torque switching in synthetic antiferromagnets
journal, May 2021
- Chen, Ruyi; Cui, Qirui; Liao, Liyang
- Nature Communications, Vol. 12, Issue 1
The Proof and Measurement of Association between Two Things
journal, January 1904
- Spearman, C.
- The American Journal of Psychology, Vol. 15, Issue 1
Machine Learning Energies of 2 Million Elpasolite Crystals
journal, September 2016
- Faber, Felix A.; Lindmaa, Alexander; von Lilienfeld, O. Anatole
- Physical Review Letters, Vol. 117, Issue 13
Machine-learned and codified synthesis parameters of oxide materials
journal, September 2017
- Kim, Edward; Huang, Kevin; Tomala, Alex
- Scientific Data, Vol. 4, Issue 1
Machine learning for molecular and materials science
journal, July 2018
- Butler, Keith T.; Davies, Daniel W.; Cartwright, Hugh
- Nature, Vol. 559, Issue 7715
The proof and measurement of association between two things
journal, October 2010
- Spearman, C.
- International Journal of Epidemiology, Vol. 39, Issue 5
Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis
journal, February 2013
- Ong, Shyue Ping; Richards, William Davidson; Jain, Anubhav
- Computational Materials Science, Vol. 68
Automated hypothesis generation based on mining scientific literature
conference, August 2014
- Spangler, Scott; Wilkins, Angela D.; Bachman, Benjamin J.
- Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
Works referencing / citing this record:
Opportunities for artificial intelligence in advancing precision medicine
preprint, January 2019
- Filipp, Fabian V.
- arXiv
High-temperature materials for structural applications: New perspectives on high-entropy alloys, bulk metallic glasses, and nanomaterials
journal, November 2019
- Huang, E-Wen; Liaw, Peter K.
- MRS Bulletin, Vol. 44, Issue 11
Revealing ferroelectric switching character using deep recurrent neural networks
journal, October 2019
- Agar, Joshua C.; Naul, Brett; Pandya, Shishir
- Nature Communications, Vol. 10, Issue 1
Nanoinformatics, and the big challenges for the science of small things
journal, January 2019
- Barnard, A. S.; Motevalli, B.; Parker, A. J.
- Nanoscale, Vol. 11, Issue 41
A Critical Review of Machine Learning of Energy Materials
journal, January 2020
- Chen, Chi; Zuo, Yunxing; Ye, Weike
- Advanced Energy Materials, Vol. 10, Issue 8
Ultralow lattice thermal conductivity of monolayer penta-silicene and penta-germanene
text, January 2019
- Gao, Zhibin; Zhang, Zhaofu; Liu, Gang
- arXiv
Key genes and co-expression modules involved in asthma pathogenesis
journal, February 2020
- Huang, Yuyi; Liu, Hui; Zuo, Li
- PeerJ, Vol. 8
Revealing ferroelectric switching character using deep recurrent neural networks
journal, October 2019
- Agar, Joshua C.; Naul, Brett; Pandya, Shishir
- Nature Communications, Vol. 10, Issue 1
Text mining facilitates materials discovery
journal, July 2019
- Isayev, Olexandr
- Nature, Vol. 571, Issue 7763
Ultra-low lattice thermal conductivity of monolayer penta-silicene and penta-germanene
journal, January 2019
- Gao, Zhibin; Zhang, Zhaofu; Liu, Gang
- Physical Chemistry Chemical Physics, Vol. 21, Issue 47
Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding
journal, October 2019
- Huang, Liyuan; Ling, Chen
- ACS Omega, Vol. 4, Issue 20
Assessment of text coherence using an ontology‐based relatedness measurement method
journal, December 2019
- Giray, Görkem; Ünalır, Murat Osman
- Expert Systems, Vol. 37, Issue 3
Opportunities for Artificial Intelligence in Advancing Precision Medicine
journal, December 2019
- Filipp, Fabian V.
- Current Genetic Medicine Reports, Vol. 7, Issue 4
Figures / Tables found in this record: