DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Leveraging natural language processing to curate the tmCAT, tmPHOTO, tmBIO, and tmSCO datasets of functional transition metal complexes

Journal Article · · Faraday Discussions
DOI: https://doi.org/10.1039/D4FD00087K · OSTI ID:2447510
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [3]; ORCiD logo [3]; ORCiD logo [3]
  1. Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
  2. Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
  3. Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Leveraging natural language processing models including transformers, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism.

Sponsoring Organization:
USDOE
Grant/Contract Number:
SC0016214
OSTI ID:
2447510
Journal Information:
Faraday Discussions, Journal Name: Faraday Discussions Vol. 256; ISSN 1359-6640; ISSN FDISE6
Publisher:
Royal Society of Chemistry (RSC)Copyright Statement
Country of Publication:
United Kingdom
Language:
English

References (91)

Design and Assembly of Virtual Homogeneous Catalyst Libraries –Towardsin silico Catalyst Optimisation journal February 2006
Development of organometallic (organo-transition metal) pharmaceuticals journal January 2005
The Computational Road to Better Catalysts journal March 2014
Development of a Ligand Knowledge Base, Part 1: Computational Descriptors for Phosphorus Donor Ligands journal January 2006
molSimplify: A toolkit for automating discovery in inorganic chemistry journal July 2016
Studies on Alternating Copolymerization of Ethylene and Carbon Monoxide Using Nickel‐Based Catalyst: Cocatalyst and the Polarity of Solvent journal December 2021
Density-Based Clustering Based on Hierarchical Density Estimates book January 2013
Natural Language Processing book January 2020
Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey journal November 2018
Mechanisms of resistance to cisplatin journal July 2001
Synthesis and metathesis reactions of a phosphine-free dihydroimidazole carbene ruthenium complex journal December 2000
Cisplatin (cis-diamminedichloroplatinum II) journal March 1979
Light-emitting devices based on organometallic platinum complexes as emitters journal November 2011
The development of RAPTA compounds for the treatment of tumors journal January 2016
Piano stool Ru(II)-arene complexes having three monodentate legs: A comprehensive review on their development as anticancer therapeutics over the past decade journal May 2022
A review of topic modeling methods journal December 2020
Advances in the light conversion properties of Cu(I)-based photosensitizers journal November 2014
Living ring-opening metathesis polymerization journal January 2007
Spinning around in Transition-Metal Chemistry journal November 2016
The Evolution of Chemical High-Throughput Experimentation To Address Challenging Problems in Pharmaceutical Synthesis journal November 2017
Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning journal October 2017
Gold-Catalyzed Reactions of Specially Activated Alkynes, Allenes, and Alkenes journal November 2020
Computational Discovery of Transition-metal Complexes: From High-throughput Screening to Machine Learning journal July 2021
Information Retrieval and Text Mining Technologies for Chemistry journal May 2017
Pd Metal Catalysts for Cross-Couplings and Related Reactions in the 21st Century: A Critical Review journal February 2018
High-Throughput Screening of Earth-Abundant Water Reduction Catalysts toward Photocatalytic Hydrogen Evolution journal January 2021
tmQM Dataset—Quantum Geometries and Properties of 86k Transition Metal Complexes journal November 2020
ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science journal September 2021
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature journal October 2016
Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach journal August 2017
Text Mining Metal–Organic Framework Papers journal January 2018
Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks journal January 2020
Seeing Is Believing: Experimental Spin States from Machine Learning Model Structure Predictions journal March 2020
Identifying Underexplored and Untapped Regions in the Chemical Space of Transition Metal Complexes journal June 2023
Classification of Hemilabile Ligands Using Machine Learning journal December 2023
Advances in Photocatalysis: A Microreview of Visible Light Mediated Ruthenium and Iridium Catalyzed Organic Transformations journal June 2016
Highly Active Yttrium Catalysts for the Ring-Opening Polymerization of ε-Caprolactone and δ-Valerolactone journal September 2015
Optimizing Open Iron Sites in Metal–Organic Frameworks for Ethane Oxidation: A First-Principles Study journal April 2017
Design and Application of a Screening Set for Monophosphine Ligands in Cross-Coupling journal June 2022
Machine Learning Accelerates the Discovery of Design Rules and Exceptions in Stable Metal–Oxo Intermediate Formation journal July 2019
Accurate Multiobjective Design in a Space of Millions of Transition Metal Complexes with Neural-Network-Driven Efficient Global Optimization journal March 2020
A Machine Learning Approach to Zeolite Synthesis Enabled by Automatic Literature Data Extraction journal April 2019
The Development and Catalytic Uses of N-Heterocyclic Carbene Gold Complexes journal October 2010
Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 journal November 2012
Statistical Modeling of a Ligand Knowledge Base journal November 2006
Cationic Organometallic Complexes of Scandium, Yttrium, and the Lanthanoids journal June 2006
Bis(imino)pyridines:  Surprisingly Reactive Ligands and a Gateway to New Families of Catalysts journal May 2007
Steric effects of phosphorus ligands in organometallic chemistry and homogeneous catalysis journal June 1977
Ruthenium-Based Heterocyclic Carbene-Coordinated Olefin Metathesis Catalysts journal March 2010
In Silico Screening of Iron-Oxo Catalysts for CH Bond Cleavage journal March 2015
Stochastic Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-Like Compounds journal May 2013
High-Throughput Screening and Automated Data-Driven Analysis of the Triplet Photophysical Properties of Structurally Diverse, Heteroleptic Iridium(III) Complexes journal January 2021
Using Machine Learning and Data Mining to Leverage Community Knowledge for the Engineering of Stable Metal–Organic Frameworks journal October 2021
A Comprehensive Discovery Platform for Organophosphorus Ligands for Catalysis journal January 2022
The Open Reaction Database journal November 2021
ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis journal August 2023
Exploring the Phototoxicity of Hypoxic Active Iridium(III)-Based Sensitizers in 3D Tumor Spheroids journal October 2019
Expansion of the Ligand Knowledge Base for Monodentate P-Donor Ligands (LKB-P) journal December 2010
Expansion of the Ligand Knowledge Base for Chelating P,P-Donor Ligands (LKB-PP) journal July 2012
Parameterization of phosphine ligands demonstrates enhancement of nickel catalysis via remote steric effects journal March 2017
Extracting accurate materials data from research papers with conversational language models and prompt engineering journal February 2024
Virtual screening of inorganic materials synthesis parameters with deep learning journal December 2017
Well-defined nickel and palladium precatalysts for cross-coupling journal March 2017
Leveraging large language models for predictive chemistry journal February 2024
Quantum chemistry structures and properties of 134 kilo molecules journal August 2014
Recent advances in enantioselective gold catalysis journal January 2016
A leap forward in iridium–NHC catalysis: new horizons and mechanistic insights journal January 2018
Asymmetric hydrofunctionalization of minimally functionalized alkenes via earth abundant transition metal catalysis journal January 2018
What can reaction databases teach us about Buchwald–Hartwig cross-couplings? journal January 2020
Photochemistry of nickel salen based complexes and relevance to catalysis journal March 2002
N-Heterocyclic carbenes in gold catalysis journal January 2008
Chemoselective olefin metathesis transformations mediated by ruthenium complexes journal January 2010
Chemical space as a source for new drugs journal January 2010
Electronic spectra from TDDFT and machine learning in chemical space journal August 2015
Machine learning of molecular electronic properties in chemical compound space journal September 2013
Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning journal January 2012
The Cambridge Structural Database
  • Groom, Colin R.; Bruno, Ian J.; Lightfoot, Matthew P.
  • Acta Crystallographica Section B Structural Science, Crystal Engineering and Materials, Vol. 72, Issue 2, p. 171-179 https://doi.org/10.1107/S2052520616003954
journal April 2016
Tracing topics and trends in drug‐resistant epilepsy research using a natural language processing–based topic modeling approach journal February 2024
A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow journal January 2018
Predicting reaction performance in C–N cross-coupling using machine learning journal February 2018
A comprehensive comparative study on term weighting schemes for text categorization with support vector machines conference January 2005
Lost in chemical space? Maps to support organometallic catalysis journal June 2015
New Cisplatin Analogues in Development journal September 1993
Characterizing Artificial Intelligence Applications in Cancer Research: A Latent Dirichlet Allocation Analysis journal September 2019
Recent Applications of Pd-Catalyzed Suzuki–Miyaura and Buchwald–Hartwig Couplings in Pharmaceutical Process Chemistry journal January 2022
Analysis of the Trends in Biochemical Research Using Latent Dirichlet Allocation (LDA) journal June 2019
Efficient Estimation of Word Representations in Vector Space preprint January 2013
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction preprint January 2018
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks preprint January 2019
BERTopic: Neural topic modeling with a class-based TF-IDF procedure preprint January 2022
A Prompt-Engineered Large Language Model, Deep Learning Workflow for Materials Classification preprint January 2024