Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Assessment of fine-tuned large language models for real-world chemistry and material science applications

Journal Article · · Chemical Science
DOI:https://doi.org/10.1039/d4sc04401k· OSTI ID:2586547
 [1];  [2];  [3];  [4];  [5];  [4];  [6];  [7];  [8];  [8];  [9];  [1];  [6];  [10];  [10];  [11];  [12];  [13];  [1];  [1] more »;  [14];  [9];  [4];  [8];  [7];  [1];  [15];  [10];  [16];  [1];  [17];  [18];  [10];  [16];  [4];  [19];  [20];  [21];  [22];  [7];  [23];  [17];  [24];  [13];  [20];  [25];  [13];  [13];  [26];  [1] « less
  1. Ecole Polytechnique Federale Lausanne (EPFL), Sion (Switzerland)
  2. Ecole Polytechnique Federale Lausanne (EPFL), Sion (Switzerland); Consejo Superior de Investigaciones Cientificas (CSIC), Oviedo (Spain). Instituto de Ciencia y TecnologÍa del Carbono (INCAR)
  3. Ecole Polytechnique Federale Lausanne (EPFL), Sion (Switzerland); Friedrich Schiller University, Jena (Germany); Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena) (Germany)
  4. University of Cambridge (United Kingdom)
  5. Technical University of Denmark, Lyngby (Denmark); University of Oxford (United Kingdom)
  6. University of Chicago, IL (United States); Argonne National Laboratory (ANL), Argonne, IL (United States)
  7. Politecnico di Torino (Italy)
  8. Ecole Polytechnique Federale Lausanne (EPFL) (Switzerland)
  9. Koc University, Istanbul (Turkey)
  10. Heriot-Watt University, Edinburgh (United Kingdom)
  11. BIGCHEM GmbH, Unterschleißheim (Germany)
  12. University of Cambridge (United Kingdom); National Institutes of Health (NIH), Bethesda, MD (United States)
  13. Helmholtz Zentrum Hereon, Geesthacht (Germany)
  14. Monash University, Clayton, VIC (Australia)
  15. University of Waterloo, ON (Canada)
  16. University of Toronto, ON (Canada)
  17. University of Pisa (Italy)
  18. Consejo Superior de Investigaciones Cientificas (CSIC), Oviedo (Spain). Instituto de Ciencia y TecnologÍa del Carbono (INCAR)
  19. University of Mohaghegh Ardabili (Iran)
  20. University of Tehran (Iran)
  21. University of Chicago, IL (United States)
  22. Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States); University of Notre Dame, IN (United States)
  23. Technical University of Vienna (Austria)
  24. BIGCHEM GmbH, Unterschleißheim (Germany); Helmholtz Munich - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg (Germany)
  25. University of Notre Dame, IN (United States)
  26. Ecole Polytechnique Federale Lausanne (EPFL), Sion (Switzerland); Technical University of Denmark, Lyngby (Denmark)

The current generation of large language models (LLMs) has limited chemical knowledge. Recently, it has been shown that these LLMs can learn and predict chemical properties through fine-tuning. Using natural language to train machine learning models opens doors to a wider chemical audience, as field-specific featurization techniques can be omitted. In this work, we explore the potential and limitations of this approach. We studied the performance of fine-tuning three open-source LLMs (GPT-J-6B, Llama-3.1-8B, and Mistral-7B) for a range of different chemical questions. We benchmark their performances against “traditional” machine learning models and find that, in most cases, the fine-tuning approach is superior for a simple classification problem. Depending on the size of the dataset and the type of questions, we also successfully address more sophisticated problems. The most important conclusions of this work are that, for all datasets considered, their conversion into an LLM fine-tuning training set is straightforward and that fine-tuning with even relatively small datasets leads to predictive models. These results suggest that the systematic use of LLMs to guide experiments and simulations will be a powerful technique in any research study, significantly reducing unnecessary experiments or computations.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
2586547
Journal Information:
Chemical Science, Journal Name: Chemical Science Journal Issue: 2 Vol. 16; ISSN 2041-6539; ISSN 2041-6520
Publisher:
Royal Society of Chemistry (RSC)Copyright Statement
Country of Publication:
United States
Language:
English

References (32)

Sintering of Magnesium journal September 2010
Science‐Driven Atomistic Machine Learning journal April 2023
Desalination of Water: a Review journal March 2018
Machine Learning in Chemical Engineering: Strengths, Weaknesses, Opportunities, and Threats journal September 2021
A review on metal hydride materials for hydrogen storage journal November 2023
Opportunities and challenges of text mining in materials research journal March 2021
Protein Phase Separation: A New Phase in Cell Biology journal June 2018
Identification Schemes for Metal–Organic Frameworks To Enable Rapid Search and Cheminformatics Analysis journal September 2019
Development of Predictive Models for Activated Carbon Synthesis from Different Biomass for CO2 Adsorption Using Artificial Neural Networks journal September 2021
Do Large Language Models Understand Chemistry? A Conversation with ChatGPT journal March 2023
Chemprop: A Machine Learning Package for Chemical Property Prediction journal December 2023
Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach journal April 2015
Transfer Learning from Simulation to Experimental Data: NMR Chemical Shift Predictions journal April 2021
Extracting an Empirical Intermetallic Hydride Design Principle from Limited Data via Interpretable Machine Learning journal December 2019
Predicting Adhesive Free Energies of Polymer–Surface Interactions with Machine Learning journal August 2022
Combining Machine Learning and Molecular Simulations to Unlock Gas Separation Potentials of MOF Membranes and MOF/Polymer MMMs journal July 2022
Data-Driven Advancement of Homogeneous Nickel Catalyst Activity for Aryl Ether Cleavage journal May 2020
Capturing chemical intuition in synthesis of metal-organic frameworks journal February 2019
Making the collective knowledge of chemistry open and machine actionable journal April 2022
Leveraging large language models for predictive chemistry journal February 2024
Predictive chemistry: machine learning for reaction deployment, reaction development, and reaction discovery journal January 2023
14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon journal January 2023
DeepStruc: towards structure solution from pair distribution function data using deep generative models journal January 2023
In silico active learning for small molecule properties journal January 2022
Biomass to energy: a machine learning model for optimum gasification pathways journal January 2023
Learning the molecular grammar of protein condensates from sequence determinants and embeddings journal April 2021
Magnesium powder injection moulding for biomedical application journal September 2014
Characterising the Atomic Structure of Mono-Metallic Nanoparticles from X-Ray Scattering Data Using Conditional Generative Models preprint July 2020
Metal Injection Molding (MIM) of Magnesium and Its Alloys journal May 2016
Mistral 7B preprint January 2023
Less can be more for predicting properties with large language models preprint January 2024
The Llama 3 Herd of Models preprint January 2024

Similar Records

SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions
Conference · Mon Nov 11 23:00:00 EST 2024 · OSTI ID:2477906

CACTUS: Chemistry Agent Connecting Tool Usage to Science
Journal Article · Fri Oct 25 00:00:00 EDT 2024 · ACS Omega · OSTI ID:2478063

Related Subjects