Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

CACTUS: Chemistry Agent Connecting Tool Usage to Science

Journal Article · · ACS Omega

Large language models (LLMs) have shown remarkable potential in various domains but often lack the ability to access and reason over domain-specific knowledge and tools. In this article, we introduce Chemistry Agent Connecting Tool-Usage to Science (CACTUS), an LLM-based agent that integrates existing cheminformatics tools to enable accurate and advanced reasoning and problem-solving in chemistry and molecular discovery. We evaluate the performance of CACTUS using a diverse set of open-source LLMs, including Gemma-7b, Falcon-7b, MPT-7b, Llama3-8b, and Mistral-7b, on a benchmark of thousands of chemistry questions. Our results demonstrate that CACTUS significantly outperforms baseline LLMs, with the Gemma-7b, Mistral-7b, and Llama3-8b models achieving the highest accuracy regardless of the prompting strategy used. Moreover, we explore the impact of domain-specific prompting and hardware configurations on model performance, highlighting the importance of prompt engineering and the potential for deploying smaller models on consumer-grade hardware without a significant loss in accuracy. By combining the cognitive capabilities of open-source LLMs with widely used domain-specific tools provided by RDKit, CACTUS can assist researchers in tasks such as molecular property prediction, similarity searching, and drug-likeness assessment.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Basic Energy Sciences (BES); USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC05-76RL01830
OSTI ID:
2478063
Alternate ID(s):
OSTI ID: 2478848
Journal Information:
ACS Omega, Journal Name: ACS Omega Journal Issue: 46 Vol. 9; ISSN 2470-1343
Publisher:
American Chemical Society (ACS)Copyright Statement
Country of Publication:
United States
Language:
English

References (32)

Machine learning for heterogeneous catalyst design and discovery journal May 2018
Lessons Learnt from Assembling Screening Libraries for Drug Discovery for Neglected Diseases journal March 2008
A BOILED‐Egg To Predict Gastrointestinal Absorption and Brain Penetration of Small Molecules journal May 2016
Integrated data-driven and experimental approaches to accelerate lead optimization targeting SARS-CoV-2 main protease journal June 2023
Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings journal January 1997
GPT understands, too journal January 2024
AI-Accelerated Design of Targeted Covalent Inhibitors for SARS-CoV-2 journal February 2023
3D-Scaffold: A Deep Learning Framework to Generate 3D Coordinates of Drug-like Molecules with Desired Scaffolds journal October 2021
Quantum Mechanical Methods Predict Accurate Thermodynamics of Biochemical Reactions journal March 2021
Prediction of Physicochemical Parameters by Atomic Contributions journal August 1999
Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and Its Application to the Prediction of Drug Transport Properties journal October 2000
New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays journal April 2010
Quantifying the chemical beauty of drugs journal January 2012
Decoding the protein–ligand interactions using parallel graph neural networks journal May 2022
Augmenting large language models with chemistry tools journal May 2024
A review on the application of molecular descriptors and machine learning in polymer design journal January 2023
Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science journal April 2016
ADMET-AI: a machine learning ADMET platform for evaluation of large-scale chemical libraries journal June 2024
PubChem 2023 update journal October 2022
ADMETlab 3.0: an updated comprehensive online ADMET prediction platform enhanced with broader coverage, improved performance, API functionality and decision support journal April 2024
Can large language models reason and plan? journal March 2024
KNIME - the Konstanz information miner: version 2.0 and beyond journal November 2009
Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions journal June 2009
Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences journal January 2010
An open source chemical structure curation pipeline using RDKit journal September 2020
Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models journal February 2021
PDBe CCDUtils: an RDKit-based toolkit for handling and analysing small molecules in the Protein Data Bank journal December 2023
Transformers: State-of-the-Art Natural Language Processing conference January 2020
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs conference January 2023
The Dangers of trusting Stochastic Parrots: Faithfulness and Trust in Open-domain Conversational Question Answering conference January 2023
Artificial Intelligence for Autonomous Molecular Design: A Perspective journal November 2021
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena preprint January 2023