Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics

Nielson, Felicity F.; Kay, Bill; Young, Stephen J.; Colby, Sean M.; Renslow, Ryan S.; Metz, Thomas O.

doi:10.3390/metabo13010105

Title: Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics

Journal Article · Mon Jan 09 00:00:00 EST 2023 · Metabolites

DOI:https://doi.org/10.3390/metabo13010105· OSTI ID:2332980

^[1]; Kay, Bill ^[1];

^[1]; Colby, Sean M. ^[1];

^[1];

^[1]

Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)

Computational methods for creating in silico libraries of molecular descriptors (e.g., collision cross sections) are becoming increasingly prevalent due to the limited number of authentic reference materials available for traditional library building. These so-called “reference-free metabolomics” methods require sampling sets of molecular conformers in order to produce high accuracy property predictions. Due to the computational cost of the subsequent calculations for each conformer, there is a need to sample the most relevant subset and avoid repeating calculations on conformers that are nearly identical. The goal of this study is to introduce a heuristic method of finding the most dissimilar conformers from a larger population in order to help speed up reference-free calculation methods and maintain a high property prediction accuracy. Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Because there exists a pairwise relationship between each item and all other items in the population, finding the set of the n most dissimilar items is different than simply sorting an array of numbers. For instance, if you have a set of the most dissimilar n = 4 items, one or more of the items from n = 4 might not be in the set n = 5. An exact solution would have to search all possible combinations of size n in the population exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the n most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of n most dissimilar conformers, our method is not only orders of magnitude faster, but it is also more accurate than running Monte Carlo for 1,000,000 iterations, each searching for set sizes n = 3–7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing that SDS produces a solution close to the exact solution in these instances. Using theoretical approaches, we also demonstrate the constraints of the greedy algorithm and its efficacy as a ratio to the exact solution.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

Grant/Contract Number:: AC05-76RL01830

OSTI ID:: 2332980

Report Number(s):: PNNL-SA-157372

Journal Information:: Metabolites, Vol. 13, Issue 1; ISSN 2218-1989

Publisher:: MDPICopyright Statement

Country of Publication:: United States

Language:: English

References (22)

AutoGraph: Autonomous Graph-Based Clustering of Small-Molecule Conformations Tanemura, Kiyoto Aramis; Das, Susanta; Merz, Kenneth M. Journal of Chemical Information and Modeling, Vol. 61, Issue 4 https://doi.org/10.1021/acs.jcim.0c01492	journal	March 2021
A nonconvex quadratic optimization approach to the maximum edge weight clique problem Hosseinian, Seyedmohammadhossein; Fontes, Dalila B. M. M.; Butenko, Sergiy Journal of Global Optimization, Vol. 72, Issue 2 https://doi.org/10.1007/s10898-018-0630-5	journal	March 2018
An efficient k-means clustering algorithm: analysis and implementation Kanungo, T.; Mount, D. M.; Netanyahu, N. S. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, Issue 7 https://doi.org/10.1109/TPAMI.2002.1017616	journal	July 2002
Automated exploration of the low-energy chemical space with fast quantum chemical methods Pracht, Philipp; Bohle, Fabian; Grimme, Stefan Physical Chemistry Chemical Physics, Vol. 22, Issue 14 https://doi.org/10.1039/C9CP06869D	journal	January 2020
OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets Clark, Robert D. Journal of Chemical Information and Computer Sciences, Vol. 37, Issue 6 https://doi.org/10.1021/ci970282v	journal	November 1997
Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds Willett, Peter Journal of Computational Biology, Vol. 6, Issue 3-4 https://doi.org/10.1089/106652799318382	journal	October 1999
Open Babel: An open chemical toolbox O'Boyle, Noel M.; Banck, Michael; James, Craig A. Journal of Cheminformatics, Vol. 3, Issue 1 https://doi.org/10.1186/1758-2946-3-33	journal	October 2011
Exploring the Impacts of Conformer Selection Methods on Ion Mobility Collision Cross Section Predictions Nielson, Felicity F.; Colby, Sean M.; Thomas, Dennis G. Analytical Chemistry, Vol. 93, Issue 8 https://doi.org/10.1021/acs.analchem.0c04341	journal	February 2021
A maximum edge-weight clique extraction algorithm based on branch-and-bound Shimizu, Satoshi; Yamaguchi, Kazuaki; Masuda, Sumio Discrete Optimization, Vol. 37 https://doi.org/10.1016/j.disopt.2020.100583	journal	August 2020
New facets and a branch-and-cut algorithm for the weighted clique problem Sørensen, Michael M. European Journal of Operational Research, Vol. 154, Issue 1 https://doi.org/10.1016/S0377-2217(02)00852-4	journal	April 2004
Solving the maximum edge-weight clique problem in sparse graphs with compact formulations Gouveia, Luis; Martins, Pedro EURO Journal on Computational Optimization, Vol. 3, Issue 1 https://doi.org/10.1007/s13675-014-0028-1	journal	February 2015
Freely Available Conformer Generation Methods: How Good Are They? Ebejer, Jean-Paul; Morris, Garrett M.; Deane, Charlotte M. Journal of Chemical Information and Modeling, Vol. 52, Issue 5 https://doi.org/10.1021/ci2004658	journal	April 2012
Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit O'Boyle, Noel M.; Morley, Chris; Hutchison, Geoffrey R. Chemistry Central Journal, Vol. 2, Issue 1 https://doi.org/10.1186/1752-153X-2-5	journal	March 2008
A branch and bound algorithm for the maximum diversity problem Martí, Rafael; Gallego, Micael; Duarte, Abraham European Journal of Operational Research, Vol. 200, Issue 1 https://doi.org/10.1016/j.ejor.2008.12.023	journal	January 2010
Dynamic clustering threshold reduces conformer ensemble size while maintaining a biologically relevant ensemble Yongye, Austin B.; Bender, Andreas; Martínez-Mayorga, Karina Journal of Computer-Aided Molecular Design, Vol. 24, Issue 8 https://doi.org/10.1007/s10822-010-9365-1	journal	May 2010
Improved Linear Integer Programming Formulations of Nonlinear Integer Problems Glover, Fred Management Science, Vol. 22, Issue 4 https://doi.org/10.1287/mnsc.22.4.455	journal	December 1975
Job shop scheduling with beam search Sabuncuoglu, I.; Bayiz, M. European Journal of Operational Research, Vol. 118, Issue 2 https://doi.org/10.1016/S0377-2217(98)00319-1	journal	October 1999
The comparison of automated clustering algorithms for resampling representative conformer ensembles with RMSD matrix Kim, Hyoungrae; Jang, Cheongyun; Yadav, Dharmendra K. Journal of Cheminformatics, Vol. 9, Issue 1 https://doi.org/10.1186/s13321-017-0208-0	journal	March 2017
An improved overlapping k-means clustering method for medical applications Khanmohammadi, Sina; Adibeig, Naiier; Shanehbandy, Samaneh Expert Systems with Applications, Vol. 67 https://doi.org/10.1016/j.eswa.2016.09.025	journal	January 2017
ISiCLE: A Quantum Chemistry Pipeline for Establishing in Silico Collision Cross Section Libraries Colby, Sean M.; Thomas, Dennis G.; Nuñez, Jamie R. Analytical Chemistry, Vol. 91, Issue 7 https://doi.org/10.1021/acs.analchem.8b04567	journal	February 2019
Dissimilarity-Based Sparse Subset Selection Elhamifar, Ehsan; Sapiro, Guillermo; Sastry, S. Shankar IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, Issue 11 https://doi.org/10.1109/TPAMI.2015.2511748	journal	November 2016
Computational aspects of the maximum diversity problem Ghosh, Jay B. Operations Research Letters, Vol. 19, Issue 4 https://doi.org/10.1016/0167-6377(96)00025-9	journal	October 1996

Similar Records

A spectral algorithm for the seriation problem

Conference · Tue Nov 01 00:00:00 EST 1994 · OSTI ID:2332980

Atkins, J E; Boman, E G; Hendrickson, B

The macro response Monte Carlo method for electron transport

Technical Report · Tue Sep 01 00:00:00 EDT 1998 · OSTI ID:2332980

Svatos, M M

Kmer-SSR: a fast and exhaustive SSR search algorithm

Journal Article · Wed Aug 30 00:00:00 EDT 2017 · Bioinformatics · OSTI ID:2332980

Pickett, Brandon D.; Miller, Justin B.; Ridge, Perry G.

Related Subjects

97 MATHEMATICS AND COMPUTING
conformer
downselection
graph
metabolomics
molecule
Monte Carlo
Python
sampling
structure
similarity

Title: Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics

Citation Formats

References (22)

Similar Records

Related Subjects