DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets

Abstract

Abstract The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENNmore » (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F 1 score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.« less

Authors:
; ; ; ; ; ; ; ; ORCiD logo;
Publication Date:
Sponsoring Org.:
USDOE
OSTI Identifier:
1690302
Resource Type:
Published Article
Journal Name:
Journal of Cheminformatics
Additional Journal Information:
Journal Name: Journal of Cheminformatics Journal Volume: 12 Journal Issue: 1; Journal ID: ISSN 1758-2946
Publisher:
Springer Science + Business Media
Country of Publication:
United Kingdom
Language:
English

Citation Formats

Idakwo, Gabriel, Thangapandian, Sundar, Luttrell, Joseph, Li, Yan, Wang, Nan, Zhou, Zhaoxian, Hong, Huixiao, Yang, Bei, Zhang, Chaoyang, and Gong, Ping. Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets. United Kingdom: N. p., 2020. Web. doi:10.1186/s13321-020-00468-x.
Idakwo, Gabriel, Thangapandian, Sundar, Luttrell, Joseph, Li, Yan, Wang, Nan, Zhou, Zhaoxian, Hong, Huixiao, Yang, Bei, Zhang, Chaoyang, & Gong, Ping. Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets. United Kingdom. https://doi.org/10.1186/s13321-020-00468-x
Idakwo, Gabriel, Thangapandian, Sundar, Luttrell, Joseph, Li, Yan, Wang, Nan, Zhou, Zhaoxian, Hong, Huixiao, Yang, Bei, Zhang, Chaoyang, and Gong, Ping. Tue . "Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets". United Kingdom. https://doi.org/10.1186/s13321-020-00468-x.
@article{osti_1690302,
title = {Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets},
author = {Idakwo, Gabriel and Thangapandian, Sundar and Luttrell, Joseph and Li, Yan and Wang, Nan and Zhou, Zhaoxian and Hong, Huixiao and Yang, Bei and Zhang, Chaoyang and Gong, Ping},
abstractNote = {Abstract The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F 1 score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.},
doi = {10.1186/s13321-020-00468-x},
journal = {Journal of Cheminformatics},
number = 1,
volume = 12,
place = {United Kingdom},
year = {2020},
month = {10}
}

Works referenced in this record:

Molecular similarity-based predictions of the Tox21 screening outcome
journal, July 2015

  • Drwal, Malgorzata N.; Siramshetty, Vishal B.; Banerjee, Priyanka
  • Frontiers in Environmental Science, Vol. 3
  • DOI: 10.3389/fenvs.2015.00054

Facing Imbalanced Data--Recommendations for the Use of Performance Metrics
conference, September 2013

  • Jeni, Laszlo A.; Cohn, Jeffrey F.; De La Torre, Fernando
  • 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII)
  • DOI: 10.1109/ACII.2013.47

EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling
journal, December 2013


Exploring different strategies for imbalanced ADME data problem: case study on Caco-2 permeability modeling
journal, December 2015

  • Pham-The, Hai; Casañola-Martin, Gerardo; Garrigues, Teresa
  • Molecular Diversity, Vol. 20, Issue 1
  • DOI: 10.1007/s11030-015-9649-4

Evolving Concept of Activity Cliffs
journal, August 2019


Classification for Imbalanced and Overlapping Classes Using Outlier Detection and Sampling Techniques
journal, February 2013

  • Yang, Zeping; Gao, Daqi
  • Applied Mathematics & Information Sciences, Vol. 7, Issue 1L
  • DOI: 10.12785/amis/071L50

Extreme entropy machines: robust information theoretic classification
journal, July 2015

  • Czarnecki, Wojciech Marian; Tabor, Jacek
  • Pattern Analysis and Applications, Vol. 20, Issue 2
  • DOI: 10.1007/s10044-015-0497-8

Extended-Connectivity Fingerprints
journal, April 2010

  • Rogers, David; Hahn, Mathew
  • Journal of Chemical Information and Modeling, Vol. 50, Issue 5
  • DOI: 10.1021/ci100050t

On the effectiveness of preprocessing methods when dealing with different levels of class imbalance
journal, February 2012


Learning from Imbalanced Data
journal, September 2009

  • Haibo He, ; Garcia, E. A.
  • IEEE Transactions on Knowledge and Data Engineering, Vol. 21, Issue 9
  • DOI: 10.1109/TKDE.2008.239

In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner
journal, May 2012


Applying Mondrian Cross-Conformal Prediction To Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets
journal, June 2017

  • Sun, Jiangming; Carlsson, Lars; Ahlberg, Ernst
  • Journal of Chemical Information and Modeling, Vol. 57, Issue 7
  • DOI: 10.1021/acs.jcim.7b00159

Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure–Activity Relationship and Machine Learning Methods
journal, December 2013

  • Zang, Qingda; Rotroff, Daniel M.; Judson, Richard S.
  • Journal of Chemical Information and Modeling, Vol. 53, Issue 12
  • DOI: 10.1021/ci400527b

Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data
journal, May 2011

  • Khoshgoftaar, Taghi M.; Van Hulse, Jason; Napolitano, Amri
  • IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, Vol. 41, Issue 3
  • DOI: 10.1109/TSMCA.2010.2084081

Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power
journal, May 2010

  • García, Salvador; Fernández, Alberto; Luengo, Julián
  • Information Sciences, Vol. 180, Issue 10
  • DOI: 10.1016/J.INS.2009.12.010

A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches
journal, July 2012

  • Galar, M.; Fernandez, A.; Barrenechea, E.
  • IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 42, Issue 4
  • DOI: 10.1109/TSMCC.2011.2161285

Learning from imbalanced data: open challenges and future directions
journal, April 2016


Predictive Modeling of Estrogen Receptor Binding Agents Using Advanced Cheminformatics Tools and Massive Public Data
journal, March 2016


Application of Random Forest Approach to QSAR Prediction of Aquatic Toxicity
journal, October 2009

  • Polishchuk, Pavel G.; Muratov, Eugene N.; Artemenko, Anatoly G.
  • Journal of Chemical Information and Modeling, Vol. 49, Issue 11
  • DOI: 10.1021/ci900203n

ZINC: A Free Tool to Discover Chemistry for Biology
journal, June 2012

  • Irwin, John J.; Sterling, Teague; Mysinger, Michael M.
  • Journal of Chemical Information and Modeling, Vol. 52, Issue 7
  • DOI: 10.1021/ci3001277

Identifying Biological Pathway Interrupting Toxins Using Multi-Tree Ensembles
journal, August 2016


Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets
journal, August 2018

  • Banerjee, Priyanka; Dehnbostel, Frederic O.; Preissner, Robert
  • Frontiers in Chemistry, Vol. 6
  • DOI: 10.3389/fchem.2018.00362

Decision Forest:  Combining the Predictions of Multiple Independent Decision Tree Models
journal, February 2003

  • Tong, Weida; Hong, Huixiao; Fang, Hong
  • Journal of Chemical Information and Computer Sciences, Vol. 43, Issue 2
  • DOI: 10.1021/ci020058s

Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde?
journal, August 2014

  • Cruz-Monteagudo, Maykel; Medina-Franco, José L.; Pérez-Castillo, Yunierkis
  • Drug Discovery Today, Vol. 19, Issue 8
  • DOI: 10.1016/j.drudis.2014.02.003

Binary classification of imbalanced datasets using conformal prediction
journal, March 2017


Rigorous Selection of Random Forest Models for Identifying Compounds that Activate Toxicity-Related Pathways
journal, February 2016


QSAR Modeling of Tox21 Challenge Stress Response and Nuclear Receptor Signaling Toxicity Assays
journal, February 2016

  • Capuzzi, Stephen J.; Politi, Regina; Isayev, Olexandr
  • Frontiers in Environmental Science, Vol. 4
  • DOI: 10.3389/fenvs.2016.00003

Support vector machines: Development of QSAR models for predicting anti-HIV-1 activity of TIBO derivatives
journal, April 2010

  • Darnag, Rachid; Mostapha Mazouz, E. L.; Schmitzer, Andreea
  • European Journal of Medicinal Chemistry, Vol. 45, Issue 4
  • DOI: 10.1016/j.ejmech.2010.01.002

Asymptotic Properties of Nearest Neighbor Rules Using Edited Data
journal, July 1972


Consensus Modeling for HTS Assays Using In silico Descriptors Calculates the Best Balanced Accuracy in Tox21 Challenge
journal, February 2016

  • Abdelaziz, Ahmed; Spahn-Langguth, Hilde; Schramm, Karl-Werner
  • Frontiers in Environmental Science, Vol. 4
  • DOI: 10.3389/fenvs.2016.00002

The relationship between Precision-Recall and ROC curves
conference, January 2006

  • Davis, Jesse; Goadrich, Mark
  • Proceedings of the 23rd international conference on Machine learning - ICML '06
  • DOI: 10.1145/1143844.1143874

Development of estrogen receptor beta binding prediction model using large sets of chemicals
journal, October 2017


scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems
journal, January 2016


Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric
journal, June 2017


SMOTE: Synthetic Minority Over-sampling Technique
journal, January 2002

  • Chawla, N. V.; Bowyer, K. W.; Hall, L. O.
  • Journal of Artificial Intelligence Research, Vol. 16
  • DOI: 10.1613/jair.953

The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models
journal, April 2003

  • Tropsha, Alexander; Gramatica, Paola; Gombar, Vijay?K.
  • QSAR & Combinatorial Science, Vol. 22, Issue 1
  • DOI: 10.1002/qsar.200390007

RUSBoost: A Hybrid Approach to Alleviating Class Imbalance
journal, January 2010

  • Seiffert, Chris; Khoshgoftaar, Taghi M.; Van Hulse, Jason
  • IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, Vol. 40, Issue 1
  • DOI: 10.1109/TSMCA.2009.2029559

Ensemble methods: bagging and random forests
journal, September 2017

  • Altman, Naomi; Krzywinski, Martin
  • Nature Methods, Vol. 14, Issue 10
  • DOI: 10.1038/nmeth.4438

Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery
journal, January 2019


DeepTox: Toxicity Prediction using Deep Learning
journal, February 2016

  • Mayr, Andreas; Klambauer, Günter; Unterthiner, Thomas
  • Frontiers in Environmental Science, Vol. 3
  • DOI: 10.3389/fenvs.2015.00080

A study of the behavior of several methods for balancing machine learning training data
journal, June 2004

  • Batista, Gustavo E. A. P. A.; Prati, Ronaldo C.; Monard, Maria Carolina
  • ACM SIGKDD Explorations Newsletter, Vol. 6, Issue 1
  • DOI: 10.1145/1007730.1007735

Compounds Activity Prediction in Large Imbalanced Datasets with Substructural Relations Fingerprint and EEM
conference, August 2015


Roughly balanced bagging for imbalanced data
journal, December 2009

  • Hido, Shohei; Kashima, Hisashi; Takahashi, Yutaka
  • Statistical Analysis and Data Mining, Vol. 2, Issue 5‒6
  • DOI: 10.1002/sam.10061

Random Forests
journal, January 2001