DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Real-time structural motif searching in proteins using an inverted index strategy

Abstract

Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins. Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. Existing methods depend on a clustering by sequence similarity and can be computationally slow. We have developed a new approach that uses an inverted index strategy capable of analyzing >170,000 PDB structures with unmatched speed. The efficiency of the inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach (implemented at motif.rcsb.org ) enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user. Herein, we describe the method and present five case studies that exemplify its efficacy and speed for analyzing 3D structures of both proteins and nucleic acids.

Authors:
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [1]
  1. Univ. of California, San Diego, La Jolla, CA (United States)
  2. Univ. of California, San Diego, La Jolla, CA (United States); Rutgers, The States Univ. of New Jersey, Piscataway, NJ (United States)
Publication Date:
Research Org.:
Univ. of California, San Diego, La Jolla, CA (United States). RCSB Protein Data Bank
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER). Earth and Environmental Systems Science Division; USDOE
OSTI Identifier:
1736347
Alternate Identifier(s):
OSTI ID: 1734438
Grant/Contract Number:  
SC0019749
Resource Type:
Accepted Manuscript
Journal Name:
PLoS Computational Biology (Online)
Additional Journal Information:
Journal Name: PLoS Computational Biology (Online); Journal Volume: 16; Journal Issue: 12; Journal ID: ISSN 1553-7358
Publisher:
Public Library of Science
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; Sequence motif analysis; RNA structure; Biological databases; Serine proteases; Polypeptides; Nucleic acids; Protein structure databases; Zinc

Citation Formats

Bittrich, Sebastian, Burley, Stephen K., and Rose, Alexander S. Real-time structural motif searching in proteins using an inverted index strategy. United States: N. p., 2020. Web. doi:10.1371/journal.pcbi.1008502.
Bittrich, Sebastian, Burley, Stephen K., & Rose, Alexander S. Real-time structural motif searching in proteins using an inverted index strategy. United States. https://doi.org/10.1371/journal.pcbi.1008502
Bittrich, Sebastian, Burley, Stephen K., and Rose, Alexander S. Mon . "Real-time structural motif searching in proteins using an inverted index strategy". United States. https://doi.org/10.1371/journal.pcbi.1008502. https://www.osti.gov/servlets/purl/1736347.
@article{osti_1736347,
title = {Real-time structural motif searching in proteins using an inverted index strategy},
author = {Bittrich, Sebastian and Burley, Stephen K. and Rose, Alexander S.},
abstractNote = {Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins. Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. Existing methods depend on a clustering by sequence similarity and can be computationally slow. We have developed a new approach that uses an inverted index strategy capable of analyzing >170,000 PDB structures with unmatched speed. The efficiency of the inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach (implemented at motif.rcsb.org ) enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user. Herein, we describe the method and present five case studies that exemplify its efficacy and speed for analyzing 3D structures of both proteins and nucleic acids.},
doi = {10.1371/journal.pcbi.1008502},
journal = {PLoS Computational Biology (Online)},
number = 12,
volume = 16,
place = {United States},
year = {2020},
month = {12}
}

Works referenced in this record:

Towards an efficient compression of 3D coordinates of macromolecular structures
journal, March 2017


Molecular structure of leucine aminopeptidase at 2.7-A resolution.
journal, September 1990

  • Burley, S. K.; David, P. R.; Taylor, A.
  • Proceedings of the National Academy of Sciences, Vol. 87, Issue 17
  • DOI: 10.1073/pnas.87.17.6878

Quadruplex DNA: sequence, topology and structure
journal, September 2006

  • Burge, Sarah; Parkinson, Gary N.; Hazel, Pascale
  • Nucleic Acids Research, Vol. 34, Issue 19
  • DOI: 10.1093/nar/gkl655

Tess: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites
journal, November 1997

  • Wallace, Andrew C.; Borkakoti, Neera; Thornton, Janet M.
  • Protein Science, Vol. 6, Issue 11
  • DOI: 10.1002/pro.5560061104

Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites
journal, November 2017

  • Ribeiro, António J. M.; Holliday, Gemma L.; Furnham, Nicholas
  • Nucleic Acids Research, Vol. 46, Issue D1
  • DOI: 10.1093/nar/gkx1012

Fast determination of the optimal rotational matrix for macromolecular superpositions
journal, January 2009

  • Liu, Pu; Agrafiotis, Dimitris K.; Theobald, Douglas L.
  • Journal of Computational Chemistry
  • DOI: 10.1002/jcc.21439

NGL viewer: web-based molecular graphics for large complexes
journal, May 2018


Efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques.
journal, December 1991

  • Nussinov, R.; Wolfson, H. J.
  • Proceedings of the National Academy of Sciences, Vol. 88, Issue 23
  • DOI: 10.1073/pnas.88.23.10495

SPRITE and ASSAM: web servers for side chain 3D-motif searching in protein structures
journal, May 2012

  • Nadzirin, N.; Gardiner, E. J.; Willett, P.
  • Nucleic Acids Research, Vol. 40, Issue W1
  • DOI: 10.1093/nar/gks401

RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy
journal, October 2018

  • Burley, Stephen K.; Berman, Helen M.; Bhikadiya, Charmi
  • Nucleic Acids Research, Vol. 47, Issue D1
  • DOI: 10.1093/nar/gky1004

Unsupervised Discovery of Geometrically Common Structural Motifs and Long-Range Contacts in Protein 3D Structures
journal, March 2019

  • Kaiser, Florian; Labudde, Dirk
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 16, Issue 2
  • DOI: 10.1109/TCBB.2017.2786250

The MASH Pipeline for Protein Function Prediction and an Algorithm for the Geometric Refinement of 3D Motifs
journal, July 2007

  • Chen, Brian Y.; Fofanov, Viacheslav Y.; Bryant, Drew H.
  • Journal of Computational Biology, Vol. 14, Issue 6
  • DOI: 10.1089/cmb.2007.R017

A geometric algorithm to find small but highly similar 3D substructures in proteins
journal, July 1998


The Enolase Superfamily:  A General Strategy for Enzyme-Catalyzed Abstraction of the α-Protons of Carboxylic Acids
journal, January 1996

  • Babbitt, Patricia C.; Hasson, Miriam S.; Wedekind, Joseph E.
  • Biochemistry, Vol. 35, Issue 51
  • DOI: 10.1021/bi9616413

A Real-Time All-Atom Structural Search Engine for Proteins
journal, July 2014


The LabelHash algorithm for substructure matching
journal, November 2010


A Novel Algorithm for Enhanced Structural Motif Matching in Proteins
journal, July 2015

  • Kaiser, Florian; Eisold, Alexander; Labudde, Dirk
  • Journal of Computational Biology, Vol. 22, Issue 7
  • DOI: 10.1089/cmb.2014.0263

OneDep: Unified wwPDB System for Deposition, Biocuration, and Validation of Macromolecular Structures in the PDB Archive
journal, March 2017


MMTF—An efficient file format for the transmission, visualization, and analysis of macromolecular structures
journal, June 2017


BinaryCIF and CIFTools—Lightweight, efficient and extensible macromolecular data management
journal, October 2020


Design and Selection of Novel Cys 2 His 2 Zinc Finger Proteins
journal, June 2001


Superfamily active site templates
journal, April 2004

  • Meng, Elaine C.; Polacco, Benjamin J.; Babbitt, Patricia C.
  • Proteins: Structure, Function, and Bioinformatics, Vol. 55, Issue 4
  • DOI: 10.1002/prot.20099

RMSD and Symmetry
journal, March 2019

  • Coutsias, Evangelos A.; Wester, Michael J.
  • Journal of Computational Chemistry, Vol. 40, Issue 15
  • DOI: 10.1002/jcc.25802

A statistical model to correct systematic bias introduced by algorithmic thresholds in protein structural comparison algorithms
conference, November 2008

  • Fofanov, V. Y.; Chen, B. Y.; Bryant, D. H.
  • 2008 IEEE International Conference on Bioinformatics and Biomedcine Workshops, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops
  • DOI: 10.1109/BIBMW.2008.4686202

Real time structural search of the Protein Data Bank
journal, July 2020


Catalytic site identification—a web server to identify catalytic site structural matches throughout PDB
journal, May 2013

  • Kirshner, Daniel A.; Nilmeier, Jerome P.; Lightstone, Felice C.
  • Nucleic Acids Research, Vol. 41, Issue W1
  • DOI: 10.1093/nar/gkt403

Fit3D: a web application for highly accurate screening of spatial residue patterns in protein structure data
journal, October 2015


Geometric hashing: an overview
journal, January 1997

  • Wolfson, H. J.; Rigoutsos, I.
  • IEEE Computational Science and Engineering, Vol. 4, Issue 4
  • DOI: 10.1109/99.641604

Protein Data Bank: the single global archive for 3D macromolecular structure data
journal, October 2018

  • Burley, Stephen K.; Berman, Helen M.; Bhikadiya, Charmi
  • Nucleic Acids Research, Vol. 47, Issue D1
  • DOI: 10.1093/nar/gky949

Serine Protease Mechanism and Specificity
journal, December 2002


ProBiS algorithm for detection of structurally similar protein binding sites by local structural alignment
journal, March 2010


A Model for Statistical Significance of Local Similarities in Structure
journal, March 2003


New and continuing developments at PROSITE
journal, November 2012

  • Sigrist, Christian J. A.; de Castro, Edouard; Cerutti, Lorenzo
  • Nucleic Acids Research, Vol. 41, Issue D1
  • DOI: 10.1093/nar/gks1067