Real-time structural motif searching in proteins using an inverted index strategy
Abstract
Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins. Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. Existing methods depend on a clustering by sequence similarity and can be computationally slow. We have developed a new approach that uses an inverted index strategy capable of analyzing >170,000 PDB structures with unmatched speed. The efficiency of the inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach (implemented at motif.rcsb.org ) enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user. Herein, we describe the method and present five case studies that exemplify its efficacy and speed for analyzing 3D structures of both proteins and nucleic acids.
- Authors:
-
- Univ. of California, San Diego, La Jolla, CA (United States)
- Univ. of California, San Diego, La Jolla, CA (United States); Rutgers, The States Univ. of New Jersey, Piscataway, NJ (United States)
- Publication Date:
- Research Org.:
- Univ. of California, San Diego, La Jolla, CA (United States). RCSB Protein Data Bank
- Sponsoring Org.:
- USDOE Office of Science (SC), Biological and Environmental Research (BER). Earth and Environmental Systems Science Division; USDOE
- OSTI Identifier:
- 1736347
- Alternate Identifier(s):
- OSTI ID: 1734438
- Grant/Contract Number:
- SC0019749
- Resource Type:
- Accepted Manuscript
- Journal Name:
- PLoS Computational Biology (Online)
- Additional Journal Information:
- Journal Name: PLoS Computational Biology (Online); Journal Volume: 16; Journal Issue: 12; Journal ID: ISSN 1553-7358
- Publisher:
- Public Library of Science
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; Sequence motif analysis; RNA structure; Biological databases; Serine proteases; Polypeptides; Nucleic acids; Protein structure databases; Zinc
Citation Formats
Bittrich, Sebastian, Burley, Stephen K., and Rose, Alexander S. Real-time structural motif searching in proteins using an inverted index strategy. United States: N. p., 2020.
Web. doi:10.1371/journal.pcbi.1008502.
Bittrich, Sebastian, Burley, Stephen K., & Rose, Alexander S. Real-time structural motif searching in proteins using an inverted index strategy. United States. https://doi.org/10.1371/journal.pcbi.1008502
Bittrich, Sebastian, Burley, Stephen K., and Rose, Alexander S. Mon .
"Real-time structural motif searching in proteins using an inverted index strategy". United States. https://doi.org/10.1371/journal.pcbi.1008502. https://www.osti.gov/servlets/purl/1736347.
@article{osti_1736347,
title = {Real-time structural motif searching in proteins using an inverted index strategy},
author = {Bittrich, Sebastian and Burley, Stephen K. and Rose, Alexander S.},
abstractNote = {Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins. Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. Existing methods depend on a clustering by sequence similarity and can be computationally slow. We have developed a new approach that uses an inverted index strategy capable of analyzing >170,000 PDB structures with unmatched speed. The efficiency of the inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach (implemented at motif.rcsb.org ) enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user. Herein, we describe the method and present five case studies that exemplify its efficacy and speed for analyzing 3D structures of both proteins and nucleic acids.},
doi = {10.1371/journal.pcbi.1008502},
journal = {PLoS Computational Biology (Online)},
number = 12,
volume = 16,
place = {United States},
year = {2020},
month = {12}
}
Works referenced in this record:
Towards an efficient compression of 3D coordinates of macromolecular structures
journal, March 2017
- Valasatava, Yana; Bradley, Anthony R.; Rose, Alexander S.
- PLOS ONE, Vol. 12, Issue 3
Molecular structure of leucine aminopeptidase at 2.7-A resolution.
journal, September 1990
- Burley, S. K.; David, P. R.; Taylor, A.
- Proceedings of the National Academy of Sciences, Vol. 87, Issue 17
Quadruplex DNA: sequence, topology and structure
journal, September 2006
- Burge, Sarah; Parkinson, Gary N.; Hazel, Pascale
- Nucleic Acids Research, Vol. 34, Issue 19
Tess: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites
journal, November 1997
- Wallace, Andrew C.; Borkakoti, Neera; Thornton, Janet M.
- Protein Science, Vol. 6, Issue 11
Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites
journal, November 2017
- Ribeiro, António J. M.; Holliday, Gemma L.; Furnham, Nicholas
- Nucleic Acids Research, Vol. 46, Issue D1
Fast determination of the optimal rotational matrix for macromolecular superpositions
journal, January 2009
- Liu, Pu; Agrafiotis, Dimitris K.; Theobald, Douglas L.
- Journal of Computational Chemistry
NGL viewer: web-based molecular graphics for large complexes
journal, May 2018
- Rose, Alexander S.; Bradley, Anthony R.; Valasatava, Yana
- Bioinformatics, Vol. 34, Issue 21
Efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques.
journal, December 1991
- Nussinov, R.; Wolfson, H. J.
- Proceedings of the National Academy of Sciences, Vol. 88, Issue 23
SPRITE and ASSAM: web servers for side chain 3D-motif searching in protein structures
journal, May 2012
- Nadzirin, N.; Gardiner, E. J.; Willett, P.
- Nucleic Acids Research, Vol. 40, Issue W1
RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy
journal, October 2018
- Burley, Stephen K.; Berman, Helen M.; Bhikadiya, Charmi
- Nucleic Acids Research, Vol. 47, Issue D1
Unsupervised Discovery of Geometrically Common Structural Motifs and Long-Range Contacts in Protein 3D Structures
journal, March 2019
- Kaiser, Florian; Labudde, Dirk
- IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 16, Issue 2
The MASH Pipeline for Protein Function Prediction and an Algorithm for the Geometric Refinement of 3D Motifs
journal, July 2007
- Chen, Brian Y.; Fofanov, Viacheslav Y.; Bryant, Drew H.
- Journal of Computational Biology, Vol. 14, Issue 6
A geometric algorithm to find small but highly similar 3D substructures in proteins
journal, July 1998
- Pennec, X.; Ayache, N.
- Bioinformatics, Vol. 14, Issue 6
The Enolase Superfamily: A General Strategy for Enzyme-Catalyzed Abstraction of the α-Protons of Carboxylic Acids †
journal, January 1996
- Babbitt, Patricia C.; Hasson, Miriam S.; Wedekind, Joseph E.
- Biochemistry, Vol. 35, Issue 51
A Real-Time All-Atom Structural Search Engine for Proteins
journal, July 2014
- Gonzalez, Gabriel; Hannigan, Brett; DeGrado, William F.
- PLoS Computational Biology, Vol. 10, Issue 7
The LabelHash algorithm for substructure matching
journal, November 2010
- Moll, Mark; Bryant, Drew H.; Kavraki, Lydia E.
- BMC Bioinformatics, Vol. 11, Issue 1
A Novel Algorithm for Enhanced Structural Motif Matching in Proteins
journal, July 2015
- Kaiser, Florian; Eisold, Alexander; Labudde, Dirk
- Journal of Computational Biology, Vol. 22, Issue 7
OneDep: Unified wwPDB System for Deposition, Biocuration, and Validation of Macromolecular Structures in the PDB Archive
journal, March 2017
- Young, Jasmine Y.; Westbrook, John D.; Feng, Zukang
- Structure, Vol. 25, Issue 3
MMTF—An efficient file format for the transmission, visualization, and analysis of macromolecular structures
journal, June 2017
- Bradley, Anthony R.; Rose, Alexander S.; Pavelka, Antonín
- PLOS Computational Biology, Vol. 13, Issue 6
BinaryCIF and CIFTools—Lightweight, efficient and extensible macromolecular data management
journal, October 2020
- Sehnal, David; Bittrich, Sebastian; Velankar, Sameer
- PLOS Computational Biology, Vol. 16, Issue 10
Design and Selection of Novel Cys 2 His 2 Zinc Finger Proteins
journal, June 2001
- Pabo, Carl O.; Peisach, Ezra; Grant, Robert A.
- Annual Review of Biochemistry, Vol. 70, Issue 1
Superfamily active site templates
journal, April 2004
- Meng, Elaine C.; Polacco, Benjamin J.; Babbitt, Patricia C.
- Proteins: Structure, Function, and Bioinformatics, Vol. 55, Issue 4
RMSD and Symmetry
journal, March 2019
- Coutsias, Evangelos A.; Wester, Michael J.
- Journal of Computational Chemistry, Vol. 40, Issue 15
A statistical model to correct systematic bias introduced by algorithmic thresholds in protein structural comparison algorithms
conference, November 2008
- Fofanov, V. Y.; Chen, B. Y.; Bryant, D. H.
- 2008 IEEE International Conference on Bioinformatics and Biomedcine Workshops, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops
Real time structural search of the Protein Data Bank
journal, July 2020
- Guzenko, Dmytro; Burley, Stephen K.; Duarte, Jose M.
- PLOS Computational Biology, Vol. 16, Issue 7
Catalytic site identification—a web server to identify catalytic site structural matches throughout PDB
journal, May 2013
- Kirshner, Daniel A.; Nilmeier, Jerome P.; Lightstone, Felice C.
- Nucleic Acids Research, Vol. 41, Issue W1
Fit3D: a web application for highly accurate screening of spatial residue patterns in protein structure data
journal, October 2015
- Kaiser, Florian; Eisold, Alexander; Bittrich, Sebastian
- Bioinformatics, Vol. 32, Issue 5
Geometric hashing: an overview
journal, January 1997
- Wolfson, H. J.; Rigoutsos, I.
- IEEE Computational Science and Engineering, Vol. 4, Issue 4
Protein Data Bank: the single global archive for 3D macromolecular structure data
journal, October 2018
- Burley, Stephen K.; Berman, Helen M.; Bhikadiya, Charmi
- Nucleic Acids Research, Vol. 47, Issue D1
Structures of native and complexed complement factor D: implications of the atypical his57 conformation and self-inhibitory loop in the regulation of specific serine protease activity
journal, October 1998
- Jing, Hua; Babu, Y. Sudhakara; Moore, Dwight
- Journal of Molecular Biology, Vol. 282, Issue 5
Serine Protease Mechanism and Specificity
journal, December 2002
- Hedstrom, Lizbeth
- Chemical Reviews, Vol. 102, Issue 12
ProBiS algorithm for detection of structurally similar protein binding sites by local structural alignment
journal, March 2010
- Konc, Janez; Janežič, Dušanka
- Bioinformatics, Vol. 26, Issue 9
A Model for Statistical Significance of Local Similarities in Structure
journal, March 2003
- Stark, Alexander; Sunyaev, Shamil; Russell, Robert B.
- Journal of Molecular Biology, Vol. 326, Issue 5
New and continuing developments at PROSITE
journal, November 2012
- Sigrist, Christian J. A.; de Castro, Edouard; Cerutti, Lorenzo
- Nucleic Acids Research, Vol. 41, Issue D1