skip to main content

DOE PAGESDOE PAGES

Title: PaperBLAST: Text Mining Papers for Information about Homologs

Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full textmore » of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.« less
Authors:
ORCiD logo [1] ;  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Grant/Contract Number:
AC02-05CH11231
Type:
Accepted Manuscript
Journal Name:
mSystems
Additional Journal Information:
Journal Volume: 2; Journal Issue: 4; Journal ID: ISSN 2379-5077
Publisher:
American Society for Microbiology
Research Org:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org:
USDOE Office of Science (SC), Biological and Environmental Research (BER) (SC-23)
Country of Publication:
United States
Language:
English
Subject:
60 APPLIED LIFE SCIENCES; 96 KNOWLEDGE MANAGEMENT AND PRESERVATION; 59 BASIC BIOLOGICAL SCIENCES; annotation; text mining
OSTI Identifier:
1399464

Price, Morgan N., and Arkin, Adam P.. PaperBLAST: Text Mining Papers for Information about Homologs. United States: N. p., Web. doi:10.1128/msystems.00039-17.
Price, Morgan N., & Arkin, Adam P.. PaperBLAST: Text Mining Papers for Information about Homologs. United States. doi:10.1128/msystems.00039-17.
Price, Morgan N., and Arkin, Adam P.. 2017. "PaperBLAST: Text Mining Papers for Information about Homologs". United States. doi:10.1128/msystems.00039-17. https://www.osti.gov/servlets/purl/1399464.
@article{osti_1399464,
title = {PaperBLAST: Text Mining Papers for Information about Homologs},
author = {Price, Morgan N. and Arkin, Adam P.},
abstractNote = {Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.},
doi = {10.1128/msystems.00039-17},
journal = {mSystems},
number = 4,
volume = 2,
place = {United States},
year = {2017},
month = {8}
}