skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Bringing large-scale multiple genome analysis one step closer: ScalaBLAST and beyond

Technical Report ·
DOI:https://doi.org/10.2172/960403· OSTI ID:960403

Genome sequence comparisons of exponentially growing data sets form the foundation for the comparative analysis tools provided by community biological data resources such as the Integrated Microbial Genome (IMG) system at the Joint Genome Institute (JGI). We present an example of how ScalaBLAST, a high-throughput sequence analysis program harnesses increasingly critical high-performance computing to perform sequence analysis which is a critical component of maintaining a state-of-the-art sequence data repository. The Integrated Microbial Genomes (IMG) system1 is a data management and analysis platform for microbial genomes hosted at the JGI. IMG contains both draft and complete JGI genomes integrated with other publicly available microbial genomes of all three domains of life. IMG provides tools and viewers for interactive analysis of genomes, genes and functions, individually or in a comparative context. Most of these tools are based on pre-computed pairwise sequence similarities involving millions of genes. These computations are becoming prohibitively time consuming with the rapid increase in the number of newly sequenced genomes incorporated into IMG and the need to refresh regularly the content of IMG in order to reflect changes in the annotations of existing genomes. Thus, building IMG 2.0 (released on December 1st 2006) entailed reloading from NCBI's RefSeq all the genomes in the previous version of IMG (IMG 1.6, as of September 1st, 2006) together with 1,541 new public microbial,viral and eukaryal genomes, bringing the total of IMG genomes to 2,301. A critical part of building IMG 2.0 involved using PNNL ScalaBLAST software for computing pairwise similarities for over 2.2 million genes in under 26 hours on 1,000 processors, thus illustrating the impact that new generation bioinformatics tools are poised to make in biology. The BLAST algorithm2, 3 is a familiar bioinformatics application for computing sequence similarity, and has become a workhorse in large-scale genomics projects. The rapid growth of genome resources such as IMG cannot be sustained without more powerful tools such as ScalaBLAST that use more effectively large scale computing resources to perform the core BLAST calculations. ScalaBLAST is a high performance computing algorithm designed to give high throughput BLAST results on high-end supercomputers. Other parallel sequence comparison applications have been developed4-6. However problems with scaling generally prevent these applications from being used for very large searches. ScalaBLAST7 is the first BLAST application to be both highly scaleable against the size of the database as well as the number of computer processors on high-end hardware and on commodity clusters. ScalaBLAST achieves high throughput by parsing a large collection of query sequences into independent subgroups. These smaller tasks are assigned to independent process groups. Efficient scaling is achieved by (transparently to the user) sharing only one copy of the target database across all processors using the Global Array toolkit 8, 9, which provides software implementation of shared memory interface. ScalaBLAST was initially deployed on the 1,960 processor MPP2 cluster in the Wiliam R. Wiley Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory, and has since been ported to a variety of linux-based clusters and shared memory architectures, including SGI Altix, AMD opteron, and Intel Xeon-based clusters. Future targets include IBM BlueGene, Cray, and SGI Altix XE architectures. The importance of performing high-throughput calculations rapidly lies in the rate of growth of sequence data. For a genome sequencing center to provide multiple-genome comparison capabilities, it must keep pace with exponentially growing collection of protein data, both from its own genomes, and from the public genome information as well. As sequence data continues to grow exponentially, this challenge will only increase with time. Solving the BLAST throughput challenge for centralized data resources like IMG has the potential to unlock the power of emerging analysis methods which, until recently, were limited by the availability of multiple genome comparison data. Fig. 1 illustrates how the run-time achieved by efficient scaling in ScalaBLAST enabled the IMG all vs. all BLAST calculations to complete in roughly 1 day. Note that to keep pace with growing IMG database, we will have to double the number of processors used in these calculations during the upcoming year. Grid-based solutions for improving throughput for BLAST searches has become a popular and attractive option for some centers. The Institute for Genome Research (http://www.tigr.org/), for instance, has implemented a grid-based BLAST tool allowing users to submit requests to be farmed out to available computers on an on-demand basis.

Research Organization:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
Genomics Division
DOE Contract Number:
DE-AC02-05CH11231
OSTI ID:
960403
Report Number(s):
LBNL-62882; TRN: US200923%%492
Country of Publication:
United States
Language:
English