Simrank: Rapid and sensitive general-purpose k-mer search tool
Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project (http://nihroadmap.nih.gov/hmp). Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity.
- Research Organization:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- Earth Sciences Division
- DOE Contract Number:
- DE-AC02-05CH11231
- OSTI ID:
- 1016705
- Report Number(s):
- LBNL-4596E; TRN: US201112%%504
- Journal Information:
- BMC Ecology, Vol. 11, Issue 11; Related Information: Journal Publication Date: 2011
- Country of Publication:
- United States
- Language:
- English
Similar Records
IMG/M 4 version of the integrated metagenome comparative analysis system
An optimized FM-index library for nucleotide and amino acid search