| | |
Summary: ReferenceBased Alignment in Large Sequence Databases
Panagiotis Papapetrou 1 , Vassilis Athitsos 2 , George Kollios 1 , and Dimitrios Gunopulos 3,4
1 Computer Science Department, Boston University
2 Computer Science and Engineering Department, University of Texas at Arlington
3 Department of Informatics and Telecommunications, University of Athens
4 Computer Science and Engineering Department, UC Riverside
ABSTRACT
This paper introduces a novel method, called ReferenceBased String
Alignment (RBSA), that speeds up retrieval of optimal subsequence
matches in large databases of sequences under the edit distance and
the SmithWaterman similarity measure. RBSA operates using the
assumption that the optimal match deviates by a relatively small
amount from the query, an amount that does not exceed a prespec
ified fraction of the query length. RBSA has an exact version that
guarantees no false dismissals and can handle large queries effi
ciently. An approximate version of RBSA is also described, that
achieves significant additional improvements over the exact ver
sion, with negligible losses in retrieval accuracy. RBSA performs
filtering of candidate matches using precomputed alignment scores
between the database sequence and a set of fixedlength reference
|