| | |
Summary: A Simple Statistical Algorithm for Biological Sequence
Compression
Minh Duc Cao Trevor I. Dix Lloyd Allison
Chris Mears
Faculty of Information Technology,
Monash University, Australia
Email: {minhc,trevor,lloyd,cmears}@infotech.monash.edu.au
Abstract
This paper introduces a novel algorithm for biological sequence compression that
makes use of both statistical properties and repetition within sequences. A panel of
experts is maintained to estimate the probability distribution of the next symbol in
the sequence to be encoded. Expert probabilities are combined to obtain the final dis-
tribution. The resulting information sequence provides insight for further study of
the biological sequence. Each symbol is then encoded by arithmetic coding. Experi-
ments show that our algorithm outperforms existing compressors on typical DNA and
protein sequence datasets while maintaining a practical running time.
1. Introduction
Modelling DNA and protein sequences is an important step in understanding bi-
ology. Deoxyribonucleic acid (DNA) contains genetic instructions for an organism.
A DNA sequence is composed of nucleotides of four types: adenine (abbreviated A),
|