 
Summary: A Simple Statistical Algorithm for Biological Sequence
Compression
Minh Duc Cao Trevor I. Dix Lloyd Allison
Chris Mears
Faculty of Information Technology,
Monash University, Australia
Email: {minhc,trevor,lloyd,cmears}@infotech.monash.edu.au
Abstract
This paper introduces a novel algorithm for biological sequence compression that
makes use of both statistical properties and repetition within sequences. A panel of
experts is maintained to estimate the probability distribution of the next symbol in
the sequence to be encoded. Expert probabilities are combined to obtain the final dis
tribution. The resulting information sequence provides insight for further study of
the biological sequence. Each symbol is then encoded by arithmetic coding. Experi
ments show that our algorithm outperforms existing compressors on typical DNA and
protein sequence datasets while maintaining a practical running time.
1. Introduction
Modelling DNA and protein sequences is an important step in understanding bi
ology. Deoxyribonucleic acid (DNA) contains genetic instructions for an organism.
A DNA sequence is composed of nucleotides of four types: adenine (abbreviated A),
