Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

A Simple Statistical Algorithm for Biological Sequence Compression

Summary: A Simple Statistical Algorithm for Biological Sequence
Minh Duc Cao Trevor I. Dix Lloyd Allison
Chris Mears
Faculty of Information Technology,
Monash University, Australia
Email: {minhc,trevor,lloyd,cmears}@infotech.monash.edu.au
This paper introduces a novel algorithm for biological sequence compression that
makes use of both statistical properties and repetition within sequences. A panel of
experts is maintained to estimate the probability distribution of the next symbol in
the sequence to be encoded. Expert probabilities are combined to obtain the final dis-
tribution. The resulting information sequence provides insight for further study of
the biological sequence. Each symbol is then encoded by arithmetic coding. Experi-
ments show that our algorithm outperforms existing compressors on typical DNA and
protein sequence datasets while maintaining a practical running time.
1. Introduction
Modelling DNA and protein sequences is an important step in understanding bi-
ology. Deoxyribonucleic acid (DNA) contains genetic instructions for an organism.
A DNA sequence is composed of nucleotides of four types: adenine (abbreviated A),


Source: Allison, Lloyd - Caulfield School of Information Technology, Monash University


Collections: Computer Technologies and Information Sciences