Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
A Simple Statistical Algorithm for Biological Sequence Compression
 

Summary: A Simple Statistical Algorithm for Biological Sequence
Compression
Minh Duc Cao Trevor I. Dix Lloyd Allison
Chris Mears
Faculty of Information Technology,
Monash University, Australia
Email: {minhc,trevor,lloyd,cmears}@infotech.monash.edu.au
Abstract
This paper introduces a novel algorithm for biological sequence compression that
makes use of both statistical properties and repetition within sequences. A panel of
experts is maintained to estimate the probability distribution of the next symbol in
the sequence to be encoded. Expert probabilities are combined to obtain the final dis-
tribution. The resulting information sequence provides insight for further study of
the biological sequence. Each symbol is then encoded by arithmetic coding. Experi-
ments show that our algorithm outperforms existing compressors on typical DNA and
protein sequence datasets while maintaining a practical running time.
1. Introduction
Modelling DNA and protein sequences is an important step in understanding bi-
ology. Deoxyribonucleic acid (DNA) contains genetic instructions for an organism.
A DNA sequence is composed of nucleotides of four types: adenine (abbreviated A),

  

Source: Allison, Lloyd - Caulfield School of Information Technology, Monash University

 

Collections: Computer Technologies and Information Sciences