Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
Discovering simple DNA sequences by compression David R. Powell, David L. Dowe, Lloyd Allison and Trevor I. Dix
 

Summary: Discovering simple DNA sequences by compression
David R. Powell, David L. Dowe, Lloyd Allison and Trevor I. Dix
Department of Computer Science, Monash University, Clayton,
Vic. 3168, Australia
e-mail: fpowell, dld, lloyd, trevorg@cs.monash.edu.au
Abstract
An information-theoretic DNAcompression scheme devised by Milosavl-
jevic and Jurka 1993 has been used in many places in the literature
for both the discovery of new genes and the compression of DNA. Their
compression method applies an encoding of previously occurring runs.
They use 5 di erent code-words: four being the DNA bases, A, C, G and
T, and the other being a pointer to a previously occurring run. They
advocate a code-word of length log2 5 for each of these and then encoding
a run by a code-word of length 2 log2 n, where n is the length of the
sequence. This scheme encodes the start of the sequence with a code-
word of length log2 n and likewise encodes the end of the sequence with
a code-word of length log2 n. In this paper, we show the above coding
scheme to be ine cient in various ways and improve upon it so that it
can compress DNA. We discuss our implementation of various schemes
some of which run in linear time.

  

Source: Allison, Lloyd - Caulfield School of Information Technology, Monash University

 

Collections: Computer Technologies and Information Sciences