A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison

Xin Chen[1] (xchen@cs.cityu.edu.hk)
Sam Kwong[2] (cssamk@cityu.edu.hk)
Ming Li[3] (mli@wh.math.uwaterloo.ca)

[1] Department of Computer Science, City University of Hong Kong
Hong Kong, China.
Permanent address: School of Mathematical Sciences, Peking University, China
[2] Department of Computer Science, City University of Hong Kong
Hong Kong, China
[3] Department of Computer Science, University of Waterloo
ONT, N2L 3G1, Canada


Abstract

We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main hidden regularities in DNA sequences.

We then describe a theory of measuring the relatedness between two DNA sequences. Using our algorithm, we present strong experimental support for this theory, and demonstrate its application in comparing genomes and constructing evolutionary trees.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics