Evaluating Distance Functions for Clustering Tandem Repeats

Suyog Rao[1],[2] (suyog@bu.edu)
Alfredo Rodriguez[2] (alfredo@bu.edu)
Gary Benson[2],[3] (gbenson@bu.edu)

[1]Department of Electrical and Computer Engineering, Boston University, Boston, MA, USA
[2]Laboratory for Biocomputing and Informatics, Boston University, Boston, MA, USA
[3]Departments of Computer Science and Biology, Graduate Program in Bioinformatics, Boston University, Boston, MA, USA


Abstract

Tandem repeats are an important class of DNA repeats and much research has focused on their efficient identification[2,4,5,11,12], their use in DNA typing and fingerprinting[6,16,18], and their causative role in trinucleotide repeat diseases such as Huntington Disease, myotonic dystrophy, and Fragile-X mental retardation. We are interested in clustering tandem repeats into groups or families based on sequence similarity so that their biological importance may be further explored. To cluster tandem repeats we need a notion of pairwise distance which we obtain by alignment. In this paper we evaluate five distance functions used to produce those alignments -- Consensus, Euclidean, Jensen-Shannon Divergence, Entropy-Surface, and Entropy-weighted. It is important to analyze and compare these functions because the choice of distance metric forms the core of any clustering algorithm. We employ a novel method to compare alignments and thereby compare the distance functions themselves. We rank the distance functions based on the cluster validation techniques -- Average Cluster Density and Average Silhouette Width. Finally, we propose a multi-phase clustering method which produces good-quality clusters. In this study, we analyze clusters of tandem repeats from five sequences: Human Chromosomes 3, 5, 10 and X and C. elegans Chromosome III.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics