## A Clustering Method for Molecular Sequences based on Pairwise Similarity

H. Matsuda (matsuda@ics.es.osaka-u.ac.jp)

T. Ishihara (tatuya-i@ics.es.osaka-u.ac.jp)

A. Hashimoto (hasimoto@ics.es.osaka-u.ac.jp)

Department of Informatics and Mathematical Science,

Graduate School of Engineering Science, Osaka University

1-3 Machikaneyama, Toyonaka, Osaka 560 Japan

**Abstract**

This paper presents a method for clustering a large and mixed set of
uncharacterized sequences provided by genome projects. As the measure of
the clustering, we use a fast approximation of sequence similarity
(FASTA score). However, in the case to
detect similarity between two sequences that are much diverged in
evolutionary process, FASTA sometimes underestimates the similarity
compared to the rigorous Smith-Waterman algorithm. Also the distance
derived from the similarity score may not be metric since the triangle
inequality may not hold when the sequences have multi-domain structure.
To cope with these problems, we introduce a new graph structure called
p-quasi complete graph for describing a cluster of sequences
with a confidence measure. We prove that a restricted version of the
p-quasi complete graph problem (given a positive integer k,
whether a graph contains a 0.5-quasi complete subgraph of which size
geq k or not) is NP-complete. Thus we present the outline of an
approximation algorithm for clustering a set of sequences into
subsets corresponding to p-quasi complete graphs. The effectiveness
of our method is demonstrated by the result of clustering
*Escherichia coli* protein sequences by our method.