Clustering of Database Sequences for Fast Homology Search Using Upper Bounds on Alignment Score

Masumi Itoh (
Tatsuya Akutsu (
Minoru Kanehisa (

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan


Homology data are among the most important information used to predict the functions of unknown proteins and thus fast and accurate methods are needed. In this paper, we propose a new approach for fast and accurate homology search using pre-computed all-against-all similarity scores in a target database. We previously developed a method for derivation of an upper bound of the Smith-Waterman score (SW-score) between a query and a homolog candidate sequence using the SW-score between the candidate and a sequence similar to the query. In this paper, by using this upper bound, we first cluster the sequences in the target database so that upper bounds of SW-scores for all the members in the clusters are less than a given value and select representative sequences for respective clusters. Then, the query sequence is searched against the representative sequences and the upper bounds of SW-scores for respective clusters are estimated. Only if the upper bound is higher than a given threshold, SW-alignments are computed for all the sequences in the cluster. We performed computational experiments to test efficiency of the proposed method for the KEGG/GENES database using the KEGG/SSDB. The results suggest that our method is efficient for redundant databases that include multiple closely related species.

[ Full-text PDF | Table of Contents ]

Japanese Society for Bioinformatics