A Graph-Based Clustering Method for a Large Set of Sequences Using a Graph Partitioning Algorithm

Hideya Kawaji [1] [2] (kawaji@ics.es.osaka-u.ac.jp)
Yosuke Yamaguchi [1] (y-yamagu@ics.es.osaka-u.ac.jp)
Hideo Matsuda [1] (matsuda@ics.es.osaka-u.ac.jp)
Akihiro Hashimoto [1] (hashimoto@ics.es.osaka-u.ac.jp)

[1] Department of Informatics and Mathematical Science, Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan
[2] NTT Software Corporation, 223-1 Yamashita-cho, Naka-ku, Yokohama-shi, Kanagawa 231-8554, Japan


Abstract

A graph-based clustering method is proposed to cluster protein sequences into families, which automatically improves clusters of the conventional single linkage clustering method. Our approach formulates sequence clustering problem as a kind of graph partitioning problem in a weighted linkage graph, which vertices correspond to sequences, edges correspond to higher similarities than given threshold and are weighted by their similarities. The effectiveness of our method is shown in comparison with InterPro families in all mouse proteins in SWISS-PROT. The result clusters match to InterPro families much better than the single linkage clustering method. 77% of proteins in InterPro families are classified into appropriate clusters.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics