Clustering Molecular Sequences with Their Components

Sivasundaram Suharnan[1] (suharu-s@is.aist-nara.ac.jp)
Takeshi Itoh[2](taitoh@lab.nig.ac.jp)
Hideo Matsuda[3](matsuda@ics.es.osaka-u.ac.jp)
Hirotada Mori[4](hmori@gtc.aist-nara.ac.jp)

[1] Graduate School of Information Science Nara Institute of Science and Technology
8916-5 Takayama, Ikoma, Nara 630-01, Japan
[2] Center for Information Biology National Institute of Genetics
1111 Yata, Mishima, Shizuoka 411, Japan
[3] Department of Informatics and Mathematical Science Graduate School of Engineering Science Osaka University
1-3 Machikaneyama, Toyonaka, Osaka 560, Japan
[4] Research and Education Center for Genetic Information Nara Institute of Science and Technology
8916-5 Takayama, Ikoma, Nara 630-01, Japan


Abstract

Motivation: Several methods in genetic information have recently been developed to estimate classification of protein sequences through their sequence similarity. These methods are essential for understanding the function of predicted open reading frames (ORFs) and their molecular evolutionary processes. However, since many protein sequences consist of a number of independently evolved structural units (we refer to these units as components), the combinatorial nature of the components makes it difficult to classify the sequences. Results: This paper presents a new method for classifying uncharacterized protein sequences. As the measure of sequence similarity, we use similarity score computed by a method based on the Smith-Waterman local alignment algorithm. Here we introduce how this method cope when sequences have multi-component structure. This method was applied to predicted ORFs on the Escherichia coli genome and we discuss the algorithm and experimental results. Keywords sequence classification, gene component, genome analysis.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics