A Novel Bioinformatic Strategy for Unveiling Hidden Genome Signatures of Eukaryotes: Self-Organizing Map of Oligonucleotide Frequency

Takashi Abe[1],[2],[3] (tajaabe@lab.nig.ac.jp)
Shigehiko Kanaya[3],[4],[5] (skanaya@gtc.aist-nara.ac.jp)
Makoto Kinouchi[3],[5],[6] (kinouchi@yz.yamagata-u.ac.jp)
Yuta Ichiba[1],[3] (yichiba@lab.nig.ac.jp)
Tokio Kozuki[2],[3] (kozuki@xanagen.com)
Toshimichi Ikemura[1],[3] (tikemura@lab.nig.ac.jp)

[1]Department of Population Genetics, National Institute of Genetics, Mishima, Shizuoka-ken 411-8540, Japan
[2]Xanagen Inc., Sakado, Takatsu-ku, Kawasaki, Kanayagawa-ken 213-0012, Japan
[3]ACT-JST (Japan Science and Technology Corp.)
[4]Department of Bioinformatics and Genomes, Graduate School of Information Science, Nara Institute of Science and Technologly, Takayama, Ikoma, Nara-ken 630-0101, Japan
[5]CREST JST (Japan Science and Technology Corp.)
[6]Department of Bio-System Engineering, Faculty of Engineering, Yamagata University, Yonezawa, Yamagata-ken 992-8510, Japan


With the increasing amount of available genome sequences, novel tools are needed for comprehensive analysis of species-specific sequence characteristics for a wide variety of genomes. We used an unsupervised neural network algorithm, Kohonen's self-organizing map (SOM), to analyze di- and trinucleotide frequencies in 9 eukaryotic genomes of known sequences (a total of 1.2 Gb); S. cerevisiae, S. pombe, C. elegans, A. thaliana, D. melanogaster, Fugu, and rice, as well as P. falciparum chromosomes 2 and 3, and human chromosomes 14, 20, 21, and 22, that have been almost completely sequenced. Each genomic sequence with different window sizes was encoded as a 16- and 64-dimensional vector giving relative frequencies of di- and trinucleotides, respectively. From analysis of a total of 120,000 nonoverlapping 10-kb sequences and overlapping 100-kb sequences with a moving step size of 10 kb, derived from a total of the 1.2 Gb genomic sequences, clear species-specific separations of most sequences were obtained with the SOMs. The unsupervised algorithm could recognize, in most of the 120,000 10-kb sequences, the species-specific characteristics (key combinations of oligonucleotide frequencies) that are signature representations of each genome. Because the classification power is very high, the SOMs can provide fundamental bioinformatic strategies for extracting a wide range of genomic information that could not otherwise be obtained.

[ Full-text PDF | Table of Contents ]

Japanese Society for Bioinformatics