Xin Chen (email@example.com)
Zhengchang Su (firstname.lastname@example.org)
Ying Xu (email@example.com)
Tao Jiang (firstname.lastname@example.org)
Department of Computer Science and Engineering, University of
California at Riverside, CA 92507, USA
Department of Biochemistry and Molecular Biology, University of Georgia at Athens, GA 30602, and Computational Biology Institute, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
We computationally predict operons in the Synechococcus sp. WH8102 genome based on three types of genomic data: intergenic distances, COG gene functions and phylogenetic profiles. In the proposed method, we first estimate a log-likelihood distribution for each type of genomic data, and then fuse these distribution information by a perceptron to discriminate pairs of genes within operons (WO pairs) from those across transcription unit borders (TUB pairs). Computational experiments demonstrated that WO pairs tend to have shorter intergenic distances, a higher probability being in the same COG functional categories and more similar phylogenetic profiles than TUB pairs, indicating their powerful capabilities for operon prediction. By testing the method on 236 known operons of Escherichia coli K12, an overall accuracy of 83.8% is obtained by joint learning from multiple types of genomic data, whereas individual information source yields accuracies of 80.4%, 74.4%, and 70.6% respectively. We have applied this new approach, in conjunction with our previous comparative genome analysis-based approach, to predict 556 (putative) operons in WH8102. All predicted data are available at (http://www.cs.ucr.edu/~xin/operons.htm) for public use.