A Novel Index which Precisely Derives Protein Coding Regions from Cross-Species Genome Alignments

Hideki Noguchi[1] (hide@gsc.riken.go.jp)
Tetsushi Yada[2] (yada@ims.u-tokyo.ac.jp)
Yoshiyuki Sakaki[1],[2] (sakaki@gsc.riken.go.jp)

[1]Genomic Sciences Center, RIKEN, 1-7-22 Furo-cho, Tsurumi-ku, Yokohama 230-0045, Japan
[2]Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, Tokyo 108-8639, Japan


We introduce here a novel index which precisely derives protein coding regions from cross-species genome alignments. The index is deeply related to frame recovery observed in coding sequence alignments, that is, if insertions or deletions of nucleotides causes frame shifts in coding regions, other in-dels which recover the reading frames will be often observed in the vicinity. In contrast, such frame recoveries are not observed in other conserved regions. We prepared two gene models: a model which finds gene by using sequence similarity and intrinsic gene measures (basic model), and the other model which finds gene by using frame recovery index in addition to sequence similarity and intrinsic gene measures (frame recovery model). We evaluated the prediction accuracies of the two models, and our benchmark test revealed that frame recovery model significantly improved the prediction accuracy in comparison with basic model.

