Fanfan Zeng (email@example.com)
Roland H. C. Yap (firstname.lastname@example.org)
Limsoon Wong (email@example.com)
School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543, Singapore
Laboratories for Information Technology, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore
Correct prediction of the translation initiation site (TIS) is an important issue in genomic research. We show that feature generation together with correlation based feature selection can be used with a variety of machine learning algorithms to give highly accurate translation initiation site prediction. Only very few features are needed and the results achieve comparable accuracy to the best existing approaches. Our approach has the advantage that it does not require one to devise a special prediction method; rather standard machine learning classifiers are shown to give very good performance on the selected features. The raw and generated features which we have found to be important are the following: positions -3 and -1 in the sequence; upstream k-grams for k=3, 4, and 5; stop-codon frequency; downstream in-frame 3-gram; and the distance of ATG to the beginning of the sequence. The best result, with an overall accuracy of 90%, is obtained by selecting only seven features from this set. The same features retrained with the use of a scanning model achieves an overall accuracy of 94% on this dataset.