On Selecting Features from Splice Junctions: An Analysis Using Information Theoretic and Machine Learning Approaches

Christina L. Zheng[1] (czheng@sdsc.edu)
Virginia R. de Sa[2] (desa@cogsci.ucsd.edu\\)
Michael Gribskov[1] (gribskov@sdsc.edu)
T. Murlidharan Nair[1] (nair@sdsc.edu)

[1]San Diego Supercomputer Center
[2]Department of Cognitive Science, University of California, San Diego, 9500 Gilman Dr., La Jolla CA, 92093 USA


The computational recognition of precise splice junctions is a challenge faced in the analysis of newly sequenced genomes. This is challenging due to the fact that the distribution of sequence patterns in these regions is not always distinct. Our objective is to understand the sequence signatures at the splice junctions, not simply to create an artificial recognition system. We use a combination of a neural network based calliper randomization approach and an information theoretic based feature selection approach for this purpose. This has been done in an effort to understand regions that harbor information content and to extract features relevant for the prediction of splice junctions. The analysis using the neural network based calliper randomization approach revealed regions important in the internal representation of the network model. The calliper approach captured both correlated as well as independently important features. The feature selection approach captures features that are independently informative. The two different methods can capture features with different properties. Comparative analysis of the results using both the methods help to infer about the kind of information present in the region.

[ Full-text PDF | Table of Contents ]

Japanese Society for Bioinformatics