Kyushu Institute of Technology
We present a machine learning system for knowledge acquisition that produces hypotheses from positive and negative examples, and report some experiments on protein data using the PIR and GenBank databases. This learning system is developed with an algorithmic learning theory for decision trees over regular patterns, which we newly devised for this research. In the experiments on transmembrane domain identification, the system discovered very simple hypotheses with very high accuracy from a small number of positive and negative examples. These hypotheses show that negative motifs, namely, motifs of negative data, play a key role in such classification. In these experiments, we classified 20 symbols of amino acid residues into 3 categories according to the hydropathy indices due to Kyte and Doolittle. We call such transformation of symbols an indexing. We observed that the indexing by the hydropathy indices is important in making the learning algorithm efficient and accurate. This observation inspired us with a desire to discover such an indexing itself just by a learning algorithm. We succeeded in it by combining the above learning algorithm and the local search technique for finding good indexings. We also report some experiments on signal peptides.