Xijin Ge (email@example.com)
Shuichi Tsutsumi (firstname.lastname@example.org)
Hiroyuki Aburatani (email@example.com)
Shuichi Iwata (firstname.lastname@example.org)
Genome Science Division, Research Center for Advanced Science and Technology,
The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8904, Japan
Department of Quantum Engineering and Systems Science, School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
In the search for new cancer subtypes by gene expression profiling, it is essential to avoid misclassifying samples of unknown subtypes as known ones. In this paper, we evaluated the false positive error rates of several classification algorithms through a `null test' by presenting classifiers a large collection of independent samples that do not belong to any of the tumor types in the training dataset. The benchmark dataset is available at www2.genome.rcast.u-tokyo.ac.jp/pm/. We found that k-nearest neighbor (KNN) and support vector machine (SVM) have very high false positive error rates when fewer genes ($<$100) are used in prediction. The error rate can be partially reduced by including more genes. On the other hand, prototype matching (PM) method has a much lower false positive error rate. Such robustness can be achieved without loss of sensitivity by introducing suitable measures of prediction confidence. We also proposed a cluster-and-select technique to select genes for classification. The nonparametric Kruskal-Wallis H test is employed to select genes differentially expressed in multiple tumor types. To reduce the redundancy, we then divided these genes into clusters with similar expression patterns and selected a given number of genes from each cluster. The reliability of the new algorithm is tested on three public datasets.