Gyan Bhanot, (email@example.com)
Gabriela Alexe, (firstname.lastname@example.org)
Arnold J. Levine, (email@example.com)
Gustavo Stolovitzky (firstname.lastname@example.org)
Center for Systems Biology, Institute for Advanced Study, Princeton, New Jersey 08540, USA
Robert Wood Johnson School of Medicine and Dentistry, Cancer Institute of New Jersey, New Brunswick, New Jersey 08903, USA
IBM Computational Biology Center, IBM Research, Yorktown Heights, New York 10598, USA
A major challenge in cancer diagnosis from microarray data is the need for robust, accurate, classification models which are independent of the analysis techniques used and can combine data from different laboratories. We propose such a classification scheme originally developed for phenotype identification from mass spectrometry data. The method uses a robust multivariate gene selection procedure and combines the results of several machine learning tools trained on raw and pattern data to produce an accurate meta-classifier. We illustrate and validate our method by applying it to gene expression datasets: the oligonucleotide HuGeneFL microarray dataset of Shipp et al. (www.genome.wi.mit.du/MPR/lymphoma) and the Hu95Av2 Affymetrix dataset (DallaFavera's laboratory, Columbia University). Our pattern-based meta-classification technique achieves higher predictive accuracies than each of the individual classifiers , is robust against data perturbations and provides subsets of related predictive genes. Our techniques predict that combinations of some genes in the p53 pathway are highly predictive of phenotype. In particular, we find that in 80% of DLBCL cases the mRNA level of at least one of the three genes p53, PLK1 and CDK2 is elevated, while in 80% of FL cases, the mRNA level of at most one of them is elevated.