Improving Gene Expression Cancer Molecular Pattern Discovery Using Nonnegative Principal Component Analysis

Xiaoxu Han (xiaoxu.han@emich.edu)

Department of Mathematics and Bioinformatics Program, Eastern Michigan University, Ypsilanti MI, 48197 USA


Abstract

Robust cancer molecular pattern identification from microarray data not only plays an essential role in modern clinic oncology, but also presents a challenge for statistical learning. Although principal component analysis (PCA) is a widely used feature selection algorithm in microarray analysis, its holistic mechanism prevents it from capturing the latent local data structure in the following cancer molecular pattern identification. In this study, we investigate the benefit of enforcing non-negativity constraints on principal component analysis (PCA) and propose a nonnegative principal component (NPCA) based classification algorithm in cancer molecular pattern analysis for gene expression data. This novel algorithm conducts classification by classifying meta-samples of input cancer data by support vector machines (SVM) or other classic supervised learning algorithms. The meta-samples are low-dimensional projections of original cancer samples in a purely additive meta-gene subspace generated from the NPCA-induced nonnegative matrix factorization (NMF). We report strongly leading classification results from NPCA-SVM algorithm in the cancer molecular pattern identification for five benchmark gene expression datasets under 100 trials of 50% hold-out cross validations and leave one out cross validations. We demonstrate superiority of NPCA-SVM algorithm by direct comparison with seven classification algorithms: SVM, PCA-SVM, KPCA-SVM, NMF-SVM, LLE-SVM, PCA-LDA and k-NN, for the five cancer datasets in classification rates, sensitivities and specificities. Our NPCA-SVM algorithm overcomes the over-fitting problem associative with SVM-based classifications for gene expression data under a Gaussian kernel. As a more robust high-performance classifier, NPCA-SVM can be used to replace the general SVM and k-NN classifiers in cancer biomarker discovery to capture more meaningful oncogenes.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics