Gul S. Dalgin (email@example.com)
Charles DeLisi, (firstname.lastname@example.org)
Molecular Biology, Cell Biology and Biochemistry Program, Boston University, Boston, MA 02215, USA
Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
Bioinformatics Graduate Program, Boston University, Boston, MA, 02215, USA
High-throughput gene expression profiling can identify sets of genes that are differentially expressed between different phenotypes. Discovering marker genes is particularly important in diagnosis of a cancer phenotype. However, gene sets produced to date are too large to be economically viable diagnostics. We use a hybrid decision tree-discriminant analysis to identify small sets of genes, i.e. single genes and gene pairs, which separate normal samples from different stages of tumor samples. Half the samples are selected for training to form the probability distribution of expression values of each gene. The distributions for the tumor and normal phenotypes are then used to classify the test samples. The algorithm also identifies gene pairs by combining the probability distributions to construct a decision tree which is used to determine the class of test samples. After a series of training and testing sessions, genes and gene pairs that classify all samples correctly are recorded. The method was applied to a breast cancer data; and classifier genes that distinguish normal breast from different stages of breast tumor were identified. The genes were ranked according to their minimum Euclidean distance between the expression values in tumor and normal samples. The algorithm was able to pick known cancer related genes but also find genes that were not identified as differentially expressed by t-test with a 2 fold cut-off. Overall, the method generates possible diagnostic genes and gene pairs for a specific disease phenotype to pursue further biological interpretations in cancer biology.