See-Kiong Ng[1] (skng@i2r.a-star.edu.sg)
Soon-Heng Tan[1] (soonheng@i2r.a-star.edu.sg)
V.S. Sundararajan[1],[2] (sundar@i2r.a-star.edu.sg)
[1]Knowledge Discovery Department, Institute for Infocomm Research, 21
Heng Mui Keng Terrace, Singapore 119613
[2]School of Computing, National University of Singapore, Lower Kent
Ridge Road, Singapore 119260
As microarray technologies become routinely applied in genome
laboratories for studying gene expression, it is not
uncommon that experiments on identical or similar
sets of genes are conducted by multiple laboratories for various
functional studies of these genes. Much of such data are often
available to researchers for their data analysis, either through
collaborators or from online gene expression databases. It will be
useful to combine data from different microarray studies to
improve the microarray data mining results.
We show that the functional classification of genes from microarray data
can be improved further by combining gene expression data from multiple
microarray studies, even if the experimental focus or conditions for
each experimental study may differ. However, blindly combining all
available datasets may not always improve the analysis results---it is
important to be selective of the datasets for inclusion.
In our approach, we consider each dataset to be one feature, and then
apply feature selection strategies to select appropriate datasets for
training. With a simple hill-climbing method, we show that gene
classification performances can be improved by whole-dataset feature
selection.