Paul B. Horton (horton-p@aist.go.jp)
Larisa Kiseleva (kiseleva-larisa@aist.go.jp)
Wataru Fujibuchi (fujibuchi-wataru@aist.go.jp)
Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
In this paper we present a fast algorithm and implementation
for computing the
Spearman rank correlation (SRC) between a query expression profile and each
expression profile in a database of profiles. The algorithm is linear
in the size of the profile database with a very small constant factor.
It is designed to efficiently handle multiple profile platforms and missing values.
We show that our specialized algorithm and C++
implementation can achieve an approximately 100-fold speed-up over a
reasonable baseline implementation using Perl hash tables.
RaPiDS is designed for general similarity search rather than
classification -- but in order to attempt to classify the usefulness
of SRC as a similarity measure we investigate the usefulness of this
program as a classifier for classifying normal human cell types based
on gene expression. Specifically we use the k nearest neighbor
classifier with a t statistic derived from SRC as the similarity
measure for profile pairs. We estimate the accuracy using a jackknife
test on the microarray data with manually checked cell type
annotation. Preliminary results suggest the measure is useful (64%
accuracy on 1,685 profiles vs. the majority class classifier's
17.5%) for profiles measured under similar conditions (same
laboratory and chip platform); but requires improvement when comparing
profiles from different experimental series.