Paul B. Horton (firstname.lastname@example.org)
Larisa Kiseleva (email@example.com)
Wataru Fujibuchi (firstname.lastname@example.org)
Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
In this paper we present a fast algorithm and implementation
for computing the
Spearman rank correlation (SRC) between a query expression profile and each
expression profile in a database of profiles. The algorithm is linear
in the size of the profile database with a very small constant factor.
It is designed to efficiently handle multiple profile platforms and missing values.
We show that our specialized algorithm and C++
implementation can achieve an approximately 100-fold speed-up over a
reasonable baseline implementation using Perl hash tables.
RaPiDS is designed for general similarity search rather than classification -- but in order to attempt to classify the usefulness of SRC as a similarity measure we investigate the usefulness of this program as a classifier for classifying normal human cell types based on gene expression. Specifically we use the k nearest neighbor classifier with a t statistic derived from SRC as the similarity measure for profile pairs. We estimate the accuracy using a jackknife test on the microarray data with manually checked cell type annotation. Preliminary results suggest the measure is useful (64% accuracy on 1,685 profiles vs. the majority class classifier's 17.5%) for profiles measured under similar conditions (same laboratory and chip platform); but requires improvement when comparing profiles from different experimental series.