RaPiDS: An Algorithm for Rapid Expression Profile Database Search

Paul B. Horton (horton-p@aist.go.jp)
Larisa Kiseleva (kiseleva-larisa@aist.go.jp)
Wataru Fujibuchi (fujibuchi-wataru@aist.go.jp)

Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan


Abstract

In this paper we present a fast algorithm and implementation for computing the Spearman rank correlation (SRC) between a query expression profile and each expression profile in a database of profiles. The algorithm is linear in the size of the profile database with a very small constant factor. It is designed to efficiently handle multiple profile platforms and missing values. We show that our specialized algorithm and C++ implementation can achieve an approximately 100-fold speed-up over a reasonable baseline implementation using Perl hash tables.
RaPiDS is designed for general similarity search rather than classification -- but in order to attempt to classify the usefulness of SRC as a similarity measure we investigate the usefulness of this program as a classifier for classifying normal human cell types based on gene expression. Specifically we use the k nearest neighbor classifier with a t statistic derived from SRC as the similarity measure for profile pairs. We estimate the accuracy using a jackknife test on the microarray data with manually checked cell type annotation. Preliminary results suggest the measure is useful (64% accuracy on 1,685 profiles vs. the majority class classifier's 17.5%) for profiles measured under similar conditions (same laboratory and chip platform); but requires improvement when comparing profiles from different experimental series.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics