High-Throughput Identification, Database Storage and Analysis of SNPs in EST Sequences

Francisco Jose Useche [1][2] (useche@capsl.udel.edu)
Guang Gao [1][2] (ggao@capsl.udel.edu)
Mike Hanafey [3] (mike.hanafey@USA.dupont.com)
Antoni Rafalski [3] (j-antoni.rafalski@USA.dupont.com)

[1] Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA
[2] Department of Electrical and Computer Engineering, University of Delaware, Evans Hall, Newark, DE 19711
[3] DuPont Crop Genetics, Delaware Technology Park, 1 Innovation Way, Newark, DE 19711, USA


Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA variation and disease-causing mutations in many genes. Due to their abundance and slow mutation rate within generations, they are thought to be the next generation of genetic markers that can be used in a myriad of important biological, genetic, pharmacological, and medical applications [13, 3, 19, 18, 16, 14]. There are several strategies both experimental, and in-silico for SNP discovery and mapping. Experimental SNP discovery consists of a number of labourious steps that make this process complex and expensive. In-silico discovery has been proposed as an alternative discovery method that makes use and takes advantage of large data sets with potential SNP information that have been generated with other purposes and have not been used as a SNP information source yet. However, in order to successfully apply the in-silico method to large data sets, the following challenges need to be addressed: First it is necessary to build an integrated SNP pipeline that handles data processing steps smoothly from the beginning (collecting sequence information) to end (SNPs in the database). Also, SNP detection tool parameters have to be optimized to satisfy specific goals of the project. Finally, SNP data could not be fully used until the in-silico method is validated experimentally. In this paper we present a design and implementation of an in-silico SNP detection software pipeline that exploits the existence of large EST (expressed sequence tag) data sets and effectively addresses the above challenges. First, the pipeline allows for smooth data transition between its different components by implementing data interfaces that translate the data formats of the different tools in the different stages. Second, we optimized PolyBayes parameters for SNP detection in maize EST. Finally, we implemented a user interface that along with the database structure created allows the scientist to perform preliminary analysis of the data and to perform basic statistics on the SNP data prior to experimental validation. The pipeline works with two different types of sequence assemblers ( PHRAP [20] and CAT from DoubleTwist [21] ). It uses a Bayesian engine for SNP detection (PolyBayes), selects relevant polymorphism information which is then uploaded into a database. We detected 2439 SNPs and 822 insertion deletions (INDELs) with a PolyBayes probability higher than 0.99 on the public set of 68,000 maize ESTs. The user interface allowed us analyzing the polymorphism information right after discovery in several ways that allowed us to gain insight into the distribution and significance of the newly acquired data.

[ Full-text PDF | Table of Contents ]

Japanese Society for Bioinformatics