Jane Marie Lin (email@example.com)
Zhiping Weng, (firstname.lastname@example.org)
Department of Biomedical Engineering, Boston University, Boston, MA
Program in Bioinformatics and Systems Biology, Boston University, Boston, MA, 02215, USA
DNA motifs, or cis-elements, are short nucleotide sequence patterns recognized by various transcription factors (TFs). In promoters, these TFs bind in a complex combinatorial manner in order to regulate the expression of a downstream gene. The combinatorial space is frequently large and difficult to manage since vertebrates have thousands of transcription factors and more than 20,000 genes. We introduce a computer program called CAYCE (Combinatorial AnalYsis of Cis-Elements) that systematically detects statistically overrepresented DNA motif association rules independent of Microarray information. CAYCE is an adaptation of the apriori algorithm traditionally used for association rule mining, but offers three significant advancements. (1) It analyzes multiple occurrences of an item, corresponding to multiple TF binding sites, (2) It compares results with a biologically relevant background, and (3), it provides p-values for straightforward statistical interpretation. CAYCE can be easily applied to any item-set data where the investigator is also interested in multiple occurrences of a single item, and/or overrepresentation of association rules compared with a background. Applying CAYCE to human promoters in 1% of the human genome, we discover that motif clusters containing five repetitions of SP1 are the most statistically significant.