Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
We recently introduced a biologically realistic and reliable significance analysis of the output of a popular class of motif finders . In this paper we further improve our significance analysis by incorporating local base composition information. Relying on realistic biological data simulation, as well as on FDR analysis applied to real data, we show that our method is significantly better than the increasingly popular practice of using the normal approximation to estimate the significance of a finder's output. Finally we turn to leveraging our reliable significance analysis to improve the actual motif finding task. Specifically, endowing a variant of the Gibbs Sampler  with our improved significance analysis we demonstrate that de novo finders can perform better than has been perceived. Significantly, our new variant outperforms all the finders reviewed in a recently published comprehensive analysis  of the Harbison genome-wide binding location data . Interestingly, many of these finders incorporate additional information such as nucleosome positioning and the significance of binding data.