RECOUNT: Expectation Maximization Based Error Correction Tool for Next Generation Sequencing Data
Edward Wijaya (firstname.lastname@example.org)
Martin Frith (email@example.com)
Yutaka Suzuki (firstname.lastname@example.org)
Paul Horton (email@example.com)
 AIST, Computational Biology Research Center, 2-42 Aomi, Koutou-Ku, Tokyo 135-0064, Japan
 Department of Medical Genome Sciences, Graduate School of Frontier Sciences, the University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562, Japan
Next generation sequencing technologies enable rapid, large-scale production of sequence
data sets. Unfortunately these technologies also have a non-neglible sequencing error rate,
which biases their outputs by introducing false reads and reducing the quantity of
the real reads. Although methods developed for SAGE data can reduce these
false counts to a considerable degree, until now they have not been
implemented in a scalable way. Recently, a program named FREC has been
developed to address this problem for next generation sequencing data.
In this paper, we introduce RECOUNT, our implementation of an
Expectation Maximization algorithm for tag count correction and
compare it to FREC. Using both the reference genome and simulated
data, we find that RECOUNT performs as well or better than FREC, while
using much less memory (e.g. 5GB vs. 75GB).
Furthermore, we report the first analysis of tag count correction
with real data in the context of gene expression analysis. Our results
show that tag count correction not only increases the number of mappable
tags, but can make a real difference in the biological
interpretation of next generation sequencing data.
RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount.
Japanese Society for Bioinformatics