Localized Suffix Array and Its Application to Genome Mapping Problems for Paired-End Short Reads
Kouichi Kimura (firstname.lastname@example.org)
Asako Koike (email@example.com)
Central Research Laboratory, Hitachi Ltd,
1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, Japan
We introduce a new data structure, a localized suffix array,
based on which occurrence information is dynamically represented as
the combination of global positional information and local
lexicographic order information in text search applications.
For the search of a pair of words within a given distance, many
candidate positions that share a coarse-grained global position can be
compactly represented in term of local lexicographic orders as in the
conventional suffix array, and they can be simultaneously examined for
violation of the distance constraint at the coarse-grained resolution.
Trade-off between the positional and lexicographical information is
progressively shifted towards finer positional resolution, and the
distance constraint is reexamined accordingly.
Thus the paired search can be efficiently performed even if there are
a large number of occurrences for each word.
The localized suffix array itself is in fact a reordering of bits
inside the conventional suffix array, and their memory requirements
are essentially the same.
We demonstrate an application to genome mapping problems for
paired-end short reads generated by new-generation DNA sequencers.
When paired reads are highly repetitive, it is time-consuming to
naïively calculate, sort, and compare all of the coordinates.
For a human genome re-sequencing data of 36 base pairs, more than 10
times speedups over the naïve method were observed in almost half
of the cases where the sums of redundancies (number of individual
occurrences) of paired reads were greater than 2,000.
Japanese Society for Bioinformatics