FOREST, a Browser for Huge DNA Sequences
FOREST, a Browser for Huge DNA Sequences
R. Gras (gras@irisa.fr)
J. Nicolas (jnicolas@irisa.fr)
IRISA
Campus de Beaulieu 35042 Rennes cedex, France
Abstract
We present a new tool, FOREST, aiming at representing the content of a
large nucleic acid sequence (e.g. >100KB ) in a suitable form for
the biologist. More precisely, FOREST builds all subsequences repeated
in a sequence or a set of sequences. It allows not only to look for
the location of the various occurrences of a given subsequence but points also to
interesting subsequences with respect to a given criterion. This
tool is based on two key ideas. The first idea consists to build a
suffix-tree representation of a sequence and to associate to each node of this
tree a set of synthesized attributes, computed on the set of subsequences under
this node. This allows the biologist to "browse" in the sequence with a constant
abstract view of what he may expect to find in the section of the tree he is currently investigating. The second idea consists to summarize
the distribution of the information with boolean vectors associated to the sequence. These vectors may be easily displayed in form of a linear map of events, as it is done in genetic mapping.
Both representations allow various efficient operations on the sequence.
They provide a powerful filtering capacity of the data, while reducing the set of elementary filtering operations to a minimum of conceptual operations. This allows the biologist to easily investigate the most prominent features of the lexical structure of its sequences.