## MUSCA: An Algorithm for Constrained Alignment of Multiple Data Sequences

Laxmi Parida (parida@us.ibm.com)

Aris Floratos (floratos@us.ibm.com)

Isidore Rigoutsos (rigoutso@us.ibm.com)

Bioinformatics and Pattern Discovery Group

Computational Biology Center

IBM Thomas J. Watson Research Center

Yorktown Heights, NY10598, USA

### Abstract

Given a set of *N* sequences, the Multiple Sequence Alignment
problem is to align these *N* sequences,
possibly with gaps, that brings out the best *commonality*
of the *N* sequences.
MUSCA is a two-stage approach to the alignment problem by
identifying
two relatively simpler sub-problems whose solutions
are used to obtain the alignment of the sequences.
We first *discover motifs* in the *N* sequences and then
extract an appropriate subset of compatible motifs
to obtain a "good" alignment.
The motifs of interest to us are the
*irredundant* motifs which are
only polynomial in the input size.
In practice, however, the number is much smaller (sub-linear).
Notice that this step aids in a direct *N*-wise alignment, as
opposed to composing the alignments from lower order (say pairwise)
alignments and the solution is also
independent of the order of the input sequences; hence
the algorithm works very well while dealing with a large number
of sequences.
The second part of the problem that deals with obtaining a
good alignment is solved using a graph-theoretic approach
that computes an induced subgraph satisfying certain simple constraints.
We reduce a version of this problem to that of solving an instance
of a set covering problem, thus offer the best possible approximate
solution to the problem (provided P≠NP).
Our experimental results, while being preliminary, indicate that
this approach is efficient, particularly on large numbers of long
sequences, and, gives good alignments when tested on
biological data such as DNA and protein sequences.
We introduce the
the notion of an *alignment number* *K* (2≤K≤2),
a user-controlled parameter,
that lends a useful flexibility to the aligning program:
this additional requirement constrains the alignment to
have at least *K* sequences agree on a character, whenever possible,
in the alignment.
The usefulness of the alignment number is corroborated by the users
who view
this as a natural constraint while dealing with a large number of sequences.

*Japanese Society for Bioinformatics*