Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT

Kazutaka Katoh[1] (kkatoh@kuicr.kyoto-u.ac.jp)
Kei-ichi Kuma[1] (kuma@kuicr.kyoto-u.ac.jp)
Takashi Miyata[2],[3],[4] (miyata@brh.co.jp)
Hiroyuki Toh[5] (toh@bioreg.kyushu-u.ac.jp)

[1]Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
[2]Biohistory Research Hall, Takatsuki, Osaka 569-1125, Japan
[3]Department of Electrical Engineering and Bioscience, Science and Engineering, Waseda University, Tokyo 169-8555, Japan
[4]Department of Biophysics, Graduate School of Science, Kyoto University, Kyoto 606-8502, Japan
[5]Division of Bioinformatics, Research Center for Prevention of Infectious Diseases, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582, Japan


Abstract

In 2002, we developed and released a rapid multiple sequence alignment program MAFFT that was designed to handle a huge (up to ∼5,000 sequences) and long data (∼2,000 aa or ∼5,000 nt) in a reasonable time on a standard desktop PC. As for the accuracy, however, the previous versions (v.4 and lower) of MAFFT were outperformed by ProbCons and TCoffee v.2, both of which were released in 2004, in several benchmark tests. Here we report a recent extension of MAFFT that aims to improve the accuracy with as little cost of calculation time as possible. The extended version of MAFFT (v.5) has new iterative refinement options, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-i in this report). These options use a new objective function combining the weighted sum-of-pairs (WSP) score and a score similar to COFFEE derived from all pairwise alignments. We discuss the improvement in accuracy brought by this extension, mainly using two benchmark tests released very recently, BAliBASE v.3 (for protein alignments) and BRAliBASE (for RNA alignments). According to BAliBASE v.3, the overall average accuracy of L-INS-i was higher than those of other methods successively released in 2004, although the difference among the most accurate methods (ProbCons, TCoffee v.2 and new options of MAFFT) was small. The advantage in accuracy of [GL]-INS-i became greater for the alignments consisting of ∼ 50 - 100 sequences. By utilizing this feature of MAFFT, we also examined another possible approach to improve the accuracy by incorporating homolog information collected from database. The [GL]-INS-i options are applicable to aligning up to ∼ 200 sequences, although not applicable to thousands of sequences because of time and space complexities.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics