Very Fast Algorithms for Evaluating the Stability of ML and Bayesian Phylogenetic Trees from Sequence Data

Peter J. Waddell[1] (waddell@stat.sc.edu)
Hirohisa Kishino[2] (kishino@wheat.ab.a.u-tokyo.ac.jp)
Rissa Ota[3] (r.ota@massey.ac.nz)

[1]Department of Statistics and Department of Biological Sciences, University of South Carolina, Columbia, SC 29208, USA
[2]Graduate School of Agriculture and Life Sciences, University of Tokyo, 1-1-1 Yayoi Bunkyo-ku, Tokyo 113-8657, Japan
[3]Institute of Molecular BioSciences, Massey University, Palmerston North, New Zealand


Abstract

Evolutionary trees sit at the core of all realistic models describing a set of related sequences, including alignment, homology search, ancestral protein reconstruction and 2D/3D structural change. It is important to assess the stochastic error when estimating a tree, including models using the most realistic likelihood-based optimizations, yet computation times may be many days or weeks. If so, the bootstrap is computationally prohibitive. Here we show that the extremely fast ``resampling of estimated log likelihoods'' or RELL method behaves well under more general circumstances than previously examined. RELL approximates the bootstrap (BP) proportions of trees better that some bootstrap methods that rely on fast heuristics to search the tree space. The BIC approximation of the Bayesian posterior probability (BPP) of trees is made more accurate by including an additional term related to the determinant of the information matrix (which may also be obtained as a product of gradient or score vectors). Such estimates are shown to be very close to MCMC chain values. Our analysis of mammalian mitochondrial amino acid sequences suggest that when model breakdown occurs, as it typically does for sequences separated by more than a few million years, the BPP values are far too peaked and the real fluctuations in the likelihood of the data are many times larger than expected. Accordingly, several ways to incorporate the bootstrap and other types of direct resampling with MCMC procedures are outlined. Genes evolve by a process which involves some sites following a tree close to, but not identical with, the species tree. It is seen that under such a likelihood model BP (bootstrap proportions) and BPP estimates may still be reasonable estimates of the species tree. Since many of the methods studied are very fast computationally, there is no reason to ignore stochastic error even with the slowest ML or likelihood based methods.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics