Combining multiple data sets in a likelihood analysis: Which models are the best?

Tal Pupko*, Dorothée Huchon, Ying Cao, Norihiro Okada, Masami Hasegawa

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

92 Scopus citations

Abstract

Until recently, phylogenetic analyses have been routinely based on homologous sequences of a single gene. Given the vast number of gene sequences now available, phylogenetic studies are now based on the analysis of multiple genes. Thus, it has become necessary to devise statistical methods to combine multiple molecular data sets. Here, we compare several models for combining different genes for the purpose of evaluating the likelihood of tree topologies. Three methods of branch length estimation were studied: assuming all genes have the same branch lengths (concatenate model), assuming that branch lengths are proportional among genes (proportional model), or assuming that each gene has a separate set of branch lengths (separate model). We also compared three models of among-site rate variation: the homogenous model, a model that assumes one gamma parameter for all genes, and a model that assumes one gamma parameter for each gene. On the basis of two nuclear and one mitochondrial amino acid data sets, our results suggest that, depending on the data set chosen, either the separate model or the proportional model represents the most appropriate method for branch length analysis. For all the data sets examined, one gamma parameter for each gene represents the best model for among-site rate variation. Using these models we analyzed alternative mammalian tree topologies, and we describe the effect of the assumed model on the maximum likelihood tree. We show that the choice of the model has an impact on the best phylogeny obtained.

Original languageEnglish
Pages (from-to)2294-2307
Number of pages14
JournalMolecular Biology and Evolution
Volume19
Issue number12
DOIs
StatePublished - 1 Dec 2002
Externally publishedYes

Keywords

  • Combining data sets
  • Mammalia
  • Maximum likelihood
  • Molecular evolution
  • Phylogeny

Fingerprint

Dive into the research topics of 'Combining multiple data sets in a likelihood analysis: Which models are the best?'. Together they form a unique fingerprint.

Cite this