TY - JOUR
T1 - Combining multiple data sets in a likelihood analysis
T2 - Which models are the best?
AU - Pupko, Tal
AU - Huchon, Dorothée
AU - Cao, Ying
AU - Okada, Norihiro
AU - Hasegawa, Masami
PY - 2002/12/1
Y1 - 2002/12/1
N2 - Until recently, phylogenetic analyses have been routinely based on homologous sequences of a single gene. Given the vast number of gene sequences now available, phylogenetic studies are now based on the analysis of multiple genes. Thus, it has become necessary to devise statistical methods to combine multiple molecular data sets. Here, we compare several models for combining different genes for the purpose of evaluating the likelihood of tree topologies. Three methods of branch length estimation were studied: assuming all genes have the same branch lengths (concatenate model), assuming that branch lengths are proportional among genes (proportional model), or assuming that each gene has a separate set of branch lengths (separate model). We also compared three models of among-site rate variation: the homogenous model, a model that assumes one gamma parameter for all genes, and a model that assumes one gamma parameter for each gene. On the basis of two nuclear and one mitochondrial amino acid data sets, our results suggest that, depending on the data set chosen, either the separate model or the proportional model represents the most appropriate method for branch length analysis. For all the data sets examined, one gamma parameter for each gene represents the best model for among-site rate variation. Using these models we analyzed alternative mammalian tree topologies, and we describe the effect of the assumed model on the maximum likelihood tree. We show that the choice of the model has an impact on the best phylogeny obtained.
AB - Until recently, phylogenetic analyses have been routinely based on homologous sequences of a single gene. Given the vast number of gene sequences now available, phylogenetic studies are now based on the analysis of multiple genes. Thus, it has become necessary to devise statistical methods to combine multiple molecular data sets. Here, we compare several models for combining different genes for the purpose of evaluating the likelihood of tree topologies. Three methods of branch length estimation were studied: assuming all genes have the same branch lengths (concatenate model), assuming that branch lengths are proportional among genes (proportional model), or assuming that each gene has a separate set of branch lengths (separate model). We also compared three models of among-site rate variation: the homogenous model, a model that assumes one gamma parameter for all genes, and a model that assumes one gamma parameter for each gene. On the basis of two nuclear and one mitochondrial amino acid data sets, our results suggest that, depending on the data set chosen, either the separate model or the proportional model represents the most appropriate method for branch length analysis. For all the data sets examined, one gamma parameter for each gene represents the best model for among-site rate variation. Using these models we analyzed alternative mammalian tree topologies, and we describe the effect of the assumed model on the maximum likelihood tree. We show that the choice of the model has an impact on the best phylogeny obtained.
KW - Combining data sets
KW - Mammalia
KW - Maximum likelihood
KW - Molecular evolution
KW - Phylogeny
UR - http://www.scopus.com/inward/record.url?scp=0036900111&partnerID=8YFLogxK
U2 - 10.1093/oxfordjournals.molbev.a004053
DO - 10.1093/oxfordjournals.molbev.a004053
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 12446820
AN - SCOPUS:0036900111
SN - 0737-4038
VL - 19
SP - 2294
EP - 2307
JO - Molecular Biology and Evolution
JF - Molecular Biology and Evolution
IS - 12
ER -