This page discusses basic
concepts and methods of phylogenetic tree reconstruction based on
biological sequences. You can use this page to test what you already
know, and also as a summary for preparing for your exam. The Background
section discusses some basic concepts with some outside links. The
Course section contains material of the course itself, with links
Classification of organisms:
- Individuals identified as members of species
- Species grouped by similarity and/or common descent
- Order of descent established (phylogenetic tree)
- Identifying ancestors
- Using phylogenetic information for further studies (mechanism
of evolution, conservation, comparison, function of traits etc)
Phenetics is the study of relationships based on similarity among
a group of organisms. Similarity or distance can be quantified.
Species are groupped according to similarity or distance mesures.
Phenetic relationships are expressed by a tree-like network called
Cladistics is the study of descent of organisms. Descent is estimated
by studying certain characters of common origin, modified through
evolution (shared derived characteristics). Degree of relatedness
of organisms and evolutionary pathways among organisms are quantified
by estimating the time needed for changes in character since divergence
of two taxa. Ancestor-descendant relationships is expressed by a
tree-like network called cladogram.
More on cladistics at Berkeley's journey into the world of phylogenetic
Cladistics vs. Phenetics
The two phylosophycal approaches often result in the same tree.
When similarity directly reflects descent, a phenogram is similar
to a cladogram. To put it simple, perhaps too simple for many, a
cladogram reflects similarity weighed by the time since divergence.
Classical and Molecular Systematics
Classical systematics consideres visible characters of organisms
to reconstruct phylogeny. Such characters might be the anatomy,
physiology or behaviour of organisms. However, most of such characters
are modified by natural selection, under varying selection pressure.
Changes of such selected characters might speed up or slow down
during certain periods, therefore time since divergence cannot be
precisely estimated in most cases.
Molecular systematics compares biological sequences (protein, nucleotide)
of organisms. Some biosequences, and many sites in most sequences
change regardless of selection pressure. Thus similarity among sequences
might provide a less biased information on branching order. However,
there are a number of complications when inferring phylogenetic
species trees from biosequences.
Order and time of divergence is represented by phylogenetic
trees. Some phylogenetic trees resemble real trees with trunk and
branches, others, look like pathways. The two figures represent
the same four species. The tree with trunk (or root) on Fig. A represents
the branching order among four species, indicating that the lineage
to Cow split first from the lineage leading to Dog, Human and Monkey.
However, trees do not come automatically with root, trees must be
rooted by speciel techniques. Simply looking at an unrooted tree
(Fig. B) there is no way to tell, where the root might be fitted
(A) Rooted tree
(B) Unrooted tree
(C) Rooted tree
Nodes represent taxonomic units. External nodes represent
units directly compared (eg. extant species), while internal nodes
are ancestral or hypothesized units.
Units might be species, subspecies, order, in fact, any kind of
taxonomic unit (OTU: Operational Taxonomic Unit).
Branches define the relationship among the OTUs.
Branch length reflects differences among OTUs in terms of time since
divergence, percent of differences etc. Please note, that many trees
are unscaled, and branch length in unscaled trees are arbitrary.
Clade is a group of all OTUs having a common ancestor.
Root: the branch leading to the common ancestor of all OTUs of the
tree. Please note, that most trees do not have a root, even if they
Topology is the branching pattern of the tree.
It should be noted, that topology does not necessarily
represent true evolutionary history. When comparison is made according
to similarity of sequences, like in classical phenetics, what we
get is a hierarchical order of similarity among OTUs. Two taxa might
be similar not only because they have a common ancestor in the near
past, but also by similar selection pressure, or simply by chance.
Complications of DNA trees
Speciation is not a short event but a long lasting
process as illustrated by Fig A. Although taxa A, B, C, and D are
clearly separated by time t4, divergence has no precise point of
time. In case of taxa A, B and C even the order of divergence cannot
be estimated precisely and Figure B might might be a true reperesentation
of divergence. However, most phylogenetic analytical tools work
only on bifurcating trees, therefore a short additional branch is
sometimes added to the tree as in Fig C.
Convergent versus paralell evolution
Similar traits might evolve independently under similar selection
pressure in different lineages. Owls are related to nightjars (lower
right corner of owl), and condors to storks (lower right corner
of condor) according to DNA-DNA hybridization studies. However,
beak shape of owls and condors are more similar to those of unrelated
birds of prey, than to their closest relatives.
Convergent evolution might obscure distincton between homologies
and analogies by classical systematics. Although less often, similarity
in biosequences might also be a result of convergent evolution.
Considere the case of lysozyme, an enzyme expressed in saliva,
tears, milk or egg yolk in most higher vertebrates. Lyzozyme defends
the organism against infection by degrading the walls of bacteria.
In some vertebrates, the ruminants, coliben monkeys and the Hoatzin
bird, lyzozyme is also expressed in the stomach. Breaking the walls
of bacteria in the stomach is important to free nutrients assimilated
by bacteria. The form of lyzozyme expressed in saliva, however,
cannot function in the highly acidic stomach fluid. Amino acid changes
in critical sites of the protein resulted in lyzozyme active in
the gut. Such changes of the sequence happened in similar direction,
independently in the 3 lineages. Thus similarity of the gut lyzozyme
sequence of langur, cow and Hoatzin reflects convergent evolution
and not common descent. Phylogenetic trees based on convergent sequence
evolution therefore might be misleading. Nevertheless, compared
to visible traits, such as the shape of beaks, there are only a
handful examples for convergent evolution of biosequences.
For more information start with Zhang and Kumar (link from PubMed
to free article here).
Orthologs versus paralogs
Evolution is a process of modification of old goodies. In case
of highly successful inventions (for example, transmembran receptors),
duplication and modification of domains or complete genes is widespread.
For example, dopamine receptors D2 and D4 in human and mouse in
figure above seem to be result of duplication back in time.
Human and mouse dopamine receptors
Tail of molecule enlarged
Considere gene A in the ancestor species in figure bellow. Following
duplication and modification, A1 and A2 variants of gene A was fixated
in the ancestor. The ancestor species diverged into species X and
Y. The two variants A1 and A2 evolves independently in the two lineages
into A1X - A2x, and A1Z - A2Z in species X and Z, respectively.
Paralogous genes are derived from duplication, such as A1 and A2.
Orthologous genes are derived from speciation, such as A1x - A1z,
and A2x - A2z.
Genetic similarity among taxa should be estimated by comparing
Going back to human and mouse dopamine receptors, which pairs seem
to be orthologs and paralogs?
Species are considered groups of organisms not transferring or
receiving genes from other species. Species to species gene exchange
(horizontal gene transfer), however, is an important evolutionary
process in bacteria and viruses. In eucharyotes, virogene transmisson
of genes can happen among lines. Genes transferred horizontally
are called xenologs. However, horizontal gene transfer in eucharyotes
is probably so rare that it would not affect the structure of most
See articles on horizontal gene transfer written by Dr. Syvanen
Gene trees versus species trees.
Speciation is a process rather than a short event, and there is
no clear cut definiton when two groups of related organisms can
be considered as subpopulations, ecotypes, subspecies or species.
Gene divergence is also a long lasting process in which mutation
continuously increases polimorphism, while selection and random
drift eliminate some of the alleles.
Phylogenetic trees based on biosequence comparisons represent time
and order of divergence of the sequences (gene tree). Sequence divergence
does not neccessarily coincides with species divergence (species
Sequence divergence would normally preceed species divergence in
regions under selection for polimorphism (Fig A). For example, some
genes involved in the immune response are highly polimorphic in
humans, whixh is good for our health. However, polimorphism was
also good, that is positively selected for, in the common ancestor
of humans and chimps resulting in high polimorphism. During speciation,
some of these variants were held by human and chimp populations,
and the two species might share those alleles even today. Some of
the allelic variants were transferred only to the human, and some
only to the chimp populations. Humans and chimps are therefore different
in such alleles, however, divergence started long before speciation.
In some genes divergence starts after speciation. Such genes would
therefore underestimate time since speciation (Fig. B). It is important
to use a number of genes to construct phyogenetic trees.
How to get sequences (right)
For some studies genes should be sequenced by the researcher. However,
once a study is published, the sequence would be uploaded into some
public data base, such as the GenBank.
Sequences in data bank can be searched for their similarity (BLAST),
by their names (ENTREZ),
their source (taxonomy),
or other attributes.
As stated earlier, sequences to be compared should be homologous,
that is, they should be descendents of a common ancestor gene of
a common ancestor species. Now we should add, that sites in the
sequences should also be homologous. Identification of homologous
sites is not trivial. Some of the bases might be lost, some bases
might be inserted, and some might be substituted by others. Homologous
sites therefore should be identified by a process called alignment
before building a tree.
Tree building algorithms
There are a good number of methods for constructing
phylogenetic trees from biosequnces. Basically there are three major
approaches: distance, parsimony, and likelihood methods.
Distance methods considere overall similarity of the
sequences. For example, number of base differences in all sites
are counted pairwise between four species, A, B, C, and D. We can
write those 7 figures or distances into a matrix:
Operational taxonomic units
The basic idea is, that sequences very different
According to Steel and Penny (pdf)
, and ther are numerous ways to classify them.
molecular data (Nei and Kumar 2000). They can be
Review Article Parsimony, Likelihood, and the Role of Models in
Molecular Phylogenetics Mike Steel2,* and David Penny *Biomathematics
Research Centre, University of Canterbury, Christchurch, New Zealand;
and Institute of Molecular BioSciences, Massey University, Palmerston
North, New Zealand Methods such as maximum parsimony (MP) are frequently
criticized as being statistically unsound and not being based on
any "model." On the other hand, advocates of MP claim that maximum
likelihood (ML) has some fundamental problems. Here, we explore
the connection between the different versions of MP and ML methods,
particularly in light of recent theoretical results. We describe
links between the two methods—for example, we describe how MP can
be regarded as an ML method when there is no common mechanism between
sites (such as might occur with morphological data and certain forms
of molecular data). In the process, we clarify certain historical
points of disagreement between proponents of the two methodologies,
including a discussion of several forms of the ML optimality criterion.
We also describe some additional results that shed light on how
much needs to be assumed about underling models of sequence evolution
in order to successfully reconstruct evolutionary trees.