|
This page introduces
basic concept of sequence alignment to the novice, using a superprimitive
example.
Sequences do not come neatly arranged one under the
other. Sequences should be aligned by identifying homologous sites
in both DNA. The history of sequences are not known, therefore,
homologous sites can be identified only with some probability.
As an example, think about a text with no space in
between the words, as there are no spaces between codons in DNA.
|
c
|
o
|
n
|
s
|
c
|
i
|
o
|
u
|
s
|
o
|
f
|
t
|
h
|
e
|
i
|
m
|
p
|
o
|
r
|
t
|
a
|
n
|
c
|
e
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Now, hand the text over to two persons who do not
understand English, and ask them to type it, and retype it from
the previous copy, again, and again 100 times. At the end, the two
final copies of the two persons might be quite different, because
they make mistakes at any time. How could a third person, who does
not speak English, reconstruct the original text from the two final
versions.
|
Original
|
|
c
|
o
|
n
|
s
|
c
|
i
|
o
|
u
|
s
|
o
|
f
|
t
|
h
|
e
|
i
|
m
|
p
|
o
|
r
|
t
|
a
|
n
|
c
|
e
|
|
VersionI
|
. |
c
|
l
|
n
|
s
|
s
|
c
|
u
|
s
|
o
|
f
|
g
|
h
|
i
|
m
|
p
|
o
|
t
|
a
|
n
|
c
|
e
|
|
|
|
|
VersionII
|
|
c
|
o
|
n
|
s
|
c
|
o
|
o
|
u
|
s
|
o
|
f
|
g
|
h
|
e
|
i
|
m
|
p
|
o
|
r
|
t
|
a
|
n
|
c
|
c
|
|
|
|
|
*
|
|
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
*
|
|
|
|
If the two versions are written one above the other,
there will be mismatches (*) at 18
sites. However, the length of the two versions are different indicating
that besides misprints, some characters might have been ommitted,
others inserted by any of the two typists. Unfortunately, there
is no way to tell, whether at a given site a new character was inserted
in Version II or deleted in Version I. Deletions and insertions
are therefore not distinguished and are called gaps, or indels.
How many gaps would you add into the sequences? If
gaps can be added freely, all mismatches disappear:
|
|
|
c
|
o
|
n
|
s
|
c
|
i
|
o
|
u
|
s
|
o
|
f
|
t
|
h
|
e
|
i
|
m
|
p
|
o
|
r
|
t
|
a
|
n
|
c
|
e
|
|
|
|
|
|
|
VersionI
|
. |
c
|
-
|
l
|
n
|
s
|
s
|
c
|
u
|
s
|
o
|
-
|
-
|
-
|
-
|
f
|
g
|
h
|
-
|
i
|
m
|
p
|
o
|
-
|
t
|
a
|
n
|
c
|
e
|
|
|
VersionII
|
|
c
|
o
|
-
|
n
|
s
|
-
|
c
|
-
|
-
|
o
|
o
|
u
|
s
|
o
|
f
|
g
|
h
|
e
|
i
|
m
|
p
|
o
|
r
|
t
|
a
|
n
|
c
|
-
|
c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
However, both sequences became much longer than the orginal text,
meaning that we got very far from the true text. If we knew the
probability of each type of mistakes a typist can commit, we could
guess better, what the original text might have been. What are the
common mistakes?
- First of all, you may want to get some data, what is more common:
omitting characters, or hitting the wrong key? If indels are less
probable than substitutions, than the second character in Version
I would be considered a substitution from o to l, and not
a deletion. For alignment, substitions and indels should be weighed
differently.
(this will be discussed as "gap and mismatch penalties"
here )
- At a given site there might be more than one substitution events
when copying the copies. It might happen, that the typist would
make a mistake many times on the same site, and arrive finally
at the original character. Chances for such events are very low.
However, if we have only four characters in the alphabet, as in
a DNA sequence, back mutation, or paralell mutation is not unlikely.
(discussed later as "Jukes and Cantor's one-parameter
correction").
- It is far more frequent to hit a key near to the one aimed at,
than hitting a key far away. Substition of O by L is more probable
than O by Y. In other words, probability of substitutions are
not equal for physical reason.
(discussed later as "Kimura's 2-parameter Correction").
- Some sites in the sequence change more readily than others.
Considere, for example, the three vowels in conscious.
Characters I, O and U are close to each other in the keyboard,
and it is quite easy to ommit or mistype one of those.
Some models, not discussed here, deal with such hot and cold spots
(If interested, see section "Uneven spatial distribution
of substitutions" in Brian Golding's tutorial. www)
- Some misprints do not change the meaning of the word. The word
"behaviour" means the same
even if mistyped as "behavior".
In protein coding DNA sequences substitutions at certain sites
(third position of codons) do not necessarily affect the amino
acid sequence. Fixation of synonymous mutation is more probable,
than that of nonsynonymous substitutions, because selection does
not act on the DNA, but on its product.
(If interested, see section "Synonymous - nonsynonymous
substitutions" in Brian Golding's tutorial. www)
- In most text editors it is possible to create a short program,
which enables the typist to hit just one special key to get the
computer type a whole word. This shortcut is handy for frequent
words. An American person might code behavior,
color etc. as shorthands, while a
British person would code behaviour,
colour instead. Spelling does not
change the meaning, however, it changes the sequence of characters.
Even it the typist would copy most of the text letter by letter,
using shorthands for the most frequent words would increase the
difference between the versions. In DNA, those genes which are
transcribed frequently tend to use one kind of spelling. Species
might differ in spelling style affecting the distance between
two DNA sequences.
(discussed later as "Codon usege")
- When sliding one version above the other we might find areas
of perfect matches, divided by areas of mismatches. When aligning
two strings it should be decided what is more important: to find
areas of good matches, and considere the rest as noise, or to
match the two strings character by character.
(discussed later as "Basic Local Alignment Search Tools"
here )
- The most horrifying typos are the ones, which do have some meaning.
If one of your typists speak French, the word importance
would make sense to him/her as much as the mistyped "impotance"
and he/she then might stick to the new version. This type of "mutation
-> new function" is important when genes are under different
selection pressure in the different species we study.
Sounds complicated? It is complicated. Many factors can be considered
when matching two sequences. Computation time, however, greatly
limits the number factors to be considered.
The good news is, that most softwares deal with a number of corrections
in a user friendly way. It is important that you know about gap
penalties, because you will have to set that yourself. It is useful
to have a clue of the one and the two-parameter correction, because
in some softwares that also can be set manually. As far as more
sophisticated methods are considered, there are shortcuts not too
difficult to understand. One way to go is to construct a matrix
containing all characters, and give a weight to each transtions.
Such matrices are built into the majority of alignment softwares,
and the novice can simply except them. For example, by comparing
homologue proteins of many species it is easy to tell which amino
acids might be replaced to which one with high probability during
evolution. From that information a substitution matrix can be created
and used in alignments. The idea is, that changes which occur frequently
need less time on average than rare changes. Sites in two sequences
different in frequent changes might be considered homologues. Protein
coding DNA sequences can be translated to amino acid sequences,
and than such translations can be compared.
Summary: Nucleotid sequence alignment
is the process of matching homologous sites of DNA sequences. Mutations
have accumulated in each line since divergence from a common ancestor,
therefore two sequences might be quite different. Some of the evolutionary
changes are more frequent than others. algorithms for alignment
take the probability of such changes into account. Homology of sites
can be accepted more readily when difference might be explained
by a frequent change, such as synonymous mutation.
Next: an example for gap panelties.
|