|
THIS PAGE: very basics
of sequence alignment.
Sequence alignment is the process of matching homologous
sites of DNA sequences. The history of sequences are not known,
therefore, homologous sites can be identified only with some probability.
Considere tow homologous sequences of equal length,
for example mtDNA from Homo sapiens neandertalensis and H. sapiens
sapiens as retrived by BLAST
search from GenBank. When it is reasonable to suppose that no insertions
or deletions have happened in any of the sequences since their divergence
from a common ancestor, alignment is easy, because any difference
between the sequences should be base substitutions. It is like having
the sequences on two rulers and sliding one ruler above the other,
untill the vast majority of letters match.
However, in most cases the occurance of insertions
or deletions cannot be excluded. Considere two identical sequences,
I. and II. in Table 1 bellow. Through time any of the sequences
can loose a single base (marked with X),
or any base might be substituted by mutation.
|
|
|
|
G
|
|
|
|
|
|
|
|
|
|
|
I.
|
A
|
T
|
C
|
C
|
G
|
T
|
C
|
G
|
T
|
T
|
|
|
|
|
÷
|
÷
|
÷
|
÷
|
÷
|
÷
|
÷
|
÷
|
÷
|
÷
|
|
|
|
II.
|
A
|
T
|
C
|
C
|
G
|
T
|
C
|
G
|
T
|
T
|
|
|
|
|
|
|
|
|
|
X
|
|
|
|
C
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 1.
Without information on the evolutionary history of
Sequence I' II', we simply have two sequences of unequal length
with 5 mismatches:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
I'.
|
A
|
T
|
G
|
C
|
G
|
T
|
C
|
G
|
T
|
T
|
m=10
|
|
|
÷
|
÷
|
|
÷
|
÷
|
|
|
|
|
|
|
|
|
II'.
|
A
|
T
|
C
|
C
|
G
|
C
|
G
|
T
|
C
|
-
|
n=9
|
|
|
|
|
*
|
|
|
*
|
*
|
*
|
*
|
|
|
|
Table 2
If there is no information on how many deletions
or insertions have happened since divergence, we should make some
decisions. First of all, we cannot be sure, whether the difference
in length was caused by an insertion in Sequence I, or by a deletion
from Sequence II. As insertions cannot be distinguished from deletions,
both of those events are considered gaps (or indels)
and are marked by "-".
The way the two sequences were written in Table 2
above minimised the number of gaps. Now, let us minimise the number
of mismatches by adding as many gaps, as necessary:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
I'.
|
A
|
T
|
G |
-
|
C
|
G
|
T
|
C
|
G
|
T
|
-
|
T |
|
|
÷
|
÷
|
|
|
÷
|
÷
|
|
÷
|
÷
|
÷
|
|
|
|
II'.
|
A
|
T
|
-
|
C
|
C
|
G
|
-
|
C
|
G
|
T
|
C
|
-
|
|
|
|
|
-
|
-
|
|
|
-
|
|
|
|
-
|
- |
Table 3
Addition of unlimited number gaps eliminates mismatches,
however, the number of gaps increased to 5. Because there is no
independent information on actually how many deletions versus substitutions
have happened since divergence, to find homologous sites an algorithm
should be worked out.
Finding homologous sites means to match sites of common
descent. It is almost like recreating the state of the original
sequence before divergence. As time has passed since divergence,
more and more mutations accumulated in the two sequences. Some of
such changes have higher probability than other types of mutations.
Occurance of events with low probability takes long time on average.
The trick of alignment therefore is to considere changes according
to their probabilities, as discussed using a hyperprimitive example
here.
Gap and mismatch penalties
The difference, or distance between two sequences
have been created by substitutions and gaps. These two types of
mutations can be given separate weight, or so called penalties:
D = kg + lm
where k is the number of gaps in the alignment, g
is the gap penalty, l is the number of mismatches and m is the mismatch
penalty. According to a number of studies, gaps are rare events,
therefore gap panelties are normally set higher than mismatch penalties.
It is also reasonable to assume, that the creation
of independent single gaps k-times is not equal to a single indel
event of a longer sequence.
D = Ski gi
+ lm
where ki is the number of gaps of length
i, and gi is the penalty of gaps of lenght
i. In most alignment algorithms, distinction between single gaps
and longer gaps is made in a simple way. One can set a penalty to
single gaps (gap opening) and a single gap length (gap extension)
penalty. Normally, gap opening penalty is set high, and gap extension
penalty is set lower.
Let us apply gap penalty g = 2, and mismatch penalty
m = 1 to our sequences.
|
Minimising gaps
|
|
Minimising mismatches
|
|
I'.
|
A
|
T
|
G
|
C
|
G
|
T
|
C
|
G
|
T
|
T
|
|
|
|
÷
|
÷
|
|
÷
|
÷
|
|
|
|
|
|
|
|
II'.
|
A
|
T
|
C
|
C
|
G
|
C
|
G
|
T
|
C
|
-
|
|
|
|
|
|
*
|
|
|
*
|
*
|
*
|
*
|
-
|
|
|
|
|
|
1
|
|
|
1
|
1
|
1
|
1
|
2
|
7
|
|
_
|
|
I'.
|
A
|
T
|
G
|
-
|
C
|
G
|
T
|
C
|
G
|
T
|
-
|
T
|
|
|
|
÷
|
÷
|
|
|
÷
|
÷
|
|
÷
|
÷
|
÷
|
|
|
|
|
II'.
|
A
|
T
|
-
|
C
|
C
|
G
|
-
|
C
|
G
|
T
|
C
|
-
|
|
|
|
|
|
-
|
-
|
|
|
-
|
|
|
|
-
|
-
|
|
|
|
|
|
2
|
2
|
|
|
2
|
|
|
|
2
|
2
|
10
|
|
Minimising gaps results in D = 7, while minising mismatches yields
D = 10. In theory, it is possible to add systematically gaps at
every position and then recalculate the sum of penalties. Doing
this many times, we would receive an alignment with minimal distance,
and the struggle is over. For example, adding a single gap at a
certain site might reduce the distance value to D = 4:
|
Optimal alignment
|
|
I'.
|
A
|
T
|
G
|
C
|
G
|
T
|
C
|
G
|
T
|
T
|
|
|
|
÷
|
÷
|
|
÷
|
÷
|
|
÷
|
÷
|
÷
|
|
|
|
II'.
|
A
|
T
|
C
|
C
|
G
|
- |
C
|
G
|
T
|
C
|
|
|
|
|
|
*
|
|
|
-
|
|
|
|
*
|
|
|
|
|
|
1
|
|
|
2
|
|
|
|
1
|
4
|
Real DNA sequences, however, are much longer and there are so many
possibilities to add gaps, that even a fast computer could not find
the optimal solution by such a simple algorithm. Most softwares,
such as Clustal, starts the alignment by dynamic programing.
Next: dynamic programing
For the novice: local alignment, introduction to BLAST. Here
|