|
This page: Here we show
a method of Sellers, which provides an algorithm to choose among
the possible alignments and to calculate the distance of two sequences.
For example, consider these two sequences:
| Site |
1.
|
2.
|
3.
|
4.
|
5.
|
6.
|
7.
|
8.
|
9.
|
10.
|
| Seq. 1 |
A
|
T
|
G
|
C
|
G
|
T
|
C
|
G
|
T
|
T
|
| Seq. 2 |
A
|
T
|
C
|
C
|
G
|
C
|
G
|
T
|
C
|
|
| mism. |
|
|
*
|
|
|
*
|
*
|
*
|
*
|
-
|
| ............. |
........ |
........ |
........ |
........ |
........ |
........ |
........ |
........ |
........ |
........ |
Sequence 1 and 2 differ in 5 sites in bases, and Seq.
1 is one base longer than Seq. 2. There are two types of events
during evolution which could result in such differences. Firstly,
fixed point mutations change bases in one or both sequences, secondly,
bases might get lost or inserted in any of the two sequences. It
is impossible to tell by comparing two sequencies, whether the difference
for example in Site 3 is caused by a base change or by insertion/deletion
in any of the two sequences.
Here we show a method of Sellers, which provides an
algorithm to choose among the possible alignments and to calculate
the distance of two sequences. The distance is measured here as
D = kg + lm,
where k is the number of gaps in the alignment, g
is the gap penalty, l is the number of mismatches and m is the mismatch
penalty.
Consider these two sequences:
ATCCGCGTC
Take an (n+1)(m+1) matrix (n and m refers to the
length of the sequences), and write the sequences on the edges with
a gap sign (-) before them:
Let us calculate the matrix element Di,j
as:
Di,j = min { Di-1,j
+ g; Di,j-1 + g;
Di-1,j-1 + { 0 if i=j else m
} } !
What does that mean? Each element of the matrix is
an end of the possible alignments and the score (Di,j)
belonging to this alignment has to be written into the cell. If
we want to align the i-th base of the first sequence (Ai)
and the j-th base of the second sequence (Bj),
there can be three cases:
1. Ai-1 was optimally
aligned with Bj, thus we need one more gap to align
Ai to this alignment, and the alignment score will be Di-1,j
+ 1 gap penalty:
..................................
Ai-1 Ai
..................................
Bi -
2. Ai was optimally
aligned with Bj-1, thus we need one more gap to align
Aj to this alignment, and the alignment score will
be Di,j-1 + 1 gap penalty:
..................................
Ai -
..................................
Bj-1 Bj
3. Ai-1 was optimally
aligned with Bj-1, so the Ai and Bj
elements get into the same position and we have to align them;
if Ai and Bj are the same bases, the alignment
score Di,j will be the same as the score Di-1,j-1,
and if they differ, the alignment will contain one more mismatch,
thus the score will be Di-1,j-1 + 1 mismatch penalty:
......................................
Ai
......................................
Bj
We must accept the case with the smallest score
and write the score to the proper cell of the matrix, and than fill
in the matrix row by row.
The distance of the two sequences will be the score in the right
bottom corner of the matrix, and the best alignment can be obtained
by the method called back-tracking.
Let's see our example:
- Let the gap penalty g = 2, and the mismatch penalty m = 1
!
- First we fill in the first row and the first column of the
matrix. The scores are unambiguous because in the first row
we can calculate Di,1 only from the element Di-1,1
(so Di,1 will be Di-1,1 + 2) and in the
first column D1,j only from the element D1,j-1
(so D1,j = D1,j-1 + 2):
-
If we reach the right-bottom corner of the matrix
we will get the best alignment score of the whole sequences.
Back-tracking means that we start from the right -bottom corner
and go along those elements from which the next alignment score
was computed until we reach the left-upper corner, and the path
will assign the best alignment (see the blue numbers):
The best alignment is:
A T G C G T C G T T
A T C C G - C G T C
Notes:
This method can be expanded to three or more sequences, but
the running time of the algorithm grows exponentially with the number
of sequences. There are some methods by which the algorithm can
be accelerated; that of Spouge reduces the running time by
computing only a fraction of the matrix: before the alignment procedure
we must determine a test value (t), which might be the maximum expected
alignment score of the sequences, and than only those matrix elements
are calculated which satisfy Di,j + |(n-i) - (m-j)|*g
=< t (the meaning of the signs see above), and if in a row we
reach this t-value, we needn't calculate the rest elements of the
row.
Note that if the distance of the two sequences is more than t, the
algorithm doesn't calculate the alignment score.
Note also that with this procedure a z-length gap has z-times more
weight than a 1-length gap, and this has no biological relevance.
(For example the deletion of three bases has usually larger probability
than the deletion of one or two bases, because in the former case
a there will not be frame-shift.) There are methods which attempt
to correct this problem.
|