SALVE FOUNDATION HOME  
BIOHOME - COURSES - LIBRARY - SCIENCE NEWS - GALLERY - LINKS - MAIL  
Phylogenetic trees and evolutionary comparison  
Page: DNA sequence alignment - gap panelties    
 
   
     
   

THIS PAGE: very basics of sequence alignment.

Sequence alignment is the process of matching homologous sites of DNA sequences. The history of sequences are not known, therefore, homologous sites can be identified only with some probability.

Considere tow homologous sequences of equal length, for example mtDNA from Homo sapiens neandertalensis and H. sapiens sapiens as retrived by BLAST search from GenBank. When it is reasonable to suppose that no insertions or deletions have happened in any of the sequences since their divergence from a common ancestor, alignment is easy, because any difference between the sequences should be base substitutions. It is like having the sequences on two rulers and sliding one ruler above the other, untill the vast majority of letters match.

 

However, in most cases the occurance of insertions or deletions cannot be excluded. Considere two identical sequences, I. and II. in Table 1 bellow. Through time any of the sequences can loose a single base (marked with X), or any base might be substituted by mutation.
G
I.
A
T
C
C
G
T
C
G
T
T
÷
÷
÷
÷
÷
÷
÷
÷
÷
÷
II.
A
T
C
C
G
T
C
G
T
T
X
C

Table 1.

Without information on the evolutionary history of Sequence I' II', we simply have two sequences of unequal length with 5 mismatches:

 
       
I'.
A
T
G
C
G
T
C
G
T
T
m=10
÷
÷
÷
÷
II'.
A
T
C
C
G
C
G
T
C
-
n=9
*
*
*
*
*


Table 2

If there is no information on how many deletions or insertions have happened since divergence, we should make some decisions. First of all, we cannot be sure, whether the difference in length was caused by an insertion in Sequence I, or by a deletion from Sequence II. As insertions cannot be distinguished from deletions, both of those events are considered gaps (or indels) and are marked by "-".

The way the two sequences were written in Table 2 above minimised the number of gaps. Now, let us minimise the number of mismatches by adding as many gaps, as necessary:
           
 
I'.
A
T
G
-
C
G
T
C
G
T
-
T
÷
÷
 
÷
÷
 
÷
÷
÷
 
II'.
A
T
-
C
C
G
-
C
G
T
C
-
-
-
-
-
-

Table 3

Addition of unlimited number gaps eliminates mismatches, however, the number of gaps increased to 5. Because there is no independent information on actually how many deletions versus substitutions have happened since divergence, to find homologous sites an algorithm should be worked out.

Finding homologous sites means to match sites of common descent. It is almost like recreating the state of the original sequence before divergence. As time has passed since divergence, more and more mutations accumulated in the two sequences. Some of such changes have higher probability than other types of mutations. Occurance of events with low probability takes long time on average. The trick of alignment therefore is to considere changes according to their probabilities, as discussed using a hyperprimitive example here.

 

Gap and mismatch penalties

The difference, or distance between two sequences have been created by substitutions and gaps. These two types of mutations can be given separate weight, or so called penalties:

D = kg + lm

where k is the number of gaps in the alignment, g is the gap penalty, l is the number of mismatches and m is the mismatch penalty. According to a number of studies, gaps are rare events, therefore gap panelties are normally set higher than mismatch penalties.

It is also reasonable to assume, that the creation of independent single gaps k-times is not equal to a single indel event of a longer sequence.

D = Ski gi + lm

where ki is the number of gaps of length i, and gi is the penalty of gaps of lenght i. In most alignment algorithms, distinction between single gaps and longer gaps is made in a simple way. One can set a penalty to single gaps (gap opening) and a single gap length (gap extension) penalty. Normally, gap opening penalty is set high, and gap extension penalty is set lower.

Let us apply gap penalty g = 2, and mismatch penalty m = 1 to our sequences.

 
Minimising gaps
Minimising mismatches
I'.
A
T
G
C
G
T
C
G
T
T
÷
÷
÷
÷
II'.
A
T
C
C
G
C
G
T
C
-
*
*
*
*
*
-
1
1
1
1
1
2
7
_
I'.
A
T
G
-
C
G
T
C
G
T
-
T
÷
÷
÷
÷
÷
÷
÷
II'.
A
T
-
C
C
G
-
C
G
T
C
-
-
-
-
-
-
2
2
2
2
2
10

Minimising gaps results in D = 7, while minising mismatches yields D = 10. In theory, it is possible to add systematically gaps at every position and then recalculate the sum of penalties. Doing this many times, we would receive an alignment with minimal distance, and the struggle is over. For example, adding a single gap at a certain site might reduce the distance value to D = 4:

Optimal alignment
I'.
A
T
G
C
G
T
C
G
T
T
÷
÷
÷
÷
÷
÷
÷
II'.
A
T
C
C
G
-
C
G
T
C
*
-
*
1
2
1
4

Real DNA sequences, however, are much longer and there are so many possibilities to add gaps, that even a fast computer could not find the optimal solution by such a simple algorithm. Most softwares, such as Clustal, starts the alignment by dynamic programing.

Next: dynamic programing

For the novice: local alignment, introduction to BLAST. Here

     
Page written by: Peter Kabai  
Edited by: Peter Kabai  
     
written: 2001-05-04, modified.: 2001-05-04