SALVE FOUNDATION HOME  
BIOHOME - COURSES - LIBRARY - SCIENCE NEWS - GALLERY - LINKS - MAIL  
Phylogenetic trees and evolutionary comparison  
Page: DNA sequence alignment    
 
 
 
     
   

This page introduces basic concept of sequence alignment to the novice, using a superprimitive example.

 

Sequences do not come neatly arranged one under the other. Sequences should be aligned by identifying homologous sites in both DNA. The history of sequences are not known, therefore, homologous sites can be identified only with some probability.

As an example, think about a text with no space in between the words, as there are no spaces between codons in DNA.

 
c
o
n
s
c
i
o
u
s
o
f
t
h
e
i
m
p
o
r
t
a
n
c
e
                                               
                                               

Now, hand the text over to two persons who do not understand English, and ask them to type it, and retype it from the previous copy, again, and again 100 times. At the end, the two final copies of the two persons might be quite different, because they make mistakes at any time. How could a third person, who does not speak English, reconstruct the original text from the two final versions.

 
Original
 
c
o
n
s
c
i
o
u
s
o
f
t
h
e
i
m
p
o
r
t
a
n
c
e
VersionI
.
c
l
n
s
s
c
u
s
o
f
g
h
i
m
p
o
t
a
n
c
e
VersionII
 
c
o
n
s
c
o
o
u
s
o
f
g
h
e
i
m
p
o
r
t
a
n
c
c
 
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*

If the two versions are written one above the other, there will be mismatches (*) at 18 sites. However, the length of the two versions are different indicating that besides misprints, some characters might have been ommitted, others inserted by any of the two typists. Unfortunately, there is no way to tell, whether at a given site a new character was inserted in Version II or deleted in Version I. Deletions and insertions are therefore not distinguished and are called gaps, or indels.

How many gaps would you add into the sequences? If gaps can be added freely, all mismatches disappear:

 
 
c
o
n
s
c
i
o
u
s
o
f
t
h
e
i
m
p
o
r
t
a
n
c
e
VersionI
.
c
-
l
n
s
s
c
u
s
o
-
-
-
-
f
g
h
-
i
m
p
o
-
t
a
n
c
e
VersionII
 
c
o
-
n
s
-
c
-
-
o
o
u
s
o
f
g
h
e
i
m
p
o
r
t
a
n
c
-
c
 

However, both sequences became much longer than the orginal text, meaning that we got very far from the true text. If we knew the probability of each type of mistakes a typist can commit, we could guess better, what the original text might have been. What are the common mistakes?

  1. First of all, you may want to get some data, what is more common: omitting characters, or hitting the wrong key? If indels are less probable than substitutions, than the second character in Version I would be considered a substitution from o to l, and not a deletion. For alignment, substitions and indels should be weighed differently.
    (this will be discussed as "gap and mismatch penalties" here )
  2. At a given site there might be more than one substitution events when copying the copies. It might happen, that the typist would make a mistake many times on the same site, and arrive finally at the original character. Chances for such events are very low. However, if we have only four characters in the alphabet, as in a DNA sequence, back mutation, or paralell mutation is not unlikely.
    (discussed later as "Jukes and Cantor's one-parameter correction").
  3. It is far more frequent to hit a key near to the one aimed at, than hitting a key far away. Substition of O by L is more probable than O by Y. In other words, probability of substitutions are not equal for physical reason.
    (discussed later as "Kimura's 2-parameter Correction").
  4. Some sites in the sequence change more readily than others. Considere, for example, the three vowels in conscious. Characters I, O and U are close to each other in the keyboard, and it is quite easy to ommit or mistype one of those.
    Some models, not discussed here, deal with such hot and cold spots (If interested, see section "Uneven spatial distribution of substitutions" in Brian Golding's tutorial. www)
  5. Some misprints do not change the meaning of the word. The word "behaviour" means the same even if mistyped as "behavior". In protein coding DNA sequences substitutions at certain sites (third position of codons) do not necessarily affect the amino acid sequence. Fixation of synonymous mutation is more probable, than that of nonsynonymous substitutions, because selection does not act on the DNA, but on its product.
    (If interested, see section "Synonymous - nonsynonymous substitutions" in Brian Golding's tutorial. www)
  6. In most text editors it is possible to create a short program, which enables the typist to hit just one special key to get the computer type a whole word. This shortcut is handy for frequent words. An American person might code behavior, color etc. as shorthands, while a British person would code behaviour, colour instead. Spelling does not change the meaning, however, it changes the sequence of characters. Even it the typist would copy most of the text letter by letter, using shorthands for the most frequent words would increase the difference between the versions. In DNA, those genes which are transcribed frequently tend to use one kind of spelling. Species might differ in spelling style affecting the distance between two DNA sequences.
    (discussed later as "Codon usege")
  7. When sliding one version above the other we might find areas of perfect matches, divided by areas of mismatches. When aligning two strings it should be decided what is more important: to find areas of good matches, and considere the rest as noise, or to match the two strings character by character.
    (discussed later as "Basic Local Alignment Search Tools" here )
  8. The most horrifying typos are the ones, which do have some meaning. If one of your typists speak French, the word importance would make sense to him/her as much as the mistyped "impotance" and he/she then might stick to the new version. This type of "mutation -> new function" is important when genes are under different selection pressure in the different species we study.

Sounds complicated? It is complicated. Many factors can be considered when matching two sequences. Computation time, however, greatly limits the number factors to be considered.

The good news is, that most softwares deal with a number of corrections in a user friendly way. It is important that you know about gap penalties, because you will have to set that yourself. It is useful to have a clue of the one and the two-parameter correction, because in some softwares that also can be set manually. As far as more sophisticated methods are considered, there are shortcuts not too difficult to understand. One way to go is to construct a matrix containing all characters, and give a weight to each transtions. Such matrices are built into the majority of alignment softwares, and the novice can simply except them. For example, by comparing homologue proteins of many species it is easy to tell which amino acids might be replaced to which one with high probability during evolution. From that information a substitution matrix can be created and used in alignments. The idea is, that changes which occur frequently need less time on average than rare changes. Sites in two sequences different in frequent changes might be considered homologues. Protein coding DNA sequences can be translated to amino acid sequences, and than such translations can be compared.

Summary: Nucleotid sequence alignment is the process of matching homologous sites of DNA sequences. Mutations have accumulated in each line since divergence from a common ancestor, therefore two sequences might be quite different. Some of the evolutionary changes are more frequent than others. algorithms for alignment take the probability of such changes into account. Homology of sites can be accepted more readily when difference might be explained by a frequent change, such as synonymous mutation.

Next: an example for gap panelties.

     
Page written by: Peter Kabai  
Edited by: Peter Kabai  
     
written: 2001-05-04, modified.: 2001-05-04