SALVE FOUNDATION HOME  
BIOHOME - COURSES - LIBRARY - SCIENCE NEWS - GALLERY - LINKS - MAIL  
Phylogenetic trees and evolutionary comparison  
Page: Finding nucleotide sequences in GenBank by BLAST    
 
 
 
     
   

THIS PAGE: describes how to use BLAST for the novice. BLAST (Basic Local Alignment Search Tools) is used to find sequences similar to sequence you have.

A DNA sequence obtained from GenBank is a series of letters, such as this:

ttttcgtctg gggggtgtgc acgcgatagc attgcgagac gctggagccg gagcacccta tgtcgcagta tctgtctttg attcctgccc cattccatta tttatcgcac ctacgttcaa tattacaggc gagcatactt actaaagtgt gttaattaat taatgcttgt aggacataat aataacgact aaatgtctgc acagctgctt tccacacaga catcataaca aaaaatttcc accaaacccc ctttcctccc ccgcttctgg ccacagcact taaacacatc tctgccaaac cccaaaaaca aagaacccta acaccagcct aaccagactt

This chain of letters contains no information in itself. A given sequence in a given species, however, did not come out of the blue, it is some modification of a previous sequence in a predecessor species. Finding similar sequences, therefore might give a clue of the function, if any, of the querry sequence.

Similarity search can be done in GenBank by BLAST. Try a BLAST search yourself. At NCBI main page click at BLAST in the deep blue navigation bar which leads you to BLAST homepage. Click at Standard Nucleotide-Nucleotide search [blastn]. Copy the above sequence in red and paste it into the Search window. Hit BLAST. In a few seconds you receive a request ID. Now hit FORMAT! and wait. The page will renew itself, so in a few minutes you should have results.

As you did not do any formatting, results are given in the default format. First of all, a graph is shown indicating similarity of the retreived sequences to the querry sequence. Then you have a list of "Sequences producing significant alignments" with misterious Score bits and E values.

The list starts with the sequence you submitted, now identified as:
gi|4927255|gb|AF142095.1|AF142095 Homo sapiens neanderthale...

The letter-number combinations are accession numbers, then comes the name of the source. Open the accession number link in a new window. Here you find the whole description of the sequence. The sequence is hypervariable region of mitochondrial DNA from a Neandertal caveman. When discussing the vombat sequence previously, we have discribed the logic of sequence pages .

Now go back to BLAST results. Following the Neandertal sequence comes a bunch of similar sequences, all of them Homo sapiens mitochondrial DNA.

Similarity is defined as the extent to which nucleotide or protein sequences are related. BLAST first breaks the querry sequence you submitted into short strings, and matches those strings to to similar ones against all sequences in the database. Simlarity of such strings in the querry versus any other sequence is calculated. If similarity exceeds a certain value, the string is lengthened in either dierction, and similarity is recalculated.

Similarity is scored by adding substitutions and gap panelties . New insertions and deletions are considered rare events during evolution, and they have high penalties (10-15). Insertion of a 3 base sequence as a single event is more probable, than insertion of 3 single bases independently, therefore gap length cost is set lower (1-2). Different types of substitutions have different probability. Substitutions scores are given by look-up tables.

 

Examples for the meaning of LOCAL in BLAST

(1) Matching DNA segments to complete sequences.

For some studies only fragments of a complete gene are sequenced. It is possible to paste those fragments together and send it to BLAST, because BLAST matches short fragments of the querry to all sequences of the data base, separately. So instead of matching the whole string (global alignment), first, partial similarities are checked (local alignment).

For example, Roy sequenced 3' and 5' ends of a gene (cytochrome b) of a bird species (Andropadus curvirostris). Here are the two sequences:

(1) Andropadus curvirostris cytochrome b (cytb) gene, mitochondrial gene encoding mitochondrial protein, 3' end, partial cds.

ttcctatttg cctacgccat ccttcgatct atcccaaaca aacttggagg agtccttgcc ttagctgcct ccgtcctagt actatttctc attcccctgc tacacgtatc caaactacga tcaataacct tccgccccct gtcacaaatc ctattctgag ccctagtagc aaacctcctc atcctaacct gagtaggcag ccaaccagtt gaacacccct tcatcatcat cggtcaactc gcctccatct catacttcac aatcatccta atcctcttcc ccat

(2) Andropadus curvirostris cytochrome b (cytb) gene, mitochondrial gene encoding mitochondrial protein, 5' end, partial cds.

ggcatctgcc taattacaca gatcatcaca ggcttactgc tggccataca ctacacagca gacacaaacc tggcctttgc ttccgttgcc cacacatgcc gaaacgtcca attcgggtga ctaatccgca atctacatgc aaacggagcc tccttctttt tcatgtgcat ctacattcac attggccgag gaatctacta cggctcatac ttaaacaaag agacctgaaa catcggagtt gtccnnctcc taactctnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn

Suppose, we want to learn quickly how long the missing part may be. For that we can paste the two segments together, like this:

ttcctatttg cctacgccat ccttcgatct atcccaaaca aacttggagg agtccttgcc ttagctgcct ccgtcctagt actatttctc attcccctgc tacacgtatc caaactacga tcaataacct tccgccccct gtcacaaatc ctattctgag ccctagtagc aaacctcctc atcctaacct gagtaggcag ccaaccagtt gaacacccct tcatcatcat cggtcaactc gcctccatct catacttcac aatcatccta atcctcttcc ccat
ggcatctgcc taattacaca gatcatcaca ggcttactgc tggccataca ctacacagca gacacaaacc tggcctttgc ttccgttgcc cacacatgcc gaaacgtcca attcgggtga ctaatccgca atctacatgc aaacggagcc tccttctttt tcatgtgcat ctacattcac attggccgag gaatctacta cggctcatac ttaaacaaag agacctgaaa catcggagtt gtccnnctcc taactctnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn

 

and send it to BLAST (www).

The resulting graph indicates, that some sequences were matched to the 3' end (blue arrow), some to the 5' end (green arrow) and some to both ends (red arrow) of the querry sequence. The grey area in the middle of complete sequences indicates the fraction not sequenced by Roy.

 

(2) Matching informative strings interspersed with uninformative segments.

In between coding sequences DNA might have strings of low complexity, such as repetitions. BLAST is able to match such mixed sequences by local alignment. To demonstrate the power of BLAST search, I inserted cta-cta repetitions in the middle of Andropadus cytochrome gene. By pasting this sequence into BLAST search window, the orginal Andropadus sequence was found, together with other closely related cytb sequences.

ttcctatttg cctacgccat ccttcgatct atcccaaaca aacttggagg agtccttgcc ttagctgcct ccgtcctagt actatttctc attcccctgc tacacgtatc caaactacga tcaataacct tccgccccct
ctactactactactactactactactactactactactactactactactactactactactactactactactactactactactactactactactacta
gtcacaaatc ctattctgag ccctagtagc aaacctcctc atcctaacct gagtaggcag ccaaccagtt gaacacccct tcatcatcat cggtcaactc gcctccatct catacttcac aatcatccta atcctcttcc ccat

 

     
Page written by: Peter Kabai  
Edited by: Peter Kabai  
modif.: 2001-05-04
     
written: 2001-05-04, modified.: 2001-05-04  
.