THIS PAGE: describes how to use BLAST
for the novice. BLAST (Basic Local Alignment Search Tools) is used
to find sequences similar to sequence you have.
A DNA sequence obtained from GenBank is a series of letters, such
as this:
ttttcgtctg gggggtgtgc acgcgatagc attgcgagac
gctggagccg gagcacccta tgtcgcagta tctgtctttg attcctgccc cattccatta
tttatcgcac ctacgttcaa tattacaggc gagcatactt actaaagtgt gttaattaat
taatgcttgt aggacataat aataacgact aaatgtctgc acagctgctt tccacacaga
catcataaca aaaaatttcc accaaacccc ctttcctccc ccgcttctgg ccacagcact
taaacacatc tctgccaaac cccaaaaaca aagaacccta acaccagcct aaccagactt
This chain of letters contains no information in itself. A given
sequence in a given species, however, did not come out of the blue,
it is some modification of a previous sequence in a predecessor
species. Finding similar sequences, therefore might give a clue
of the function, if any, of the querry sequence.
Similarity search can be done in GenBank by BLAST. Try a BLAST
search yourself. At NCBI main
page click at BLAST in the deep blue navigation bar which leads
you to BLAST homepage.
Click at Standard Nucleotide-Nucleotide search [blastn]. Copy the
above sequence in red and paste it into the Search window. Hit BLAST.
In a few seconds you receive a request ID. Now hit FORMAT! and wait.
The page will renew itself, so in a few minutes you should have
results.
As you did not do any formatting, results are given in the default
format. First of all, a graph is shown indicating similarity of
the retreived sequences to the querry sequence. Then you have a
list of "Sequences producing significant alignments" with
misterious Score bits and E values.
The list starts with the sequence you submitted, now identified
as:
gi|4927255|gb|AF142095.1|AF142095
Homo sapiens neanderthale...
The letter-number combinations are accession numbers, then comes
the name of the source. Open the accession number link in a new
window. Here you find the whole description of the sequence. The
sequence is hypervariable region of mitochondrial DNA from a Neandertal
caveman. When discussing the vombat sequence previously, we have
discribed the logic of sequence pages .
Now go back to BLAST results. Following the Neandertal sequence
comes a bunch of similar sequences, all of them Homo sapiens mitochondrial
DNA.
Similarity is defined as the extent to which nucleotide or protein
sequences are related. BLAST first breaks the querry sequence you
submitted into short strings, and matches those strings to to similar
ones against all sequences in the database. Simlarity of such strings
in the querry versus any other sequence is calculated. If similarity
exceeds a certain value, the string is lengthened in either dierction,
and similarity is recalculated.
Similarity is scored by adding substitutions and gap panelties
.
New insertions and deletions are considered rare events during evolution,
and they have high penalties (10-15). Insertion of a 3 base sequence
as a single event is more probable, than insertion of 3 single bases
independently, therefore gap length cost is set lower (1-2). Different
types of substitutions have different probability. Substitutions
scores are given by look-up tables.
Examples for the meaning of LOCAL in BLAST
(1) Matching DNA segments to complete sequences.
For some studies only fragments of a complete gene are sequenced.
It is possible to paste those fragments together and send it to
BLAST, because BLAST matches short fragments of the querry to all
sequences of the data base, separately. So instead of matching the
whole string (global alignment), first, partial similarities are
checked (local alignment).
For example, Roy sequenced 3' and 5' ends of a gene (cytochrome
b) of a bird species (Andropadus curvirostris). Here are the two
sequences:
(1) Andropadus curvirostris cytochrome b (cytb) gene, mitochondrial
gene encoding mitochondrial protein, 3' end, partial cds.
| ttcctatttg cctacgccat ccttcgatct atcccaaaca aacttggagg agtccttgcc
ttagctgcct ccgtcctagt actatttctc attcccctgc tacacgtatc caaactacga
tcaataacct tccgccccct gtcacaaatc ctattctgag ccctagtagc aaacctcctc
atcctaacct gagtaggcag ccaaccagtt gaacacccct tcatcatcat cggtcaactc
gcctccatct catacttcac aatcatccta atcctcttcc ccat |
(2) Andropadus curvirostris cytochrome b (cytb) gene, mitochondrial
gene encoding mitochondrial protein, 5' end, partial cds.
| ggcatctgcc taattacaca gatcatcaca ggcttactgc tggccataca ctacacagca
gacacaaacc tggcctttgc ttccgttgcc cacacatgcc gaaacgtcca attcgggtga
ctaatccgca atctacatgc aaacggagcc tccttctttt tcatgtgcat ctacattcac
attggccgag gaatctacta cggctcatac ttaaacaaag agacctgaaa catcggagtt
gtccnnctcc taactctnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn |
Suppose, we want to learn quickly how long the missing part may
be. For that we can paste the two segments together, like this:
ttcctatttg cctacgccat ccttcgatct
atcccaaaca aacttggagg agtccttgcc ttagctgcct ccgtcctagt actatttctc
attcccctgc tacacgtatc caaactacga tcaataacct tccgccccct gtcacaaatc
ctattctgag ccctagtagc aaacctcctc atcctaacct gagtaggcag ccaaccagtt
gaacacccct tcatcatcat cggtcaactc gcctccatct catacttcac aatcatccta
atcctcttcc ccat
ggcatctgcc taattacaca gatcatcaca ggcttactgc tggccataca ctacacagca
gacacaaacc tggcctttgc ttccgttgcc cacacatgcc gaaacgtcca attcgggtga
ctaatccgca atctacatgc aaacggagcc tccttctttt tcatgtgcat ctacattcac
attggccgag gaatctacta cggctcatac ttaaacaaag agacctgaaa catcggagtt
gtccnnctcc taactctnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn |
|
and send it to BLAST (www).
The resulting graph indicates, that some sequences were
matched to the 3' end (blue arrow), some to the 5' end (green
arrow) and some to both ends (red arrow) of the querry sequence.
The grey area in the middle of complete sequences indicates
the fraction not sequenced by Roy.
|
|
|
(2) Matching informative strings interspersed with uninformative
segments.
In between coding sequences DNA might have strings of low complexity,
such as repetitions. BLAST is able to match such mixed sequences
by local alignment. To demonstrate the power of BLAST search, I
inserted cta-cta repetitions in the middle of Andropadus cytochrome
gene. By pasting this sequence into BLAST search window, the orginal
Andropadus sequence was found, together with other closely related
cytb sequences.
ttcctatttg cctacgccat ccttcgatct atcccaaaca aacttggagg agtccttgcc
ttagctgcct ccgtcctagt actatttctc attcccctgc tacacgtatc caaactacga
tcaataacct tccgccccct
ctactactactactactactactactactactactactactactactactactactactactactactactactactactactactactactactactacta
gtcacaaatc ctattctgag ccctagtagc aaacctcctc atcctaacct
gagtaggcag ccaaccagtt gaacacccct tcatcatcat cggtcaactc gcctccatct
catacttcac aatcatccta atcctcttcc ccat |
|