I presume that you have already downloaded and installed Blast (if not, then click here) and downloaded two sets of protein deduced from genomic sequences, one from the genomic sequence of E. coli K-12 and the other from the genomic sequence of either E. coli O157:H7 EDL399 or E. coli O157 Sakai (if not, then click here). (If you don't know which strain to choose, click here)
Blasting the protein of one genome against the protein of another proceeds in two steps. First, you need to let Blast analyze one set of protein to create a database it can understand. Second, you need to run Blast to compare each protein of the OTHER set of protein to that database. You'll make the database from the set of E. coli K12 protein. You'll run the set of proteins from your pathogenic strain against that database.
1. Create a database of E. coli K12 protein
a. Get into a Dos window (Run Command or Cmd)
b. Get into the directory where Blast and the FA files reside (CD \Blast)
c. Type the following command to format the database:
formatdb -ieck12.FA –pT –oT –nK12-Prot
- formatdb invokes the Blast accessory program to create the database
- -i tells the program that the path that follows leads to the input file. The file name eck12.FA is used only as an example. Use whatever name you gave the file of E. coli K12 protein sequences you downloaded.
- -pT tells the program "True, the file consists of protein sequences" (-pF would have been appropriate for DNA sequences)
- -oT tells the program "True, you should make an index of the identification numbers for the protein sequences" (the K12 file uses ID numbers like b0001). Frankly, I don't know what good the index does, but it's cheap.
- -n tells the program that the characters that follow should be used as the name of the database (you can name it anything you want, so long as you use 8 or fewer legal characters).
[NULL_Caption] WARNING: lcl|1445 has zero-length sequence
[NULL_Caption] WARNING: lcl|2827 has zero-length sequence
[NULL_Caption] WARNING: lcl|3800 has zero-length sequence
[NULL_Caption] WARNING: lcl|3973 has zero-length sequenceThese messages mean that the protein sequences b1445, b2827, b3800, and b3873 don't have any amino acids. Which is not very likely. TIGR evidently screwed up, but problems with four out of about four thousand proteins aren't going to hurt us much.