Scenario
: Comparison of genomes to look for genes responsible for pathogenesis
Our Story
E. coli: The good, the bad, and the ugly.
You’ve probably heard of E. coli (the common abbreviation for this bacterium’s
full name, Escherichia coli) in one or both of two contexts. The first is
news reports about people getting very sick or even dying after eating undercooked
hamburger meat, non-pasteurized fruit juice, or alfalfa sprouts contaminated
with E. coli. The second place you might have encountered this bacterium
is in a molecular biology laboratory. E. coli was mentioned in a previous
problem set as a bacterium that might be used for producing a protein of
interest. In fact, E. coli cells can be found in most molecular biology laboratories,
where they are mostly used as factories for making recombinant DNA or proteins.
As an aside, when we isolate a piece of DNA and put it into
E. coli to make more of it, we call this “cloning.” This is because we are
making many identical copies of something (usually a gene). This process
is sometimes referred to as “molecular cloning” and should not be confused
with “organismal cloning”, as in cloning sheep or humans.
E. coli are also favorites for use in experiments
in student labs. So why are scientists and teachers exposing themselves
and their students to these deadly bacteria?
It turns out that there are several varieties or “strains” of E. coli.
They are all related, but can have different properties. We all have a great
many E. coli bacteria in our large intestines, or colons (hence, the name
“coli”). These cells are usually strains that are not only harmless in that
environment, but beneficial, providing us with some vitamins and helping
to prevent more harmful bacteria that we might eat from taking up residence
and causing disease. So E. coli are mostly good for us. Common laboratory
strains, such as E. coli strain K-12, are also harmless. We like to use these
strains because they’re easy to grow and because we have lots of experience
using them to make DNA and protein. So these bacteria are good, too. Click
here
for a picture of E. coli cells that carry cloned genes allowing them to
produce light.
The E. coli that are responsible for the illnesses mentioned in the
news are most often a different strain, known as O157:H7. The name comes
from the particular varieties of two surface structures possessed by this
bacterium. This is akin to describing a criminal suspect as having short,
brown hair and a dragon tattoo on his right bicep. This strain is definitely
bad. (The “ugly” strains are probably those that just cause diarrhea. They
usually don’t kill people, but they sometimes make people wish they were
dead.)
What’s the difference?
So now that we know that E. coli can be good or bad, harmless or deadly,
the question we would like answered is, “What is it about the O157:H7 strain
that makes it harmful?” If we can learn this, we might be able to come up
with better diagnostic reagents for tracking these bacteria in our food.
We might also be able to devise a vaccine or drug that would target this
deadly bacterium without targeting the beneficial ones. One obvious suggestion
is that the O157 and H7 surface components are responsible for pathogenesis.
Alas, this is not the case, just as having short, brown hair and a tattoo
doesn’t dictate that someone will be a criminal. The problem of identifying
the components of a microbe that are responsible for its pathogenicity (known
as “virulence factors”) comes up often in the study of infectious diseases.
One general approach to answering this question is to compare a harmful
strain, or “pathogen,” to an innocuous strain, or “non-pathogen.” The proteins
possessed by the pathogen but absent from the non-pathogen may hold the
key to the virulence of the pathogen.
While sophisticated techniques may be used to identify the proteins
actually produced by a given bacterial strain (we’ll encounter such proteomic
analysis later on), you have all the information you need right now to identify
all the proteins E. coli may POTENTIALLY produce. The DNA sequences of the
entire genomes of a pathogenic E. coli (O157:H7) and a nonpathogenic strain
(K12) have been determined, and from these sequences, one can predict with
a high degree of accuracy the full complement of functional genes and the
proteins they encode. Since you’d expect that most of the DNA in the two
closely related strains should be the same, so should the encoded protein.
The protein that is uniquely encoded by E. coli O157:H7 (and not encoded
by E. coli K12) may be responsible for its virulence.
What to do about too much success?
Excited by the prospect of identifying the protein unique to E. coli O157:H7,
understanding the basis for its pathogenesis, and winning the gratitude
of Burger Kings everywhere, you use a standard bioinformatics tool, BLAST,
to compare the set of proteins encoded by E. coli K12 with the set encoded
by E. coli O157:H7. Unfortunately, the output you get from the protein is
a file several tens of megabytes in length, much bigger than the entire body
of Shakespeare’s plays, and much less interesting reading. Surely the answer
you seek is in that huge file. How can you rework the output into something
you can comprehend?
Problem
Use BLAST to identify those proteins encoded by the pathogen
O157:H7 and not by the non-pathogenic laboratory strain K-12, and parse
the output into a usable form.
Tools
Blast
This is a standard tool for comparing sequences that we'll be looking at
a lot closer later. For now, the task is to install the program on your
own computer so that you can run huge genomes through it.
Parsing program
Go through output and pick out just the items you're
interested in, saving them in a convenient format. Writing parsing programs
is one of the most common activities of people who do bioinformatics.
References
Perna NT et al (2001) Genome sequence of enterohaemorrhagic Escherichia
coli O157:H7. Nature 40(:529-533.
Hayashi T et al (2001) Complete
genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic
comparison with a laboratory strain K-12. DNA Research 8:11-22.