Finding ORFs
The patterns on the previous page can be fooled by stop codons for
another reading frame. They would match TTGCTGA, for instance,
seeing it as TTG (a start codon)
folowed by a base, C, followed by a stop
codon, TGA.
But TGA
is in a different frame: biologically that's a start codon(TTG)
followed by a codon for alanine (GCT), followed by a single
extra base that's part of the following codon (A). What we want
is an arbitrary number of codons between the start and stop codons, not
an arbitrary number of bases. Here's a pattern that does that:
(ATG|GTG|TTG)(...)*(TGA|TAG|TAA)
This one keeps the reading frames straight because it looks at three
bases at a time, which is just what a codon is.
SQ9: We'll see in a page or two
that the improved pattern still isn't quite right. Can you think
of a way in which it might go wrong?
We have one more hump to get over before we find ORFs. We want to
report the position of the ORF,
not the string of all the bases in the ORF. Perl has a way to do that:
the pos() function. You'll
usually find pos()in code
that looks something like this:
while ($sequence =~ /some pattern/g) {
... pos($sequence)...
}
Here the =~ operator tells
Perl to look in the $sequence variable to
find the pattern. The /g at the end of the pattern tells Perl to
remember the end of the pattern it matches, and start the search at
that place (instead of the beginning of $sequence) the next time
through the loop. Eventually we reach the end of $sequence, the pattern
fails, and the loop is finished. Inside the loop, pos($sequence) gives
the position where the search will resume: that is, the position of the
first character after the
match.
In our application we'll have something like this:
while ($sequence =~ /(ATG|GTG|TTG)(...)*(TGA|TAG|TAA)/g) {
... pos($sequence)...
}
Here's one way to use pos($sequence):
the end of the ORF is the position of the last character matched, and
that's one place before pos($sequence).
We can say:
my $orf_end = pos($sequence) - 1;
If we knew the length of the ORF we could calculate the start:
my $orf_start = $orf_end - $orf_length + 1;
Or more directly:
my $orf_start = pos($sequence) - $orf_length;
How can we tell the length of the ORF? If we add an extra set of
parentheses to the pattern /(ATG|GTG|TTG)(...)*(TGA|TAG|TAA)/g ,
making it /((ATG|GTG|TTG)(...)*(TGA|TAG|TAA))/g ,
we'll set the special variable $1 to the ORF that we've
matched. ($1
matches what's inside the first set of parentheses, $2 the second
set, and so forth. But what's the first set? It's the one with the
leftmost opening parenthesis.) So we can say
my $orf_length = length($1);
Now we can teach Perl how to calculate the information we need.
while ($sequence =~ /((ATG|GTG|TTG)(...)*(TGA|TAG|TAA))/g) {
my $orf_length = length($1);
my $orf_end = pos($sequence) - 1;
my $orf_start = pos($sequence) - $orf_length;
# Save $orf_start, $orf_length, and $direction
}
The last thing to fill in is how to save it (indiceated by the comment
above).
SQ10: What do you think the
statement to save the information will look like?
Next page