Finding ORFs

The patterns on the previous page can be fooled by stop codons for another reading frame.  They would match TTGCTGA, for instance, seeing it as TTG (a start codon) folowed by a base, C, followed by a stop codon, TGA. But TGA is in a different frame: biologically that's a start codon(TTG) followed by a codon for alanine (GCT), followed by a single extra base that's part of the following codon (A).  What we want is an arbitrary number of codons between the start and stop codons, not an arbitrary number of bases. Here's a pattern that does that:

        (ATG|GTG|TTG)(...)*(TGA|TAG|TAA)

This one keeps the reading frames straight because it looks at three bases at a time, which is just what a codon is.

SQ9: We'll see in a page or two that the improved pattern still isn't quite right.  Can you think of a way in which it might go wrong?

We have one more hump to get over before we find ORFs. We want to report the position of the ORF, not the string of all the bases in the ORF. Perl has a way to do that: the pos() function. You'll usually find pos()in code that looks something like this:
   while ($sequence =~ /some pattern/g) {   
... pos($sequence)...
}
Here the =~ operator tells Perl to look in the $sequence variable to find the pattern. The /g at the end of the pattern tells Perl to remember the end of the pattern it matches, and start the search at that place (instead of the beginning of $sequence) the next time through the loop. Eventually we reach the end of $sequence, the pattern fails, and the loop is finished. Inside the loop, pos($sequence) gives the position where the search will resume: that is, the position of the first character after the match.

In our application we'll have something like this:
   while ($sequence =~ /(ATG|GTG|TTG)(...)*(TGA|TAG|TAA)/g) {
... pos($sequence)...

 Here's one way to use pos($sequence): the end of the ORF is the position of the last character matched, and that's one place before pos($sequence). We can say:
      my $orf_end = pos($sequence) - 1;
If we knew the length of the ORF we could calculate the start:
      my $orf_start = $orf_end - $orf_length + 1;

Or more directly:

my $orf_start = pos($sequence) - $orf_length;
How can we tell the length of the ORF?  If we add an extra set of parentheses to the pattern /(ATG|GTG|TTG)(...)*(TGA|TAG|TAA)/g , making it /((ATG|GTG|TTG)(...)*(TGA|TAG|TAA))/g , we'll set the special variable $1 to the ORF that we've matched. ($1 matches what's inside the first set of parentheses, $2 the second set, and so forth. But what's the first set? It's the one with the leftmost opening parenthesis.) So we can say
      my $orf_length = length($1);
Now we can teach Perl how to calculate the information we need.
   while ($sequence =~ /((ATG|GTG|TTG)(...)*(TGA|TAG|TAA))/g) {   

my $orf_length = length($1);
my $orf_end = pos($sequence) - 1;
my $orf_start = pos($sequence) - $orf_length;

# Save $orf_start, $orf_length, and $direction

}
The last thing to fill in is how to save it (indiceated by the comment above).

SQ10: What do you think the statement to save the information will look like?

Next page