Minimal matching

One the last page, we complained that  "(...)* will match any sequence of codons, including stop codons. And it matches as much as it can. So our pattern finds a start codon, the first one in the sequence, skips as many codons as possible, and stops at the last stop codon in the sequence." To change that behavior we can do one of two things: either replace (...)* by a pattern that doesn't match stop codons (and if you tried SQ11 you know that's complicated), or replace (...)* by a pattern that doesn't match as much as it can -- in fact by one that matches as little as it can.

SQ13:Why does that solve the problem?

To tell Perl to match as little as possible, add a question mark after the asterisk or plus sign. (See page 337 in Beginning Perl for Bioinformatics for more information). In our case, we'll change (...)* to (...)*? so the whole pattern becomes /((ATG|GTG|TTG)(...)*?(TGA|TAG|TAA))/g.

If you make that change in orf4.pl and run it, you'll get lots of output:
1       2       c
35 772 d
802 1494 d
1534 1692 d
1707 1769 d
1776 1820 d
1831 1851 d
1857 1874 d
1884 1928 d
1932 1991 d
1994 2098 d
Too much output: we're picking up very short ORFs, like the one from 1831 to 1851, which codes only 5 amino acids. At the beginning of the program we had these lines:

   my $threshold = 300;  # An ORF (open reading frame) with this many bases might
# be a gene.
So we should ignore ORFs that are shorter than $threshold.

SQ14: How would you do this?

See the next page for two solutions