First Try with real data

The usual way to add information to an array is with the push statement. To push a whole row at once, we enclose the row in square brackets:

      push @orfs_found, [$orf_start, $orf_end, $direction];

Here's our final subroutine:

sub orfs_in_direction {
   my ($direction, $sequence) = @_;
   my @orfs_found;
   while ($sequence =~ /((ATG|GTG|TTG)(...)*(TGA|TAG|TAA))/g) {   

      my $orf_length = length($1);
      my $orf_end = pos($sequence) - 1;
      my $orf_start = pos($sequence) - $orf_length;

      push @orfs_found, [$orf_start, $orf_end, $direction];

   }

   return @orfs_found;
}

The program so far is orf4.pl. When we run it we get the following discouraging result:

      1       2       c
      35      3573469 d

The first line is from our reverse complement stub, and we can ignore it. The second line claims that there is a direct-frame ORF that takes up almost the entire genome, over 3.5 million bases long. Pretty unlikely!

What went wrong? (Answer to SQ9 coming up...) We're looking for a start codon followed by some number of codons for amino acids, ending with a stop codon. Looking at our pattern, we see this:

To Match
Look For

start codon
(ATG|GTG|TTG)

coding codons
(...)*

stop codon
(TGA|TAG|TAA)

But (...)* will match any sequence of codons, including stop codons. And it matches as much as it can. So our pattern finds a start codon, the first one in the sequence, skips as many codons as possible, and stops at the last stop codon in the sequence.

SQ11: (Not for the faint of heart.) Write a pattern that will match any codon except a stop codon. Explain why it works.

SQ12: Before going to the next page, think about a feature you could add to Perl to make solving this problem much easier.

To Match	Look For
start codon	(ATG\|GTG\|TTG)
coding codons	(...)*
stop codon	(TGA\|TAG\|TAA)