First Try with real data

The usual way to add information to an array is with the push statement. To push a whole row at once, we enclose the row in square brackets:
      push @orfs_found, [$orf_start, $orf_end, $direction];
Here's our final subroutine:
sub orfs_in_direction {
my ($direction, $sequence) = @_;
my @orfs_found;
while ($sequence =~ /((ATG|GTG|TTG)(...)*(TGA|TAG|TAA))/g) {

my $orf_length = length($1);
my $orf_end = pos($sequence) - 1;
my $orf_start = pos($sequence) - $orf_length;

push @orfs_found, [$orf_start, $orf_end, $direction];

}

return @orfs_found;
}

The program so far is orf4.pl. When we run it we get the following discouraging result:
      1       2       c
35 3573469 d
The first line is from our reverse complement stub, and we can ignore it.  The second line claims that there is a direct-frame ORF that takes up almost the entire genome, over 3.5 million bases long.  Pretty unlikely!

What went wrong? (Answer to SQ9 coming up...) We're looking for a start codon followed by some number of codons for amino acids, ending with a stop codon. Looking at our pattern, we see this:
To Match
Look For
start codon
(ATG|GTG|TTG)
coding codons
(...)*
stop codon
(TGA|TAG|TAA)
But (...)* will match any sequence of codons, including stop codons. And it matches as much as it can. So our pattern finds a start codon, the first one in the sequence, skips as many codons as possible, and stops at the last stop codon in the sequence.

SQ11: (Not for the faint of heart.) Write a pattern that will match any codon except a stop codon. Explain why it works.

SQ12: Before going to the  next page, think about a feature you could add to Perl to make solving this problem much easier.