First Try with real data
The usual way to add information to an array is with the push
statement. To push a whole row at once, we enclose the row in square
brackets:
push @orfs_found, [$orf_start, $orf_end, $direction];
Here's our final subroutine:
sub orfs_in_direction {
my ($direction, $sequence) = @_;
my @orfs_found;
while ($sequence =~ /((ATG|GTG|TTG)(...)*(TGA|TAG|TAA))/g) {
my $orf_length = length($1);
my $orf_end = pos($sequence) - 1;
my $orf_start = pos($sequence) - $orf_length;
push @orfs_found, [$orf_start, $orf_end, $direction];
}
return @orfs_found;
}
The program so far is orf4.pl. When we run it we
get the following discouraging result:
1 2 c
35 3573469 d
The first line is from our reverse complement stub, and we can ignore
it. The second line claims that there is a direct-frame ORF that
takes up almost the entire genome, over 3.5 million bases long.
Pretty unlikely!
What went wrong? (Answer to SQ9
coming up...) We're looking for a start codon followed by some number of
codons for amino acids, ending with a stop codon. Looking at our
pattern, we see this:
To Match
|
Look For
|
start codon
|
(ATG|GTG|TTG) |
coding codons
|
(...)* |
stop codon
|
(TGA|TAG|TAA) |
But (...)*
will match any sequence of codons, including stop codons. And it matches
as much as it can. So our pattern finds a start codon, the first one in
the sequence, skips as many codons as possible, and stops at the last
stop codon in the sequence.
SQ11: (Not for the faint of
heart.) Write a pattern that will match any codon except a stop codon. Explain why it
works.
SQ12: Before going to the next page, think about a feature you could add
to Perl to make solving this problem much easier.