Concatenating Arrays
Here's the answer to the last question. You can put two (or more)
arrays together just by writing them inside of parentheses, separated by
commas:
@orfs = (@direct_orfs, @reverse_orfs);
With that out of the way, we can concentrate on replacing one of our
stub statements. Since finding ORFs in the forward direction is likely
to be simpler than in the reverse direction, we'll concentrate on that:
sub find_orfs {
my @direct_orfs; # ORFs found in the genome as given
my @reverse_orfs; # ORFs found in the reverse complement
@direct_orfs = orfs_in_direction("d", $genome);
@reverse_orfs = ([0, 1, "c"]);
@orfs = (@direct_orfs, @reverse_orfs);
}
sub orfs_in_direction {
my ($direction, $sequence) = @_;
my @orfs_found;
# Find some ORFs!
return @orfs_found;
}
We'll going to need Perl's pattern-matching features to find the ORFs;
in fact we'll need a little more than we currently know about.
Let's think about how to match an ORF. First, we have a start codon;
then some coding codons, then a stop codon. Let's tackle that one
piece at a time. What's a start codon? Usually ATG, but possibly GTG or
TTG. How do we teach Perl to match a start codon? How do we say
"ATG or GTG or TTG"? We know how to ask for "A or G or T", so we could
use our current Perl skills to search for A or G or T followed by TG.
SQ7: What's a pattern to match a single
character, A or G or T, followed by the two characters TG? Here's the answer if you're stuck.
Stop codons are TGA or TAG or TAA: we can't use the same trick. It's
very simple to tell a human to look for TGA or TAG or TAA; it ought to
be simple to tell Perl to do that. And in fact it is: we say TGA|TAG|TAA
-- in other words, we use the vertical bar | to mean "or".
So another way to match a start codon would be ATG|GTG|TTG.
To match at start codon followed immediately by a stop codon we'd write (ATG|GTG|TTG)(TGA|TAG|TAA).
We need the parentheses -- otherwise we'd have ATG|GTG|TTGTGA|TAG|TAA,
which matches ATG or GTG or TTGTGA or TAG or TAA.
SQ8: Think about how to match an
open reading frame: a start codon followed by some number of codons for
amino acids, ending with a stop codon. Two possibilities are (ATG|GTG|TTG).*(TGA|TAG|TAA)
and (ATG|GTG|TTG).+(TGA|TAG|TAA).
They don't work. Why not?
Go to the next page.