You'll need to download the Synechocystis genome (Synecho.nt), if you haven't already.
It may help to compare the output of your program at various points with
Kasuza's list of ORFs for Synechocystis, in
the file Synecho-orfs.txt.
P7P.1 Wrapping around. If you look at the first line of genome
file you'll see
>6803.flat 3573470 bases
If you look at the last line of Kasuza's list of ORFs you'll see
3573271 3574242 d
So we've got an ORF that starts at position 3573271 out of 3573470, very
near the end of the genome, and stops at position 3574242 -- after
the end. How can that be?
The answer is that the genome is circular, so position 3573470 is followed
by position 1, and position 3574242 is the same as position 773.
Modify the program to take the circularity of the genome into account.
Hint: the expression $sequence
. $sequence stands for $sequence followed by an identical copy of
itself. Warning: Be sure not to report ORFs whose beginning
is after the end of the sequence.
P7P.2 Searching inside ORFs. Study questions 11 through 13 dealt
with the problem of finding false stop codons. A complementary problem is
ignoring legitimate start codons. Consider the sequence ATG CCA TGA:
a start codon, a codon for proline, and a stop codon. Way too short to take
any notice of. But hiding in there is a start codon for another frame: AT GCC ATG A. That
start codon could be followed by an open reading frame thousands of bases
long, but the program won't find it, because it will resume searching after
the last A.
We can change where the program searches by setting pos($sequence), which we can do by using it on the
left-hand side of an assignment
pos($sequence) = ... ;
P7P.2a: Replace the ... with an expression giving the right position to resume searching, and put the statement in the right place in the following code:
while ($sequence =~ /((ATG|GTG|TTG)(...)*?(TGA|TAG|TAA))/g) {
my $orf_length = length($1);
my $orf_end = pos($sequence) - 1;
my $orf_start = pos($sequence) - $orf_length;
next if $orf_length < $threshold;
push @orfs_found, [$orf_start, $orf_end, $direction];
}P7P.2b: Solving P7P.2a will find some out-of-frame start codons that had been hidden; it will also find some in-frame start codons that we don't really want. For instance, in ATG CCC ATG ... TAA there are two start codons. Assuming that the ... stands for a string three hundred or more bases long, the program would report something like this:
3261 4232 d
3267 4232 dHaving seen the first line, we probably don't need to see the second. Modify orfs_in_direction so that it reports no more than one ORF with a given endpoint. Hint: use a hash to record the endpoints we've seen already.
P7P.3: Reverse complements. The orf-ps.pl
program provides you with a reverse complement function. However, there is
a fly in the ointment. We'd like to have the positions reported relative
to the direct genome, but when orfs_in_direction calculates the starting
and ending position of an ORF, it does the calculation relative to the reversed
genome. After we call orfs_in_direction
to find reverse ORFs, an row in @reverse_orfs
with a starting position of 0 represents an ORF right at the end of the
genome in terms of direct positions.
Uncomment the two statements that are commented out in the following section
of ocde, and add statements to calculate the direct start and direct end
of the ORFs:
# @reverse_orfs = orfs_in_direction("c", reverse_complement($genome));
foreach my $info (@reverse_orfs) {
my ($reverse_start, $reverse_end, $direction) = @$info;
# @$info = ($direct_start, $direct_end, $direction);
}