Finding Open Reading Frames
Our job is to find open reading frames that might be genes. We'll do the
simplest possible search, for start codons followed in the same frame by
a stop codon.
We're going to test this with the Synechocystis
genome from Kasuza DNA Research
Institute. To check our work we'll look at a list of known good
ORFs, also from Kasuza. We're going to expect some false positives: our
program will find some ORFs that Kasuza doesn't list, and sometimes
Kasuza will list a shorter ORF than ours, ending at the same place, but
beginning earlier than ours.
Before you begin, download the Synechocystis genome in a file called Synecho.nt.
We'll write the program piece by piece, stopping from time to time for
study questions. Sometimes a study question will ask you to think about
how to use the Perl you already know to solve a problem; other times
we'll really need a new Perl feature, and the study question will
ask you to use your imagination to think of what that feature might be.
Here's our starting point, with three crucial variables filled in.
#!/usr/bin/perl -w
use strict;
########################### Variables #################################
my $threshold = 300; # An ORF (open reading frame) with this many bases might
# be a gene.
my $genome; # The genome we're investigating.
my @orfs; # Lists the ORFs we find in the genome. Each row of
# @orfs is a triple
#
# [$start, $end, $direction]
#
# where $start tells us the beginning of the ORF in
# the genome, $end tells us the last position, and
# $direction is either "d" (for direct), meaning the
# ORF came from the genome as given, or "c" (for
# complement), meaning the ORF came from the reverse
# complement of the genome.
############################# Files ###################################
########################## Main Program ###############################
######################## Subroutines ##################################
SQ1: Read the comments, and
decide what you would put in the Files and Main program section. Then go
to the next page and compare your outline to
the one there.