New Perl features in FindMotif
The FindMotif.pl program uses
position-specific scoring matrices to search for sequences similar to
a consensus sequence. Here we'll go over some new Perl features that the
program uses. You can download the program,
or look at a formatted version of it. To run the program you'll need 71NpNtsm.txt, the set of motif sequences (and also NostocChromosome.nt, which you should already have).
Hashes
Hashes are sometimes called associative arrays. Like arrays,
they associate a key or index with a value. Unlike arrays, the
key or index is not an integer but a string, a sequence of characters
in quotes. Here is the definition of a hash from the FindMotif.pl program:
my %bg_frequency = # nucleotide frequencies of overall sequences
("A" => 0.33264,
"T" => 0.33264,
"G" => 0.16736,
"C" => 0.16736);
This statement stores four key-value pairs in %bg_frequency. We use curly brackets to refer
to them: $bg_frequency{"A"} is 0.33264, $bg_frequency{"G"} is 0.16736. and so forth. We can work with hashes much
like we do with arrays. The following piece of code
foreach my $nucleotide ("A", "C", "G", "T") {
$log_p{$nucleotide} = log($bg_frequency{$nucleotide});
}
is quite similar to what we might do with arrays, if we had
decide to code nucleotides as numbers, using 0
for "A" , 1 for "C", 2 for "G" and 3 for
"T":
foreach my $nucleotide (0, 1, 2, 3) {
$log_p_array[$nucleotide] = log($bg_frequency_array[$nucleotide]);
}
Two differences are the square brackets, and the fact that using
arrays we need to remember the coding we used. Another difference
is that we'd likely have coded the array version a bit more compactly,
using just the starting and ending numbers:
foreach my $nucleotide (0 .. 3) {
$log_p_array[$nucleotide] = log($bg_frequency_array[$nucleotide]);
}
That's not possible with hashes -- we need to specify each key.
Study Question 1: How many
words are there in the dictionary between "CAT"
and "DOG"? Is that even a reasonable
question? How many sequences fall between "CAT" and "GGG"?
How often would an abbreviation like "CAT".."DOG" or "CAT".."GGG" be useful?
|
We can have two dimensional hashes just as we can have two-dimensional
arrays:
$dictionary{"Webster"}{"cat"} = "noun";
is a perfectly good statement. In fact, we can mix and match:
In the FindMotif.pl program, the expression $n{"A"}[23] stands for
the number of times we find the nucleotide adenosine at column 23 of the
motif sequences. Here's how that's calculated:
foreach my $position (0..$motif_length-1) {
my $nucleotide = substr($motif, $position, 1);
$n{$nucleotide}[$position] = $n{$nucleotide}[$position] + 1;
}
The inner two lines are much shorter than the equivalent:
my $nucleotide = substr($motif, $position, 1);
if ($nucleotide eq "A") {
$n{"A"}[$position] = $n{"A"}[$position] + 1;
}
elsif ($nucleotide eq "C") {
$n{"C"}[$position] = $n{"C"}[$position] + 1;
}
elsif ($nucleotide eq "G") {
$n{"G"}[$position] = $n{"G"}[$position] + 1;
}
elsif ($nucleotide eq "T") {
$n{"T"}[$position] = $n{"T"}[$position] + 1;
}
--which is the point of using hashes. If we used arrays and coded
bases numerically we'd be forced to be lengthy:
my $nucleotide = substr($motif, $position, 1);
if ($nucleotide eq "A") {
$n_array[0][$position] = $n_array[0][$position] + 1;
}
elsif ($nucleotide eq "C") {
$n_array[1][$position] = $n_array[1][$position] + 1;
}
elsif ($nucleotide eq "G") {
$n_array[2][$position] = $n_array[2][$position] + 1;
}
elsif ($nucleotide eq "T") {
$n_array[3][$position] = $n_array[3][$position] + 1;
}
Understanding Loops
Whether or not we use hashes, figuring out what's going on in a
loop can take some close analysis. Here's a piece of complicated code:
# Calculate %log_q
#
$N = @motif_sequences;
foreach my $nucleotide ("A", "C", "G", "T") {
foreach my $position (0..$motif_length) {
$log_q{$nucleotide}[$position] =
log( ($n{$nucleotide}[$position] + $B * $bg_frequency{$nucleotide})
/ ($N + $B)
);
}
}
It might help to think about it in two steps: first the outer part,
which tells us where we're calculating, has a pattern something like
this:
foreach my $nucleotide ("A", "C", "G", "T") {
foreach my $position (0..$motif_length) {
Do Something (with this nucleotide, at this position)
}
}
Without knowing what Do Something
is, we can still guess that we're calculating some quantity for each nucleotide,
at each position of the motif. With that in mind, we can look at
the formula itself:
$log_q{$nucleotide}[$position] =
log( ($n{$nucleotide}[$position] + $B * $bg_frequency{$nucleotide})
/ ($N + $B)
);
Here we see that we're calculating log_q
for a particular nucleotide and position.
To do the calculation we use information from n, which depends on the nucleotide and position;
from bg_frequency, which depends
only on the nucleotide, and is the same at all positions; and from B and N,
which don't vary at all: they're the same for every nucleotide and position.
Knowing those dependencies, we can think of the formula this way:
log_q = log((n + B * bg_frequency) / (N + B))
That's simpler for humans to read, and we can temporarily overlook
the computer details to concentrate on the math. For this formula
we have some hints from Monday: first, B
and bg_frequency are primarily there
to make sure the frequency doesn't become zero; so (n + B
* bg_frequency) / (N + B)
is an adjustment of the much simpler n/N , where n
is the number of times a particular nucleotide occurs, and N is the total number of sequences in which
it could occur. Ah! n/N is the frequency of the nucleotide, and if
we trust our sources that (n
+ B * bg_frequency) / (N + B)
is a reasonable adjustment of the frequency, we can think of it as being
more or less the same as the frequency. So we're left with something like:
log_q = log(the frequency of the nucleotide at this position)
And from Monday we also remember that we're using logs here just
to speed things us: we'd really like to multiply frequencies over all
positions. Since adding the logs of the frequencies is equivalent to doing
the multiplications, and adding is faster on computers than multiplying,
we store the log of the frequency instead of the frequency itself.
So now we can go back to our outer loop:
foreach my $nucleotide ("A", "C", "G", "T") {
foreach my $position (0..$motif_length) {
Do Something (with this nucleotide, at this position)
}
}
and change the Do Something to reflect our understanding:
foreach my $nucleotide ("A", "C", "G", "T") {
foreach my $position (0..$motif_length) {
Calculate the frequency (of this nucleotide, at this position)
}
}
To make even more of a summary, we can write:
for each nucleotide {
for each position {
Calculate the frequency of this nucleotide, at this position
}
}
This (or something like it) was what the programmer had in mind when
the loop was written. It would be nice if that were all the computer
needed to know! But unfortunately we need to give it explicit directions
on just how log_q and n depend on the nucleotide and position, and
how bg_frequency depends on the
nucleotide.
Sorting
Here's another bit of complicated code:
sub sort_and_discard_excess_values {
@hit_info = reverse sort { $$a[0] <=> $$b[0] } @hit_info;
while (@hit_info > $number_to_save) { pop @hit_info }
}
What the programmer had in mind here was something like this:
sort @hit_info (by score);
throw away the low scoring hits
Let's take it on faith for the moment that the first statement sorts
the @hit_info array with the highest scoring
hits to the left, so that $hit_info[0]
contains the best hit found so far, $hit_info[1]
has the next best hit, and so forth. If that's true, we can tackle the
second part: throwing away the low scoring hits. Another way to look at
it is that we're keeping the high-scoring hits, and that $number_to_save tells us how many to keep.
So what we want to do is reduce the size of @hit_info
to $number_to_save by throwing away
the items at the end. There are various ways to do this. Here's
one that builds on what we know. The pop command
decreases the size of an array by exactly one; so if the size of @hit_info is larger than $number_to_save,
we can go one step in the right direction with
pop @hit_info. So
while (@hit_info > $number_to_save) { pop @hit_info }
keeps checking the size of @hit_info,
and popping one item off the array, until we've thrown away all but $number_to_save hits.
Now we can go back to the first part:
@hit_info = reverse sort { $$a[0] <=> $$b[0] } @hit_info;
There are four things going on here:
- We're sorting the @hit_info array
(sort);
- We're telling sort how to
sort the array ({ $$a[0] <=> $$b[0] }), which as we'll see sorts @hit_info so that the lowest scores come first;
- We're reversing the order of the result, so that the highest
scores come first; and finally
- We're putting the sorted and reversed result back into @hit_info.
This is a little more complicated than the usual sort, which sort by
dictionary order. For instance,
@a = sort ("the", "quick", "red", "fox");
print join(" ", @a), "\n"
prints
fox quick red the
That's O.K. for words, but doesn't work as well for numbers:
@a = sort (1, 5, 17, 18, 100, -4, -5);
print join(" ", @a), "\n"
prints
-4 -5 1 100 17 18 5
sorting the numbers in order as if they were words. The correct
numeric order is
-5 -4 1 5 17 18 100
Perl lets us specify numeric order with a special block of code:
@a = sort { $a <=> $b } (1, 5, 17, 18, 100, -4, -5);
print join(" ", @a), "\n"
Here $a and $b
are special identifiers -- they aren't really connected to the rest of the
program. The block of code should evaluate to -1 if
$a < $b, 0 if $a = $b, and 1 if $a > $b.
As it happens, <=> is a special Perl operator that does
just that, for numbers. The equivalent operator for strings is cmp, so
@a = sort ("the", "quick", "red", "fox");
and
@a = sort { $a cmp $b } ("the", "quick", "red", "fox");
do exactly the same thing.
Study question 2: What does the
following code do? (Notice the order of $a
and $b)
@a = sort { $b <=> $a } (1, 5, 17, 18, 100, -4, -5);
How about this?
@a = sort { $b cmp $a } ("the", "quick", "red", "fox");
|
Now that we know about , we could sort an array of numeric scores. If @hit_info had only numbers in it, we could sort it
like this:
sort { $a <=> $b } @hit_info
But that's not what we see. The program actually says:
sort { $$a[0] <=> $$b[0] } @hit_info;
To understand this, it helps to look at what goes into @hit_info. The crucial statement is in the store_hit_information routine:
sub store_hit_information {
my ($name, $score, $start, $subregion) = @_;
push @hit_info, [$score, $name, $start, $subregion];
...
}
Here we're actually creating a two-dimensional array, row by row. Each
time the routine is called it adds another row to @hit_info,
namely:
[$score, $name, $start, $subregion]
Each row stores the information for one hit. So $hit_info[0][0]
is the score of the first hit, $hit_info[0][1]
is the name, $hit_info[0][2] is the starting
position of the hit, and $hit_info[0][3] is
the hit itself, the subregion that corresponds to the consensus sequence.
Unfortunately Perl treats array like
[$score, $name, $start, $subregion]
differently from the usual arrays. We say
my @a = ($score, $name, $start, $subregion);
print $a[0];
but
my $a = [$score, $name, $start, $subregion];
print $$a[0];
The reasons for the difference are, as they say, beyond the scope of this
web page. But now we can make some sense out of the code block for the sort:
sort { $$a[0] <=> $$b[0] } @hit_info;
We are sorting the rows of @hit_info.
Each row contains one hit, and the key for the sort is item 0 of the row, which is the score of the hit:$$a[0] is the score
of one hit and $$b[0] is the score of the othert. Since the score
are numbers, we use the numeric comparison operator, <=>.
Finally, since our code block sorts the array smallest-first, we use reverse to change that to largest first:
@hit_info = reverse sort { $$a[0] <=> $$b[0] } @hit_info;