Reading Files and Pattern Matching

We start off with a short introduction to reading information from files. Then we talk about pattern matching, a way of selecting textual information based on patterns of characters in it. Finally, we discuss SequenceSearch.pl, a program that will help us decide the frequency of NtcA-consensus sequences in the genome of Nostoc. (See the biology notes.)

Reading files

We've seen how to print information.

print $line;

will print $line to the screen. But how do we read information?

Here's a program, PrintPresidents.pl, that reads from the file presidents.txt and prints out each line it reads.

#!/usr/bin/perl -w
use strict;

open INPUT, "presidents.txt" or die "Can't open presidents.txt";

while (my $line = <INPUT>) {
   print $line;
}

Try downloading the program and the presidents.txt file, and running the program. (To download presidents.txt, click on the link with the right mouse button, then pick Save Link as... or Save Link Target as... from the menu that pops up.)

You should see:

Franklin Roosevelt  1932, 1936, 1940, 1944
Harry Truman        1948
Dwight Eisenhower   1952, 1956
John Kennedy        1960
Lyndon Johnson      1964
Richard Nixon       1968, 1972
Gerald Ford         
Jimmy Carter        1976
Ronald Reagan       1980, 1984
George H. W. Bush   1988
William Clinton     1992
George W. Bush      2000

Each president's name is followed by the year(s) of his election(s), if any. Gerald Ford has no year of election listed since he was vice-president when Nixon resigned, and wasn't elected to the office.

Pattern matching

Relevant sections of Beginning Perl for Bioinformatics: pages 67-70, and 334-340.

One of Perl's claims to fame is pattern matching -- searching text for a word or phrase, or sometimes something more complicated. For instance, the program bush.pl searches for lines referring to either president Bush.

#!/usr/bin/perl -w
use strict;

open INPUT, "presidents.txt" or die "Can't open presidents.txt";

while (my $line = <INPUT>) {
   if ($line =~ /Bush/) {
      print $line;
   }
}

The line if ($line =~ /Bush/) { contains a test, $line =~ /Bush, which is true when $line matches /Bush/ -- that is, contains the word Bush somewhere within it.

SQ1: Change the program to print out the line for George W. Bush only.

SQ2: Change the program to print out the line for Lyndon Johnson.

Suppose we wanted to print out only presidents elected more than once. Looking at presidents.txt, we see that those presidents have a comma after their first election year; so any line with a comma belongs to a multiply-elected president. Here's comma.pl:

#!/usr/bin/perl -w
use strict;

open INPUT, "presidents.txt" or die "Can't open presidents.txt";

while (my $line = <INPUT>) {
   if ($line =~ /,/) {
      print $line;
   }
}

SQ3: Change the program to print out each president whose name contains a J.

Suppose we want to print out lines with presidents who have been elected in a year ending in 8. We can try just searching for /8/. That gives us the lines we want (Truman 1948, Nixon 1968, and the elder Bush 1988). But it also prints out the line for Ronald Reagan (elected in 1980 and 1984, but not in 1998). We need to look for 19 followed by a single character - we don't really care which - followed by 8.

How to tell Perl that we don't care what's between the 19 and the 8? We use a period (.), which Perl interprets as "don't care". So the following program, eight.pl, will do the trick:

#!/usr/bin/perl -w
use strict;

open INPUT, "presidents.txt" or die "Can't open presidents.txt";

while (my $line = <INPUT>) {
   if ($line =~ /19.8/) {
      print $line;
   }
}

We could print out the line for Franklin Roosevelt (and any other president elected three or more times, should there ever be one) with multicomma.pl:

#!/usr/bin/perl -w
use strict;

open INPUT, "presidents.txt" or die "Can't open presidents.txt";

while (my $line = <INPUT>) {
   if ($line =~ /,.....,/) {
      print $line;
   }
}

The pattern /,.....,/ looks for two commas five positions apart. To help us count, we could abbreviate that to /,.{5},/. The {5} following the period says to repeat it exactly 5 times. We could also have been less exact, and used /,.+,/ (two commas one or more positions apart) or /,.*,/ (two commas zero or more positions apart).

SQ4: Which presidents does /e.*n/ find? (Try it!) Why?

SequenceSearch

We'll be working with the Nostoc genome, which is contained in the file NostocChromosome.nt. You may want to download it now and continue reading; it will take five to ten minutes if you are using a modem at home. The first few lines of the file are:

>Nostoc PCC7120_chromosome
cctaggcgaacctttagcagtagcgacaaaagctaaatcacccctaagcccttctccctc
gataacttcaagtcctcctgatgttgtagtctcagttaaaacttccccataaggtgttac
tccctcttgaatcaaagctcctgtctcttgagctactcgcccatacaaaccgccattttt
taacgggaaagtatatggataagagagtatcctcaaacttatctcttgagtttctttatt
gtttcctgaacgtaaggcatttaaccctctttcaatgttatcgacataaaattgcgtcat
tttggagtctagtcctagctgaccaatttgttgctttaagtcaggtatttttgatgcttc
ctgtaaaagagtgttgacgtagctaatcttcaagtttatgaatccactacgaggatcaga
tcctggaagaggtgataatgggaagtttttagtgacatctcttaaacctgcgacgggatt
agaactcagcctagaaccagaactatattgaacagaatctaaaatttctctatcaaattt

The first line identifies the information that follows: the Nostoc genomic sequence, broken into lines of 60 bases each. About 100,000 lines of 60 bases each -- not something we want to search by hand!

For that we can use the SequenceSearch.pl program. This program searches for a pattern in the Nostoc genome: GTA.{8}TAC.{20,24}TA.{3}T. This is more or less familar from the presidental search we did above: .{8}, for instance, stands for a distance of 8 positions. The new wrinkle is .{20,24}, which means a distance of 20 to 24 positions inclusive.

The program searches for both exact matches, and for matches which differ in at most one base. The two routines it uses, fuzzy_pattern and match_positions, are not built in to Perl, but we don't need to know how they work, just what they do.

The expression fuzzy_pattern($pattern, $mismatches) stands for a pattern that matches whatever $pattern matches, except that bases may differ in at most $mismatches positions. So fuzzy_pattern($pattern, 0) gives us an exact match, and fuzzy_pattern($pattern, 1) gives a pattern where at most one base is a mismatch.

If you run SequenceSearch.pl, you'll see that it doesn't quite do what we want. Instead of counting matches, it prints out every match location. (It will pause for a long time as it finds the exact matches, then print them out, then pause again before printing out the inexact matches.) Since there are several thousand inexact matches, we again don't want to do the counting by hand.

The section of program that prints out the matches is this subroutine:

sub print_matches {
   my (@matches) = @_;
   foreach my $match (@matches) {
      print $match, "\n";
   }
}

SQ5: Change the print_matches subroutine to printout the number of matches instead of each match location.

Please take a good shot at solving SQ5. If you get stuck, here are some hints to look at.

SQ6: Make the change to SequenceSearch.pl on your PC. Now run the program again. What numbers do you get?

SQ7: After the lines

my $sequence =
    get_genome_sequence("NostocChromosome.nt");    # Sequence to search within

Add the following line to SequenceSearch.pl:

print "Length of sequence: ", length($sequence), "\n";

What is the probability of an exact match at any one point in the genome sequence? An inexact match?

SQ8: What is the probability that you encountered the consensus NtcA binding site by chance? (This is not a cut-and-dried question. Re-read section D in the biology notes for discussion of the issues involved.)