print $line;will print $line to the screen. But how do we read information?
Here's a program, PrintPresidents.pl, that reads from the file presidents.txt and prints out each line it reads.
#!/usr/bin/perl -w use strict; open INPUT, "presidents.txt" or die "Can't open presidents.txt"; while (my $line = <INPUT>) { print $line; }Try downloading the program and the presidents.txt file, and running the program. (To download presidents.txt, click on the link with the right mouse button, then pick Save Link as... or Save Link Target as... from the menu that pops up.)
You should see:
Franklin Roosevelt 1932, 1936, 1940, 1944 Harry Truman 1948 Dwight Eisenhower 1952, 1956 John Kennedy 1960 Lyndon Johnson 1964 Richard Nixon 1968, 1972 Gerald Ford Jimmy Carter 1976 Ronald Reagan 1980, 1984 George H. W. Bush 1988 William Clinton 1992 George W. Bush 2000Each president's name is followed by the year(s) of his election(s), if any. Gerald Ford has no year of election listed since he was vice-president when Nixon resigned, and wasn't elected to the office.
One of Perl's claims to fame is pattern matching -- searching text for a word or phrase, or sometimes something more complicated. For instance, the program bush.pl searches for lines referring to either president Bush.
#!/usr/bin/perl -w use strict; open INPUT, "presidents.txt" or die "Can't open presidents.txt"; while (my $line = <INPUT>) { if ($line =~ /Bush/) { print $line; } }The line if ($line =~ /Bush/) { contains a test, $line =~ /Bush, which is true when $line matches /Bush/ -- that is, contains the word Bush somewhere within it.
SQ1: Change the program to print out the line for George W. Bush only.
SQ2: Change the program to print out the line for Lyndon Johnson.
Suppose we wanted to print out only presidents elected more than once. Looking at presidents.txt, we see that those presidents have a comma after their first election year; so any line with a comma belongs to a multiply-elected president. Here's comma.pl:
#!/usr/bin/perl -w use strict; open INPUT, "presidents.txt" or die "Can't open presidents.txt"; while (my $line = <INPUT>) { if ($line =~ /,/) { print $line; } }SQ3: Change the program to print out each president whose name contains a J.
Suppose we want to print out lines with presidents who have been elected in a year ending in 8. We can try just searching for /8/. That gives us the lines we want (Truman 1948, Nixon 1968, and the elder Bush 1988). But it also prints out the line for Ronald Reagan (elected in 1980 and 1984, but not in 1998). We need to look for 19 followed by a single character - we don't really care which - followed by 8.
How to tell Perl that we don't care what's between the 19 and the 8? We use a period (.), which Perl interprets as "don't care". So the following program, eight.pl, will do the trick:
#!/usr/bin/perl -w use strict; open INPUT, "presidents.txt" or die "Can't open presidents.txt"; while (my $line = <INPUT>) { if ($line =~ /19.8/) { print $line; } }We could print out the line for Franklin Roosevelt (and any other president elected three or more times, should there ever be one) with multicomma.pl:
#!/usr/bin/perl -w use strict; open INPUT, "presidents.txt" or die "Can't open presidents.txt"; while (my $line = <INPUT>) { if ($line =~ /,.....,/) { print $line; } }The pattern /,.....,/ looks for two commas five positions apart. To help us count, we could abbreviate that to /,.{5},/. The {5} following the period says to repeat it exactly 5 times. We could also have been less exact, and used /,.+,/ (two commas one or more positions apart) or /,.*,/ (two commas zero or more positions apart).
SQ4: Which presidents does /e.*n/ find? (Try it!) Why?
>Nostoc PCC7120_chromosome cctaggcgaacctttagcagtagcgacaaaagctaaatcacccctaagcccttctccctc gataacttcaagtcctcctgatgttgtagtctcagttaaaacttccccataaggtgttac tccctcttgaatcaaagctcctgtctcttgagctactcgcccatacaaaccgccattttt taacgggaaagtatatggataagagagtatcctcaaacttatctcttgagtttctttatt gtttcctgaacgtaaggcatttaaccctctttcaatgttatcgacataaaattgcgtcat tttggagtctagtcctagctgaccaatttgttgctttaagtcaggtatttttgatgcttc ctgtaaaagagtgttgacgtagctaatcttcaagtttatgaatccactacgaggatcaga tcctggaagaggtgataatgggaagtttttagtgacatctcttaaacctgcgacgggatt agaactcagcctagaaccagaactatattgaacagaatctaaaatttctctatcaaatttThe first line identifies the information that follows: the Nostoc genomic sequence, broken into lines of 60 bases each. About 100,000 lines of 60 bases each -- not something we want to search by hand!
For that we can use the SequenceSearch.pl program. This program searches for a pattern in the Nostoc genome: GTA.{8}TAC.{20,24}TA.{3}T. This is more or less familar from the presidental search we did above: .{8}, for instance, stands for a distance of 8 positions. The new wrinkle is .{20,24}, which means a distance of 20 to 24 positions inclusive.
The program searches for both exact matches, and for matches which differ in at most one base. The two routines it uses, fuzzy_pattern and match_positions, are not built in to Perl, but we don't need to know how they work, just what they do.
The expression fuzzy_pattern($pattern, $mismatches) stands for a pattern that matches whatever $pattern matches, except that bases may differ in at most $mismatches positions. So fuzzy_pattern($pattern, 0) gives us an exact match, and fuzzy_pattern($pattern, 1) gives a pattern where at most one base is a mismatch.
If you run SequenceSearch.pl, you'll see that it doesn't quite do what we want. Instead of counting matches, it prints out every match location. (It will pause for a long time as it finds the exact matches, then print them out, then pause again before printing out the inexact matches.) Since there are several thousand inexact matches, we again don't want to do the counting by hand.
The section of program that prints out the matches is this subroutine:
sub print_matches { my (@matches) = @_; foreach my $match (@matches) { print $match, "\n"; } }SQ5: Change the print_matches subroutine to printout the number of matches instead of each match location.
Please take a good shot at solving SQ5. If you get stuck, here are some hints to look at.
SQ6: Make the change to SequenceSearch.pl on your PC. Now run the program again. What numbers do you get?
SQ7: After the lines
my $sequence = get_genome_sequence("NostocChromosome.nt"); # Sequence to search withinAdd the following line to SequenceSearch.pl:
print "Length of sequence: ", length($sequence), "\n";What is the probability of an exact match at any one point in the genome sequence? An inexact match?
SQ8: What is the probability that you encountered the consensus NtcA binding site by chance? (This is not a cut-and-dried question. Re-read section D in the biology notes for discussion of the issues involved.)