Scenario 1, Problem Set 1P

Biol 591

Introduction to Bioinformatics
Scenario 1, Problem Set 1P

Fall 2003

All prior and future Study Questions are deemed members of a problem set. This makes them fair game on days in which we discuss problem sets (and also when I devise questions for exams).

PS1P-1. Discover! Download and run ps1p-1.pl, part of which is reproduced below. Fool around with it, change anything you can think of, until you understand what each of these lines do:

@array = ("AATT","ACGT","AGCT","ATAT","CATG","CCGG","CGCG","CTAG",
"GATC","GCGC","GGCC","GTAC","TATA","TCGA","TGCA","TTAA");
$scalar = @array;
print join(" ",@array), "\n";
print $scalar, "\n";

In particular:

1a. What does @array = (...) do?
1b. What does $scalar = @array do? (What gets assigned to $scalar?)
1c. What does join(" ", @array) do?

PS1P-2. Discover! Download and run ps1p-2.pl (and the accompanying data file ps1p-2-data.txt), part of which is reproduced below. Fool around with it, change anything you can think of, until you understand what each of these lines do:

   print "Hit any key to continue", $LF;
   <STDIN>;
   $x =~ s/T/U/g;
   $x =~ s/CAGGU.+U..CAG/CAG/g;

In particular:

2a. What does $LF do?
2b. What does <STDIN> do?
2c. What does $x =~ s/T/U/g; do?
2d. What does $x =~ s/CAGGU.+U..CAG/CAG/g; do?
2e. What is the significance of this program?

PS1P-3. The query descriptions that BlastParser prints are so long as to make the output ugly. And the last part of the descriptions are useless -- they all say "Escherichia coli..." We'd like to modify the program so that "Escherichia coli" is suppressed.

3a. Which section of the program would you modify to get rid of "Escherichia coli"?
3b. Which variable would you put on the left of the =~operator?
3c. Make the change.
3d. How could you eliminate both "Escherichia coli" and everything after it in the description?

PS1P-4. Match the following regular expressions with the value they will produce when they act on the line of text (taken from a GenBank file):

LOCUS BACLEFB 3291 bp DNA linear BCT 12-OCT-1995

Regular expression Value of $part or @part

1. (my $part) = /.+(\d+) bp/; a.

2. (my @part) = /.+(\d+)-(\w+)-(\d+)/ b. 1

3. (my $part) = /\w+(\W+)/; c. 3291

4. (my $part) = /.+\d+(.+)\s+/; d. 2, Oct, 1995

5. (my $part) = /^DNA\s+(W+)/; e. bp DNA linear BCT

PS1P-5.Most of the lines that BlastParser prints sumarize matches between the pathogenic O175:H7 and non-pathogenic K12 sets of proteins. The interesting queries, though, are the ones that don't find a match. How can you list only those? There are several ways. Here are two strategies:

Strategy 1: Define a flag (call it, perhaps, is_to_be_printed). Set the flag equal to 1 (or to $true if you initialize $true=1;) and then change it to 0 (or to $false if you initialize $false=0;) if a match is found. Then test the flag before printing.

Strategy 2: Note that all information concerning the query and the subject is stored in a single array. The size of the array (the number of values it contains) depends on how many matches (if any) were found. Therefore, you can test to see whether a match was found by looking at the size of the array. If the array exceeds a certain size (i.e. a match was found), then don't print it.

Let's go with Strategy 2.

5a. The size of what array in BlastParser varies depending on whether a match is found? What size will that array have when there are zero matches? When there is one match?
5b. Which section of the program would you change to test for the size of the array? (For instance, the part labeled MAIN PROGRAM? The section following sub print_previous_query? The section following
sub start_new_query? The section following sub record_subject? Somewhere else?)
5c. Make the change and run the program.

PS1P-6. Remember the four warning messages that occurred when installing the E.coli K12 database using FormatDB and running BlastAll? I went into the text file, and sure enough, the protein sequences were missing. This did not seem right, so I contacted TIGR/CMR asking what was going on. ... In brief, it turns out that these four proteins use a very rare amino acid (selenocysteine) that is not part of the conventional 20. TIGR's program choked on the sequences.

What could make a program choke? I decided to download one of the offending sequences and take a closer look at it. I chose fdnG, encoding formate dehydrogenase, alpha subunit. How is its amino acid sequence different from other amino acid sequences, in such a way that a program used to normal amino acid sequences would have a problem?

I scanned the amino acid sequence but learned nothing from my effort. Maybe I just missed the strange part. So, I wrote a quick program, and there it was. There WHAT was? Well, you can find out by:

6a. Download the sequence as a GenBank file (click here for instructions).
6b. Extract the amino acid sequence from the GenBank file to create a file in FastA format. You can do this by hand in Word (or similar). Or see 1P-7 below. You should end up with a file that looks like this:
>NP_415991: formate dehydrogenase-N, nitrate-inducible, alpha subunit
mkkvvtvcpycasgckinlvvdngkivraeaaqgktnqgtlclkgyygwdfindtqiltp
etc
6c. Write (or actually 90% steal) a program that reads in the FastA file you created and searches for a letter that is NOT amongst the conventional 20 amino acids, i.e. [acdefghiklmnpqrstvwy]. The program should look for lines in which there is an amino acid that does not match any of these conventional amino acids. It should print out the line and print out the offending symbol (which is used to represent the unusual amino acid selenocysteine).

PS1P-7. Modify BlastParser so that it takes as input any protein file in GenBank format and spits out an amino acid sequence in FastA format, as shown in 1P-6b.

PS1P-8. Perl is not merely useful in bioinformatics. You can use it for a variety of useful purposes! (Will it dice those hard to cut onions?... never mind). Write a quick program to take a list of names (click here) and reverse last names with first names. The next step would be to alphabetize them. You can do this in Perl, using something like:

my $LF = "\n";
@sorted_array = sort(@unsorted_array);
print join($LF, @sorted_array), $LF;

Or, if that's too confusing, you can do it in Word or Excel if you like.

Regular expression	Value of $part or @part
`1. (my $part) = /.+(\d+) bp/;`	`a.`
`2. (my @part) = /.+(\d+)-(\w+)-(\d+)/`	`b. 1`
`3. (my $part) = /\w+(\W+)/;`	`c. 3291`
`4. (my $part) = /.+\d+(.+)\s+/;`	`d. 2, Oct, 1995`
`5. (my $part) = /^DNA\s+(W+)/;`	`e. bp DNA linear BCT`