Biol 591 |
Scenario 1, Problem Set 1P |
Fall 2003
|
All prior and future Study Questions are deemed members of a problem set. This makes them fair game on days in which we discuss problem sets (and also when I devise questions for exams).
PS1P-1. Discover! Download and run ps1p-1.pl, part of which is reproduced below. Fool around with it, change anything you can think of, until you understand what each of these lines do:
@array = ("AATT","ACGT","AGCT","ATAT","CATG","CCGG","CGCG","CTAG",In particular:
"GATC","GCGC","GGCC","GTAC","TATA","TCGA","TGCA","TTAA");
$scalar = @array;
print join(" ",@array), "\n";
print $scalar, "\n";
1a. What does @array = (...) do?PS1P-2. Discover! Download and run ps1p-2.pl (and the accompanying data file ps1p-2-data.txt), part of which is reproduced below. Fool around with it, change anything you can think of, until you understand what each of these lines do:
1b. What does $scalar = @array do? (What gets assigned to $scalar?)
1c. What does join(" ", @array) do?
print "Hit any
key to continue", $LF;
<STDIN>;
$x =~ s/T/U/g;
$x =~ s/CAGGU.+U..CAG/CAG/g;
In particular:
2a. What does $LF do?PS1P-3. The query descriptions that BlastParser prints are so long as to make the output ugly. And the last part of the descriptions are useless -- they all say "Escherichia coli..." We'd like to modify the program so that "Escherichia coli" is suppressed.
2b. What does <STDIN> do?
2c. What does $x =~ s/T/U/g; do?
2d. What does $x =~ s/CAGGU.+U..CAG/CAG/g; do?
2e. What is the significance of this program?
3a. Which section of the program would you modify to get rid of "Escherichia coli"?PS1P-4. Match the following regular expressions with the value they will produce when they act on the line of text (taken from a GenBank file):
3b. Which variable would you put on the left of the =~operator?
3c. Make the change.
3d. How could you eliminate both "Escherichia coli" and everything after it in the description?
LOCUS BACLEFB 3291 bp DNA linear BCT 12-OCT-1995
Regular expression Value of $part or @part 1. (my $part) = /.+(\d+) bp/; a. 2. (my @part) = /.+(\d+)-(\w+)-(\d+)/ b. 1 3. (my $part) = /\w+(\W+)/; c. 3291 4. (my $part) = /.+\d+(.+)\s+/; d. 2, Oct, 1995 5. (my $part) = /^DNA\s+(W+)/; e. bp DNA linear BCT
PS1P-5.Most of
the lines that BlastParser prints sumarize matches between the pathogenic
O175:H7 and non-pathogenic K12 sets of proteins. The interesting queries,
though, are the ones that don't find a match. How can you list only
those? There are several ways. Here are two strategies:
Strategy 1: Define a flag (call it, perhaps, is_to_be_printed). Set the flag equal to 1 (or to $true if you initialize $true=1;) and then change it to 0 (or to $false if you initialize $false=0;) if a match is found. Then test the flag before printing. Strategy 2: Note that all information concerning the query and the subject is stored in a single array. The size of the array (the number of values it contains) depends on how many matches (if any) were found. Therefore, you can test to see whether a match was found by looking at the size of the array. If the array exceeds a certain size (i.e. a match was found), then don't print it.
Let's go with Strategy 2.
5a. The size of what array in BlastParser varies depending on whether a match is found? What size will that array have when there are zero matches? When there is one match?PS1P-6. Remember the four warning messages that occurred when installing the E.coli K12 database using FormatDB and running BlastAll? I went into the text file, and sure enough, the protein sequences were missing. This did not seem right, so I contacted TIGR/CMR asking what was going on. ... In brief, it turns out that these four proteins use a very rare amino acid (selenocysteine) that is not part of the conventional 20. TIGR's program choked on the sequences.5b. Which section of the program would you change to test for the size of the array? (For instance, the part labeled MAIN PROGRAM? The section following sub print_previous_query? The section following
sub start_new_query? The section following sub record_subject? Somewhere else?)5c. Make the change and run the program.
What could make a program choke? I decided to download one of the offending sequences and take a closer look at it. I chose fdnG, encoding formate dehydrogenase, alpha subunit. How is its amino acid sequence different from other amino acid sequences, in such a way that a program used to normal amino acid sequences would have a problem?
I scanned the amino acid sequence but learned nothing from my effort. Maybe I just missed the strange part. So, I wrote a quick program, and there it was. There WHAT was? Well, you can find out by:
6a. Download the sequence as a GenBank file (click here for instructions).PS1P-7. Modify BlastParser so that it takes as input any protein file in GenBank format and spits out an amino acid sequence in FastA format, as shown in 1P-6b.6b. Extract the amino acid sequence from the GenBank file to create a file in FastA format. You can do this by hand in Word (or similar). Or see 1P-7 below. You should end up with a file that looks like this:
>NP_415991: formate dehydrogenase-N, nitrate-inducible, alpha subunit6c. Write (or actually 90% steal) a program that reads in the FastA file you created and searches for a letter that is NOT amongst the conventional 20 amino acids, i.e. [acdefghiklmnpqrstvwy]. The program should look for lines in which there is an amino acid that does not match any of these conventional amino acids. It should print out the line and print out the offending symbol (which is used to represent the unusual amino acid selenocysteine).
mkkvvtvcpycasgckinlvvdngkivraeaaqgktnqgtlclkgyygwdfindtqiltp
etc
PS1P-8. Perl is not merely useful in bioinformatics. You can use it for a variety of useful purposes! (Will it dice those hard to cut onions?... never mind). Write a quick program to take a list of names (click here) and reverse last names with first names. The next step would be to alphabetize them. You can do this in Perl, using something like:
my $LF = "\n";Or, if that's too confusing, you can do it in Word or Excel if you like.
@sorted_array = sort(@unsorted_array);
print join($LF, @sorted_array), $LF;