Biol 591 |
Problem Set 8 - Use of Databases |
Fall 2002
|
PS8.1: Write an outline that describes the strategy employed by the subroutine print-context within find-context.plto locate the appropriate orfs for each hit in motif.hits. What assumption(s) does the routine make? (Since the program works, these assumptions are evidently correct)
PS8.2: The program find-context.pl
assumes that memory is not a limiting resource and proceeds to read into
memory all the information from the data file it will ever need. This is
often not a bad assumption, but as we want to consider more and more information
it can become bad. Suppose that you’re working on a PC with 64 Mbyte memory
(call me old-fashioned, but that’s my PC), and, for safety reasons, you
decide you don’t want to use more than 32 Mbytes of it on running this
program.
PS8.2a. Calculate how much memory you’re using in storing all the information you read from 7120DB.dat. It's not always easy to tell how much memory Perl will use for its data. Try something like this: 20 bytes for each number, 25 bytes for each string, plus 1 byte for each character in the string, and 72 bytes for each array (where each row [ ... ] counts as an array). You may find that you’re still lacking one key piece of information needed to calculate this number. If so, then write a quick program (actually, alter find-context.pl) so that it gives you this piece of information.PS8.3: The program find-context.pl treats 7120DB.dat as a stream file, which is to say, it reads it sequentially from the beginning. Suppose that practical considerations (see previous problem) convince you that you can’t do it this way. Instead, you will access ONLY those records you need, when you need them, treating 7120DB.dat as a random access file. Outline the strategy or draw a flow chart (don’t actually write any code) of a program that would enable you to do this. It will consist of the following steps:PS8.2b. Suppose that you want to run the program using a similar database file that covers all human genes. Now how much memory do you need?
PS8.2c. Suppose that want the program to give you not only the coordinates of the relevant genes but also the 3000 bp adjacent to each of the identified orfs. This would be important if you wanted to investigate the sequences of complex regulatory sequences governing the expression of the genes you identify. NOW how much memory do you need?
PS8.4: Same problem as above, but use the following strategy instead:
- Read 7120DB.dat sequentially, storing only the information you will need to find the orfs you need, reading it into an array (What do you read? How do you store it?).
- Read motif.hits sequentially, considering one hit at a time (What do you read? How do you store it?)
- For each hit, find the number of the record you want. To do this, go through the array you created until you find the orf you want, then consider the subscript of the array to be the number of the desired record. (describe more precisely how you find the desired orf).
- Calculate where in 7120DB.dat (how many bytes into the file) is the desired record (describe this calculation).
- Read the number of bytes you want starting with the byte number you calculated above. (You don't know any Perl statement that can do this, but assume one exists)
PS8.5: Run find-context.pl using the input file test-motif.hits. It gives the error message:PS8.4a. Presuming that you search 7120DB.dat sequentially, what is the average number of records you’ll have to go through to find some random coordinate read from motif.hits?
- Read motif.hits sequentially, considering one hit at a time.
- For each hit, search through 7120DB.dat until you find the desired record.
PS8.4b. Suppose that searching through this number of records leads to unacceptably long waits (it does on my machine) (with my level of patience). Devise a search strategy that cuts the search time down enormously. [Hint: consider how you would find a name in the phone book – not by scanning from page 1!]
PS8.4c. There are (at least!) two such strategies. One benefits from assuming that motif.hits is sorted (it's not, but clearly we could sort it and print it out again). Another strategy works whether or not motif.hits is sorted. Which was yours? What would the other one look like?
PS8.6: The program find-context.pl does not quite do everything we would want of it. It would be useful to have the output in a form that's easily read by Excel and in a form that makes it easy to sort by the direction of genes (something we're very interested in). Here’s an example of the desired output:Can't yet print context without right-hand ORFPS8.5a. Why?
PS8.5b. Modify find-context.pl so that it works properly with test-motif.hits.
<--*all0606*PetC: cytochrome b6/*104*704137*704212*467*-->*alr0607*NirA: nitrite reduct*25.5(replace * with tab). This places the motif (coordinates 704137 to 704212) upstream from all0606 (104 bases from the left end of the motif) and upstream from alr0607 (467 bp from the right end of the motif). The motif has a high score (25.5) and it's positioned the right way (upstream from at least one gene), so it may well be a functional NtcA binding site.The output in this form can be read into Excel where it can be searched and displayed in a number of ways.
PS8.6a. Modify find-motif.pl so that it gives the desired output.This, if you can do it, is the ultimate solution of the problem posed by Scenario 8.PS8.6b. Examine the output in Excel. What motifs look interesting? By “interesting” I mean that they are placed upstream from at least one gene of known function. To facilitate your search, sort the data so that hits between two upstream regions are at the top of the list, then those with one upstream region, then those with no upstream region (or the NtcA site is within a gene).
PS8.7: Consider the unpack
statements
(related to input) and printf
statements
(related to output) you have run across. Never mind graphic input and output,
which add another order of magnitude of complexity. How much easier would
programming be if you could let someone else worry about input and output?