The scenario will continue the problem begun with Scenario 1. This time we have (in motif.hits) the output of FindMotif, the program that used a PSSM to scan the genome for likely NtcA binding sites. Well, the numbers dont do us much good. What wed like to know is the following:
Is the binding site inside a gene (and so probably bogus) or between genes?
If between genes, is it upstream from one or more genes? If not, then again the site is unlikely to control transcription
Which gene(s) is the site near?
The output desired is then the following:
<-- all0011 Atp synthase: 340 bp SITE 221 bp <-- all0012 hypothetical pro
111 2222222 3333333333333 444444 5555 666666 777 8888888 9999999999999999
Tab delimited file (so it can be read into Excel)
Field 1: <-- (if left orf reads leftwards)
--> (if left orf reads rightwards)
IN (if site is within gene)
Field 2: (If between genes) OrfName of left orf
(If in gene) OrfName of orf
Field 3: (If between genes) OrfDescr of left orf
(If in gene) OrfDescr of orf
Field 4: (If between genes) Distance from left orf to putative NtcA site
(If in gene) Distance from left end of orf to site)
Field 5: The word SITE, included just for readability
Field 6: (If between genes) Distance from right orf to putative NtcA site
(If in gene) Distance from right end of orf to site
Field 7: <-- (if right orf reads leftwards)
--> (if right orf reads rightwards)
(if site is within gene)
Field 8: (If between genes) orfName of right orf
(If in gene) blank
Field 9: (If between genes) OrfDescr of right orf
(If in gene) blank
Field |
Bytes* |
Length |
type |
OrfName |
4 + 11 |
15 |
Char |
OrfContig |
4 + 5 |
9 |
Char |
OrfLeft |
1 + 8 |
9 |
Num |
OrfRight |
1 + 8 |
9 |
Num |
OrfDirection |
4 + 1 |
5 |
Char |
OrfAccession |
4 + 14 |
18 |
Char |
OrfPct |
4 + 5 |
9 |
Char |
OrfEval |
1 + 8 |
9 |
Num |
OrfDescr |
4 + 50 |
54 |
Char |
Types |
4 + 5 |
9 |
Char |
HitName1 |
4 + 11 |
15 |
Char |
HitRecord1 |
1 + 8 |
9 |
Num |
HitPosition1 |
1 + 8 |
9 |
Num |
HitEval1 |
1 + 8 |
9 |
Num |
HitName2 |
4 + 11 |
15 |
Char |
HitRecord2 |
1 + 8 |
9 |
Num |
HitPosition2 |
1 + 8 |
9 |
Num |
HitEval2 |
1 + 8 |
9 |
Num |
TOTAL |
|
230 |
|
*Numeric data is preceded by one byte (letter S I think)
*Character data is preceded by four bytes giving the length of the string.
The highlighted fields are the only ones that concern us. They are:
OrfName always of the form /a(s|l|r)(l|r)/d{4}/
2nd
position: s=small, l=large, r=rna
3rd position: l=left
(complement), r=right (direct)
example: all0001
OrfLeft Low order coordinate for orf
OrfRight High order coordinate for orf
OrfDirection Either d (direct) or c (complement)
OrfDescr The whole point of this exercise, the description of whats known about the gene
There are 6199 records, however only the first 5435 are of genes in the chromosome (the rest are of genes in various small pieces of DNA that I did not include in NostocChromosome.nt.
The first 5435 records occur in increasing order of OrfLeft; after that the small pieces of DNA begin again from near the beginning.