Goal

The scenario will continue the problem begun with Scenario 1. This time we have (in motif.hits) the output of FindMotif, the program that used a PSSM to scan the genome for likely NtcA binding sites. Well, the numbers don’t do us much good. What we’d like to know is the following:

Is the binding site inside a gene (and so probably bogus) or between genes?
If between genes, is it upstream from one or more genes? If not, then again the site is unlikely to control transcription
Which gene(s) is the site near?

The output desired is then the following:

<-- all0011 Atp synthase: 340 bp SITE 221 bp <-- all0012 hypothetical pro

111 2222222 3333333333333 444444 5555 666666 777 8888888 9999999999999999

Tab delimited file (so it can be read into Excel)

Field 1: <-- (if left orf reads leftwards)

--> (if left orf reads rightwards)

IN (if site is within gene)

Field 2: (If between genes) OrfName of left orf

(If in gene) OrfName of orf

Field 3: (If between genes) OrfDescr of left orf

(If in gene) OrfDescr of orf

Field 4: (If between genes) Distance from left orf to putative NtcA site

(If in gene) Distance from left end of orf to site)

Field 5: The word “SITE”, included just for readability

Field 6: (If between genes) Distance from right orf to putative NtcA site

(If in gene) Distance from right end of orf to site

Field 7: <-- (if right orf reads leftwards)

--> (if right orf reads rightwards)

(if site is within gene)

Field 8: (If between genes) orfName of right orf

(If in gene) blank

Field 9: (If between genes) OrfDescr of right orf

(If in gene) blank

Description of 7120db.dat

Field	Bytes^*	Length	type
OrfName	4 + 11	15	Char
OrfContig	4 + 5	9	Char
OrfLeft	1 + 8	9	Num
OrfRight	1 + 8	9	Num
OrfDirection	4 + 1	5	Char
OrfAccession	4 + 14	18	Char
OrfPct	4 + 5	9	Char
OrfEval	1 + 8	9	Num
OrfDescr	4 + 50	54	Char
Types	4 + 5	9	Char
HitName1	4 + 11	15	Char
HitRecord1	1 + 8	9	Num
HitPosition1	1 + 8	9	Num
HitEval1	1 + 8	9	Num
HitName2	4 + 11	15	Char
HitRecord2	1 + 8	9	Num
HitPosition2	1 + 8	9	Num
HitEval2	1 + 8	9	Num
TOTAL		230

^*Numeric data is preceded by one byte (letter S I think)

^*Character data is preceded by four bytes giving the length of the string.

The highlighted fields are the only ones that concern us. They are:

OrfName always of the form /a(s|l|r)(l|r)/d{4}/
2^nd position: s=small, l=large, r=rna
3^rd position: l=left (complement), r=right (direct)
example: all0001

OrfLeft Low order coordinate for orf

OrfRight High order coordinate for orf

OrfDirection Either d (direct) or c (complement)

OrfDescr The whole point of this exercise, the description of what’s known about the gene

There are 6199 records, however only the first 5435 are of genes in the chromosome (the rest are of genes in various small pieces of DNA that I did not include in NostocChromosome.nt.

The first 5435 records occur in increasing order of OrfLeft; after that the small pieces of DNA begin again from near the beginning.