Goal

The scenario will continue the problem begun with Scenario 1. This time we have (in motif.hits) the output of FindMotif, the program that used a PSSM to scan the genome for likely NtcA binding sites. Well, the numbers don’t do us much good. What we’d like to know is the following:

  1. Is the binding site inside a gene (and so probably bogus) or between genes?

  2. If between genes, is it upstream from one or more genes? If not, then again the site is unlikely to control transcription

  3. Which gene(s) is the site near?


The output desired is then the following:


<-- all0011 Atp synthase: 340 bp SITE 221 bp <-- all0012 hypothetical pro

111 2222222 3333333333333 444444 5555 666666 777 8888888 9999999999999999


Tab delimited file (so it can be read into Excel)


Field 1: <-- (if left orf reads leftwards)

--> (if left orf reads rightwards)

IN (if site is within gene)


Field 2: (If between genes) OrfName of left orf

(If in gene) OrfName of orf


Field 3: (If between genes) OrfDescr of left orf

(If in gene) OrfDescr of orf


Field 4: (If between genes) Distance from left orf to putative NtcA site

(If in gene) Distance from left end of orf to site)


Field 5: The word “SITE”, included just for readability


Field 6: (If between genes) Distance from right orf to putative NtcA site

(If in gene) Distance from right end of orf to site


Field 7: <-- (if right orf reads leftwards)

--> (if right orf reads rightwards)

(if site is within gene)


Field 8: (If between genes) orfName of right orf

(If in gene) blank


Field 9: (If between genes) OrfDescr of right orf

(If in gene) blank



Description of 7120db.dat


Field

Bytes*

Length

type

OrfName

4 + 11

15

Char

OrfContig

4 + 5

9

Char

OrfLeft

1 + 8

9

Num

OrfRight

1 + 8

9

Num

OrfDirection

4 + 1

5

Char

OrfAccession

4 + 14

18

Char

OrfPct

4 + 5

9

Char

OrfEval

1 + 8

9

Num

OrfDescr

4 + 50

54

Char

Types

4 + 5

9

Char

HitName1

4 + 11

15

Char

HitRecord1

1 + 8

9

Num

HitPosition1

1 + 8

9

Num

HitEval1

1 + 8

9

Num

HitName2

4 + 11

15

Char

HitRecord2

1 + 8

9

Num

HitPosition2

1 + 8

9

Num

HitEval2

1 + 8

9

Num

TOTAL


230


*Numeric data is preceded by one byte (letter S I think)

*Character data is preceded by four bytes giving the length of the string.


The highlighted fields are the only ones that concern us. They are:


OrfName always of the form /a(s|l|r)(l|r)/d{4}/
2nd position: s=small, l=large, r=rna
3rd position: l=left (complement), r=right (direct)
example: all0001

OrfLeft Low order coordinate for orf

OrfRight High order coordinate for orf

OrfDirection Either d (direct) or c (complement)

OrfDescr The whole point of this exercise, the description of what’s known about the gene


There are 6199 records, however only the first 5435 are of genes in the chromosome (the rest are of genes in various small pieces of DNA that I did not include in NostocChromosome.nt.


The first 5435 records occur in increasing order of OrfLeft; after that the small pieces of DNA begin again from near the beginning.