...but wait a second. Perhaps those triplets are red herrings! They are correlated with the beginnings of genes, but do they determine the beginning? After all, capital letters don't ALWAYS indicate the beginning of a sentence.
Let's see if there is a similar indicator preceding genes, but for some variety, switch to the freshwater cyanobacterium Synechococcus PCC 7942 (abbreviation S7942B). First verify that genes of this organism start with the same triplets as the genes of ss120.
You might be able to find some sort of pattern in all those nucleotides (the human mind can find a pattern in anything), but certainly nothing jumps out as did the initial triplets.
Some statistical analysis might focus our attention on areas of interest. Suppose we built a table that looked like what you see to the right.
If there were particular biases for or against certain nucleotides at specific positions before genes, maybe the table would make them apparent. Lets find out. First, give the set of sequences a name, something like:
Not something you want to do when the weather's nice outside. Fortunately, BioBIKE can do this automatically, using a function called MAKE-PSSM-FROM. A PSSM (Position-Specific Scoring Matrix) is a table of the sort we imagined, except that frequencies instead of counts are given.
Find this function in the STRING-SEQUENCES menu and Bioinformatic-tools submenu, giving you in the end something like the following:
You could click on the aligned-list gray argument box and type in the name upstream-sequences, but here's a more foolproof method: click on the gray box, then click on the VARIABLES menu, and finally click on upstream-sequences. That will transfer the name of the variable to the argument box with no possibility of misspelling error. Finally, execute the function. You will get a message indicating that you generated a two-dimensional table, as expected.
PROBLEM 4:
What do you make of the results? Can you find any nucleotide counts at any position that stand out? Can you imagine any explanation for the pattern?
PROBLEM 5:
Examine lots of sequences before genes of S7942B. Now that you know what to look for, do you see by eye what your program detected as an aggregate? How (in principle) could you test whether the pattern is significant or your eye is just inventing it?