Repeated sequences in bacterial genomes
Ultra-high frequency oligonucleotides in cyanobacterial genomes (2014-)


Elhai J (2015). Highly Iterated Palindromic Sequences (HIPs) and Their Relationship to DNA Methyltransferases. Life 5:921-948.

Elhai J (2018). Superabundant HIP1 and other oligomers in cyanobacterial genomes: A mechanism for their gain and loss. 16th International Symposium on Phototrophic Prokaryotes, Aug 2018. pp.56-57.

   
Some short DNA sequences are grossly underrepresented. There are also those that appear far more frequently than would be expected by chance. The winner amongst bacteria is the DNA sequences GCGATCGC, called HIP1 (for Highly Iterated Palindrome), which occurs more frequently than any reported sequence of at least 8 nucleotides, often over a 100-times more than predicted by chance (Figure 1). It is found in almost all cyanobacterial genomes except those of the closely related small marine picocyanobacteria (Group 7).

Apart from strains in Group 7, there are rare cyanobacteria that do not have highly abundant HIP1 sequences. Many of these have a different highly abundant sequence (e.g. strains b, d, and k in Figure 1). Some don't have any highly abundant sequence. Those in the latter class are all strains in obligate symbiotic relationships.

The genomes without HIP1 but with a different repeated oligomer are highly informative. All genomes that have a highly repeated oligomer possess at least one modifying enzyme that methylates a cytosine residue within the oligomer, either GCGATCGC (HIP1) or one of the alternate HIPs: GGCGCC, rCCGGy, GC[G/C]GC. The correlation is so striking, it must be telling something important.


Figure 1. Counts of HIP1 and other oligomers in genomes of cyanobacteria. The counts for HIP1 are shown in red, as well as the counts for the most numerous oligomer, if not HIP1. The letters and phylogenetic groups are explained in a tree given in Elhai (2015) and Elhai (2018).
The observations make sense if we postulate that all cyanobacteria (except Group 7) possess a G[Me]C-specific, methyl-directed mismatch repair system, i.e. a system that acts during DNA replication to determine which of the two strands is newly synthesized and thus the strand to repair if a mismatched nucleotide pair is detected. Elhai (2018) describes how a methyl-directed mismatch repair system has been shown to work in E. coli.

Figure 2 shows how this system could lead to high frequency sites. In the left panel, TCGATCGT is shown methylated, because most cyanobacteria possess a CGATCG-specific methylase, but the methylated C does not participate in methyl-directed mismatch repair, because it is not part of a G[Me]C pair. If the 5' T is mutated to a G, then a G[Me]C pair appears to direct the degradation of the opposite strand. This makes the mutation to G permanent. Only mutations to G in this position create the conditions to do this. In the same way, G[Me]C-dependent mismatch repair bias mutation towards the creation of the other methylase recognition sites.

There is no biochemical evidence for or against the existence of G[Me]C-dependent mismatch repair in cyanobacteria. Its existence would therefore constitute strong support of the model.


Figure 2. Model to explain the high abundance of HIP sites. Blue DNA = old strand; red DNA = newly synthesized strand; Green diamonds = methylation; * = mutation. The left panel shows the first step in creating GCGATCGC (HIP1) from a CGATCG site. The right panel shows the creation of a rCCGGy site.