While obtaining a nearly complete genomic sequence is an arduous task, rewards require much additional work. The insights that can be gained from a genomic sequence result from analysis of what genes and what other sequences are present. The tour below is intended to give a taste of what kind of analyisis is done on genomic sequences. (Description of many of the resources visited in this tour and more besides can be found in this list). Gene finding
You'll remember from the tour What is a Gene? that it is not a trivial matter to find the beginning of a gene by inspection. You can't merely scan for ATG's for example. Gene finding programs use a more sophisticated approach, examining the nucleotide sequences of known genes from an organism and extracting sequence tendencies the programs can use to predict if a genome segment is part of a gene. The primary tool in the analysis is hidden Markov models, which perhaps we'll have time to discuss. Let's look at gene finding in action. Suppose you're considering a segment of the Drosophila genome, which you can find here. Where are the genes? To get a prediction:
SQ1. How many genes are predicted by GeneMark within this segment?To generate a listing of the segment, go into BioBIKE and upload and display the sequence in the following way:
Copy the amino acid sequence of the larger of the two proteins predicted by GeneMark. Does it have any transmembrane regions? Kyte-Doolittle hydropathy plots provide a simple and easiliy comprehended way to predict such regions. The algorithm merely totals the hydrophobicity of amino acids within a window that slides along the length of the amino acid sequence. Go to the Kyte-Doolittle web site (provided by Malcolm Campbell to accompany his excellent book, Discovering Genomics, Proteomics, and Bioinformatics). Paste in the amino acid sequence you copied earlier, set the window size to be 19, and click Submit (you might also spend a moment looking at the background info link). Positive scores represent regions of hydrophobicity, negative scores regions of hydrophilicity (you can read more about the graph at the bottom of its page). SQ5. How do you interpret the graph?A program called DAS (Dense Alignment Surface method) is more sophisticated in predicting transmembrane regions, comparing candidate amino acid sequences to the amino acid frequencies of known membrane-spanning proteins. Go to DAS, paste in the amino acid sequence, and click Submit. (If DAS is having a problem showing graphical output, as it sometimes does, click here as a last resort.) SQ6. How do you interpret the graph?A good deal of different sorts of information about a protein sequence can be obtained from several sites, typified by SMART. Go to that site, choose Normal mode, paste in the amino acid sequence, check the Pfam domains and Signal Peptides boxes, and click Sequence SMART. The most interesting thing in the output is the gray box marked Pfam 7tm_1, indicating that a region of high similarity was found between the amino acid sequence and a known protein family. Click on the box and then click on the 7tm_1 link to find out about the protein family. SQ7. What kind of protein is your amino acid sequence similar to?Sequence Features Enough about the coding sequence. What about the rest of the DNA? In particular, what part of it determines the mRNA transcript? This is important in two regards. First, the coding region must lie within the mRNA transcript, so determining the transcript is a check on the predicted amino acid sequence. Secondly, the region immediately upstream of the start of transcription is likely to be responsible for the regulation of transcription, an important feature of the gene. To find the transcript, go back to your friend, Blast.
SQ8. The first three red bars extend over most or all of the 13Kb sequence. What are these hits and why do they extend so far? SQ9. The fourth and fifth hit
have patchy regions of similarity. Why? |