As microarrays have become more comprehensive and more reliable, the number of genes detected as differentially expressed in many studies has exceeded most researchers' capacity to interpret. It is often the case that crucial genes show relatively modest changes; furthermore many genes selected are poorly annotated. One approach to aid interpretation is to look for changes in many genes with a common function. To do this researchers leverage gene sets available from Gene Ontology or curated databases of pathways such as BioCarta. The idea is that genes in a gene group are carrying out some co-ordinated function; if many of these genes in a group are changed in a co-ordinated way between the conditions under study, then it is plausible that this function or pathway plays a major role in the biology of the difference between conditions.
Three kinds of approaches have been developed. The earliest and simplest approach is simply to take the list of differentially expressed genes from the t-test and ask whether any of the functional groups under consideration is over-represented in that list using a cross-classification procedure. This kind of simple discrete (categorical) procedure is often an easy first step in analysis. However Mootha et al noted that biologically meaningful co-ordinated changes may not include many genes with individual changes sufficiently large to achieve statistical significance (after multiple comparisons correction). Furthermore procedures on continuous variables are more powerful than procedures involving discretized variables. Hence a variety of procedures have been devised that combine continuous test values (such as t-scores) for individual genes. A third approach builds on the idea that the co-ordinated changes most biologically significant are those that run counter to the normal co-ordinated variation. This idea forms the basis of several procedures based on multivariate analysis of gene set co-expression.
The earliest and simplest approach is to classify the genes into two sets: those which are significantly changed (or significantly changed in a specific direction) between conditions and the rest. Then such procedures cross-classify genes by membership in a particular gene set of interest, and test the departure from independence using a standard χ2 test or a more accurate Fisher’s Exact Test. If there are many more genes, which are both differentially expressed and in the gene set, than would be expected if the two groups were selected independently, then the function underlying that gene set is probably important in the difference between the two conditions.
The basic idea of the continuous approaches is still that the test statistics of genes within the functional grouping that is causing the change will tend to be more extreme than those from outside that gene set; however we don't know a priori how to set threshold. These approaches provide a test statistic for whether the test statistics are crowded to one extreme (or the other); they differ in what assumptions they make and in exactly how they assess crowding.
In 2003 researchers at the Broad Institute introduced the Gene Set Enrichment Analysis (GSEA) procedure. This uses the Kolmogorov-Smirnov (K-S) test of distribution equality to compare the distribution of test statistics (e.g. t-scores) for each gene set with the distribution of t-scores for all other genes. A set of genes whose t-scores seem to have a different distribution than the remaining genes likely represents a function, which has changed between conditions.
A year later a simpler approach appeared in the form of Parametric Analysis of Gene Expression (PAGE). This test assumes Normality and is more sensitive than GSEA under these conditions because it is easier to detect that two Normal distributions have different means, as opposed to detecting any one of the many possible changes of distribution; but by the same token it is also more limited than GSEA. The PAGE test statistic is essentially a z-score: Z = ( mG – m ) / Sall, where m is the mean of all fold changes, mG is the mean of fold changes of genes in group G, and Sall is the standard deviation of all fold changes.
Multivariate statistical procedures that account for the covariation in the different measures are generally more powerful than those that do not. One such procedure is the Hotelling’s T-squared method, which compares the sizes and directions of the differences in gene means between two conditions to the sizes and directions of normal variation within groups (think of homeostatic compensatory changes). The test statistic is: where W is the sample covariance matrix. Directions of little variation within groups correspond to small values (actually ‘eigenvalues’) of W which correspond to big values of W-1. Thus a difference in means between the two groups in a direction in which little normal variation occurs counts for more than a difference between groups that could occur easily within groups. So if within groups two genes are tightly correlated, but between groups one changes up and the other changes down, that is a difference unlikely to arise by random sampling. For example if two genes are tightly correlated within samples from the same condition, but between conditions one gene changes up and the other changes down, that is a difference unlikely to arise by normal variation, even if the magnitudes of the individual changes are not extreme.