Normalization is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Biologists have long experience coping with technical variation between experimental conditions that is unrelated to the biological differences they seek. However expression arrays may vary in even more ways than measures such as rt-PCR. In practice methods that have worked well for rt-PCR and similar measures do not perform as well for microarray data,which shows many dimensions (here 'many dimensions' means simultaneous inferences)of systematic differences. No lab technician can reproduce exactly the same technical conditions in each assay of a long series. Researchers do not have detailed data on the procedures and therefore cannot compare all the relevant technical condition of individual hybridizations.
The key assumption that enables us to normalize microarrays is that only a few genes are expressed at really different levels between samples (here 'a few’ means up to a few hundred out of tens of thousands). Hence the expression levels of the majority of genes are essentially identical across samples. This assumption may not always be held, for example when comparing highly transformed cancer cells with normal cells. However,we need to assume something of this sort if we are to make comparisons at all, since the fluorescent intensity measures can easily be manipulated by twiddling settings on the photo-multiplier tube and these measures are routinely affected by different amounts one sample than of another and different efficiencies of label incorporation. A more specific assumption is that microarray measures should not be correlated with technical characteristics of probes, such as: base content,dye bias, intensity effects,print-tip effects, Tm,position in gene, etc. We would hope that biological changes would be independent of technical characteristics of expression probes.
Heraclitus said "You cannot step into the same river twice": no technician does things in exactly the same way twice. In many cases differences in processing of two samples, particularly during the processes of making cDNA, labelling and hybridization systematically affect the signals on the arrays, on which those samples' gene expressions are measured. Some systematic processing differences between chips that frequently affect measures are:
In order to identify the real biological differences among samples, we attempt to compensate for the systematic technical differences in measurement. Although the aims of normalization for all arrays are similar, the issues and techniques used in normalization of expression arrays differ from those useful for other kinds of array-based assays. Traditionally normalization has been done differently for one-color (e.g. Affymetrix and Illumina) arrays, compared to two-color arrays (e.g. most Agilent and Nimblegen arrays). With two-color arrays there is usually a 'within-array' normalization between the colors, before an 'across-array' normalization. This guide will introduce across array methods first before discussing 'within-array' methods.
If there are no really stable genes across all samples, can we identify a set of genes stable across a particular set of samples? This question leads to an approach called 'invariant set normalization': the idea is to find empirically a set of genes which seem like the best candidates because their expression values maintain the same relation to each other across all samples. If such a group could be found, then the rest of the genes could be 'pinned' to their levels of expression. The implementation in dChip (Li & Wong, 2001) selects a reference chip, and then uses pairwise comparison with that reference chip, because such an invariant subset cannot be identified across the full range of signal values, but only the most abundant genes. This may not be a fault of the idea, but rather due to the wide range of signals obtained on Affymetrix chips for genes which are absent (up to signal intensities of 2000), in turn a a reflection of the high levels of cross-hybridization of 25-mers. It may be that this idea can be useful for the other technologies, based on longer oligomers, on which non-expressed genes have uniformly low signals.
Most approaches to normalizing expression levels assume that the overall distribution of RNA levels doesn't change much between samples or across the conditions. This seems reasonable for most laboratory treatments, although treatments affecting transcription apparatus have large systemic effects, and malignant tumours often have dramatically different expression profiles. If most genes are unchanged, then the mean transcript levels should be the same for each condition. An even stronger version of this idea is that the distributions of gene abundances must be similar.Statisticians call systematic errors, which affect a large number of genes, ‘bias’. Keep in mind that normalization, like any form of data ‘fiddling’ adds noise (random error) to the expression measures. You never really identify the true source or nature of a systemic bias; rather you identify some feature, which correlates with the systematic error. When you ‘correct’ for that feature, you are adding some error to those samples where the feature you have observed does not correspond well with the true underlying source of bias. Statisticians try to balance bias and noise; their rule of thumb is that it is better to slightly under-correct for systemic biases than to compensate fully.
A key decision researchers must make, with consequences for normalization, is on what scale to analyse their data. It is common practice to transform to a logarithmic (usually base 2) scale. The principal motivation for this transformation is to make variation roughly comparable among measures which span several orders of magnitude. This often works as intended however such a transformation may actually increase variation of the low intensity probes relative to the rest. In particular when a measure can be reported as zero, the logarithm isn’t defined. A simple remedy is to add a small constant to the measures before taking the logarithm. A more sophisticated approach is to use a non-linear variance stabilizing transform; a simple such transform is f(x) = ln( (x + (x2+ c2)1/2 / 2), where c is the ratio of the constant portion of the variance to the rate of increase of variance with intensity.
In practice there are often outliers at the top end, for example a number of probes are saturated on one chip, but not on the other. More consistent results are obtained by using a robust estimator, such as median or one-third trimmed mean because they are less influent by the outliers. To do the latter, you compute the mean of the middle two-thirds of all probes in the red, and the green channels, and scale all probes to make those means equal. John Quackenbush suggested this originally, but TIGR now uses lowess – see below.
Most normalization algorithms, including lowess, can be applied either globally (to the entire data set) or locally (to specific subsets of the data). For spotted arrays, local normalization is often applied to each group of array elements deposited by a single spotting pen (sometimes referred to as a 'pen group', 'print-tip group' or 'sub grid'). Local normalization has the advantage that it can help correct for systematic spatial variation in the array, including inconsistencies among the spotting pens used to make the array, variability in the slide surface, and local differences in hybridization conditions across the array. However, such a procedure may over fit the data, reducing accuracy, especially if the genes are not randomly spotted on the array; the approach assumes that genes in any sub grid should have average expression ratios of 1, and that several hundred probes are in each group. Another approach is to look for a smooth correction to uneven hybridization. The thinking behind this approach is that most spatial variation is caused by uneven fluid flow. Flow is continuous, and hence the correction should be continuous as well. There is still not a consensus about the best way to do local normalization.
By 2003 statisticians were developing more complex normalizations. Some statisticians noticed that there were pronounced differences in the loess curves fit to log-ratios in different regions of the same chip; they tried to fit separate loess curves to each set of probes produced by a common print tip of a robotically printed cDNA array. Others tried to fit two-dimensional loess surfaces over chips. Further complications included estimating a clone order effect, and re-scaling variation within each print-tip group. In 2003, Benjamin Bolstad, one of Terry Speed’s students, proposed cutting through all the complexity by a simple non-parametric normalization procedure, at least for one-color arrays. He proposed to shoe-horn the intensities of all probes on each chip into one standard distribution shape, which he determined by pooling all the individual chip distributions. In practice, the distribution of intensities from any high-quality chip will do. The algorithm mapped every value on any one chip to the corresponding quantile of the standard distribution; hence the method is called quantile normalization. This simple 'between-chip' procedure worked as well as most of the more complex procedures then current, and certainly better than the regression method, which was then the manufacturer's default for Affymetrix chips. This method was also made available as the default in the affy package of Bioconductor, which has become the most widely used suite of freeware tools for microarrays (see www.bioconductor.org). For all these reasons quantile normalization has become the normalization procedure which I see most often in papers.
In a formula, the transform isWhile quantile normalization is a simple fast one-size-fits-all solution, it engenders some problems of its own. For example the genes in the upper range of intensity are forced into the same distribution shape; such shoe-horning reduces biological differences as well as technical differences. A recent adjustment to the quantile procedure in the latest versions of the affy package fixes that problem. A second issue is more subtle. For reasons that are still not entirely clear, the errors in different sets of probes are highly correlated. For probes for genes that are in fact not expressed in the samples under study, these correlated errors comprise most of the variation among chips. When quantile normalization acts on these probes, the procedure preserves this apparent but entirely spurious correlation among low-intensity probes and sometimes seems to amplify that correlation. Hence sophisticated data mining methods that depend on subtle analysis of correlations may pick up spurious relationships. Finally quantile normalization explicitly depends on the idea that the distribution of gene expression measures does not change across the samples. This assumption is unlikely to be true when testing treatments with severe effects on the transcription apparatus or studying cancer samples with severe genomic aberrations.
Researchers have observed that changes in the sample preparation environment, such as a different technician, or a new batch of arrays, or a new hybe station, can make a significant difference in the measures. However these kinds of data are often not available to the data analyst, and surely there are other factors, not tracked, which could make a difference. What if the analyst could infer these covariates from the data itself? This is the basis of two proposals, with very different algorithms.
Mike West proposed selecting control genes, which should not change among samples, and then doing a multivariate analysis of these controls, to identify covariates that influence many gene expression measurues.
(Leek & Storey, 2007) proposed using multivariate structural analysis of residuals to infer some sample covariates with significant effect on many gene expression measures. They first fit the design model and then perform a singular value decomposition (SVD) of the residual matrix.
A related proposal by (Reimers, 2010) also works from an SVD of residuals. However (Reimers, 2010) recovers the subspace corresponding to characteristic technical variation.