An Opinionated Guide to Microarray Data Analysis

Quality Assessment

Checking data quality is tedious, but even the most carefully done experiments usually include one or two bad chips. Quality assessment (QA) often makes a big difference to the results of a study. If you've invested effort in generating data, it's painful to think that some of the data may not be good. Most array data analysis programs don't give you much reason to doubt the quality of your arrays. Why open a can of worms? However if you try to dig up dirt on your chips, and try to find the problems up front, you may avoid digging a hole for your research by trying to interpret bad data.

Wet Lab Quality Checks

Many researchers check RNA quality and dye incorporation before hybridizing the samples onto arrays. Between the time that a sample is taken, and the time the RNA is extracted and purified, enzymes in the cell rapidly degrade mRNA by cutting it into shorter pieces. Most of these shorter pieces will hybridize easily to several different probes; then the signals from many probes reflect abundances of several transcripts not just their targets. One way to detect degraded RNA is to examine the two most abundant types of RNA – the 18S and 28S ribosomal RNA's. If the ribosomal RNAs are mostly intact they form two sharp peaks as the total RNA is washed through a gel. This may be done also with a commercial tool such as the Agilent BioAnalyzer. A BioAnalyzer trace from good quality RNA is shown at left.

Since the signal from a probe depends on the amount of label in molecules attached to that probe, it makes sense to check how well the label is incorporated in the sample. In practice the amount of label in different samples varies, especially for the red Cy-5 dye. A commercial product to measure how much label is incorporated in the sample is the NanoDrop Probe. The amount of label may not be stable. Microarray technicians have observed that the Cy5 label seems to perform poorly in hot humid summers. Researchers from Agilent confirmed that even moderate levels (5ppb) of ozone can degrade Cy5 while not affecting Cy3. Such effects can dramatically change the ratios on two-color slides from winter to summer. Some labs near expressways have built 'clean rooms' where air is treated to prevent ozone and humidity entering.

Spot Level Quality Checks

In the early days most microarrays were printed by robotic pipettes from 96-well plates containing cDNA clones. This process rarely worked perfectly; it was common to see spots that were badly formed or to see fluorescent material spread over a large area. This is not so much a problem with modern commercial arrays. If you are working with spotted arrays, you might want to look at the Appendix on spot-level control.

Elementary Data-Driven Quality Checks

Controls

If the sample RNA and the labelling pass the wet lab quality checks, then further information about the process of hybridization comes from the controls. There is no excuse for chips without a well–designed set of negative and positive control probes. Negative controls are probes designed for DNA sequences that should not be present in the sample. Positive controls are replicate probes for sequences that should occur; often these are in fact abundant. Both positive and negative controls should be distributed over the chip. Spike-in controls are probes that match transcripts that do not occur in the sample, but are added (in known amounts) to the samples before hybridization, or in some cases, before labelling. Most manufacturers have included a variety of negative controls; some include spike-in controls. Agilent includes some positive controls. Unfortunately Illumina's expression systems had poor controls up to 2010.

The signal from negative controls gives an idea of the background in all signals due to non-specific hybridization. Therefore in a good chip the negative controls should all report low signal, and this low value should be fairly uniform (i.e. it should not show any pronounced spatial pattern); however different negative control probes from different genes will typically have somewhat different means, because the probes have different intrinsic properties and thermodynamics. Generally you won't be able to estimate reliably the abundances of those genes whose signals are comparable to the signals from negative controls, even if the signals from those probes are above the local surrounding background (see Image Analysis).

Positive controls give some idea of the dynamic range spatial variation in hybridization. Probes for the same gene should show fairly uniform intensities across the chip. If the positive controls are very different from their average in some region, it is worth taking a closer look, and perhaps discarding all signals from that region. It is common to see spatial gradients in intensity, and sometimes in ratio. In the old days two-color slides were placed on lab counters during hybridization; lab counters are not precisely level, and sometimes slightly more of the sample is present at one end of the slide than the other. The differences are small and would seem not to matter, but the balance of processes is delicate and the consequence of such small differences are that one end of the chip is brighter than the other. Because of saturation issues sometimes the log ratios show a similar gradient.

Spike-in controls give some idea of the accuracy and linearity of the signals. Some transcripts are added in ratios of 3:1 or 10:1 to the two samples. Typically one sees that the ratios as reported are squashed, and sometimes that the low intensity spike-ins show different ratios than the majority of spike-ins. In the early days of microarrays it was fairly hard to control spike-in amounts precisely, and they rarely worked exactly as expected.

Recently NIST has put in considerable effort to standardize spike-in procedures and to ensure that a uniform subset of spike-ins is available on every array via their External RNA Controls Consortium. As of 2010 many manufacturers will start including these standard controls.

Statistical Quality Assessments: Variation in relation to technical variables

Data analysts often worry that differences in the measures, which they are analyzing, reflect some artifact of the measurement process, rather than true biological differences. This worry is often well-founded. To satisfiy themselves that this isn't true, statisticians like to plot their measures against known technical variables, which they think might affect the measures. Traditionally these are variables, such as the technician, or the date, or the batch of reagents. With microarrays no one wants to plot such pictures for thousands of genes. A simpler approach is to consider each sample and plot the measures against technical characteristics of the probes.

Residuals

Saturation and quenching differences

Thermodynamic differences

Differences in amplification

Cross-hybridization differences

Spatial variation over a chip

In data from poorly functioning hybridization stations one often observes uneven signal and high background around the inlet ports; it seems the turbulent fluid affects the hybridization reaction. One should discard signals from the affected regions, and if this uneven pattern extends for a long way it is better to discard the chip.

References

Effects of Atmospheric Ozone on Microarray Data Quality Thomas L.Fare, Ernest M. Coffey, Hongyue Dai, Yudong D. He, Deborah A. Kessler, Kristopher A. Kilian,* John E. Koch, Eric LeProust, Matthew J. Marton, Michael R. Meyer, Roland B. Stoughton, George Y. Tokiwa, and Yanqun Wang Anal. Chem., ASAP Article 10.1021

Appendix:Quality Assessment of Individual Probes on Spotted Arrays

If you are using a custom spotted cDNA array you may want to filter your spots individually. If you are using printed or synthesized arrays from a major manufacturer, this sectionwon't be relevant. Spot-level QC detects mostly printing problems rather than hybridization anomalies. Most image quantification programs flag spots that fail their internal QC measures; it is rarely a good idea to keep spots that have been flagged. You may want to do further QC of individual spots based on several other measures reported by the image processing program (GenePix and Quantarray give many). Some reported measures are often: spot area, uniformity (standard deviation of foreground), and background uniformity. It is not practical to examine thousands of spots individually; an automated filtering procedure is what is needed. However the filtering criteria that are useful for one experiment, are too slack or too strict for the next; there are no rules about spot size, or background that apply across the board to all chips under all circumstances. A sensible thing to do before filtering is to examine the distribution of the various measures across the chips for each new experiment. Then identify the ‘normal’ ranges for each of these variables, and what are unacceptable. Then discard (or down-weight) all those spots that fall outside the ‘normal’ range. This is best done in collaboration with a good core facility. The printer often drops small amounts of probe, elsewhere than intended. This becomes a problem if a spatter of probe for a highly expressed gene lands on a probe for a faint gene; then the signal from both channels reflects the abundant gene, rather than the gene that is annotated at that position. Another type of problem is spot formation – printers aim to deliver fairly round, even sized spots. When they fail, printed clones may flow into each other. So in practice it makes trouble to use data from extremely small, or extremely large spots, or those that are very irregular. Further measures you might use in batch filtering depend on the level of noise in the image, and the uniformity of the color ratios.

Figure 1. Section of cDNA image: some spots run into each other; these spots have excessively large areas.

The area criterion is the easiest to apply and understand. Spots whose size is only a few tens of pixels are much more likely to be scatterings of bright probe; extremely large spots are likely to be mingled with their neighbors.

Figure 2. A plot of one quality score as a function of diameter, for a grid where the intended diameter is 100 microns, and the inter–spot distance is 200 microns.

The uniformity criteria are perhaps the most complex, because there are so many options, and no underlying explanations for variation. High foreground variability is obviously a problem – it makes it very hard to be confident about the real ratio. However it is not clear from first principles, what is an acceptable variation, and different chips have very different distributions of the uniformity measures. Some programs give the red–green correlation, which is a direct measure of the replicability of the ratio measures; values less than 0.8 should be down-weighted or discarded. Usually one has to decide based on indirect measures of uniformity. Most programs give both a mean and a median for a spot. If the spot has a reasonable distribution of pixels, the mean and median should be similar. If they are quite different, something strange is happening, such as a droplet. We accept spots if the mean and median differ by at most 15%: |μ – m*| < 0.15(μ + m*)/2. Many image quantification programs now give standard deviations for the foreground and background in both channels. A reasonable criterion is to accept a spot if the foreground is well above the bg noise: μ_fg > μ_bg + 2σ_bg. Unfortunately this often fails for a majority of spots on some chips. At this point it is not clear whether these chips are really poor, or whether the criterion is too strict. Some chips feature duplicate probes for each gene. We use a 15% criterion there also: Accept if |μ1-μ2| <0.15(μ1+μ2)/2. There may be some point in doing a spatial normalization here before applying this quality control criterion. It is simplest to set up criteria as filters, and to exclude spots that fail any quality criterion at a certain threshold. However, in practice few spots may pass all criteria, even with reasonable thresholds for each. Some groups use a composite score. Wang (et al 2002) constructs quality measures q₁, q₂, q₃, and q₄, based on area, signal-to-noise, background level, and variability; they define a composite score q* = (q₁q₂q₃q₄)^1/4 , and reject a spot if the composite q* < 0.8. The threshold of 0.8 is somewhat arbitrary, although spots in their arrays with q* ~ 0.5 have twice the random variation of those with q* > 0.8. In principle, most quality measures are continuous, and while there are obvious outliers, there is no clear–cut threshold. A better procedure than filtering would be to down weight probe signals, in further analysis, based on quality score. This poses a practical problem for most people, since it is difficult to use weight information in packaged software, although it is easy to adapt hand-coded R routines to weighted signals.