The aims of this section are to prepare you in the mechanics of working with probe level data, to give you a set of approaches to normalization (which may work some of the time), and to get you to think critically about what normalization is supposed to do and how you would tell if any particular normalization is working well.
The first step in any data analysis is quality checking. Microarrays are very complex and delicately balanced measuring instruments. We have very limited access to diagnostic information about the process of preparation (although that is changing), but we do have very rich data, about which we may form reasonable expectations based on what we think should have happened during the preparation of the array.
A standard technique in statistics is to check the validity of a model by examining the residuals from the fit. We don't have a statistical model here, but we have expectations which can be formulated as a crude model, and then plot residuals. These residual plots are often quite informative about problems during the preparation process. See the Opinionated Guide .
The simplest kinds of covariation plots are ratio-intensity plots and spatial aberration plots. Ratio-intensity plots show the ratios of signals on one chip to the average intensity in the sample (usually on a log2-transformed scale). For samples all taken from one tissue type, most intensities are roughly constant, and so average probe intensity is a sensitive indicator of probe saturation and quenching. Spatial aberration plots represent the ratios of intensities over the spatial extent of the chip.
There has been a great deal published over the past ten years about microarray normalization (to the point where a leading journal, Bioinformatics, will not accept any more papers on the subject), without a consensus forming as to what is the appropriate normalization. The problem is not merely statistical. There seems to be a great deal of non-random error in array data, and it seems to depend on a great many factors, and so how to model this error is a matter of art, rather than elegant science.
However we can compare normalization methods on a variety of standard data sets, where we have some idea of what a good normalization should do. The key idea is that, while we may not for some time be able to know the 'truth' about absolute gene expression levels, we can at least ask how accurate the relative measures of expression are. See the OGMDA web site for further information.
By 2003 statisticians were inventing very complex normalization procedures. Benjamin Bolstad, one of Terry Speed's students, proposed cutting through all the complexity by a simple non-parametric normalization procedure, at least for one-color arrays. He proposed to shoe-horn the intensities of all probes on each chip into one standard distribution shape, which is determined by pooling all the individual chip distributions. The algorithm mapped every value on any one chip to the corresponding quantile of the standard distribution; hence the method is called quantile normalization. This simple 'between-chip' procedure worked as well as most of the more complex procedures then current, and certainly better than the regression method, which was then the manufacturer's default for Affymetrix chips. This method was also made available as the default in the affy package of Bioconductor, which has become the most widely used suite of freeware tools for microarrays (see www.bioconductor.org). For all these reasons quantile normalization has become the most common normalization procedure.
One of the biggest changes to our way of thinking needed to analyze high-throughput data, is the idea of correlated or systematic errors. The majority of technical variation in a microarry experiment can be represented by only a few principal components. Since the beginning, statisticians have assumed that errors in repeated measures are independent. Most procedures for high-density data have built on such procedures, which embody at least tacitly, the same assumption. We have only recently begun drawing on the deep traditions of statistics to systematically address the kinds of correlated errors that characterize high-throughput data.
One of the ground-breaking studies of this sort was (Leek and Storey, 2007). This paper showed how to perform a singular value decomposition of the data, using the design matrix, and therefore come up with a set of inferred ('surrogate') covariates, which could then function in a traditional analysis of covariance.
Another approach to normalization proceeds by generalizing the LOESS procedure that Terry Speed introduced.
Use the affy package to read in the Affy data, but read in the other formats using read.table(). It will be convenient to use a loop and extract only the data you need from each table; the Agilent files in particular contain a lot of information you won't keep. You'll need the probe information (Probe ID, row and column) and it makes sense to extract that once. The Agilent files are a bit tricky because they have single quote ("'") and # characters, both of which R reads in in special ways. Try adding   quote="", comment.char="" to your read.table() command. Use the following code to read in Illumina MAQC, Illumina HapMap and Agilent.
smoothScatter
function from library geneplotter. You've now constructed (most of) a microarray pre-processing pipeline!