Pattern-Finding |
Exploratory analysis aims to find patterns in the data that aren’t
predicted by the experimenter’s current knowledge or pre-conceptions. Some typical
goals are to identify groups of genes expression patterns across samples are closely
related; or to find unknown subgroups among samples. A useful first step in all
analyses is to identify outliers among samples – those that appear suspiciously
far from others in their group. To address these questions, researchers have
turned to methods such as cluster analysis, and principal components analysis,
although these have often been used inappropriately. |
The first widely publicized microarray studies aimed to find uncharacterised
genes, which act at specific points during the cell cycle. Clustering is the natural first
step in doing this. Unfortunately many people got the impression that clustering is the 'right'
thing to do with microarray data; the confusion has been perpetuated, since many software
packages have catered to this impression. The proper way to analyze data is the way that
addresses the goal at which the study was aimed. Clustering is a useful exploratory technique
for suggesting resemblances among groups of genes, but it’s not a way of identifying the
differentially regulated genes in an experimental study. |
|
Clustering |
After that disclaimer, and after all preprocessing steps, there are two
main types of analysis; Unsupervised Analysis and Supervised Analysis.
It is considered Supervised Analysis when class lables are known a priori and are
used in the analysis is such a way as to identify differentially expressed genes, or to develope
a class prediction model. Unsupervised Analysis methods do not use any external class
or phenotypic variables external to the gene expression data that are used in the analysis;
that is to say we wish to learn from the gene expression data only. Also, unsupervised
analyses are commonly used to determine whether there are previously unrecognized phenotypic
groups identifiable only using gene expression data. Suppose that we want to find groups of similar genes
or similar samples, how do we go about it? Clustering depends on the idea that differences
between gene expression profiles are like distances; however the user must make (somewhat
arbitrary) choices to compute a single measure of distance from many individual
differences. It is important to determine what distance measure is appropriate for the
data at hand before choosing a clustering algorithm, because the resulting preprocessed
microarray values themselves are not directly used in the clustering algorithm. Instead,
these expression values are converted to distance, or dissimilarities, and then clustering is applied.
For visualization purposes it may be helpfull to see how the data from a gene expression microarray experiment
is stored as a matrix (see figure one). |
|
Figure 1. Gene Expression Matrix |
For the above figure (figure 1) note that the Y–axis may be either genes or probe sets.
Also, realize that for the first value (the square in the upper right hand corner of the matrix) corresponds
to the expression for the first gene for the first sample. Moving one square to the right, represents the first gene
for the second sample. This same convention applies moving veritcally within the matrix as well. |
Different procedures emphasize different types of similarities, and give different resulting clusters.
Four choices you have to make are: |
- what scale to use: original scale, log scale, or another transform,
- whether to use all genes or to make a selection of genes,
- what metric (proximity measure) to use to combine the scaled values of the selected genes, and
- what clustering algorithm to use.
|
Scale
|
It is not clear what is the most appropriate scale for multivariate exploratory
techniques, such as clustering and PCA (see below). Differences measured on the linear scale
will be strongly influenced by the one hundred or so highly expressed genes, while being only moderately
affected by the hundreds of moderate–abundance genes. The thousands of low–abundance genes will
contribute little. Often the high-abundance genes are 'housekeeping' genes; these may or may not
be diagnostic for the kinds of differences being sought. A housekeeping gene is any gene
involved in basic functions needed for the sustenance of the cell; housekeeping genes are constitutively
expressed, that is to say they are always turned 'ON'. On the other hand, the log scale will
amplify the noise among genes with low expression levels. If low-abundance genes are included
(see below) then they should be down-weighted. In the author's opinion, the most useful measure
of a single gene difference is the difference between two samples, relative to that gene's
variability within experimental groups; this is like a t-score for difference between two
individuals. |
|
Gene Selection
|
It would be wise not to place much emphasis on genes whose values are
uncertain. These are usually those with low signals in relation to noise, or which fail
spot-level quality control. If the estimation software provides a measure of confidence in
each gene estimate, this can be used to weight the contribution to distance of that gene
overall. It's not wise to simply omit (that is, set to 0) distances which are not known
accurately, but it is wise to down-weight relative distances if several are
probably in error. A general rule is that genes whose signal falls
within the background noise range are probably contributing just noise to your
clustering (and any other global procedure); therefore one may discard them. |
|
Metrics (Proximity Measures)
|
Proximities are the imputs to a clustering algorithm, such as
distances, dissimilarities, and similarities. In order for a distance to be considered
metric we require:
|
|
|
|
Most cluster programs give you a menu of distance measures: Euclidean,
Manhattan distances, and some relational measures: correlation, and sometimes relative
distance, and mutual information. The most commonly used distance measure in many clustering algorithms is
Euclidean Distance, that is the straight-line distance; some may prefer to say the root of the sum of squares,
as in geometry. More specifically the Euclidean distance is computed as follows: |
|
|
|
|
|
Some biologist prefer to stadardize (when clustering genes) the gene expression vectors first,
then calculate d(i,j), more specifically: |
|
, where S^2 is the pooled variance.
|
|
or: |
|
, where R is the range.
|
|
|
|
Manhattan Distance is the sum of linear distances (like navigating in Manhattan),
some refer to it as the 'city block' distance. More specifically: |
|
|
|
|
|
The aforementioned distance measures are all part of general class known as
Minskowski Distances, this class of distances are all of the form: |
|
|
|
|
|
Another type of proximity measure is a similarity measure. Similarities are used
when gene expression profiles represent comparative expression measures, causing the Euclidean, and other, distance
measures to not be as meaningful. Therefore, we commonly use a distance measure that scores based on similarity
between genes. It is important to note that for similarities, the more objects i and j are alike,
the larger the measure, this is to say: |
|
|
|
The correlation distance measure is actually
1-r, where r is the correlation coefficient. Probably a more useful version is 1 – |r|;
negative correlation is as informative as positive correlation. One common type of correlation
is Pearson's Correlation which must be transformed to meet properties of a similarity measure. More
specifically the Pearson's correlation is calculated as follows: |
|
, where
|
|
Typically transformations are of the form: |
|
, negative p's are dissimilar
|
|
or: |
|
, negative p's are similar.
|
|
The choice of which transformation to use should be based on subject matter knowledge.
It is important to note that if the similarities have been calculated and one wishes to cluster the data,
similarities may be converted to dissimilarities as follows: |
|
|
|
|
|
The mutual information (MI)
is defined in terms of entropy: H = Σp(x)log2(p(x))for a discrete
distribution {p}. Then MI(g1,g2) = H(g1) + H(g2)
– H(g1,g2) for genes g. This measure is robust – not
affected by outliers. However it is tedious to program, because it requires
adaptive binning to construct a meaningful discrete distribution. |
By and large there are no theoretical reasons to pick one over the
other, since we don't really know what we mean by 'distance' between expression
profiles. The author prefers to use 'Manhattan' metrics for clustering samples
by similarity of gene expression levels, and to use a correlation measure to
cluster genes. Most of these measures are fairly sensitive to outliers, except
mutual information. Robust versions of these measures can easily be constructed
by a competent statistician, but are not available in most software packages.
However we do get different results depending on the algorithm we use, as shown
below for a study with 10 samples: two normal samples and two groups of tumor
samples. |
|
|
|
Clustering of the same data set using four different distance measures.
All genes were on a logarithmic scale, and only genes with an MAS 5 'Present' call in 8 out
of 10 samples were used (Affymetrix data). The four measures are listed in the
titles; 'relative' is |x-y|/|x+y|. |
|
Clustering Algorithms
|
To better understand clustering it may help to discuss filtering briefly.
filtering is an attempt to exclude probe–sets or genes based on how much they vary. Some different types
of filtering use different criteria. Some may retain genes whose absolute deviation
exceeds a predefined threshold (the type of array is important to keep in mind
when setting this threshold). Some methods retain genes who have an expression value
above a predefined threshold. Others retain probe–sets declared 'Present'among a
certain proportion of Affymetrix GeneChips, say 70%. While still other methods
retain only well characterized genes. Most biologists find hierarchical clustering more familiar, and other
algorithms somewhat magical. Some different types of hierarchical clustering are
agglomerative hierarchical clustering, divisive heirarchical clustering, and Ward's method.
With agglomerative hierarchical clustering each object starts as a cluster
of size one. Once the distances between all pairs of clusters is calculated there are
several ways to group the clusters. Hierarchical clustering groups the two clusters that are most similar
(that is, the two that have the smallest distance. Another way to cluster would be using
the single–linkage approach, also thought of as the nearest neighbor. There is also
the average–linkage approach which averages all pairwise distances. Finally,
there is a complete–linkage approach, which uses a maximum distance. Divisive hierarchical clustering
starts with all objects in one large cluster, then proceeds by removing that the object from the cluster that is
most dissimilar. Ward's method seeks to minimize the within–cluster sum of squares at each
step; the union of every possible cluster pair is considered and the two clusters whose fusion results in a minimum increase
in "information loss" (that is the error sum of squares) are combined. One potentially misleading artifact of
hierarchical clustering is at the end, all objects are ultimately in one large cluster; it is possibly that one, or some,
of the objects are totally unrelated, yet the algorithm will display them together.
Statisticians object to hierarchical clustering because it seems
(falsely) to imply descent; however this is a quibble: all of the common clustering methods
are based on models which don't really apply to microarray data. There is no reason to suspect
that genes (or samples) naturally cluster in a hierarchical way. That is, it may not be natural
to impose a nesting structure if the primary purpose is to ID naturally occuring groups in the data.
Another approach to clustering is use partitioning methods. Partitioning methods partition
the set of n objects into a prespecified number of clusters; two such methods are K-means and Partitioning
Around Medoids (PAM). The K-means method requires that K, the number of clusters, be pre–specified. Once K
has been determined, initial clusters are defined (typically at random), each object is assigned to be a member of
one of the K-clusters or K-random clusters are identified and objects are assigned to the closest of k clusters. The specific
steps taken when using the K-means clustering method are as follows. (ONE) Specify K (recall K is the
random assignment of objects to be clusters). (TWO) Calculate the center of each cluster. (THREE) For each object,
calculate the distance between that object and the K centers. (FOUR) Assign each object to that cluster
which minimizes the distances. The aforementioned steps are repeated untill no assignments change. The PAM
method differs slightly, rather than use means of the currently defined clusters, the cluster center is restricted
to be one of the observations within the cluster.
While K–means requires the raw data to be available, PAM only needs the distances/dissimilarities rather than the raw data. The specific
steps taken when using the PAM clustering method are as follows. (ONE) Specify K. (TWO) For a given cluster assignment C,
find the observations in the cluster that minimizes the total distance to all other points within the cluster. (THREE) For a current set
of K–medoids, minimize the total error by assigning each observation to that cluster with minimum distance (that is
assign each observation to the closest medoid). The aforementioned steps are repeated untill no assignments change or you exceed the maximum number
of iterations allowed by the software. It is important to note that since K is a parameter decided on by the technician or statistician
that not all values of K will result in natural groupings. It is advisable to run the algorithm several times with
different values of K and select that K for which certain characteristics are optimal, or to retain the clustering that gives rise to the most
meaningful interpretation. There are some ways to check the appropriateness, or goodness of fit, for a given clustering.
One such check is to compare the size of the cluster against the distance of the nearest cluster. One may also deem a cluster assignment
as trust worthy if the inter–cluster distance is much larger than the size of the clusters. Yet another method
for estimating the appropriate number of clusters is to estimate a homogeneity or a seperation quantity for various
values of K and then look for a leveling off. The equations for the aforemention quatities are as follows. For the homogeneity quantity
the average distance between each observation and the center of the cluster to which it belongs is calculated by:
|
|
|
|
|
|
For the separtation quantity the weighted average distances between the clusters is calculated by:
|
|
|
|
|
|
One may also use a silhouette, a composite index reflecting the compactness of, as well as, the seperation between clusters.
Useful statistics resulting from this are the medoids (and the co-ordinates), the diameters and separation of the clusters.
Broadly speaking, the differences between clustering methods show up in how ambiguous cases are assigned; if you
have very many ambiguous cases you'll see great differences; if so, then maybe clustering
isn't appropriate anyway, because the data don't separate into groups naturally. The k-means
and SOM methods require the user to specify in advance the number of clusters to be constructed.
Of course you don’t know ahead of time, most people end up trying out many values. A
criterion that some people use to decide how many clusters to use is to track
how much the intra-group variance drops at each addition of another cluster;
then going back to the point where the rate of decrease really slows down. More
advanced methods allow clusters to overlap, as often happens in biology, or to
omit some outlying genes. To recap, a list of all commonalities between the various
clustering methods discussed is presented here: |
- Objects to be clusters must be selected;
- The variables to be used must be selected, and need to contain enough information
for accurate clustering;
- The researcher must decide whether or not to standardize the raw data;
- A dissimilarity measure must be chossen;
- A clustering technique needs to be ascertained;
- The number of clusters needs to be ascertained;
- One must interpret the meaning of the clusters in terms of research objectives.
|
|
Statistical significance of clusters by bootstrapping |
An important question, but rarely asked, is whether the clusters obtained
from a procedure depend on the overall pattern of gene expression, or on a few samples; they
could be very different if one or two samples are omitted. One approach to address this is the
Bootstrap: you re-cluster many times, each time re-sampling conditions or genes from your
original data, and then derive new clusters of genes or conditions. A variant is known as
Jack-knife analysis. Branches in a hierarchical cluster that are supported by a large fraction
of the re-sampled data sets are considered fairly reliable. A reasonable figure is 70%, but
this is arbitrary, like the often-quoted 5% level of significance. |
With any exploratory technique, one should think about what technical
variable may underlie the groups discovered this way, before going to the lab to confirm
findings. The author finds that clustering most often identifies systematic differences in
collection procedures or experimental protocol. These are important but not biologically
significant. Even when the difference is biological, it may not be a discovery.
Most sets of breast cancer data segregate into ER+ and ER- in clustering, which
is re-assuring but hardly news. |
|
Principal Components and Multi-dimensional scaling |
Several other good multivariate techniques can help with exploratory
analysis. Many authors suggest principal components analysis (PCA) or singular value
decomposition to find coherent patterns of genes, or ‘metagenes’, that
discriminate groups. These techniques with a long history in the statistical arsenal rely
on the idea that most variation in a data set can be explained by a smaller number of
transformed variables; they each form linear combinations of the data, which represent most
of the variation, and in principle these approaches, are well-suited for this purpose.
However this author believes these approaches are delicate enough that they are
not very often useful for deep exploratory work; often the largest coherent
component, such as the first principal component (PC), reflects mostly
systematic error in the data. In fact some researchers have seriously suggested
normalizing microarray data, by removing the first PC. PCA is not terribly
robust to outliers, which are common. Like cluster analysis, the results of PCA
are sensitive to transforms, which are somewhat arbitrary. |
These multivariate approaches are
more useful for exploring relations among samples, and particularly for a
diagnostic look at samples before formal statistical tests. Multi-dimensional
scaling (MDS) makes the most useful graphical displays. Classical MDS is
identical to PCA for most data sets; however if you fix a dimension, a modern
iterative algorithm can place the points in an arrangement that is more
representative of true distances, than are the same number of principle
components. It’s worth getting the ‘strain’ parameter for the MDS fit: this
parameter measures the discrepancy between the distances computed by the
metric, and the distances represented in the picture. When this parameter is
much over 15%, the picture can be misleading. Often if you omit outlying
samples, or take a major subgroup, the picture is a more accurate
representation of the computed distances. |
When the data conform to expectation
the MDS plot shows well-defined groups. Figure ** shows an MDS plot of seven
groups in a comparative study; the control group is in black; three of the
groups are quite similar to controls and each other, while three others are
quite distinct from controls and each other. This experiment showed many genes
under very clear regulation. |
|
|
|
Things aren’t always so happy. The
following figure ** shows a group of experiments which might have been
misinterpreted. Replicate experiments comparing three treatments against three controls
were done on two different dates. Results were not consistent between
experiments, but the MDS plot shows that we can work with the day 2 data quite
confidently, and separately from day 1. The cluster analysis on the left
doesn’t show this nearly as clearly. |
|
|
|
Representing two batches of 3
treatment (T) and 3 controls (C), done on two different dates: 1T3 represents 3rd
treated sample on day one. We can see that the day 1 chips cluster together,
and are displayed together by MDS. However the day 2 genes seem to fall into
two distinct clusters, which don’t divide neatly along T and C. The MDS plot
shows that the C samples on day 2 are in fact quite close, whereas the 3 T’s
are more disparate but all quite different from the C’s. |