Statistical Methods in Systems Biology - Analysis of Gene Sets
Current Methods
Draghici. This is the only overview paper for such methods. It is somewhat superficial and often inaccurate about details, but at least gives some perspective.
Kim et al. published a method based on Normal theory. Think about what assumptions are implied by using Normal theory. Are these assumptions roughly true for gene sets? How might the method be extended to cover the case where genes within a set move in opposite directions?
Subramanian et al. published a revised version of their original method. What statistical theory inspires this method? How would you assess how good it is?
Kemp et al addresses a problem with GSEA and PAGE, which assume that all co-ordinated movements are in the same direction. What is the basic idea? How else might the idea be implemented?
Web Resources
Multivariate Approaches
Papers
Kong et al introduced the idea of using the multivariate analogue of the t-test in order to identify significantly changed gene groups. This approach incorporates the covariance. Questions:
1/ Why could this approach be more efficient than PAGE?
2/ What drawbacks does the approach have?
3/ How might you overcome some of the problems?
4/ How might you extend it to the situation where covariance changes also?
Tai et al. improved the Hotelling T2 procedure by dealing with the poor estimates from the sample covariance. Their method could be used in a variety of situations, including selection of changed gene sets.
Gao describes a web application that will compute a score based on both the expression change and the connectivity. This is a fairly simple paper with one hard equation. If you discuss this paper, be prepared to explain the equation in detail.
Draghici constructs a composite score for gene groups combining a GSEA style score with a kind of correlation between connected components of the gene set. The approach is a bit of a hack and the paper is slick. What is he actually doing? Is it reasonable? Are there some other alternatives?
Ideker defines gene sets in terms of
protein interactions starting from a specified gene of interest. He then uses discriminant
analysis to separate classes. Is this definition of gene sets really what we want? Why does he do a Normal probability transform? Is there another way?