Normalization and statistical methods for crossplatform expression array analysis
A large volume of gene expression data exists in public repositories like the NCBI’s Gene Expression Omnibus (GEO) and the EBI’s ArrayExpress and a significant opportunity to re-use data in various combinations for novel in-silico analyses that would otherwise be too costly to perform or for which the equivalent sample numbers would be difficult to collects exists. For example, combining and re-analysing large numbers of data sets from the same cancer type would increase statistical power, while the effects of individual study-specific variability is weakened, which would result in more reliable gene expression signatures. Similarly, as the number of normal control samples associated with various cancer datasets are often limiting, datasets can be combined to establish a reliable baseline for accurate differential expression analysis. However, combining different microarray studies is hampered by the fact that different studies use different analysis techniques, microarray platforms and experimental protocols. We have developed and optimised a method which transforms gene expression measurements from continuous to discrete data points by grouping similarly expressed genes into quantiles on a per-sample basis. After cross mapping each probe on each chip to the gene it represents, thereby enabling us to integrate experiments based on genes they have in common across different platforms. We optimised the quantile discretization method on previously published prostate cancer datasets produced on two different array technologies and then applied it to a larger breast cancer dataset of 411 samples from 8 microarray platforms. Statistical analysis of the breast cancer datasets identified 1371 differentially expressed genes. Cluster, gene set enrichment and pathway analysis identified functional groups that were previously described in breast cancer and we also identified a novel module of genes encoding ribosomal proteins that have not been previously reported, but whose overall functions have been implicated in cancer development and progression. The former indicates that our integration method does not destroy the statistical signal in the original data, while the latter is strong evidence that the increased sample size increases the chances of finding novel gene expression signatures. Such signatures are also robust to inter-population variation, and show promise for translational applications like tumour grading, disease subtype classification, informing treatment selection and molecular prognostics.