Normalization is one of the most important procedures in genomics data analysis. A typical dataset contains more than one sample and we are almost always interested in making comparisons between these. Unfortunately, technical and biological sources of unwanted variation can cloud downstream results. Here we demonstrate with real data, how this can happen and then describe several existing solutions. The examples are based on microarray data but can be applied to other datasets.
To understand the downstream consequences of not normalizing we will use the spike-in experiment discussed in class.
library(SpikeIn)
## Loading required package: affy
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
data(SpikeIn95)
spms <- pm(SpikeIn95)
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail'
## when loading 'hgu95acdf'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head'
## when loading 'hgu95acdf'
##
spd <- pData(SpikeIn95)
library(rafalib)
mypar(1,1)
shist(log2(spms[,2]),unit=0.1,type="n",xlab="log (base 2) intensity",main="Ten technical replicates")
# show the first 10
for(i in 1:10) shist(log2(spms[,i]),unit=0.1,col=i,add=TRUE,lwd=2,lty=i)
# pick 9 and 10 and make an MA plot
i <- 10; j <- 9
siNames <- colnames(spd)
# show probes with expected fold change of 2
siNames <- siNames[which(spd[i,]/spd[j,]==2)]
M <- log2(spms[,i])-log2(spms[,j])
A <- (log2(spms[,i])+log2(spms[,j]))/2
splot(A,M,ylim=c(-1.5,1.5))
spikeinIndex <- which(probeNames(SpikeIn95)%in%siNames)
points(A[spikeinIndex],M[spikeinIndex],ylim=c(-4,4),bg=1,pch=21)
In this plot the green dots are probes for genes spiked in to be different in the two samples. The rest of the points (black dots) should be at 0 because other than the spiked-in genes these are technical replicates. Notice that without normalization the black dots on the left side of the plot are as high as most of the green dots. If we were to order by fold change, we would obtain many false positives. In the next units we will introduce three normalization procedures that have proven to work well in practice.
In the MA-plot above we see a non-linear bias in the M that changes as function of A. The general idea behind loess normalization is to estimate this bias and remove it. Because the bias is a curve of no obvious parametric form (it is not a line or parabola or a sine function, etc.) we want to fit a curve to the data. Local weighted regression (loess) provides one way to do this. Loess is inspired by Taylor’s theorem that in practice means that at any given point, if one looks at a small enough region around that point, the curve looks like a parabola. If you look even closer it looks like a straight line.
Loess takes advantage of this mathematical property of functions. For each point in your data set a region is defined considered to be small enough to assume the curve approximated by a line in that region and a line is fit with weighted least squares. The weights depend on the distance from the point of interest. The robust version of loess also weights points down that are considered outliers.
We now use loess to normalize the arrays showing a bias.
o <- order(A)
a <- A[o]; m <- M[o]
ind <- round(seq(1,length(a),len=5000))
a <- a[ind]; m <- m[ind]
fit<-loess(m~a)
# show the fit
mypar(1,1)
plot(a,m,ylim=c(-1.5,1.5))
lines(a,fit$fitted,col=2,lwd=2)
# remove bias
bias <- predict(fit,newdata=data.frame(a=A))
nM <- M-bias
splot(A,M,ylim=c(-1.5,1.5))
points(A[spikeinIndex],M[spikeinIndex],ylim=c(-4,4),bg=1,pch=21)
lines(a,fit$fitted,col=2,lwd=2)
splot(A,nM,ylim=c(-1.5,1.5))
points(A[spikeinIndex],nM[spikeinIndex],ylim=c(-4,4),bg=1,pch=21)
abline(h=0,col=2,lwd=2)
Note that the bias is removed and now the highest fold changes are almost all spike-ins.
One limitation of loess normalization is that it depends on pairings of samples. We have this for two color arrays, but for other platforms, such as Affymetrix, we do not have such pairings. Quantile normalization offers a solution that is more generally applicable.
The smooth histogram plots demonstrate that different samples have different distributions, not just median shifts. This happens even when we look at data from replicated RNA. Quantile normalization forces all these distributions to be the same: it makes each quantile (not just the median) the same across samples. The algorithm, as implemeted in Biocondcutor, does the following for a matrix with rows representing genes and columns representing samples:
Here we demonstrate how to use quantile normalization and how it corrects the bias.
library(preprocessCore)
nspms <- normalize.quantiles(spms)
# MA plot
M <- log2(spms[,i])-log2(spms[,j])
A <- (log2(spms[,i])+log2(spms[,j]))/2
splot(A,M,ylim=c(-1.5,1.5))
points(A[spikeinIndex],M[spikeinIndex],bg=1,pch=21)
# corrected
M <- log2(nspms[,i])-log2(nspms[,j])
A <- (log2(nspms[,i])+log2(nspms[,j]))/2
splot(A,M,ylim=c(-1.5,1.5))
points(A[spikeinIndex],M[spikeinIndex],bg=1,pch=21)
Note how quantile normalization also fixes the bias but keeps the spiked-in genes different. Also note that the densities are now identical as expected since we forced this to be the case.
pms <- spms
mypar(1,1)
shist(log2(pms[,2]),unit=0.1,type="n",xlab="log (base 2) intensity",main="Five technical replicates")
for(i in 1:5) shist(log2(pms[,i]),unit=0.1,col=i,add=TRUE,lwd=2,lty=i)
qpms <- normalize.quantiles(pms[,1:5])
shist(log2(qpms[,2]),unit=0.1,type="n",xlab="log (base 2) intensity",main="")
for(i in 1:5) shist(log2(qpms[,i]),unit=0.1,col=i,add=TRUE,lwd=2,lty=i)
In the background unit we learned that the background noise appears to be additive. However, shifts we see that explain some of the need for normalization appear to be multiplicative terms. Also we have observed a strong mean and variance relationship that is in agreement with multiplicative error. Varianze stabilizing normalization (vsn) motivates the need for normalization with an additive background multiplicative noise model:
\[ Y_{ij}= \beta_i + \varepsilon_{ij} + A_i \theta_{j} \eta_{ij} \]
The expression level we are interested in estimating is \(\theta_j\) which we assume changes across genes \(j\) and is the same across arrays \(i\). Here, \(\beta_i\) is an array specific background level that changes from probe to probe due to additive noise \(\varepsilon_{ij}\). We refer to \(A_i\) as the gain and note that it changes from array to array. Here is a monte carlo simulation demonstrating that by changing \(\beta\) and \(A\) we can generate non-linear biases as we see in practice.
library(rafalib)
N <- 10000
e <- rexp(N,1/1000)
b1 <- 24; b2 <- 20
A1 <- 1; A2 <- 1.25
sigma <- 1; eta <-0.05
y1 <- b1+rnorm(N,0,sigma)+A1*e*2^rnorm(N,0,eta)
y2 <- b2+rnorm(N,0,sigma)+A2*e*2^rnorm(N,0,eta)
mypar(1,1)
maplot(log2(y1),log2(y2),ylim=c(-1,1),curve.add=FALSE)
For this type of data, the variance depends on the mean. We seek a transfromation that stabilizies the variance of the estimates of \(\theta\) after we subctract the additive background estimate and divide by the estimate of the gain.
ny1 <- (y1-b1)/A1
ny2 <- (y2-b2)/A2
mypar(1,2)
maplot(ny1,ny2,curve.add=FALSE,ylim=c(-500,500))
maplot(log2(2+ny1),log2(2+ny2),ylim=c(-2,2),xlim=c(0,15))
If we know how the variance depends on the mean, we can compute a variance stabilizing transform:
\[ Y \text{ with } \text{E}(Y)=\mu \text{ and } \text{var}(Y) = v(\mu)\\ \text{var}\{f(Y)\} \text{ does not depend on } \mu \] In the case of the model above we can derive the following transformation
\[ \text{arsinh}(y) = \log\left(y + \sqrt{y^2+1} \right) \]
The vsn
library implements this apprach. It estimates \(\beta\) and \(A\) by assuming that most genes don’t change, i.e. \(\theta\) does not depend on \(i\).
The below takes a bit of time.
library(vsn)
nspms <- exprs(vsn2(spms))
i <- 10; j <- 9
M <- log2(spms[,i])-log2(spms[,j])
A <- (log2(spms[,i])+log2(spms[,j]))/2
splot(A,M,ylim=c(-1.5,1.5))
points(A[spikeinIndex],M[spikeinIndex],bg=1,pch=21)
M <- nspms[,i]-nspms[,j]
A <- (nspms[,i]+nspms[,j])/2
splot(A,M,ylim=c(-1.5,1.5))
points(A[spikeinIndex],M[spikeinIndex],bg=1,pch=21)
We notice that it corrects the bias in a similar way to loess and quantile normalization.