To illustrate how we can adjust for batch effects using statistcal methods, we will create a data example in which the outcome of interest is confounded with batch but not completely. We will also select a outcome for which we have an expectation of what genes should be differentially expressed. We make sex the outcome of interest, and expect genes on the Y chromosome to be differentially expressed.
You can download the data in binary format from the class web page and read them in (as I have done on my laptop), or read them directly over the network (uncomment the line with the url() command, and put a # in front of the first line). The latter will take longer.
We start by finding the genes on the Y chromosome.
library(Biobase)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
library(genefilter)
load("/Users/ingo/kogo/web/robj/GSE5859.rda")
# load(url("http://biostat.jhsph.edu/~iruczins/teaching/kogo/robj/GSE5859.rda"))
library(hgfocus.db) ## get the gene chromosome
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: org.Hs.eg.db
##
##
chr <- mget(featureNames(e),hgfocusCHRLOC)
chr <- sapply(chr,function(x){ tmp<-names(x[1]); ifelse(is.null(tmp),NA,paste0("chr",tmp))})
y <- colMeans(exprs(e)[which(chr=="chrY"),])
hist(y)
sex <- ifelse(y<4.5,"F","M")
Now we select samples so that sex and month of hybridization are confounded.
batch <- format(pData(e)$date,"%y%m")
ind <- which(batch%in%c("0506","0510"))
set.seed(1)
N <- 12; N1 <- 3; M <- 12; M1 <- 9
ind <- c(sample(which(batch=="0506" & sex=="F"),N1),
sample(which(batch=="0510" & sex=="F"),N-N1),
sample(which(batch=="0506" & sex=="M"),M1),
sample(which(batch=="0510" & sex=="M"),M-M1))
table(batch[ind],sex[ind])
##
## F M
## 0506 3 9
## 0510 9 3
To illustrate the confounding we will pick some genes to show in a heatmap plot. We pick all Y chromosome genes, some genes that we see correlate with batch, and then some randomly selected genes.
library(rafalib)
library(RColorBrewer)
set.seed(1)
tt <- genefilter::rowttests(exprs(e)[,ind],factor(batch[ind]))
ind1 <- which(chr=="chrY") ## real differences
ind2 <- setdiff(c(order(tt$dm)[1:25],order(-tt$dm)[1:25]),ind1)
ind0 <- setdiff(sample(seq(along=tt$dm),50),c(ind2,ind1))
geneindex <- c(ind2,ind0,ind1)
mat <- exprs(e)[geneindex,ind]
mat <- mat-rowMeans(mat)
mat[mat>3]<-3; mat[mat< -3]<- -3 # truncate to visualize
icolors <- rev(brewer.pal(11,"RdYlBu"))
mypar(1,1)
image(t(mat),xaxt="n",yaxt="n",col=icolors)
So what follows is like the analysis we would do in practice. We don’t know there is a batch and we are interested in finding genes that are different between males and females. We start by computing t-statistics and p-values comparing males and females. We use histograms to notice the problem introduced by the batch.
The batch effect adjustment methods are best described with the linear models so we start by writing down the linear more for this particular case:
dat <- exprs(e)[,ind]
X <- sex[ind] ## the covariate
Z <- batch[ind]
tt <- genefilter::rowttests(dat,factor(X))
HLIM <- c(0,1500)
mypar(1,2)
hist(tt$p[!chr%in%c("chrX","chrY")],nc=20,xlab="p-value",ylim=HLIM,main="")
hist(tt$p[chr%in%c("chrY")],nc=20,xlab="p-value",ylim=c(0,9),main="")
Here we show how to implement Combat.
library(sva)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:IRanges':
##
## collapse
## This is mgcv 1.8-12. For overview type 'help("mgcv-package")'.
mod <- model.matrix(~X)
cleandat <- ComBat(dat,Z,mod)
## Found 2 batches
## Adjusting for 1 covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
tt <- genefilter::rowttests(cleandat,factor(X))
mypar(1,2)
hist(tt$p[!chr%in%c("chrX","chrY")],nc=20,xlab="p-value",ylim=HLIM,main="")
hist(tt$p[chr%in%c("chrY")],nc=20,xlab="p-value",ylim=c(0,9),main="")
But what exactly is a batch?
times <- (pData(e)$date)
mypar(1,2)
o <- order(times)
plot(times[o],pch=21,bg=as.fumeric(batch)[o],ylab="date")
o <- order(times[ind])
plot(times[ind][o],pch=21,bg=as.fumeric(batch)[ind][o],ylab="date")
library(limma)
##
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
svafit <- sva(dat,mod)
## Number of significant surrogate variables is: 5
## Iteration (out of 5 ):1 2 3 4 5
svaX <- model.matrix(~X+svafit$sv)
lmfit <- lmFit(dat,svaX)
tt <- lmfit$coef[,2]*sqrt(lmfit$df.residual)/(2*lmfit$sigma)
mypar(1,2)
pval <- 2*(1-pt(abs(tt),lmfit$df.residual[1]))
hist(pval[!chr%in%c("chrX","chrY")],xlab="p-values",ylim=HLIM,main="")
hist(pval[chr%in%c("chrY")],nc=20,xlab="p-value",ylim=c(0,9),main="")
Decompose the data:
Batch <- lmfit$coef[geneindex,3:7]%*%t(svaX[,3:7])
Signal <- lmfit$coef[geneindex,1:2]%*%t(svaX[,1:2])
error <- mat-Signal-Batch
## demean for plot
Signal <- Signal-rowMeans(Signal)
mat <- mat-rowMeans(mat)
mypar(1,4,mar = c(2.75, 4.5, 2.6, 1.1))
image(t(mat),col=icolors,zlim=c(-5,5),xaxt="n",yaxt="n")
image(t(Signal),col=icolors,zlim=c(-5,5),xaxt="n",yaxt="n")
image(t(Batch),col=icolors,zlim=c(-5,5),xaxt="n",yaxt="n")
image(t(error),col=icolors,zlim=c(-5,5),xaxt="n",yaxt="n")