Detecting Significant Changes in Protein Abundance: An R guide


This guide shows how to use R for proteomics data analysis derived from mass spectrometry plattform iTRAQ (or TMT). It aims at data preprocessing, data normalization, and performing a two sample comparison using ordinary and moderated t-test statistics.

At first we provide R code for the analysis of one 8-plex iTRAQ experiment. That following we deal with statistical inference when multiple iTRAQ experiments are present.


Overview

Given one data set derived from an iTRAQ experiment, we only need to run the following three main functions to obtain a list of proteins where changes in protein abundance between cases and controls are calculated:

read.peptides

quantify.proteins

eb.fit

A detailed description is given in the following sections.


Requirements

The reader is expected to have basic R knowledge. The current release Bioconductor is version 3.1, which works with R version 3.2.1.

Install R from CRAN and the basic Bioconductor packages by typing the following code in your R command window

source("http://bioconductor.org/biocLite.R")
biocLite()

Now install additional Bioconductor packages. These packages are necessary for the proteomic data analysis. Make sure to install and update all dependencies when you are asked.

biocLite("limma")
biocLite("qvalue")

You can also check the Bioconductor installation page.


Inference using limma for one 8-plex iTRAQ experiment

Load the following packages by typing in your R command window

library(limma)
library(qvalue)

In a first step, we read raw data from a csv-file and store the data set in a data frame that is named dat. The commands dim and str give an overview of the dimensions and the structure of the data. This exemplary data set contains simulated data for one 8-plex iTRAQ experiment.

# read artificial iTRAQ data set
dat <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/eupa/example_iTRAQ.csv") 
dim(dat)
## [1] 10396    14
str(dat)
## 'data.frame':    10396 obs. of  14 variables:
##  $ Protein.Group.Accessions: int  4501867 4501867 4501885 4501885 4501885 4501885 4502027 4502027 4502027 4502027 ...
##  $ Protein.Descriptions    : Factor w/ 376 levels "10 kDa heat shock protein, mitochondrial [Homo sapiens]",..: 48 48 49 49 49 49 351 351 351 351 ...
##  $ Sequence                : Factor w/ 3560 levels "aAAIQTmSLDAER",..: 993 2580 450 452 457 458 66 1714 2075 2088 ...
##  $ Quan.Usage              : Factor w/ 1 level "Used": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Quan.Info               : Factor w/ 1 level "Unique": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Isolation.Interference  : int  30 30 30 30 30 30 30 30 30 30 ...
##  $ X113                    : num  4026041 528744 3644218 296407 1354282 ...
##  $ X114                    : num  4200342 490108 2632247 1790033 633951 ...
##  $ X115                    : num  312788 1839129 2146686 1253718 1231307 ...
##  $ X116                    : num  1385926 7682245 239161 2410432 4883382 ...
##  $ X117                    : num  103338 407685 1719507 2184130 360273 ...
##  $ X118                    : num  4604397 67507 5419868 191450 888141 ...
##  $ X119                    : num  1026928 1237453 317735 2400822 74120 ...
##  $ X121                    : num  142222 363690 113920 557290 768629 ...

To perform the analysis correctly the following columns are required to be included in the data set and should be named in the same way as in this exemplary data set. Each row corresponds to a peptide spectra

Protein.Group.Accessions protein group accession or a protein id for the spectra
Sequence name of peptide sequence
Quan.Usage indicator whether the spectra is used or not used
Quan.Info indicator whether the spectra is unique or redundant
Isolation.Interference % of isolation interference


The column Protein.Descriptions is optional and maybe helpful for deeper investigation after detecting significantly differentially expressed proteins.

For each row, the columns X113,...X121 contain measurements for a spectra (in this example for 8 channels representing an 8-plex iTRAQ experiment).


In the next step we define a vector of headlines of all channels as they are stored in the data set dat.

cha <- c("X113", "X114", "X115", "X116", "X117", "X118", "X119", "X121")
# data preprocessing, load all functions
source("http://www.biostat.jhsph.edu/~kkammers/software/eupa/source.functions.r")

In the R script source.functions.r there are four functions included: read.peptides, quantify.proteins, eb.fit, and eb.fit.mult. R code for all function is provided in the appendix.

The function read.peptides does data preprocessing and excludes spectra (rows in the data set) that contain missing values, a higher isolation interference than 30%, that are redundant or not used.

By calling the function quantify.proteins, we quantify all proteins that are represented by peptide spectra. Channel values for each protein are median polished log2 transformed values from the median value of all peptides belonging to each protein.

The function eb.fit (eb: empirical Bayes) performes the statistical analysis of interest: two sample t-tests. The output of this procedure is a data frame that especially contains for each protein its log2-fold-change, and ordinary as well as moderated p-values and q-values. The results are sorted by moderated p-values in increasing order.

The function eb.fit.mult is an extension of eb.fit when multiple experiments are present (see next section).

dat <- read.peptides(dat, cha) 
dim(dat) # 10396 out of 10396 peptides left (no missing values in this artificial data set)
## [1] 10396    14
# identify proteins from peptide spectra
dat <- quantify.proteins(dat, cha) # 371 proteins identified
# find "one-hit wonders"
dat.onehit <- subset(dat, dat$n.peptides == 1) 
dim(dat.onehit) # 64 proteins are identified by one peptide only
## [1] 64 10
# eliminate "one-hit wonders"
dat <- subset(dat, dat$n.peptides != 1)
dim(dat) # 307 proteins are identified by at least two peptides
## [1] 307  10
# boxplot: intensities of all eight channels after data preprocessing and normalization
par(mfrow=c(1,1), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
boxplot(dat[, 1:length(cha)],  ylim = c(-3, 3), main="Boxplot normalized Intensities")

# define treatment and control groups for two group comparison, assuming 4 cases and 4 controls
tr <- c("X113", "X114", "X115", "X116")
ct <- c("X117", "X118", "X119", "X121")

# define design according to syntax of limma package
design <- model.matrix(~factor(c(2,2,2,2,1,1,1,1)))
design
##   (Intercept) factor(c(2, 2, 2, 2, 1, 1, 1, 1))2
## 1           1                                  1
## 2           1                                  1
## 3           1                                  1
## 4           1                                  1
## 5           1                                  0
## 6           1                                  0
## 7           1                                  0
## 8           1                                  0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(c(2, 2, 2, 2, 1, 1, 1, 1))`
## [1] "contr.treatment"
colnames(design) <- c("Intercept", "Diff")
res.eb <- eb.fit(dat[, c(tr,ct)], design)
head(res.eb)
##            logFC     t.ord     t.mod       p.ord       p.mod    q.ord
## 4504111  0.93550  3.547207  3.565387 0.012112875 0.006761027 0.699554
## 4503481  0.69300  3.100886  2.958554 0.021092616 0.017199594 0.699554
## 4503609 -0.49675 -4.148599 -2.893145 0.006020499 0.019062442 0.699554
## 4504025  1.15875  2.528647  2.818942 0.044761396 0.021429507 0.699554
## 4502605  1.19975  2.440570  2.741301 0.050432631 0.024231000 0.699554
## 4504301  0.64875  2.800465  2.705035 0.031144845 0.025665360 0.699554
##             q.mod df.r    df.0      s2.0         s2    s2.post
## 4504111 0.7366396    6 2.41595 0.1341766 0.13910558 0.13769062
## 4503481 0.7366396    6 2.41595 0.1341766 0.09989067 0.10973305
## 4503609 0.7366396    6 2.41595 0.1341766 0.02867496 0.05896109
## 4504025 0.7366396    6 2.41595 0.1341766 0.41998446 0.33793816
## 4502605 0.7366396    6 2.41595 0.1341766 0.48331363 0.38308755
## 4504301 0.7366396    6 2.41595 0.1341766 0.10733079 0.11503735

The rownames of the generated data frame res.eb correspond to protein groups accessions numbers provided by the original raw data set. Each protein (row) has the following columns:

logFC estimate of the log2-fold-change corresponding to the effect size
t.ord vector of ordinary t-statistic
t.mod moderated t-statistic
p.ord ordinary p-value corresonding to the ordinary t-statistic
p.mod moderated p-value corresonding to the moderated t-statistic
q.ord ordinary q-value corresonding to the ordinary t-statistic
q.mod moderated q-value corresonding to the moderated t-statistic
df.r residual degrees of freedom assiciated with ordinary t-statistic and p-value
df.0 degrees of freedom associated with s2.0
s2.0 estimated prior value for the variance
s2 sample variance
s2.post posterior value for the variance


The protein with protein group accessions 4504111 listed at the top of the results chart has an estimated log2-effect size of 0.9355 between four cases and four controls. The moderated p-value of 0.006761 is considerably smaller that the ordinary p-value of 0.0121129. A moderated q-value of 0.7366396 indicates that this protein cannot be declared significant if the false discovery rate (FDR) should be controlled at 5%.


The following volcano plots visualize the shrinkage effect of the moderated p-values compared to the orinary by plotting the significance versus log2-fold-change on the y- and x-axes, respectively.

# volcano plots for ordinary and moderated p-values
rx <- c(-1, 1)*max(abs(res.eb$logFC))*1.1
ry <- c(0, ceiling(max(-log10(res.eb$p.ord), -log10(res.eb$p.mod))))

par(mfrow=c(1,2), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
par(las=1, xaxs="i", yaxs="i")

plot(res.eb$logFC, -log10(res.eb$p.ord), pch=21, bg="lightgrey", cex=0.9, 
     xlim=rx, ylim=ry, xaxt="n",
     xlab="fold change", ylab="-log10  p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of ordinary p-values")

plot(res.eb$logFC, -log10(res.eb$p.mod), pch=21, bg="lightgrey", cex=0.9,
     xlim=rx, ylim=ry, xaxt="n",
     xlab="fold change", ylab="-log10  p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,6,1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of moderated p-values")


Inference using limma for multiple experiments

# read three simulated iTRAQ experiments
dat1 <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/eupa/example1_iTRAQ.csv") 
dat2 <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/eupa/example2_iTRAQ.csv") 
dat3 <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/eupa/example3_iTRAQ.csv") 

# data preoprocessing as described in the previous section for each experiment separately
cha <- c("X113", "X114", "X115", "X116", "X117", "X118", "X119", "X121")

dat1 <- read.peptides(dat1, cha) 
dat2 <- read.peptides(dat2, cha) 
dat3 <- read.peptides(dat3, cha) 

dat1 <- quantify.proteins(dat1, cha)
dat2 <- quantify.proteins(dat2, cha)
dat3 <- quantify.proteins(dat3, cha)

# eliminate "one-hit wonders"
dat1 <- subset(dat1, dat1$n.peptides != 1)
dat2 <- subset(dat2, dat2$n.peptides != 1)
dat3 <- subset(dat3, dat3$n.peptides != 1)


Now, we estimate the treatment effect by adjusting the statistical model for the three different experiments.

# limma, factorial design; the first 8 columns of ech data set contain the channel measurements
dat <- data.frame(dat1[, 1:8], dat2[, 1:8], dat3[, 1:8])
tr <- as.factor(rep(c(2,2,2,2,1,1,1,1), 3))
ex <- as.factor(c(rep(1,8), rep(2,8), rep(3,8)))
design <- model.matrix(~ ex + tr)
res.eb.mult <- eb.fit.mult(dat, design)
head(res.eb.mult)
##             logFC    t.ord    t.mod        p.ord        p.mod     q.ord
## 4502247 1.3410833 3.882061 4.052162 0.0009268081 0.0005138626 0.2125986
## 4506633 0.9284167 3.474735 3.591761 0.0023909071 0.0015850192 0.2125986
## 4503529 0.7955833 3.378840 3.468716 0.0029833756 0.0021348222 0.2125986
## 4505751 1.1865833 3.254283 3.401990 0.0039715179 0.0025069562 0.2125986
## 4502395 1.3559167 3.189580 3.346055 0.0046043906 0.0028670592 0.2125986
## 4505235 0.5120833 3.268263 3.233199 0.0038463699 0.0037531966 0.2125986
##             q.mod     df.0 df.r      s2.0        s2   s2.post
## 4502247 0.1418205 2.449922   20 0.1767337 0.7160411 0.6571874
## 4506633 0.1582555 2.449922   20 0.1767337 0.4283451 0.4008872
## 4503529 0.1582555 2.449922   20 0.1767337 0.3326499 0.3156350
## 4505751 0.1582555 2.449922   20 0.1767337 0.7976953 0.7299308
## 4502395 0.1582555 2.449922   20 0.1767337 1.0843020 0.9852606
## 4505235 0.1726403 2.449922   20 0.1767337 0.1472986 0.1505108
# volcano plots for ordinary and moderated p-values
rx <- c(-1, 1)*max(abs(res.eb$logFC))*1.1
ry <- c(0, ceiling(max(-log10(res.eb$p.ord), -log10(res.eb$p.mod))))

par(mfrow=c(1,2), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
par(las=1, xaxs="i", yaxs="i")

plot(res.eb.mult$logFC, -log10(res.eb.mult$p.ord), pch=21, bg="lightgrey", cex=0.9, 
     xlim=rx, ylim=ry, xaxt="n",
     xlab="fold change", ylab="-log10  p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of ordinary p-values")

plot(res.eb.mult$logFC, -log10(res.eb.mult$p.mod), pch=21, bg="lightgrey", cex=0.9,
     xlim=rx, ylim=ry, xaxt="n",
     xlab="fold change", ylab="-log10  p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,6,1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of moderated p-values")


Appendix: R code for source functions

read.peptides <- function(dat, cha){
  output <- NULL
  
  dat$Sequence <- as.character(dat$Sequence)
  dat$Protein.Group.Accessions <- as.character(dat$Protein.Group.Accessions)
  dat$Quan.Usage <- as.character(dat$Quan.Usage)
  dat$Quan.Info <- as.character(dat$Quan.Info)
  dat$Isolation.Interference <- as.numeric(as.character(dat$Isolation.Interference))
  
  dat <- subset(dat, Isolation.Interference<=30)  
  dat <- subset(dat, Quan.Usage=="Used")
  dat <- subset(dat, Protein.Group.Accessions!="")
  dat <- subset(dat, !apply(dat[cha], 1, f <- function(x) any(is.na(x))))  
}

quantify.proteins <- function(dat, cha){
  e.function <- function(x, seq) tapply(x, seq, median)
  output <- NULL
  
  dat$Sequence <- toupper(dat$Sequence) # Capital letters
  accessions <- as.character(unique(dat$Protein.Group.Accessions))
  n.proteins <- length(accessions)
  n.cha <- length(cha)
  
  for(k in 1:n.proteins){
    id <- accessions[k]
    sdat <- subset(dat, Protein.Group.Accessions==id)[c("Sequence", cha)]
    sdat[cha] <- log2(sdat[cha])
    sdat[cha] <- sdat[cha] - apply(sdat[cha], 1, median)
    pdat <- sdat[, -1]
    n.spectra <- ifelse(is.integer(dim(pdat)), nrow(pdat), 1)
    temp <- apply(sdat[,-1], 2, e.function,seq=sdat[, 1])          
    n.peptides <- ifelse(is.integer(dim(temp)), nrow(temp), 1)    
    if(is.integer(dim(pdat))) pdat <- apply(pdat, 2, median)
    pdat <- c(pdat, n.peptides=n.peptides, n.spectra=n.spectra)
    output <- rbind(output, pdat)
  }
  output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
  output[,1:n.cha] <- sweep(output[,1:n.cha],2,apply(output[,1:n.cha],2,median))
  output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
  output[,1:n.cha] <- round(output[,1:n.cha],3)
  row.names(output) <- accessions
  output <- as.data.frame(output)
  return(output)
}

eb.fit <- function(dat, design){
  n <- dim(dat)[1]
  fit <- lmFit(dat, design)
  fit.eb <- eBayes(fit)
  logFC <- fit.eb$coefficients[, 2]
  df.r <- fit.eb$df.residual
  df.0 <- rep(fit.eb$df.prior, n)
  s2.0 <- rep(fit.eb$s2.prior, n)
  s2 <- (fit.eb$sigma)^2
  s2.post <- fit.eb$s2.post
  t.ord <- fit.eb$coefficients[, 2]/fit.eb$sigma/fit.eb$stdev.unscaled[, 2]
  t.mod <- fit.eb$t[, 2]
  p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
  p.mod <- fit.eb$p.value[, 2]
  q.ord <- qvalue(p.ord)$q
  q.mod <- qvalue(p.mod)$q
  results.eb <- data.frame(logFC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.r, df.0, s2.0, s2, s2.post)
  results.eb <- results.eb[order(results.eb$p.mod), ]
  return(results.eb)
}

eb.fit.mult <- function(dat, design){
  n <- dim(dat)[1]
  fit <- lmFit(dat, design)
  fit.eb <- eBayes(fit)
  logFC <- fit.eb$coef[, "tr2"]
  df.0 <- rep(fit.eb$df.prior, n)
  df.r <- fit.eb$df.residual
  s2.0 <- rep(fit.eb$s2.prior, n)
  s2 <- (fit.eb$sigma)^2
  s2.post <- fit.eb$s2.post
  t.ord <- fit.eb$coef[, "tr2"]/fit.eb$sigma/fit.eb$stdev.unscaled[, "tr2"]
  t.mod <- fit.eb$t[, "tr2"]
  p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
  p.mod <- fit.eb$p.value[, "tr2"]
  q.ord <- qvalue(p.ord)$q
  q.mod <- qvalue(p.mod)$q
  results.eb.mult <- data.frame(logFC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.0, df.r, s2.0, s2, s2.post)
  results.eb.mult <- results.eb.mult[order(results.eb.mult$p.mod), ]
  return(results.eb.mult)
}