R guide: Analysis of Cardiovascular Proteomics Data


This guide shows how to use R for analyzing cardiovascular proteomics data derived from mass spectrometry plattforms TMT or iTRAQ. This analysis pipeline contains code for data preprocessing, data normalization, and performing a two sample comparison using ordinary and moderated t-test statistics.

We provide a straightforward analysis pipeline for 10-plex TMT experiments. Hereby, we start with R-code for one experiment followed by R-code for statistical inference when multiple experiments are present.


Overview

Given the data of one 10-plex TMT experiment, we need to run the following three main functions to obtain a list of proteins where changes in protein abundance between, e.g., treated and untreated cells are calculated. Details are provided in the following paragraphs:

read.peptides

quantify.proteins

eb.fit.


Requirements

The reader is expected to have basic R knowledge. For this analysis we used the (current) Bioconductor release (version 3.1) and R version 3.2.1.

Please install R from CRAN and the basic Bioconductor packages by typing the following code in your R command window:

source("http://bioconductor.org/biocLite.R")
biocLite()

In a next step, install additional Bioconductor packages that are necessary for the proteomic data analysis. Please make sure to install and update all dependencies when you are asked.

biocLite("limma")
biocLite("qvalue")

You can also check the Bioconductor installation page.


Analysis of one 10-plex TMT experiment

Load the following packages by typing in your R command window

library(limma)
library(qvalue)

In a first step, we read raw data from a csv-file and store the data set in a data frame that is named dat. The commands dim and str give an overview of the dimensions and the structure of the data. This exemplary data set contains simulated data for one 10-plex TMT experiment.

# read simulated TMT data set
dat <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment.csv") 
dim(dat)
## [1] 22624    16
str(dat)
## 'data.frame':    22624 obs. of  16 variables:
##  $ Protein.Group.Accessions: int  4506975 4506975 18079216 12707580 16579885 16579885 16579885 16579885 16579885 117938274 ...
##  $ Protein.Descriptions    : Factor w/ 3895 levels "1-acyl-sn-glycerol-3-phosphate acyltransferase delta [Homo sapiens]",..: 3195 3195 633 2259 155 155 155 155 155 2034 ...
##  $ Sequence                : Factor w/ 10980 levels "aAAAAAAAAAAAAAAGAGAGAk",..: 1 1 2 3 4 5 5 5 5 6 ...
##  $ Quan.Usage              : Factor w/ 1 level "Used": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Quan.Info               : Factor w/ 1 level "Unique": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Isolation.Interference  : int  25 38 75 31 22 53 8 10 11 69 ...
##  $ X126                    : num  6712160 1873633 593284 96868 4322936 ...
##  $ X127_N                  : num  221073 392450 4248749 531059 312364 ...
##  $ X127_C                  : num  400791 612276 1302342 731533 592202 ...
##  $ X128_N                  : num  580248 834695 5276243 226954 777629 ...
##  $ X128_C                  : num  2089265 1548747 432149 241949 886331 ...
##  $ X129_N                  : num  557290 355312 1917577 1688991 6639442 ...
##  $ X129_C                  : num  391106 20691 69547 1192171 1371124 ...
##  $ X130_N                  : num  1314488 433218 826219 653767 684538 ...
##  $ X130_C                  : num  1016075 2329839 3890083 208666 1504843 ...
##  $ X131                    : num  227170 442554 366487 803876 1755958 ...

To perform the analysis correctly the following columns are required to be included in the data set and should be named in the same way as in this exemplary data set. Each row corresponds to a peptide spectra

Protein.Group.Accessions protein group accession or a protein id for the spectra
Sequence name of peptide sequence
Quan.Usage indicator whether the spectra is used or not used
Quan.Info indicator whether the spectra is unique or redundant
Isolation.Interference % of isolation interference


The column Protein.Descriptions is optional and maybe helpful for deeper investigation after detecting significantly differentially expressed proteins.

For each row, the columns X126, X127_N, ..., X131 contain measurements for the 10 channels.


In the next step we define a vector of headlines of all channels as they are stored in the data set dat.

cha <- c("X126", "X127_N", "X127_C", "X128_N", "X128_C", "X129_N", "X129_C", "X130_N", "X130_C", "X131")
# data preprocessing, load all functions
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/read.peptides.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/quantify.proteins.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.r")

R code for all function is provided in the appendix.

The function read.peptides does data preprocessing and excludes spectra (rows in the data set) that contain missing values, a higher isolation interference than 30%, and that are redundant or not used. The exclusion of spectra with a higher isolation interference than 30% is a suggestion.

THe funcion quantify.proteins quantifies all proteins that are represented by peptide spectra. Channel values for each protein are median polished log2 transformed values from the median value of all peptides belonging to each protein.

The function eb.fit (eb: empirical Bayes) performes the statistical analysis of interest: two sample t-tests. The output of this procedure is a data frame that in particular contains for each protein its log2 (fold-change) and its ordinary as well as moderated p-values and q-values. The results are sorted by moderated p-values in increasing order.

dat <- read.peptides(dat, cha) 
dim(dat) # 19047 peptides
## [1] 19047    16
# identify proteins from peptide spectra 
# (takes aorund one minute with an average laptop)
dat <- quantify.proteins(dat, cha) 
dim(dat) # 3569 proteins identified
## [1] 3569   12
# find "one-hit wonders"
dat.onehit <- subset(dat, dat$n.peptides == 1) 
dim(dat.onehit) # 1821 proteins are identified by only one peptide
## [1] 1821   12
# eliminate "one-hit wonders"
dat <- subset(dat, dat$n.peptides != 1)
dim(dat) # 1748 proteins are identified by at least two peptides
## [1] 1748   12
# boxplot: intensities of all eight channels after data preprocessing and normalization
par(mfrow=c(1,1), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
boxplot(dat[, 1:length(cha)],  ylim = c(-3, 3), main="Boxplot normalized Intensities")

# define treatment (tr) and control (ct) groups for a two group comparison, assuming 5 cases and 5 controls
tr <- c("X126", "X127_N", "X127_C", "X128_N", "X128_C")
ct <- c("X129_N", "X129_C", "X130_N", "X130_C", "X131")

# define design according to limma package framework
design <- model.matrix(~factor(c(2,2,2,2,2,1,1,1,1,1)))
design
##    (Intercept) factor(c(2, 2, 2, 2, 2, 1, 1, 1, 1, 1))2
## 1            1                                        1
## 2            1                                        1
## 3            1                                        1
## 4            1                                        1
## 5            1                                        1
## 6            1                                        0
## 7            1                                        0
## 8            1                                        0
## 9            1                                        0
## 10           1                                        0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(c(2, 2, 2, 2, 2, 1, 1, 1, 1, 1))`
## [1] "contr.treatment"
colnames(design) <- c("Intercept", "tr-cr")
res.eb <- eb.fit(dat[, c(tr,ct)], design)
head(res.eb)
##            log2FC     t.ord     t.mod       p.ord        p.mod     q.ord
## 189027129  2.3440  5.026608  5.133506 0.001018458 0.0002793448 0.5819295
## 82775371   1.4416  6.226266  4.676150 0.000252119 0.0005926906 0.3435027
## 66932916   1.4338  4.002641  3.742556 0.003935244 0.0029926666 0.8222439
## 28827795  -1.2990 -4.312766 -3.738402 0.002571040 0.0030149568 0.8222439
## 63252888   1.6682  3.676763  3.725913 0.006248238 0.0030830064 0.8222439
## 226246591  1.6096  3.659504  3.674562 0.006405803 0.0033798430 0.8222439
##               q.mod df.r     df.0      s2.0        s2   s2.post
## 189027129 0.4821453    8 3.553636 0.4707887 0.5436323 0.5212273
## 82775371  0.5114880    8 3.553636 0.4707887 0.1340214 0.2376034
## 66932916  0.7428239    8 3.553636 0.4707887 0.3207923 0.3669278
## 28827795  0.7428239    8 3.553636 0.4707887 0.2268018 0.3018466
## 63252888  0.7428239    8 3.553636 0.4707887 0.5146416 0.5011534
## 226246591 0.7428239    8 3.553636 0.4707887 0.4836504 0.4796945

The rownames of the generated data frame res.eb correspond to protein groups accessions numbers provided by the original raw data set. Each protein (row) has the following columns:

log2FC estimate of the log2-fold-change corresponding to the effect size
t.ord vector of ordinary t-statistic
t.mod moderated t-statistic
p.ord ordinary p-value corresonding to the ordinary t-statistic
p.mod moderated p-value corresonding to the moderated t-statistic
q.ord ordinary q-value corresonding to the ordinary t-statistic
q.mod moderated q-value corresonding to the moderated t-statistic
df.r residual degrees of freedom assiciated with ordinary t-statistic and p-value
df.0 degrees of freedom associated with s2.0
s2.0 estimated prior value for the variance
s2 sample variance
s2.post posterior value for the variance


The protein with protein group accessions 189027129 listed on top of the results chart has an estimated log2-effect size of 2.344 between five cases and five controls. The moderated p-value of 0.00027 is considerably smaller that the ordinary p-value of 0.001. A moderated q-value of 0.48215 indicates that this protein cannot be declared significant if the false discovery rate (FDR) should be controlled at 5%.


The following volcano plots visualize the shrinkage effect of the moderated p-values compared to the orinary by plotting the significance versus log2-fold-change on the y- and x-axes, respectively.

# volcano plots for ordinary and moderated p-values
rx <- c(-1, 1)*max(abs(res.eb$log2FC))*1.1
ry <- c(0, ceiling(max(-log10(res.eb$p.ord), -log10(res.eb$p.mod))))

par(mfrow=c(1,2), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
par(las=1, xaxs="i", yaxs="i")

plot(res.eb$log2FC, -log10(res.eb$p.ord), pch=21, bg="lightgrey", cex=0.9, 
     xlim=rx, ylim=ry, xaxt="n",
     xlab="fold change", ylab="-log10  p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of ordinary p-values")

plot(res.eb$log2FC, -log10(res.eb$p.mod), pch=21, bg="lightgrey", cex=0.9,
     xlim=rx, ylim=ry, xaxt="n",
     xlab="fold change", ylab="-log10  p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of moderated p-values")


Inference using limma for multiple experiments

In this section, we analyze three TMT experiments together. For simplicity, we assume that each of the three data sets is already preprocessed separately by applying read.peptides and quantify.proteins to the original data sets. We also sugguest to remove one-hit wonders. Each entry in each data set represents a simulated measurement for one protein (rows) in a specific channel (columns).


First, we additionally load the function eb.fit.mult which is an extension of eb.fit when multiple experiments are present.

# data preprocessing, load all functions
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.mult.r")


Next, we read the data of three TMT experiments that are stored in csv-files.

# read preprocessed data sets
dat1 <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment_1.csv") 
dat2 <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment_2.csv") 
dat3 <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment_3.csv") 

# set channel names
cha <- c("X126", "X127_N", "X127_C", "X128_N", "X128_C", "X129_N", "X129_C", "X130_N", "X130_C", "X131")


Now, we estimate the treatment effect by adjusting the statistical model for the three different experiments.

# limma, factorial design; the first 10 columns of ech data set contain the channel measurements
# Here, dat1, dat2, and dat3 only consist of 10 channel columns, but there might me addional information saved in the data sets
dat <- data.frame(dat1[, 1:10], dat2[, 1:10], dat3[, 1:10])
tr <- as.factor(rep(c(2,2,2,2,2,1,1,1,1,1), 3))
ex <- as.factor(c(rep(1,10), rep(2,10), rep(3,10)))
design <- model.matrix(~ ex + tr)
res.eb.mult <- eb.fit.mult(dat, design)
head(res.eb.mult)
##       log2FC    t.ord    t.mod        p.ord        p.mod        q.ord
## 52 0.3400148 8.819983 7.800588 2.706211e-09 1.108432e-08 2.703505e-06
## 54 0.3128688 7.240442 6.621382 1.089316e-07 2.583376e-07 5.283399e-05
## 48 0.3910481 6.785453 6.607751 3.350579e-07 2.681472e-07 7.642497e-05
## 75 0.3189281 7.087028 6.550356 1.586606e-07 3.137650e-07 5.283399e-05
## 66 0.4260741 6.589929 6.533841 5.471007e-07 3.282933e-07 7.642497e-05
## 37 0.4693814 6.381745 6.431789 9.264516e-07 4.344809e-07 9.255252e-05
##           q.mod     df.0 df.r       s2.0         s2    s2.post
## 52 1.107324e-05 3.776869   26 0.03561429 0.01114606 0.01424958
## 54 6.559300e-05 3.776869   26 0.03561429 0.01400411 0.01674512
## 48 6.559300e-05 3.776869   26 0.03561429 0.02490944 0.02626723
## 75 6.559300e-05 3.776869   26 0.03561429 0.01518863 0.01777940
## 66 6.559300e-05 3.776869   26 0.03561429 0.03135235 0.03189293
## 37 6.566976e-05 3.776869   26 0.03561429 0.04057273 0.03994381

Note, that there may be a warning message that NAs occure. This is due to missing values in the data sets (proteins that are not present in all three data sets) and does not effect the results for statistical inference.


Create volcano plots:

# volcano plots for ordinary and moderated p-values
rx <- c(-1, 1)*max(abs(res.eb.mult$log2FC), na.rm = TRUE)*1.1
ry <- c(0, ceiling(max(-log10(res.eb.mult$p.ord), -log10(res.eb.mult$p.mod), na.rm = TRUE)))

par(mfrow=c(1,2), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
par(las=1, xaxs="i", yaxs="i")

plot(res.eb.mult$log2FC, -log10(res.eb.mult$p.ord), pch=21, bg="lightgrey", cex=0.9, 
     xlim=rx, ylim=ry, xaxt="n",
     xlab="fold change", ylab="-log10  p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of ordinary p-values")

plot(res.eb.mult$log2FC, -log10(res.eb.mult$p.mod), pch=21, bg="lightgrey", cex=0.9,
     xlim=rx, ylim=ry, xaxt="n",
     xlab="fold change", ylab="-log10  p-value")
abline(v=seq(-2,2,1), col="lightgray", lty="dotted")
abline(h=seq(0,ry[2],1), col="lightgray", lty="dotted")
axis(1, seq(-2,2,1), paste(c("1/4","1/2","1/1","2/1","4/1")))
title("volcano plot of moderated p-values")


Appendix: R code for source functions

read.peptides <- function(dat, cha){
  output <- NULL
  
  dat$Sequence <- as.character(dat$Sequence)
  dat$Protein.Group.Accessions <- as.character(dat$Protein.Group.Accessions)
  dat$Quan.Usage <- as.character(dat$Quan.Usage)
  dat$Quan.Info <- as.character(dat$Quan.Info)
  dat$Isolation.Interference <- as.numeric(as.character(dat$Isolation.Interference))
  
  dat <- subset(dat, Isolation.Interference<=30)  
  dat <- subset(dat, Quan.Usage=="Used")
  dat <- subset(dat, Protein.Group.Accessions!="")
  dat <- subset(dat, !apply(dat[cha], 1, f <- function(x) any(is.na(x))))  
}


quantify.proteins <- function(dat, cha){
  e.function <- function(x, seq) tapply(x, seq, median)
  output <- NULL
  
  dat$Sequence <- toupper(dat$Sequence) # Capital letters
  accessions <- as.character(unique(dat$Protein.Group.Accessions))
  n.proteins <- length(accessions)
  n.cha <- length(cha)
  
  for(k in 1:n.proteins){
    id <- accessions[k]
    sdat <- subset(dat, Protein.Group.Accessions==id)[c("Sequence", cha)]
    sdat[cha] <- log2(sdat[cha])
    sdat[cha] <- sdat[cha] - apply(sdat[cha], 1, median)
    pdat <- sdat[, -1]
    n.spectra <- ifelse(is.integer(dim(pdat)), nrow(pdat), 1)
    temp <- apply(sdat[,-1], 2, e.function,seq=sdat[, 1])          
    n.peptides <- ifelse(is.integer(dim(temp)), nrow(temp), 1)    
    if(is.integer(dim(pdat))) pdat <- apply(pdat, 2, median)
    pdat <- c(pdat, n.peptides=n.peptides, n.spectra=n.spectra)
    output <- rbind(output, pdat)
  }
  output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
  output[,1:n.cha] <- sweep(output[,1:n.cha],2,apply(output[,1:n.cha],2,median))
  output[,1:n.cha] <- sweep(output[,1:n.cha],1,apply(output[,1:n.cha],1,median))
  output[,1:n.cha] <- round(output[,1:n.cha],3)
  row.names(output) <- accessions
  output <- as.data.frame(output)
  return(output)
}


eb.fit <- function(dat, design){
  n <- dim(dat)[1]
  fit <- lmFit(dat, design)
  fit.eb <- eBayes(fit)
  log2FC <- fit.eb$coefficients[, 2]
  df.r <- fit.eb$df.residual
  df.0 <- rep(fit.eb$df.prior, n)
  s2.0 <- rep(fit.eb$s2.prior, n)
  s2 <- (fit.eb$sigma)^2
  s2.post <- fit.eb$s2.post
  t.ord <- fit.eb$coefficients[, 2]/fit.eb$sigma/fit.eb$stdev.unscaled[, 2]
  t.mod <- fit.eb$t[, 2]
  p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
  p.mod <- fit.eb$p.value[, 2]
  q.ord <- qvalue(p.ord)$q
  q.mod <- qvalue(p.mod)$q
  results.eb <- data.frame(log2FC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.r, df.0, s2.0, s2, s2.post)
  results.eb <- results.eb[order(results.eb$p.mod), ]
  return(results.eb)
}


eb.fit.mult <- function(dat, design){
  n <- dim(dat)[1]
  fit <- lmFit(dat, design)
  fit.eb <- eBayes(fit)
  log2FC <- fit.eb$coef[, "tr2"]
  df.0 <- rep(fit.eb$df.prior, n)
  df.r <- fit.eb$df.residual
  s2.0 <- rep(fit.eb$s2.prior, n)
  s2 <- (fit.eb$sigma)^2
  s2.post <- fit.eb$s2.post
  t.ord <- fit.eb$coef[, "tr2"]/fit.eb$sigma/fit.eb$stdev.unscaled[, "tr2"]
  t.mod <- fit.eb$t[, "tr2"]
  p.ord <- 2*pt(-abs(t.ord), fit.eb$df.residual)
  p.mod <- fit.eb$p.value[, "tr2"]
  q.ord <- q.mod <- rep(NA,n)
  ids <- which(!is.na(p.ord))
  k <- 0
  q1 <- qvalue(p.ord[!is.na(p.ord)])$q
  q2 <- qvalue(p.mod[!is.na(p.mod)])$q
  for(i in ids){ 
    k <- k+1
    q.ord[i] <- q1[k] 
    q.mod[i] <- q2[k]
  }
  results.eb.mult <- data.frame(log2FC, t.ord, t.mod, p.ord, p.mod, q.ord, q.mod, df.r, df.0, s2.0, s2, s2.post)
  results.eb.mult <- results.eb.mult[order(results.eb.mult$p.mod), ]
  return(results.eb.mult)
}