R guide: Analysis of Cardiovascular Proteomics Data


This guide shows how to use R for analyzing cardiovascular proteomics data derived from mass spectrometry plattforms TMT or iTRAQ. This analysis pipeline contains code for data preprocessing, data normalization, and performing a two sample comparison using ordinary and moderated t-test statistics.

We provide a straightforward analysis pipeline for 10-plex TMT experiments. Hereby, we start with R-code for one experiment followed by R-code for statistical inference when multiple experiments are present.


Overview

Given the data of one 10-plex TMT experiment, we need to run the following three main functions to obtain a list of proteins where changes in protein abundance between, e.g., treated and untreated cells are calculated. Details are provided in the following paragraphs:

read.peptides

quantify.proteins

eb.fit.


Requirements

The reader is expected to have basic R knowledge. For this analysis we used the (current) Bioconductor release (version 3.1) and R version 3.2.1.

Please install R from CRAN and the basic Bioconductor packages by typing the following code in your R command window:

source("http://bioconductor.org/biocLite.R")
biocLite()

In a next step, install additional Bioconductor packages that are necessary for the proteomic data analysis. Please make sure to install and update all dependencies when you are asked.

biocLite("limma")
biocLite("qvalue")

You can also check the Bioconductor installation page.


Analysis of one 10-plex TMT experiment

Load the following packages by typing in your R command window

library(limma)
library(qvalue)

In a first step, we read raw data from a csv-file and store the data set in a data frame that is named dat. The commands dim and str give an overview of the dimensions and the structure of the data. This exemplary data set contains simulated data for one 10-plex TMT experiment.

# read simulated TMT data set
dat <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment.csv") 
dim(dat)
## [1] 22624    16
str(dat)
## 'data.frame':    22624 obs. of  16 variables:
##  $ Protein.Group.Accessions: int  4506975 4506975 18079216 12707580 16579885 16579885 16579885 16579885 16579885 117938274 ...
##  $ Protein.Descriptions    : Factor w/ 3895 levels "1-acyl-sn-glycerol-3-phosphate acyltransferase delta [Homo sapiens]",..: 3195 3195 633 2259 155 155 155 155 155 2034 ...
##  $ Sequence                : Factor w/ 10980 levels "aAAAAAAAAAAAAAAGAGAGAk",..: 1 1 2 3 4 5 5 5 5 6 ...
##  $ Quan.Usage              : Factor w/ 1 level "Used": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Quan.Info               : Factor w/ 1 level "Unique": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Isolation.Interference  : int  25 38 75 31 22 53 8 10 11 69 ...
##  $ X126                    : num  6712160 1873633 593284 96868 4322936 ...
##  $ X127_N                  : num  221073 392450 4248749 531059 312364 ...
##  $ X127_C                  : num  400791 612276 1302342 731533 592202 ...
##  $ X128_N                  : num  580248 834695 5276243 226954 777629 ...
##  $ X128_C                  : num  2089265 1548747 432149 241949 886331 ...
##  $ X129_N                  : num  557290 355312 1917577 1688991 6639442 ...
##  $ X129_C                  : num  391106 20691 69547 1192171 1371124 ...
##  $ X130_N                  : num  1314488 433218 826219 653767 684538 ...
##  $ X130_C                  : num  1016075 2329839 3890083 208666 1504843 ...
##  $ X131                    : num  227170 442554 366487 803876 1755958 ...

To perform the analysis correctly the following columns are required to be included in the data set and should be named in the same way as in this exemplary data set. Each row corresponds to a peptide spectra

Protein.Group.Accessions protein group accession or a protein id for the spectra
Sequence name of peptide sequence
Quan.Usage indicator whether the spectra is used or not used
Quan.Info indicator whether the spectra is unique or redundant
Isolation.Interference % of isolation interference


The column Protein.Descriptions is optional and maybe helpful for deeper investigation after detecting significantly differentially expressed proteins.

For each row, the columns X126, X127_N, ..., X131 contain measurements for the 10 channels.


In the next step we define a vector of headlines of all channels as they are stored in the data set dat.

cha <- c("X126", "X127_N", "X127_C", "X128_N", "X128_C", "X129_N", "X129_C", "X130_N", "X130_C", "X131")
# data preprocessing, load all functions
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/read.peptides.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/quantify.proteins.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.r")

R code for all function is provided in the appendix.

The function read.peptides does data preprocessing and excludes spectra (rows in the data set) that contain missing values, a higher isolation interference than 30%, and that are redundant or not used. The exclusion of spectra with a higher isolation interference than 30% is a suggestion.

THe funcion quantify.proteins quantifies all proteins that are represented by peptide spectra. Channel values for each protein are median polished log2 transformed values from the median value of all peptides belonging to each protein.

The function eb.fit (eb: empirical Bayes) performes the statistical analysis of interest: two sample t-tests. The output of this procedure is a data frame that in particular contains for each protein its log2 (fold-change) and its ordinary as well as moderated p-values and q-values. The results are sorted by moderated p-values in increasing order.

dat <- read.peptides(dat, cha) 
dim(dat) # 19047 peptides
## [1] 19047    16
# identify proteins from peptide spectra 
# (takes aorund one minute with an average laptop)
dat <- quantify.proteins(dat, cha) 
dim(dat) # 3569 proteins identified
## [1] 3569   12
# find "one-hit wonders"
dat.onehit <- subset(dat, dat$n.peptides == 1) 
dim(dat.onehit) # 1821 proteins are identified by only one peptide
## [1] 1821   12
# eliminate "one-hit wonders"
dat <- subset(dat, dat$n.peptides != 1)
dim(dat) # 1748 proteins are identified by at least two peptides
## [1] 1748   12
# boxplot: intensities of all eight channels after data preprocessing and normalization
par(mfrow=c(1,1), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
boxplot(dat[, 1:length(cha)],  ylim = c(-3, 3), main="Boxplot normalized Intensities")