This guide shows how to use R for analyzing cardiovascular proteomics data derived from mass spectrometry plattforms TMT or iTRAQ. This analysis pipeline contains code for data preprocessing, data normalization, and performing a two sample comparison using ordinary and moderated t-test statistics.
We provide a straightforward analysis pipeline for 10-plex TMT experiments. Hereby, we start with R-code for one experiment followed by R-code for statistical inference when multiple experiments are present.
Given the data of one 10-plex TMT experiment, we need to run the following three main functions to obtain a list of proteins where changes in protein abundance between, e.g., treated and untreated cells are calculated. Details are provided in the following paragraphs:
read.peptides
quantify.proteins
eb.fit
.
The reader is expected to have basic R knowledge. For this analysis we used the (current) Bioconductor release (version 3.1) and R version 3.2.1.
Please install R from CRAN and the basic Bioconductor packages by typing the following code in your R command window:
source("http://bioconductor.org/biocLite.R")
biocLite()
In a next step, install additional Bioconductor packages that are necessary for the proteomic data analysis. Please make sure to install and update all dependencies when you are asked.
biocLite("limma")
biocLite("qvalue")
You can also check the Bioconductor installation page.
Load the following packages by typing in your R command window
library(limma)
library(qvalue)
In a first step, we read raw data from a csv-file and store the data set in a data frame that is named dat
. The commands dim
and str
give an overview of the dimensions and the structure of the data. This exemplary data set contains simulated data for one 10-plex TMT experiment.
# read simulated TMT data set
dat <- read.csv("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/TMT_experiment.csv")
dim(dat)
## [1] 22624 16
str(dat)
## 'data.frame': 22624 obs. of 16 variables:
## $ Protein.Group.Accessions: int 4506975 4506975 18079216 12707580 16579885 16579885 16579885 16579885 16579885 117938274 ...
## $ Protein.Descriptions : Factor w/ 3895 levels "1-acyl-sn-glycerol-3-phosphate acyltransferase delta [Homo sapiens]",..: 3195 3195 633 2259 155 155 155 155 155 2034 ...
## $ Sequence : Factor w/ 10980 levels "aAAAAAAAAAAAAAAGAGAGAk",..: 1 1 2 3 4 5 5 5 5 6 ...
## $ Quan.Usage : Factor w/ 1 level "Used": 1 1 1 1 1 1 1 1 1 1 ...
## $ Quan.Info : Factor w/ 1 level "Unique": 1 1 1 1 1 1 1 1 1 1 ...
## $ Isolation.Interference : int 25 38 75 31 22 53 8 10 11 69 ...
## $ X126 : num 6712160 1873633 593284 96868 4322936 ...
## $ X127_N : num 221073 392450 4248749 531059 312364 ...
## $ X127_C : num 400791 612276 1302342 731533 592202 ...
## $ X128_N : num 580248 834695 5276243 226954 777629 ...
## $ X128_C : num 2089265 1548747 432149 241949 886331 ...
## $ X129_N : num 557290 355312 1917577 1688991 6639442 ...
## $ X129_C : num 391106 20691 69547 1192171 1371124 ...
## $ X130_N : num 1314488 433218 826219 653767 684538 ...
## $ X130_C : num 1016075 2329839 3890083 208666 1504843 ...
## $ X131 : num 227170 442554 366487 803876 1755958 ...
To perform the analysis correctly the following columns are required to be included in the data set and should be named in the same way as in this exemplary data set. Each row corresponds to a peptide spectra
Protein.Group.Accessions |
protein group accession or a protein id for the spectra |
Sequence |
name of peptide sequence |
Quan.Usage |
indicator whether the spectra is used or not used |
Quan.Info |
indicator whether the spectra is unique or redundant |
Isolation.Interference |
% of isolation interference |
The column Protein.Descriptions
is optional and maybe helpful for deeper investigation after detecting significantly differentially expressed proteins.
For each row, the columns X126, X127_N, ..., X131
contain measurements for the 10 channels.
In the next step we define a vector of headlines of all channels as they are stored in the data set dat
.
cha <- c("X126", "X127_N", "X127_C", "X128_N", "X128_C", "X129_N", "X129_C", "X130_N", "X130_C", "X131")
# data preprocessing, load all functions
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/read.peptides.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/quantify.proteins.r")
source("http://www.biostat.jhsph.edu/~kkammers/software/CVproteomics/eb.fit.r")
R code for all function is provided in the appendix.
The function read.peptides
does data preprocessing and excludes spectra (rows in the data set) that contain missing values, a higher isolation interference than 30%, and that are redundant
or not used
. The exclusion of spectra with a higher isolation interference than 30% is a suggestion.
THe funcion quantify.proteins
quantifies all proteins that are represented by peptide spectra. Channel values for each protein are median polished log2
transformed values from the median value of all peptides belonging to each protein.
The function eb.fit
(eb: empirical Bayes) performes the statistical analysis of interest: two sample t-tests. The output of this procedure is a data frame that in particular contains for each protein its log2 (fold-change) and its ordinary as well as moderated p-values and q-values. The results are sorted by moderated p-values in increasing order.
dat <- read.peptides(dat, cha)
dim(dat) # 19047 peptides
## [1] 19047 16
# identify proteins from peptide spectra
# (takes aorund one minute with an average laptop)
dat <- quantify.proteins(dat, cha)
dim(dat) # 3569 proteins identified
## [1] 3569 12
# find "one-hit wonders"
dat.onehit <- subset(dat, dat$n.peptides == 1)
dim(dat.onehit) # 1821 proteins are identified by only one peptide
## [1] 1821 12
# eliminate "one-hit wonders"
dat <- subset(dat, dat$n.peptides != 1)
dim(dat) # 1748 proteins are identified by at least two peptides
## [1] 1748 12
# boxplot: intensities of all eight channels after data preprocessing and normalization
par(mfrow=c(1,1), font.lab=2, cex.lab=1.2, font.axis=2, cex.axis=1.2)
boxplot(dat[, 1:length(cha)], ylim = c(-3, 3), main="Boxplot normalized Intensities")