dPCA

dPCA: differential principal component analysis of ChIP-seq

[Introduction]

 

We propose Differential Principal Component Analysis (dPCA) for analyzing multiple ChIP-seq datasets to identify differential protein-DNA interactions between two biological conditions. dPCA integrates unsupervised pattern discovery, dimension reduction, and statistical inference into a single statistical framework. It uses a small number of principal components to concisely summarize the major multi-protein differential patterns between the two conditions. For each pattern, it detects and prioritizes differential genomic loci by comparing the between-condition differences with the within-condition variation among replicate samples. dPCA provides a new tool for efficiently analyzing large amounts of ChIP-seq data to study dynamic changes of gene regulation across different biological conditions.

 

dPCA is part of CisGenome project. Currently, it can be run as a command line program. We will consider incorporating it into CisGenome GUI in the future.

 

[Supporting information for the dPCA paper]

 

Supporting Figures: FigureWeb1.pdf

Additional Simulations: dPCA_TechReport_2.pdf

 

 

[News]

 

Several new functions have been released, including:

(1) dpca_peakcalls: a program that uses CisGenome peak calling function to eliminate input genomic loci not bound by any protein in any dataset.

(2) -r option of dpca: allows one to compute the R^B statistic for identifying differential loci with significant absolute binding.

(3) -z option of dpca: allows one to use dPCA-Z to filter out differences without significant binding activities before dPCA.

See readme for details.

 

[Download]

 

Software: Windows, Linux, Mac OS

 

Example Data: The data below are normalized read count data used in the dPCA paper. You can run dpca directly using these data without using dpca_importdata.

 

MYC analysis (Example I): Ebox_data; Ebox_peakprob (peak calls for R^B, dPCA-Z)

Promoter analysis (Example II): Prom_data; Prom_peakprob (for R^B, dPCA-Z)

ASB analysis (Example III): ASB_data; ASB_peakprob (for R^B, dPCA-Z)

 

Example commands for analyzing these data are:

 

(1) Promoter analysis (dPCA-P)

> dpca -I Prom_data.txt -d /home/ -o Prom_output -t 1

 

(2) Promoter analysis (dPCA-Z)

> dpca -i Prom_data.txt -d /home/ -o PromZ_output -t 1 -z 1 -r Prom_peakprob.txt

 

(3) Promoter analysis (dPCA-P and compute R^B)

> dpca -i Prom_data.txt -d /home/ -o PromP_output -t 1 -r Prom_peakprob.txt

 

(4) MYC analysis (dPCA-P)

> dpca -i Ebox_data.txt -d /home/ -o Ebox_output -t 1 -cm 0

OR

> dpca -i Ebox_data.txt -d /home/ -o Ebox_output -t 1 -cm 0 -r Ebox_peakprob.txt

 

(5) ASB analysis (dPCA-P, paired sample)

> dpca -i ASB_data.txt -d /home/ -o ASB_output -t 1 -sm 1 -cm 0

OR

> dpca -i ASB_data.txt -d /home/ -o ASB_output -t 1 -sm 1 -cm 0 -r ASB_peakprob.txt

 

[Installation]

 

For Windows:

An executable program is provided. To run dPCA, click the start menu of your windows system (typically on the bottom left corner of your screen). Choose ‘Accessories > Run’, type ‘cmd’ and then press Enter. A command window will show up. In this window, enter the folder that contains dPCA, for example, by typing:

 

> cd D:\Users\dPCA\

 

Now type:

> dpca_importdata

> dpca

> dpca_peakcalls

 

You will be able to see some usage information which indicates that you can start to use dPCA now.

 

For Linux, Mac OS:

(dPCA is bundled with CisGenome. You can follow the cisgenome installation procedure to install dPCA. dPCA is written in C language. Before installation, you need to have a C compiler such as g++ or gcc installed on your computer.)

 

1. Unzip using ‘gzip -d *.gz’ (here * is the name of the file you have downloaded)

2. Untar using ‘tar xvf *.tar’

3. Enter cisgenome folder;

4. compile by typing ‘./makefile’.

5. Now enter the subfolder named ‘bin’ by typing ‘cd bin’.

6. Type ‘ls’, you will find three files named ‘dpca_importdata’, ‘dpca’, and ’dpca_peakcalls’, respectively.

7. Now type

 

> dpca_importdata

> dpca

> dpca_peakcalls

 

If you installed cisgenome correctly, you will be able to see some usage information after you type these two commands.

 

8. You can now start to use dPCA.

 

[Readme]

 

In order to know how to use dPCA, please read the following readme file.

 

        Readme

 

Examples and sample parameter files:

 

(1) Basic dPCA

(Note: the test data for STEP1 are just toy examples illustrating the data formats and the dpca_importdata function. We keep them small to avoid overloading our web server. The test data for STEP2 is a different dataset. It is the data used in our paper. In real applications, you should use data generated by STEP 1 as input for STEP 2.)

 

STEP1: run dpca_importdata

Download the test data and regions here and run the command

> dpca_importdata sample_importdata_arg.txt

 

(another more complicated sample file sample_importdata2_arg.txt)

 

STEP2: run dpca

Download the data below and run the following command:

> dpca -i Ebox_data.txt -d /home/ -o Ebox_output -t 1

 

 

(2) dPCA-P + R^B

(Note: the test data for STEP1 and STEP 2 are just toy examples illustrating the data formats. The test data for STEP3 is a different dataset. It is the data used in our paper. In real applications, you should use data generated by STEP 2 as input for STEP 3.)

 

STEP1: run dpca_importdata

Download the test data and regions here and run the command

> dpca_importdata sample_importdata_arg.txt

 

STEP2: run peak calling

> dpca_peakcalls -i sample_peakcall_sampledescription.txt -p sample_peakcall_experimentdesign.txt -d /user/cisgenome/bin

 

Here, -d specifies the folder that contains the cisgenome and dpca executable files.

 

(Here are two more complicated sample parameter files: sample_peakcall_sampledescription2.txt and sample_peakcall_experimentdesign2.txt)

 

STEP3: run dpca

Download the data below and run the following command:

> dpca -i Ebox_data.txt -d /home/ -o Ebox_output -t 1 -cm 0 -r Ebox_peakprob.txt

 

 

(3) dPCA-Z

(Note: the test data for STEP1 and STEP 2 are just toy examples illustrating the data formats. The test data for STEP3 is a different dataset. It is the data used in our paper. In real applications, you should use data generated by STEP 2 as input for STEP 3.)

 

STEP1: run dpca_importdata

Download the test data and regions here and run the command

> dpca_importdata sample_importdata_arg.txt

 

STEP2: run peak calling

> dpca_peakcalls -i sample_peakcall_sampledescription.txt -p sample_peakcall_experimentdesign.txt -d /user/cisgenome/bin

 

Here, -d specifies the folder that contains the cisgenome and dpca executable files.

 

(Here are two more complicated sample parameter files: sample_peakcall_sampledescription2.txt and sample_peakcall_experimentdesign2.txt)

 

STEP3: run dpca

Download the data below and run the following command:

> dpca -i Ebox_data.txt -d /home/ -o Ebox_output -t 1 -cm 0 -r Ebox_peakprob.txt

 

 

[Contact]

     

        Hongkai Ji [hji@jhsph.edu]