README: ================================= O. Index ================================= I. Overview II. Different modes for using dPCA 2.1. dPCA (basic, read counts extracted using dPCA) 2.2. dPCA (basic, read counts provided by users) 2.3. dPCA-P + R^B 2.4. dPCA-Z III. dpca_importdata: usage 3.1. prepare read alignment files 3.2. prepare a genomic region file 3.3. prepare a parameter file 3.4. run dpca_importdata 3.5. understand the output IV. dpca_peakcalls: usage 4.1. prepare read alignment files 4.2. prepare a data file for determining binding activities 4.3. prepare a data file for dPCA 4.4. prepare a parameter file describing ChIP-seq samples 4.5. prepare a parameter file describing ChIP-seq experimental design and peak calling parameters 4.6. run dpca_peakcalls 4.7. understand the output V. dpca: usage 5.1. prepare data 5.2. run dpca 5.3. understand the output ================================= I. Overview ================================= dPCA has three main programs. 1. dpca_importdata: this program converts read alignment data into a format that can be recognized by dpca. 2. dpca: this program runs the dPCA analysis. 3. dpca_peakcalls: this program is used to support dPCA-P, dPCA-Z and computation of the R^B statistic. It uses CisGenome to call peaks and then eliminates input genomic loci that are not bound by any protein in any dataset. For loci bound in at least one dataset, the program reports how likely each locus is bound in each dataset. This is measured by (1-false discovery rate) (i.e., 1-FDR) reported by the peak caller for the peak associated with the locus and dataset. The output files of dpca_peakcalls can then be used along with the -z and/or -r option in dpca to perform the dPCA-Z or dPCA-P analysis, or compute the R^B statistic. ================================== II. Different modes for using dPCA ================================== dPCA can be run in multiple modes which are summarized below. ################################## ### 2.1. dPCA (basic, read counts extracted using dPCA): ################################## This mode involves running two programs: dpca_importdata + dpca In this mode, dPCA takes aligned sequence reads and a list of genomic regions as input. You first run dpca_importdata to obtain normalized read counts for each genomic region. The counts will be exported to a tab-delimited text file which essentially gives you a matrix of read counts. In this file, each line corresponds to a genomic region, and each column corresponds to a sample. Using this file, you can then run dpca to perform dPCA analysis. To use dPCA in this mode, you only need to read instructions for dpca_importdata and dpca. Example: > dpca_importdata sample_importdata_arg.txt > dpca -i sample_data.txt -d /home/ -o sample_output -t 1 ################################## ### 2.2. dPCA (basic, read counts provided by users) ################################## This mode involves running one program: dpca If you already have normalized read counts for a list of genomic regions and a number of samples, and have saved those read counts into a tab-delimited text file, you may run dpca directly without using dpca_importdata. This is useful for users who want to use dPCA to analyze other types of data such as RNA-seq, DNA-methylation data, etc. For those users, once you have organized your data (e.g., gene expression levels of all genes in a number of RNA-seq samples) into a matrix, you can use dpca directly. Example: > dpca -i sample_data.txt -d /home/ -o sample_output -t 1 ################################## ### 2.3. dPCA-P + R^B ################################## This mode involves running three programs: dpca_importdata + dpca_peakcalls + dpca This mode allows you to incorporate absolute binding information into the analysis. Specifically, it allows you to filter out genomic loci not bound in any dataset before dPCA. It also computes the R^B statistic after dPCA so that you can more stringently identify differential loci by requiring them to have significant binding activities. Example: > dpca_importdata sample_importdata_arg.txt > dpca_peakcalls -i sample_peakcall_sampledescription.txt -p sample_peakcall_experimentdesign.txt -d /user/cisgenome/bin > dpca -i filtered_data.txt -d /home/ -o sample_output -t 1 -r filtered_peakprob.txt ################################## ### 2.4. dPCA-Z ################################## This mode involves running three programs: dpca_importdata + dpca_peakcalls + dpca This mode provides another way to incorporate the absolute binding information. It allows you to set differences not covered by binding peaks to zero before running dPCA. Example: > dpca_importdata sample_importdata_arg.txt > dpca_peakcalls -i sample_peakcall_sampledescription.txt -p sample_peakcall_experimentdesign.txt -d /user/cisgenome/bin > dpca -i filtered_data.txt -d /home/ -o sample_output -t 1 -z 1 -r filtered_peakprob.txt ================================== III. dpca_importdata: usage ================================== dPCA takes aligned sequence reads and a list of genomic regions as input. dpca_importdata converts these data to normalized read counts. Before using dpca_importdata, you will need to prepare three types of files: (1) read alignment files; (2) a file with genomic regions to be analyzed; (3) a parameter file. ################################## ### 3.1: read alignment files ### ################################## For each ChIP-seq sample, you need to prepare a file that contains read mapping information. The file is in the ALN format. Each line in the file corresponds to a read, and there are three columns: column1 = chromosome where the read is aligned; column2 = coordinate where the read is aligned; column3 = 'F' or '+': if the read is aligned to the forward strand of the genome assembly; 'R' or '-': if the read is aligned to the reverse complement strand of the genome. The three columns are tab-delimited. In other words, each line of the file should have the following format: chromosome[tab]position[tab]strand An example is shown below: chr1 359077 + chr1 376890 - chr2 589006 + If you have N samples, you will need to have N such files. ################################## ### 3.2: genomic region file ### ################################## You also need a file that contains the list of genomic regions to be analyzed. The file is in the COD format. Each line in the file specifies a genomic region. The file has 5 columns. column1 = a unique name or identifier of the region. column2 = chromosome column3 = start coordinate of the region column4 = end coordinate of the region column5 = strand. You can put '+' for all regions if you only use dpca. If you will also use other functions of CisGenome, such as extracting DNA sequences, then this column can be '+' or '-'. Columns are tab-delimited. In other words, each line of the file should have the following format: [region_name][tab][chromosome][tab][start][tab][end][tab][strand] An example is shown below: peak1 chr1 0 200 + peak2 chr3 5000 5200 - The coordinates are 0-indexed, i.e., the first position in a chromosome is indexed by 0, the second position = 1, and so on. ################################## ### 3.3: parameter file ### ################################## Finally, you also need a parameter file in the following format: [WorkingFolder]=H:\Projects\JiHK_Lab\dPCA\test [ProjectTitle]=Myctest [GenomicLoci]=H:\Projects\JiHK_Lab\PCAT\Promoter\hg18_prom_up2kdown2k_K562-Helas3data.txt [GroupNum]=2 [DatasetNum]=2 [SampleNum]=8 [Paired]=0 [BinSize]=50 [Samples] K562Myc_Rep1 1 1 150 G:\Projects\SeqClust_project\Myc\K562_Cmyc_Rep1.aln K562Myc_Rep2 1 1 150 G:\Projects\SeqClust_project\Myc\K562_Cmyc_Rep2.aln K562Input_Rep1 1 2 150 G:\Projects\SeqClust_project\Myc\K562_InputV2_Rep1.aln K562Input_Rep2 1 2 150 G:\Projects\SeqClust_project\Myc\K562_InputV2_Rep2.aln HelaMyc_Rep1 2 1 150 G:\Projects\SeqClust_project\Myc\UT_Helas3_CmycV2_Rep1.aln HelaMyc_Rep2 2 1 150 G:\Projects\SeqClust_project\Myc\UT_Helas3_CmycV2_Rep2.aln HelaInput_Rep1 2 2 150 G:\Projects\SeqClust_project\Myc\UT_Helas3_Input.aln HelaInput_Rep2 2 2 150 G:\Projects\SeqClust_project\Myc\UT_Hepg2_Input.aln The meaning of each field is explained below: [WorkingFolder]: the folder where you want to export final results and temporary files generated by dPCA. [ProjectTitle]: title of the project. The dPCA results will be saved to files whose names start with this title. [GenomicLoci]: the path of the genomic region file, i.e., the file that contains regions to be analyzed. [GroupNum]: how many biological conditions do you have. For dPCA, it is always 2. [DatasetNum]: how many datasets do you have. [SampleNum]: how many ChIP-seq samples do you have. Note: each dataset may have multiple replicate samples. [Paired]: are the samples paired (e.g. in ASB analysis) or not (e.g., differential chromatin analysis at motif sites). 1: paired. 0: not paired. [BinSize]: For computational efficiency considerations, genomes were divided into non-overlapping bins. Read count for each bin is collected and saved to hard drive. This makes the data processing memory efficient. This parameter specifies the bin size. Typically, we use 50 (bp). If you use 25, you will double the hard disk requirement. If you use 100, you will reduce the hard disk requirement by half, etc. [Samples]: Below this line, list all samples you have. Each line is a sample, and should have five columns. The five columns are tab-delimited. Their meanings are Sample_Name[tab]GroupID[tab]DatasetID[tab]Extension_Length[tab]File_Path_to_the_Alignment_File. GroupID: should be 1 or 2. This specifies the two biological conditions to be compared. DatasetID: If you have M datasets, then this ID takes values from 1 to M. It specifies which dataset this sample belongs to. Extension_Length: specifies how many base pairs you want to extend each read toward its 3' end. File_Path_to_the_Alignment_File: the file path to the read alignment file for this sample. ################################## ### 3.4: run dpca_importdata ### ################################## After you have prepared all three types of files described above, you can run dpca_importdata. Suppose your parameter file is named as "sample_importdata_arg.txt", then you can run importdata by typing: > dpca_importdata sample_importdata_arg.txt ################################## ### 3.5: understand the output ### ################################## dpca_importdata will count reads for each input genomic region, normalize the reads across samples, and then save the normalized read counts to an output file. The file is tab-delimited and has the following format: #groupnum=2 #datasetnum=26 #samplenum=102 #locinum=24376 #paired=0 #groupid chr start end strand 1 1 1 1 #datasetid chr start end strand 1 1 2 2 #locus chr start end strand K562_Ctcf_BR1 K562_Ctcf_BR2 K562_H3K4me1_BR1 K562_H3K4me1_BR2 1 chr1 12753 16753 + 60.12294 44.535088 65.924377 45.153957 2 chr1 5172 9172 + 42.800598 56.413612 20.468401 15.635988 The meaning of each field is explained below. #groupnum: the number of biological conditions. Always 2 for dPCA. #datasetnum: the number of datasets #samplenum: the number of ChIP-seq samples #locinum: the number of genomic loci #paired: whether the analysis should treat samples as paired (1) or not (0). #groupid: specifies which condition does a sample belong to. It takes value 1 or 2. #datasetid: specifies which dataset does a sample belong to. It takes values from 1 to M if there are M datasets. #locus: this line specifies the sample name for each sample. The remaining lines contain the normalized read counts. Each line is a genomic region. In each line, the first five columns are: region_name[tab]chromosome[tab]region_start[tab]region_end[tab]strand The remaining columns are read counts for each sample. ================================== IV. dpca_peakcalls: usage ================================== dpca_peakcalls allows you to incorporate the absolute binding information into the analysis. This function will first use CisGenome to call peaks in each dataset. It will then use the peak calls to filter out the user-provided genomic loci that are not bound by any protein in any dataset. For the remaining loci, it will report how likely each locus is bound in each dataset (measured by 1-FDR of the associated peak). The output of this program can then be used as the input for dpca to perform dPCA-P or dPCA-Z analysis, or to compute the R^B statistic. In order to use this function, you need to prepare the following files. (1) Read alignment files. These files will be used for peak calling. (2) A data file with genomic regions and the normalized read count data. This file will be used to assist the program to determine the absolute binding levels of each region. The file is usually generated by dpca_importdata. Please see section "3.5: understand the output" for dpca_importdata to learn about the file format. (3) A data file with genomic regions and the normalized read count data. This file contains data to be analyzed by dPCA. The file has the same format as (2). It can be the same file as (2), but it can also be a different file. In either case, the files in (2) and (3) should contain exactly the same genomic regions, however they can have different read count data. If (2) and (3) are different, then the data in (2) will be used to assist the program to determine the binding level of each locus in each dataset, and the data in (3) will be filtered using the peak status obtained from (2). After filtering, the data in (3) will be used for subsequent dPCA analysis. This design, which separates (2) and (3), is useful for certain applications such as the ASB analysis. To analyze ASB, all reads are used to determine binding status and only a subset of reads mapped to heterozygote SNPs are used for analyzing differences by dPCA. In that scenario, the read count data obtained from all reads will serve as (2), and the read count data from heterozygote SNPs will serve as (3). (4) A parameter file that describes ChIP-seq samples. (5) A parameter file that describes ChIP-seq experimental designs and peak calling parameters. ################################## ### 4.1: read alignment files ### ################################## For each ChIP-seq sample, you need to prepare a file that contains read mapping information. These are the same files used for dpca_importdata (see section 3.1). The files are in the ALN format. In each file, each line corresponds to a read, and there are three columns: column1 = chromosome where the read is aligned; column2 = coordinate where the read is aligned; column3 = 'F' or '+': if the read is aligned to the forward strand of the genome assembly; 'R' or '-': if the read is aligned to the reverse complement strand of the genome. The three columns are tab-delimited. In other words, each line of the file should have the following format: chromosome[tab]position[tab]strand An example is shown below: chr1 359077 + chr1 376890 - chr2 589006 + If you have N samples, you will need to have N such files. ###################################### ### 4.2: data file for determining ### ### binding activities ### ###################################### This file is usually generated by dpca_importdata. It contains genomic regions to be analyzed and the normalized read count data for each region. The read count data in this file will be used to assist the program to determine the binding level of each region in each dataset. The file is tab-delimited and has the following format: #groupnum=2 #datasetnum=26 #samplenum=102 #locinum=24376 #paired=0 #groupid chr start end strand 1 1 1 1 #datasetid chr start end strand 1 1 2 2 #locus chr start end strand K562_Ctcf_BR1 K562_Ctcf_BR2 K562_H3K4me1_BR1 K562_H3K4me1_BR2 1 chr1 12753 16753 + 60.12294 44.535088 65.924377 45.153957 2 chr1 5172 9172 + 42.800598 56.413612 20.468401 15.635988 The meaning of each field is explained below. #groupnum: the number of biological conditions. Usually, groupnum=2. For ASB analysis, groupnum=1 since two alleles are combined for peak calling. #datasetnum: the number of datasets #samplenum: the number of ChIP-seq samples #locinum: the number of genomic loci #paired: whether the analysis should treat samples as paired (1) or not (0). (Not used by dpca_peakcalls). #groupid: specifies which condition a sample belongs to. It takes value 1 or 2. #datasetid: specifies which dataset a sample belongs to. It takes values from 1 to M if there are M datasets. #locus: this line specifies the sample name for each sample. The remaining lines contain the normalized read counts. Each line is a genomic region. In each line, the first five columns are: region_name[tab]chromosome[tab]region_start[tab]region_end[tab]strand The remaining columns are normalized read counts for each ChIP-seq sample. ################################# ### 4.3: data file for dPCA ### ################################# This file has the same format as the file in 4.2. It contains genomic regions to be analyzed and the normalized read count data for each region. The read count data in this file will be used for dPCA. ################################## ### 4.4: parameter file to ### ### describe samples ### ################################## This file describes ChIP-seq samples used for peak calling and determining binding activities. It is similar to the parameter file used in dpca_importdata, but the file here has two additional lines: [Transform]=xxx [DataToBeFiltered]=xxx Below is an example of the file: [WorkingFolder]=H:\Projects\JiHK_Lab\dPCA\test [ProjectTitle]=Myctest [GenomicLoci]=H:\Projects\JiHK_Lab\PCAT\Promoter\Ebox_K562-Huvecdata.txt [Transform](0:identity;1:log2)=1 [DataToBeFiltered]=H:\Projects\JiHK_Lab\PCAT\Promoter\Ebox_K562-Huvecdata.txt [GroupNum]=2 [DatasetNum]=2 [SampleNum]=8 [Paired]=0 [BinSize]=50 [Samples] K562Myc_Rep1 1 1 150 G:\Projects\SeqClust_project\Myc\K562_Cmyc_Rep1.aln K562Myc_Rep2 1 1 150 G:\Projects\SeqClust_project\Myc\K562_Cmyc_Rep2.aln K562Input_Rep1 1 2 150 G:\Projects\SeqClust_project\Myc\K562_InputV2_Rep1.aln K562Input_Rep2 1 2 150 G:\Projects\SeqClust_project\Myc\K562_InputV2_Rep2.aln HelaMyc_Rep1 2 1 150 G:\Projects\SeqClust_project\Myc\UT_Helas3_CmycV2_Rep1.aln HelaMyc_Rep2 2 1 150 G:\Projects\SeqClust_project\Myc\UT_Helas3_CmycV2_Rep2.aln HelaInput_Rep1 2 2 150 G:\Projects\SeqClust_project\Myc\UT_Helas3_Input.aln HelaInput_Rep2 2 2 150 G:\Projects\SeqClust_project\Myc\UT_Hepg2_Input.aln The meaning of each field is explained below: [WorkingFolder]: the folder where you want to export final results and temporary files generated by the program. [ProjectTitle]: title of the project. The filtered data will be saved to files whose names start with this title. [GenomicLoci]: the full path of the file described in section 4.2, i.e., the data file that will be used to assist the program to determine binding activities. [Transform]: 0 (default) = identity; 1 = log2 transform. This parameter specifies whether you want to log2 tranform the read count data in the [GenomicLoci] file before computing the binding activities. [DataToBeFiltered]: the full path of the file described in section 4.3, i.e., the file that contains normalized read count data to be analyzed by dPCA. [GroupNum]: the number of biological conditions. Normally, it is 2. However, for ASB analysis, it is 1 since two alleles are combined for peak calling. [DatasetNum]: the number of datasets you have. [SampleNum]: the number of ChIP-seq samples you have. Note: each dataset may have multiple replicate samples. [Paired]: whether the samples are paired (e.g. in ASB analysis) or not (e.g., differential chromatin analysis at motif sites). 1: paired. 0: not paired. (Not used by dpca_peakcalls). [BinSize]: For computational efficiency considerations, genomes were divided into non-overlapping bins. Read count for each bin is collected and saved to hard drive. This makes the data processing memory efficient. This parameter specifies the bin size. Typically, we use 50 (bp). If you use 25, you will double the hard disk requirement. If you use 100, you will reduce the hard disk requirement by half, etc. [Samples]: Below this line, list all samples you have. Each line is a sample, and should have five columns. The five columns are tab-delimited. Their meanings are Sample_Name[tab]GroupID[tab]DatasetID[tab]Extension_Length[tab]File_Path_to_the_Alignment_File. GroupID: should be 1 or 2. This specifies the two biological conditions to be compared in dPCA. DatasetID: If you have M datasets, then this ID takes values from 1 to M. It specifies which dataset this sample belongs to. Extension_Length: how many base pairs do you want to extend each read toward its 3' end. File_Path_to_the_Alignment_File: the file path to the read alignment file for this sample. ################################## ### 4.5: parameter file to ### ### describe experimental ### ### designs and parameters ### ### for peak calling ### ################################## This file describes the experimental design of each ChIP-seq study (i.e., which samples are IP samples and which samples are controls in each ChIP-seq experiment). It also specifies the peak calling parameters. The file has the following format: [FDRCut]=0.1 [WinSize]=1 [Tcut]=2.0 [IPOnlyType](0:CombineCount;1:SepRep-Faster)=1 [IPOnlyUniqMax]=5 [ExportType](0:Binary;1:FDR;2:T;3:FC)=1 [SkipPeakCall](0:no,1:yes)=0 [OverlapRatio]=0.001 [Peaklist] K562Ctcf_BR 1 1 11 K562H3K4me1_BR 1 2 11 K562Control_BR 1 3 -1 K562DNase_DK 1 4 0 HuvecCtcf_BR 2 1 11 HuvecH3K4me1_BR 2 2 11 HuvecControl_BR 2 3 -1 HuvecDNase_DK 2 4 0 The meaning of each field is explained below. [FDRCut]: False discovery rate cutoff for peak calling. [WinSize]: Window size for peak calling. The genome is divided into 50bp bins. If the window size here is W, then a 50*(2W+1)bp window will be used to scan the genome to find peaks using the CisGenome peak caller. [Tcut]: The enrichment statistics computed by CisGenome will be standardized to have zero mean and unit standard deviation (SD). Tcut specifies how many SDs above the mean will be used as the cutoff for determining the peak boundaries. [IPOnlyType](0:CombineCount;1:SepRep-Faster): How to handle dataset that has only IP samples. 1: uses a faster algorithm to approximate the negative binomial distribution. 0: uses a slower algorithm to determine significance using the negative binomial distribution. [IPOnlyUniqMax]: Maximal number of reads allowed at a single genomic position. Sometimes, many reads can be aligned to the same position due to PCR artifacts. This parameter will limit the number of reads at each position. If a position has more than [IPOnlyUniqMax] reads, only [IPOnlyUniqMax] will be kept. In this way, the analysis will not be dominated by potential PCR artifacts. [ExportType](0:Binary;1:FDR;2:T;3:FC): Always use 1, which exports 1-FDR. 0, 2 or 3: only used by developers for other purposes, and users should not use them. [SkipPeakCall](0:no,1:yes): Usually 0, which means the program will call CisGenome to find peaks. If you have run dpca_peakcalls before and have the peak calls, you may use 1 to skip the peak calling. 1 is useful if you just want to adjust the parameters for determining whether your input region overlaps with a peak or not (e.g., adjust the overlap ratio parameters) and do not want to run peak calling again. [OverlapRatio]: The minimal overlap ratio between an input genomic locus and a peak in order to call the input genomic locus bound. Suppose the overlap between the locus and the peak is X bp, and the lengths of the locus and the peak are Y bp and Z bp respectively, then the overlap ratio between them is defined as X/max(Y,Z). If this ratio is bigger than [OverlapRatio], then the locus is labeled as bound. [Peaklist]: Each line following [Peaklist] specifies the experimental design of a ChIP-seq experiment. Each line has four columns: column1: name of the experiment column2: group id. This is the same as the GroupID in described in section 4.4, which specifies the two biological conditions to be compared. It should be 1 or 2. column3: dataset id. This is the same as the DatasetID described in section 4.4, which specifies the dataset that should be used as IP samples. column4: control id. This specifies the dataset that should be used as control samples for the experiment. If the id>0, the experiment will be a two-sample experiment that has both IP and control samples. CisGenome will use its two-sample peak caller to call peaks for this experiment. If the id=0, it means that no control sample is available for this experiment, and therefore the experiment will be analyzed by CisGenome using its one-sample peak caller. If the id<0, it means that the dataset in column 3 itself is a control dataset, and therefore the program will not call peaks for this experiment. In such a case, all peak calls will be set to 0 for all genomic loci. ################################## ### 4.6: run dpca_peakcalls ### ################################## After you have prepared all files described above, you can now run dpca_peakcalls. Suppose your parameter file for describing samples is "sample_peakcall_sampledescription.txt", and the parameter file for describing experimental designs and peak calling parameters is "sample_peakcall_experimentdesign.txt", then you can run dpca_peakcalls by typing: > dpca_peakcalls -i sample_peakcall_sampledescription.txt -p sample_peakcall_experimentdesign.txt -d /user/cisgenome/bin Here -d specifies the folder than contains the executable programs of the CisGenome/dPCA package. ################################## ### 4.7: understand the output ### ################################## dpca_peakcalls will create two files as its output. ################################## (1) [ProjectTitle]_filtered_data.txt ################################## This file contains loci bound in at least one dataset and the normalized read count data for these loci. The read count data are copied from the file described in section 4.3, which is also the file specified in [DataToBeFiltered] part of the sample description parameter file (see section 4.4). If you want to perform dPCA-P or dPCA-Z analysis, you should use this file as the input to run dpca. The file is tab-delimited and has the following format: #groupnum=2 #datasetnum=18 #samplenum=70 #locinum=22368 #paired=0 #groupid chr start end strand 1 1 1 1 #datasetid chr start end strand 1 1 2 2 #locus chr start end strand K562_Ctcf_BR1 K562_Ctcf_BR2 K562_H3K4me1_BR1 K562_H3K4me1_BR2 1 chr1 12753 16753 + 60.12294 44.535088 65.924377 45.153957 2 chr1 5172 9172 + 42.800598 56.413612 20.468401 15.635988 The meaning of each field is explained below. #groupnum: the number of biological conditions. #datasetnum: the number of datasets #samplenum: the number of ChIP-seq samples #locinum: the number of genomic loci #paired: whether the analysis should treat samples as paired (1) or not (0). #groupid: specifies which condition a sample belongs to. It takes value 1 or 2. #datasetid: specifies which dataset a sample belongs to. It takes values from 1 to M if there are M datasets. #locus: this line specifies the sample name for each sample. The remaining lines contain the normalized read counts. Each line is a genomic region. In each line, the first five columns are: region_name[tab]chromosome[tab]region_start[tab]region_end[tab]strand The remaining columns are read counts for each sample. ################################## (2) [ProjectTitle]_filtered_peakprob.txt ################################## This file contains loci bound in at least one dataset and information on how likely each locus is bound in each dataset (measured by 1-FDR). The FDR is determined by CisGenome in the peak calling process. If you want to perform dPCA-P or dPCA-Z or R^B analysis, you should use this file for the -r option when you run dpca. The file is tab-delimited and has the following format: #peaklistnum=18 #locinum=22368 #groupid chr start end strand 1 1 1 1 #datasetid chr start end strand 1 2 3 4 #controlid chr start end strand 11 11 11 11 #locus chr start end strand 1 2 3 4 0 chr1 12753 16753 + 0.999832 0.999914 0.999386 0.967655 1 chr1 5172 9172 + 0.999832 0.999721 0.968859 0.934139 The meaning of each field is explained below. #peaklistnum: the number of datasets #locinum: the number of genomic loci #groupid: specifies which condition a sample belongs to. Here it always takes value 1. This is because peak calling results from two biological conditions are combined. In other words, if the locus is bound in at least one condition in a dataset, then the locus is labeled as bound in the dataset. #datasetid: specifies which dataset a sample belongs to. It takes values from 1 to M if there are M datasets. #controlid: specifies which dataset is used as the control samples for the IP samples specified in the line above (i.e., the line starting with #datasetid). #locus: this line specifies the sample name for each sample. The remaining lines contain information on how likely each locus is bound in each dataset (measured by 1-FDR). Each line is a genomic region. In each line, the first five columns are: region_name[tab]chromosome[tab]region_start[tab]region_end[tab]strand The remaining columns are 1-FDR for each locus in each dataset. A large value means that the locus is more likely to be bound in that dataset. ================================== V. dpca: usage ================================== After you obtain the normalized read counts for each region, the program dpca runs the dPCA analysis. ################################## ### 5.1: prepare data ### ################################## dPCA takes the normalized read counts as input. If you run basic dPCA, the input data usually is the output file obtained from running dpca_importdata. If you use dPCA-P or dPCA-Z, the input data usually are the output files obtained from running dpca_peakcalls. If you have other types of data (e.g. RNA-seq), you may prepare your own normalized read count file. The file should have exactly the same format as the output file of dpca_importdata, which is described in section 3.5. In this case, you do not need to run dpca_importdata or dpca_peakcalls. ################################## ### 5.2: run dpca ### ################################## Now you can run dpca by typing the following command: (dPCA basic) > dpca -i C:\dPCA\input.txt -d D:\Work\ -o output -t 1 (dPCA-P) > dpca -i C:\dPCA\filtered_input_data.txt -d D:\Work\ -o output -t 1 -r C:\filtered_input_peakprob.txt (dPCA-Z) > dpca -i C:\dPCA\filtered_input_data.txt -d D:\Work\ -o output -t 1 -z 1 -r C:\dPCA\filtered_input_peakprob.txt The parameters are: -i The data file to be analyzed by dPCA. This is the normalized read count file generated by dpca_importdata, or dpca_peakcalls, or a read count file prepared by yourself. -d The path of the output folder. dPCA results will be saved to this folder. -o The title for the output file. A number of files starting with this title will be generated by dpca to contain the final results. -t transform: 0 (default) = identity; 1 = log2 transform. This parameter specifies whether you want to log2 tranform your read count data before dPCA analysis. Usually, when you use dpca_importdata to prepare your data, we recommend you to use log2 transform. -cm center columns of D to have zero mean: 0 = no; 1 (default) = yes. This parameter specifies whether you want to center each column of D to have zero mean. -cs standardize each column of D to have SD=1: 0 (default) = no; 1 = yes. This parameter specifies whether you want to standardize each column to have unitary standard deviation. -sm center x_gimk of each sample to have zero mean before computing D: 0 (default) = no; 1 = yes. This is usually useful for allele-specific analysis for removing reference mapping bias. -ss standardize each sample to have zero mean and SD=1 before computing D: 0 (default) = no; 1 = yes. This parameter specifies whether you want to standardize each sample so that x_gimk have mean = 0, SD = 1 before computing D. -snr signal-to-noise ratio cutoff: default = 5. This parameter specifies the signal-to-noise ratio (SNR) cutoff. dPCs above this SNR cutoff will be reported. -z whether or not to apply the dPCA-Z filter: 0 (default) = no; 1 = yes. If you use -z 1, you have to use the -r option as well. In the -r option, you need to specify a file that contains the information on how likely each genomic locus is bound in each dataset (e.g., a file that contains 1-FDR of the peak associated with each locus and dataset). This file is usually obtained from running dpca_peakcalls. -r full path for the file that contains the information on how likely each genomic locus is bound in each dataset. This file will be used to compute the R^B statistic or determine which differences should be set to zero in dPCA-Z. If you use -r without using -z, you will run dPCA-P + R^B. In that case, dpca will compute the R^B statistic for each differential locus and each dPC. If you use -r together with -z, you will run dPCA-Z. In that case, this file will be used to determine which d_{gm} should be set to zero to perform the dPCA-Z analysis. ################################## ### 5.3: understand the output ### ################################## dPCA will create a number of output files, including: 1. [Title]_proj.txt: estimated beta coefficients for each dPC. 2. [Title]_PC[j]_rank.txt: genomic loci ranked based on the j-th dPC. 3. [Title]_eigvec.txt: dPCs (eigenvectors). 4. [Title]_Dobs.txt: The D matrix, which contains the observed differences between the two conditions. This is the data analyzed by dPCA. 5. [Title]_Aobs.txt: The A matrix, which contains the observed absolute binding levels. 6. [Title]_Aproj.txt: The absolute binding projected to dPCs. The meanings of these files are explained below. ################################## 1. [Title]_proj.txt: estimated beta coefficients for each dPC. ################################## The file starts with several information lines: #rank: how many dPCs passed the SNR cutoff #s2: the estimated sigma^2 #locus: column names #eigenvalue: eigenvalue estimate for each dPC #percent_var: percentage of variance explained by each dPC in dPCA. #SNR: signal-to-noise ratio for each dPC. #sig_prc: estimate of pi_0. After these information lines, the remaining lines are the data for each genomic locus. Each line has multiple columns. Columns are tab-delimited and have the following meanings: column1=region name column2=chromosome column3=region start column4=region end column5=strand column6-9: dPC1 column6: beta_g1 column7: t-statistic column8: false discovery rate for testing differential ChIP-seq signals column9: R^B statistic (or NA if the -r option is not used) column10-13: dPC2 column10: beta_g2 column11: t-statistic column12: false discovery rate for testing differential ChIP-seq signals column13: R^B statistic (or NA if the -r option is not used) ... Users can open the file using Excel and freely sort the loci in different ways (e.g., t-statistic, R^B, etc.). ################################## 2. [Title]_PC[j]_rank.txt: ################################## Genomic loci ranked based on the n-th dPC. Each line is a genomic region. Regions are ranked from the most significant ones to the least significant ones. Each line has 9 columns. The columns are tab-delimited. Their meanings are: column1=region name column2=chromosome column3=region start column4=region end column5=strand column6=data projection scores to the dPC j, i.e., the coefficients beta_{gj} column7=t-statistic for testing significance column8=false discovery rate for testing differential ChIP-seq signals column9=R^B statistic (or NA if the -r option is not used) Users can open the file using Excel and freely sort the loci in different ways (e.g., t-statistic, R^B, etc.). ################################## 3. [Title]_eigvec.txt: ################################## Each column is a dPC, i.e., the eigenvector estimate v_j. ---------------------------------- 4. [Title]_Dobs.txt: The D matrix, which contains the observed differences between the two conditions. This is the data analyzed by dPCA. The file starts with several information lines: #s2: the estimated sigma^2 #locus: column names After these information lines, the remaining lines are the data for each genomic locus. Each line has multiple columns. Columns are tab-delimited and have the following meanings: column1=region name column2=chromosome column3=region start column4=region end column5=strand The remaining columns contain d_{gm} for each dataset. Each column corresponds to a dataset. ################################## 5. [Title]_Aobs.txt: The A matrix, which contains the observed binding levels. ################################## Each line corresponds to a input genomic locus. Columns are tab-delimited and have the following meanings: column1=region name column2=chromosome column3=region start column4=region end column5=strand The remaining columns are organized into groups of three, named: dataset[m]_1: bar{x}_{g1m}, i.e., mean binding level at locus g in condition 1 and dataset m. dataset[m]_2: bar{x}_{g2m}, i.e., mean binding level at locus g in condition 2 and dataset m. dataset[m]_A: bar{x}_{g1m}+bar{x}_{g2m}, i.e., absolute binding level at locus g in dataset m, ################################## 6. [Title]_Aproj.txt: The observed binding projected to dPCs (note: unlike R^B, these projections do not necessarily reflect the absolute binding level along each dPC). ################################## Each line corresponds to a input genomic locus. Columns are tab-delimited and have the following meanings: column1=region name column2=chromosome column3=region start column4=region end column5=strand The remaining columns are organized into groups of three, named: dPC[j]_1: alpha_{g1j} = (v_j)'*bar{x}_{g1}, i.e., mean binding level at locus g in condition 1 projected to dPC j. dPC[j]_2: alpha_{g2j} = (v_j)'*bar{x}_{g2}, i.e., mean binding level at locus g in condition 2 projected to dPC j. dPC[j]_A: alpha_{gj} = alpha_{g1j}+alpha_{g2j}.