User's Manual

User Manual

This page provides a detailed manual for running CisGenome as command line programs. If you are a new user of CisGenome, we recommend you to go through our Tutorial first for a quick start. Unlike the tutorial which is designed for GUI users, the manual here is designed for command mode users and contains more detailed usage information for various CisGenome functions.

User Manual for:

Running CisGenome (Core) as Command Line Programs (This file is a little outdated, and we are updating it now. See the common pipelines below for the latest usage information of key functions.)

The most commonly used analysis pipelines and sample parameter files (a simplified version of the user manual) are here:

General notes

1. Analyzing Affymetrix tiling array data

2. Analyzing other tiling array data

3. Analyzing one-sample ChIP-seq data

4. Analyzing two-sample ChIP-seq data (CisGenome v2 new & recommended)

5. Analyzing two-sample ChIP-seq data (CisGenome v1 old)

6. Get gene annotation and genome sequences

7. De novo motif discovery

8. Motif mapping and motif enrichment

9. File format conversion

10. Other utilities

Format and meaning of the key output files:

1. The peak list (*_all.cod) and signals (*.bar) from tiling array analysis

2. The peak list (*.cod) and signals (*.bar) from one-sample ChIP-seq analysis

3. The peak list (*.cod) and signals (*.bar) from CisGenome v1 two-sample ChIP-seq analysis

4. The peak list (*.cod) and signals (*.bar) from CisGenome v2 SeqPeak two-sample peak calling

Common pipelines

General notes:

To know the meaning of each parameter, please type the program name and press enter (e.g., type “tilemapv2_quantilenorm” and press Enter will give you usage information for tilemapv2_quantilenorm.

All bar files generated below can be visualized using CisGenome browser.

(*) means required steps, and you should run them according to the specified order here. The other functions are optional.

1. Analyzing Affymetrix tiling array data:

1.1* normalization: tilemapv2_importaffy sample_importaffy_arg.txt

1.2* peak detection: tilemapv2 sample_tilemap_arg.txt

Note: peak list will be saved to a [*].cod file. Signals for visualization will be exported to [*].bar files. [*]fc.bar are probe-level log2(IP/control) fold changes. [*]ma.bar are TileMap-MA or HMM statistics.

1.3 get information for a list of genomic regions:

tilemapv2_regioninfo -i sample_region.cod -o output.cod -d sample_regioninfo_arg.txt

1.4 convert bar file to txt file: affy_bar2txt -i sample_input.bar -o output.txt

1.5 convert txt file to bar file: tilemapv2_txt2bar -i sample_input.txt -d /workingfolder/ -o output.cgw

2. Analyzing other tiling array data (data in tab-delimited txt file):

2.1* normalization: tilemapv2_quantilenorm -i sample_input.txt -o output.txt -c 2 -l 1 -t l

2.2* convert normalized txt file to bar files: see tilemapv2_txt2bar in step 1.4

2.2* peak detection: use tilemapv2 as in step 1.5.

Other functions are used in the same way as analyzing Affymetrix data.

3. Analyzing one-sample ChIP-seq data:

3.1* sort the chromosome location of the input alignment file:

tablesorter_str sample_ChIP.aln

Note: the ALN file is the standard input for CisGenome ChIP-seq analysis. If your alignment file is generated by ELAND, you can use file_eland2aln to convert your file to ALN format.

3.2* convert the sorted file from txt format to bar format:

hts_aln2barv2 -i sample_ChIP.aln.sort -o sample_ChIP.bar

Note: the program will generate three bar files, ending with .bar, _F.bar and _R.bar respectively. They contain all reads, 5’ reads and 3’ reads respectively. In your later steps, only use .bar as you –i option. Do not use _F.bar or _R.bar since the program will automatically use them when needed.

3.3* compute FDR and choose cutoff

hts_windowsummaryv2 -i sample_ChIP.bar -g /data/genomes/human/hg18/chrlist.txt -l /data/genomes/human/hg18/chrlen.txt -w 100 -o summarys1.txt

3.4* first round of peak detection (you should set -c to be the read count level that satisfies your FDR requirement. FDR for each read count level is given in the last column of summarys1.txt obtained from the last step).

hts_peakdetectorv2 -i sample_ChIP.bar -d /workingfolder/ -o output -w 100 -s 25 -c 10 -br 1 -brl 30 -ssf 1

Note: peak list will be saved to output.cod file. Signals for visualization will be exported to [*].bar files. To generate the bar files, a W bp sliding window with step size S is used to scan the genome. When -s (the step size) is set to a small number, you will need VERY BIG memory to process the bar files. You may not have sufficient memory. If so, try to set -s to a bigger number, for example, use -s 100 instead of the default -s 25.

3.5 shift reads towards fragment center by L bp. L is the half of the median length of the peaks. Peak coordinates and peak length can be found in output.cod obtained from the last step (here let’s say L=35)

hts_alnshift2bar -i sample_ChIP.bar -s 35

Note: It will generate a file called sample_ChIP.bar_C.bar. In your later steps, you should still use sample_ChIP.bar as your -i option instead of using sample_ChIP.bar_C.bar. The program will automatically directed to _C.bar file when -z is set to 1.

3.6 recompute FDR and rechoose cutoff:

hts_windowsummaryv2 -i sample_ChIP.bar -g /data/genomes/human/hg18/chrlist.txt -l /data/genomes/human/hg18/chrlen.txt -w 100 -o output.txt -z 1

3.7 second round of peak detection

hts_peakdetectorv2 -i sample_ChIP.bar -d /workingfolder/ -o output -w 100 -s 25 -c 10 -br 1 -brl 30 -ssf 1 -z 1

4. Analyzing two-sample ChIP-seq data (CisGenome v2 new):

4.1* Convert the alignment obtained by eland or bowtie to ALN files:

file_eland2aln -i sample_input.eland.out -o input.aln

OR file_eland2aln_v2 -i elandv2_out.txt -o input.aln

OR file_bowtie2aln -i bowtie_out.txt -o input.aln

OR (if your alignment is in BED format), file_bed2aln -i input.bed -o input.aln

4.2* Prepare a tab-delimited text file that contains all DNA samples to be analyzed. An example file is here: sample_seqpeakfilelist.txt. The first column gives the full file path. The second column indicates the DNA sample type (1 = IP, 0 = control). The ALN files in this example is here: Myc_hg18_ChIP-seq_data (you need to unzip it using WinZip, WinRAR etc. before testing peak calling).

4.3* Run seqpeak peak calling:

seqpeak -i sample_filelist.txt -d . -o mypeak

Here -d specifies a folder to export results, -o is the output file title. There are a number of optional parameters. The meaning of the parameters are explained in the tutorial 2 section 2.2. You can also learn about them by typing seqpeak and then press “Enter”.

4.4 Interpret results based on this.

5. Analyzing two-sample ChIP-seq data (CisGenome v1 old):

5.1* sort the chromosome location of the input alignment file:

Run tablesorter_str for ChIP and control samples respectively (see 3.1)

5.2* convert the sorted file from txt format to bar format:

Run hts_aln2barv2 for ChIP and control samples respectively (see 3.2)

5.3* compute FDR

hts_windowsummaryv2_2sample -i sample_ChIP.bar -n sample_control.bar -g /data/genomes/human/hg18/chrlist.txt -l /data/genomes/human/hg18/chrlen.txt -w 100 -o summarys2.txt

Note: it will not only generate summarys2.txt, but also three files called summarys2.txt.fdr, summarys2.txt.pos.fdr and summarys2.txt.neg.fdr which contain FDR information. summarys2.txt.fdr will be used for -f option in the next step.

5.4* first round of peak detection (you should set -p according to the “dP0_hat” provided in the first line of summarys2.txt)

hts_peakdetectorv2_2sample -i sample_ChIP.bar -n sample_control.bar -d /workingfolder/ -o output -f summarys2.txt.fdr -c 0.1 -m 10 -w 100 -s 25 -p 0.5 -br 1 -brl 30 -ssf 1

5.5 shift reads towards fragment center by L bp.

Run hts_alnshift2bar for ChIP and control samples respectively (see 3.5)

5.6 recompute FDR:

hts_windowsummaryv2_2sample -i sample_ChIP.bar -n sample_control.bar -g /data/genomes/human/hg18/chrlist.txt -l /data/genomes/human/hg18/chrlen.txt -w 100 -o summarys2.txt -z 1

5.7 second round of peak detection

hts_peakdetectorv2_2sample -i sample_ChIP.bar -n sample_control.bar -d /workingfolder/ -o output -f summarys2.txt.fdr -c 0.1 -m 10 -w 100 -s 25 -p 0.5 -br 1 -brl 30 -ssf 1 -z 1

6. Get gene annotation and genome sequences:

6.1 Encode the gene annotation database (convert it from UCSC refFlat format to CisGenome recognizable format)

refflat_encode -d /data/genomes/human/hg18/annotation/refFlat.txt -o /data/genomes/human/b35_hg17/annotation/refFlat_sorted.txt -s human -n 24

6.2 annotate a peak list (or a list of regions) with the closest genes:

refgene_getnearestgene -d /data/genomes/human/hg18/annotation/refFlat_sorted.txt -dt 1 -s human -i sample_region.cod -o output.cod -r 0 -up 8000 -down 2000

6.3 annotate a peak list with k-closest neighboring genes:

reflocus_getneighborgenes -d /data/genomes/human/hg18/annotation/refLocus_sorted.txt -s human -i sample_region.cod -o output.cod -g 1000000 -up 3 -down 3

6.4 summarize physical distribution of peaks in relation to promoters, exons, introns, etc.

refgene_getlocationsummary -d /data/genomes/human/hg18/annotation/refFlat_sorted.txt -dt 1 -s human -i sample_region.cod -o output_gene.txt -r 0

6.5 get DNA sequences:

genome_getseq_c -d /data/genomes/human/hg18/ -s human -i sample_region.cod -o output.fa

6.6 get matched genomic control regions

refgene_getmatchedcontrol -d /data/genomes/human/hg18/annotation/refFlat_sorted.txt -dt 1 -s human -c /data/genomes/human/hg18/chrlen.txt -i sample_region.cod -o output_matchcontrol.cod -n 1 -l 1000 -nr 1

6.7 remove repeat-enriched regions from region list:

genome_getmaskedreg_c -d /data/genomes/human/hg18/ -r 0.5 -s human -i sample_region.cod -o output_masked.txt

6.8 get conservation summary for a region list:

genome_regioncssummary -gd /data/genomes/human/hg18/ -i sample_region.cod -o output_conservation.txt -c 40 -cd /data/genomes/human/hg18/conservation/phastcons/

7. De novo motif discovery:

7.1 Gibbs motif sampler: flexmodule_motif flexmodule_arg.txt

Note: The example searches for 10 motifs simultaneously. Before running, you also need a FASTA file to contain the DNA sequences (sample.fa) and a file to specify prior frequency of each motif (freqpriorquality10.txt).

7.2 A flexible module sampler: flexmodule_[*] (to be announced).

8. Motif mapping and motif enrichment:

8.1 Map motif matrix to a list of genomic regions:

motifmap_matrixscan_genome -m motif.mat -gd /data/mm6 -i sample_region.cod -o output_motifmap.txt -r 500 -b 3 -bt genome -bd /data/mm6/markovbg/S100000_W1000000 -bs 100000 -c 40 -cd /data/mm6/conservation/phastcons/

Note: motif.mat is the motif matrix. The matrix contains pseudocounts to specify the frequency of A,C,G,T respectively. All cells in the matrix must be bigger than 0. You can use a small number like 0.1 for a rare base and a big number, say 150, for a frequent base. “-c” specifies conservation cutoff. -c 40 means only sites with phastcons score >= 40 are retained. Phastcons score is linearly transformed from [0,1] to [0, 255] in CisGenome.

8.2 Map motif matrix to FASTA file:

motifmap_matrixscan -m motif.mat -d /workingfolder/ -i sample.fa -o output_motifmap.txt -r 500 -b 3

8.3 Map motif consensus to a list of genomic regions:

motifmap_consensusscan_genome -m motif.cons -gd /data/mm6 -i sample_region.cod -o output_motifmap.txt -mc 2 -md 1 -c 40 -cd /data/mm6/conservation/phastcons/

8.4 Map motif consensus to FASTA file:

motifmap_consensusscan -m motif.cons -d /workingfolder/ -i sample.fa -o output_motifmap.txt -mc 2 -md 1

8.5 Compare relative enrichment of a list of motif matrices

motifmap_matrixscan_genome_summary -mr motiflist.txt -gd /data/mm6 -i input.cod -n control.cod -o output_motifsenrichsummary.txt -b 3 -bt genome -bd /data/mm6/markovbg/S100000_W1000000/ -bs 100000 -c 40 -cd /data/mm6/conservation/phastcons/

8.6 Compare relative enrichment of a list of motif consensus

motifmap_consensusscan_genome_summary -mr consensuslist.txt -gd /data/mm6 -i input.cod -n control.cod -o output_motifsenrichsummary.txt -c 40 -cd /data/mm6/conservation/phastcons/

8.7 Examine motif enrichment as a function of peak rank

motifmap_matrixscan_genome_enrich -m motif.mat -gd /data/mm6 -i input.cod -n control.cod -s 100 -o output_motif_tierenrich.txt -r 500 -b 3 -bt genome -bd /data/mm6/markovbg/S100000_W1000000/ -bs 100000 -c 40 -cd /data/mm6/conservation/phastcons/

motifmap_consensusscan_genome_enrich -m motif.cons -gd /data/mm6 -i input.cod -n control.cod -s 100 -o output_motif_tierenrich.txt -mc 2 -md 1 -c 40 -cd /data/mm6/conservation/phastcons/

8.8 Extending motif sites

motifmap_getsitearound -i motifmap.txt -w 200 -d /data/genomes/mouse/mm6/ -s mouse -cn 1 -a 1 -o motif_e200.txt

8.9 Get average conservation for both motif sites and flanking positions

motifmap_getsitearoundcs -i motifmap.txt -o motif_cscurve.txt -l 12 -w 50 -s mouse -gd /data/mm6/ -cd /data/mm6/conservation/phastcons/

8.10 Get clustered motif sites

menu_motifmap_getcluster -i motifmap.txt -w 200 -o motif_clusteredsites.txt

9. File format conversion:

9.1 BED to COD (COD files are standard files recognized by CisGenome for manipulating a list of genomic regions)

file_bed2cod -i input.bed -o output.cod

9.2 COD to BED: file_cod2bed -i input.cod -o output.bed

9.3 BAR to TXT or WIG (BAR files are defined by Affymetrix. They are standard files for CisGenome visualization and ChIP-chip/ChIP-seq peak detection)

affy_bar2txt -i input.bar -o output.txt OR affy_bar2wig -i input.bar -o output.wig

9.4 FASTA soft-masked to hard-masked. In a soft-masked FA file, a,c,g,t are repeats, A,C,G,T are non-repeats. This function converts a,c,g,t in a soft-masked FA file to N to generate a hard-masked FA file. Some external motif discovery tools need hard-masked FA file as input. CisGenome motif discovery can use soft-maksed FA file directly and repeats are automatically skipped)

fasta_soft2hardmask -i input.fa -o output.fa

9.5 ELAND to ALN. Convert ELAND output to ALN file. ALN file is the standard file format recognized by CisGenome as the input for ChIP-seq analysis.

file_eland2aln -i sample_input.eland.out -o input.aln

OR file_eland2aln_v2 -i elandv2_out.txt -o input.aln

9.6 BOWTIE to ALN. Convert bowtie output to ALN file.

file_bowtie2aln -i bowtie_out.txt -o input.aln

9.7 BED to ALN. Convert a BED file to an ALN file.

file_bed2aln -i input.bed -o output.aln

10. Other utilities:

10.1 Count overlap among several lists of genomic regions (for Venn diagram)

region_overlap -n 3 -i peak1.cod peak2.cod peak3.cod -o OUTPUT_title

10.2 get information from a list of BAR files for a list of genomic regions.

tilemapv2_regioninfo -i sample_region.cod -o output.cod -d sample_regioninfo_arg.txt

Note: also the program is named as “tilemapv2_...”, it can be used to process any BAR files, e.g., BAR files obtained from ChIP-seq analyzes.

Format and meaning of the key output files

1. The peak list (*_all.cod) and signals (*.bar) from tiling array analysis

outputformat_tilemapv2.txt

2. The peak list (*.cod) and signals (*.bar) from one-sample ChIP-seq analysis

outputformat_onesamplechipseq.txt

3. The peak list (*.cod) and signals (*.bar) from two-sample ChIP-seq analysis

outputformat_twosamplechipseq.txt

4. The peak list (*.cod) and signals (*.bar) from CisGenome v2 SeqPeak two-sample peak calling

outputformat_seqpeak.txt