ChIP-seq data were collected from GEO, SRA, and the ENCODE at UCSC[1]. ChIP-seq samples were grouped into experiments based on descriptions in GEO, SRA and ENCODE. For the first build of the database, we only collected experiments which had both ChIP and control samples. Reads were mapped to genomes using Bowtie[2], by allowing up to two mismatches.The uniquely mapped sequence reads were kept for subsequent analyses.
For each ChIP-seq sample, a protein-DNA binding intensity profile was generated as follows. First, sequence reads in each sample were extended to 3'end by L bp, where L was the DNA fragment length reported in the ChIP-seq experiment. Next, for each genomic position, the number of DNA fragments (i.e. the extended reads) covering the position was counted. The counts for a subset of genomic positions were then stored. The stored positions were chosen to be evenly spaced at 35 bp, consistent with the probe density in Affymetrix tiling arrays. As an example, for each chromosome, we stored counts at position 17, 52, 87, 122, etc. We chose to store counts at 35 bp resolution rather than at single base pair resolution to save storage space and to increase computational efficiency for data retrieval. The stored counts are called
the protein-DNA binding intensity profile of the ChIP-seq sample. For simplicity, the stored positions will be called probes hereinafter, and the stored counts will be called probe intensities to be consistent with ChIP-chip data.
For each ChIP-seq experiment, we also applied CisGenome[3] to generate a peak list (step size = 35, window size = 175 (i.e. 5*35) for transcription factors and 525 (i.e. 15*35) for histone modifications). Peaks with FDR below 10% were retained. During the peak detection procedure, a log2 fold change between ChIP and control was computed for each probe, after averaging intensities across biological and technical replicates. To compute the log2 fold change for a genomic position, the DNA
fragment counts at that position were first normalized by dividing the total read counts of the corresponding samples. The counts were then multiplied by N which is the smallest sample read count across all samples in the experiment. The normalized counts from ChIP samples and those from control samples were then averaged separately, and the log2 fold change was computed as log2 [ (1 + mean
normalized ChIP fragment count) / (1+ mean normalized control fragment count) ].
In summary, each ChIP-seq experiment in hmChIP is associated with one peak list, one log2 fold change profile, and one or more samples. Each sample in the experiment is associated with a protein-DNA binding intensity profile. These files are stored in the same formats as ChIP-chip.


[1]Kent, W.J., Sugnet, C.W., Furey, T.S., et al. (2002) The human genome browser at
UCSC. Genome Res. 12, 996-1006.

[2]Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B, 57, 289-300.

[3]Ji, H., Jiang, H., Ma, W., et al. (2008) An integrated software system for analyzing
ChIP-chip and ChIP-seq data. Nat. Biotechnol. 26, 1293-1300.

Content :

How To Use:



    Li Chen

    Dept. of Biostatistics

    Johns Hopkins Univ