Data Query

Users can search for available experiments by providing names of transcription factors (TF) or histone modifications, cell type names, and/or a list of genomic regions (the regions should be in BED or COD file format). If a protein name is provided as key word, the server will find all gene symbols and alternate gene names that contain a match to the name. The match does not need to be exact. Any gene that contains the key word as part of the gene symbols will be treated as a match, and the corresponding experiments will be returned. For example, if one uses protein name = ER as the key word, one will not only get experiments for estrogen receptor (ER), but also experiments for GATA1 which has an alternategene symbol ERYF1. If a cell type name is provided as key word, the server will match the key word with the experiment descriptions in the database. Experiments containing the key word will be returned.
If no key word is provided by users, all experiments in the database will be returned. The description of each experiment will be displayed in a column called escription This column provides information about the cellular context of each experiment. It can also help users to find appropriate key words to use for querying the database using cell types. If a list of genomic regions is provided, the experiments in hmChIP will be rank ordered based on the degree of overlap between the peak list of each experiment and the query region list. The degree of overlap is determined as follows.
First, for each peak list in the database, a list of random genomic control regions was generated. The control regions were chosen using the method described in Ji et al. (2006) to carefully match the distributional properties of the original peak list. In other words, the distances of the control regions to their closest transcription start sites (TSS) were required to have a distribution similar to the distribution of the distances between the real peaks and their closest TSS. The length distribution of control regions was also required to match the length distribution of the peak list. The control regions chosen in this way are called “matched genomic controls since their properties match the properties of the original peak list.
Given a query region list, the number of query regions that overlap with each peak list in hmChIP will be computed. For peak list i, let t1i denote the total number of peaks in the peak list, and m1i be the number of overlapping regions between the peak list and the query list. Let t0i denote the total number of regions in the corresponding control list, and m0i denote the number of overlapping regions between the control region list and the query list. The degree of overlap between the query region list and peak list i is defined as the ratio ri = (m1i/t1i) / (m0i/t0i). The ratio ri represents the fold enrichment of the observed overlap compared to the overlap expected by chance. For example, a fold enrichment of 2 means that 50% of the observed overlap can occur by chance. To avoid zeros in the denominators, and to make the fold enrichment estimate more stable, in the computation we added a pseudo-count a=5 to both numerators
and denominators, that is, ri is actually computed as [(m1i+a)/(t1i+a)] / [(m0i+a)/(t0i+a)]. Once the ratios ri are computed, they will be used to rank the experiments. Experiments with a larger ri will be listed first in the query results.
For each peak list, a Fisher exact test will also be carried out using the four numbers (m1i, t1i, m0i, t0i) to test whether the percentage of overlap with the peak list is the same as the percentage of overlap with the control list (i.e. whether the true means for m1i/t1i and m0i/t0i are the same). The p-values obtained from the tests will be adjusted for multiple testing using Benjamini-Hochberg procedure. The adjusted p-values (FDR) will be reported together with the enrichment ratio ri.

Content :

How To Use:



    Li Chen

    Dept. of Biostatistics

    Johns Hopkins Univ