Capstone Project

M.H.S. in Bioinformatics

Jichao Chen


        Genes have specific spatial and temporal expression pattern to carry out their normal functions. Gene regulation is achieved by the binding of transcription factors to their cognate binding sites (motifs) in gene promoters. Upon binding, transcription factors either activate or repress the expression of corresponding genes. Each gene has multiple binding motifs in its promoter and is regulated by multiple transcription factors, while each transcription factor regulates the expression of multiple genes. It is the different combination of binding motifs and transcription factors that results in the variety of expression patterns for different genes.

        The availability of large-scale expression profiling data from microarray and in-situ experiments can easily generate a list of presumably co-regulated genes based on the similarity in spatial expression pattern (in situ) or the way by which functionally related genes are mis-regulated in disease states (microarray). In many biological contexts, the most parsimonious explanation is that these genes share binding motifs for (therefore regulated by) at least one transcription factor. The identification of such common binding motifs may not only help us understand the complex process of gene regulation, but also lead to the identification of the transcription factor involved.

        Gene regulation is complex. A typical binding motif is short (6~10 bases) as compared to several kilobases of promoter sequences. The binding motif for a given transcription factor is also degenerate, allowing mismatches at various positions. As a result, the signal to noise ratio is low for most motifs. Certainly, in vivo transcription factors utilize information from high order chromatin structure (which is mostly unavailable) as well as primary sequence to increase binding specificity. However, given a relatively short list of candidate motifs, biologists can verify them experimentally (chromatin IP or gel shift assay).

        There are several programs available for identification of common binding motifs (see references). This program ("Comtifinder") is designed to be complementary to existing methods and has the following features. (1) It searches both strands of input sequences exhaustively for all possible permutations of short sequences allowing mismatches at any position, then summarizes the common short sequences in a unique way taking advantage of the fact that the real motif will be represented multiple times in the short sequences permutation (see the method section for details). (2) Unlike existing programs, it increases the signal to noise ratio by specifying a high cut-off so that >85% of the number of input sequences contain the common motif. This cut-off can be achieved by carefully selecting the candidate sequences to be analyzed. (3) Although the on-line version limit the search space to 7 or 8 nucleotides, it can be easily scaled to motifs of arbitrary lengths and mismatches. Users can increase the search speed and specificity if there is a priori knowledge of the length or complexity (e.g. at least two types of nucleotides) of the motif.