Motif-blind computational discovery of cis-regulatory modules
Saurabh Sinha
University of Illinois, USA

Despite recent advances in experimental approaches to identifying transcriptional cis-regulatory modules (CRMs, " enhancers "), direct empirical discovery of CRMs for all genes in all cell types under the full range of environmental conditions is likely to remain an elusive goal. Effective, sensitive, and widely-applicable methods for computational CRM discovery are thus a critically needed complement to empirical approaches. This is particularly true for CRMs regulating genes involved in less fully described processes for which lack of knowledge of relevant transcription factor (TF) binding makes ChIP-based experimental approaches impractical. We are developing a flexible suite of methods for the identification and characterization of regulatory sequences when beginning with different degrees of knowledge of the underlying regulatory network.

Our most successful efforts to date center on " motif blind " in silico CRM discovery methods that do not depend on knowledge or accurate prediction of TF binding sites and that are effective when limited knowledge of existing CRMs is available. A sliding genomic window is scored for similarity to a training set consisting of a small (as few as seven) number of known CRMs that direct a common pattern of gene expression. Scoring is based on the statistics of short word (k-mer) frequencies under any of several novel measures. Empirical testing of our predictions using this strategy so far yields a 95% success rate in correctly identifying sequences with CRM activity in transgenic reporter assays in both Drosophila and mouse (n=20). We are working to increase the accuracy of our methods with respect to successful prediction of patterns of regulated gene expression. We are also extending our CRM discovery methods to biomedically and/or agriculturally important, but experimentally less well-characterized, arthropod species such as Anopheles gambiae, Nasonia vitripennis, Tribolium castaneum and Apis mellifera.

Overall, our methods represent a flexible, accurate way of identifying regulatory sequences in both insect and mammalian genomes and provide an important in silico adjunct to and extension of empirical CRM discovery approaches.