Despite recent advances in experimental approaches to identifying transcriptional cis-regulatory modules
(CRMs, " enhancers "), direct empirical discovery of CRMs for all genes in all cell types under the full
range of environmental conditions is likely to remain an elusive goal. Effective, sensitive,
and widely-applicable methods for computational CRM discovery are thus a critically needed complement
to empirical approaches. This is particularly true for CRMs regulating genes involved in less fully
described processes for which lack of knowledge of relevant transcription factor (TF) binding makes
ChIP-based experimental approaches impractical. We are developing a flexible suite of methods for the
identification and characterization of regulatory sequences when beginning with different degrees of
knowledge of the underlying regulatory network.
Our most successful efforts to date center on " motif blind " in silico CRM discovery methods
that do not depend on knowledge or accurate prediction of TF binding sites and that are effective when
limited knowledge of existing CRMs is available. A sliding genomic window is scored for similarity to a
training set consisting of a small (as few as seven) number of known CRMs that direct a common pattern of
gene expression. Scoring is based on the statistics of short word (k-mer) frequencies under any of several novel measures.
Empirical testing of our predictions using this strategy so far yields a 95% success rate in correctly
identifying sequences with CRM activity in transgenic reporter assays in both Drosophila and mouse (n=20).
We are working to increase the accuracy of our methods with respect to successful prediction of patterns of
regulated gene expression. We are also extending our CRM discovery methods to biomedically and/or agriculturally
important, but experimentally less well-characterized, arthropod species such as Anopheles gambiae,
Nasonia vitripennis, Tribolium castaneum and Apis mellifera.
Overall, our methods represent a flexible, accurate way of identifying regulatory sequences in both insect and
mammalian genomes and provide an important in silico adjunct to and extension of empirical CRM discovery approaches.