Identification of regulatory signals using graph modularity analysis
Alexandre P. Francisco1, Sophie Schbath2, Arlindo L. Oliveira1 and Ana T. Freitas1
1INESC-ID / IST, Tech Univ of Lisbon, Portugal (aplf,aml,atf@inesc-id.pt)
2INRA, France (sophie.schbath@jouy.inra.fr)


In the last decade, after the completion of genome sequencing projects for various organisms, the study of gene regulation and gene expression mechanisms has required new computational approaches. Despite the remarkable success of these tools in some areas of application like gene finding, sequence alignment, etc, there are still problems for which no definitive methods have been developed. Notably, the accurate identification of biologically meaningful nucleotide sequences in cis-regulatory regions remains an open problem.

Many algorithms have been proposed to date for the problem of finding biologically significant motifs in promoter regions. They can be classified into two large families: combinatorial methods and probabilistic methods. Probabilistic methods have been used more extensively, since their output is easier to interpret. Combinatorial methods have the potential to identify hard to detect motifs, i.e. both strong and weak signals, but they deluge the user with a large, possibly huge, number of motifs, consisting of hundreds or thousands variations of the motifs of interest.

In this work, we propose a method that can be used to process the output of combinatorial motif finders in order to find groups of motifs that represent variations of the same motif, thus reducing the outputs to manageable sizes. This output processing step is done by building a graph that represents the co-occurrences of motifs. This graph has one node for each motif found, and one edge between two motifs if they have significant occurrence overlap. The identification of motifs is performed by finding communities in the co-occurrences graph using graph clustering techniques. The motifs in each cluster are finally combined into a composed representation, and a position weight matrix (PWM) is generated. Note that our approach is also able to process and identify complex motifs, i.e. motifs that are built of two or more simple motifs, spaced by a number of bases that falls within a specific range.

We show that this innovative approach leads to a method that is as easy to use as a probabilistic motif finder, and as sensitive to low quorum motifs as a combinatorial motif finder. Furthermore, we derive an approach to evaluate the statistical significance of motif clusters, providing motif rankings, and we show how to combine the output of different motif finders, both combinatorial and probabilistic, leading to an innovative integration method which helps to integrate, analyze and compare the output of several motif finders.

Several tests have been preformed against several well known motif finders, using a set of recently published large-scale compendium of transcription factors, derived from diverse high-throughput experiments in several metazoan. The results show that the method is highly competitive with state of the art methods that use much more extensive information.