About MotifLab

MotifLab is a general workbench for analyzing regulatory sequence regions and discovering transcription factor binding sites and cis-regulatory modules. MotifLab can improve performance of binding site predictions by allowing users to integrate several motif discovery tools as well as different types of data, for instance phylogenetic conservation, epigenetic marks, DNase HS sites, ChIP-Seq data, positional binding preferences of transcription factors, TF-TF interactions, TF-expression and target gene expression.

With more than 40 data-processing operations available, users can create, import, export, transform and manipulate data objects such as DNA sequences, numeric- or region-based data tracks, transcription factor binding motifs, regulatory modules, background models, text- and numeric-variables, and collections and partitions of data objects.

Various statistical analyses can be applied to evaluate or compare data objects and results can be output in many common text-based data formats or in HTML with images included. Interactive tools allow users to search for and highlight interesting features of datasets, for instance potential networks of binding sites for interacting transcription factors.

The program is written in JAVA and can either be run in interactive mode with a graphical user interface providing powerful data visualization capabilities, or the analysis to be performed can be defined in a protocol-script and executed by MotifLab in batch-mode from a command-line interface. The latter option allows MotifLab to be incorporated as a component in larger data processing pipelines.

What can you do with MotifLab?

Import and export data

  • Automatically obtain sequence data from a list of gene identifiers or specified genomic intervals.
  • Annotate your sequences with data downloaded from internet resources such as the UCSC Genome Browser or DAS servers with the click of a button, or load your own annotation data from files.
  • Import and export data in a variety of supported formats, including FASTA, GFF, GTF, BED, WIG, PSP and CisML (for sequence data) and Transfac, Jaspar, TAMO, MEME minimal motif, XMS, INCLUSive and raw PSSM format (for motif data). Some data objects can even be output to formatted HTML with images.

Manipulate data objects

  • Transform numeric data objects with arithmetic operations (add,subtract,multiply,divide) or other mathematical functions such as logarithms, rounding functions and many more.
  • Apply sliding window functions to numeric tracks to smooth data or to detect edges, peaks and valleys.
  • Mask portions of DNA sequences by changing the letter case or by replacing the bases with other letters such as N or X.
  • Combine information from several tracks into new tracks.
  • Most data objects can be edited manually within the GUI by altering their values in dialog boxes.
    You can also edit sequence annotation tracks by drawing directly into the sequence browser.

Discover binding motifs and binding sites

  • Import collections of motifs from databases like TRANSFAC and JASPAR with the click of a button, and scan your sequences to find potential matching sites.
  • Run external de novo motif discovery methods from within MotifLab to search for novel motifs.
  • Create positional priors tracks to guide motif discovery methods in their search, and use machine learning to train classifiers that can automatically generate such tracks based on information from various features.
  • Compare newly discovered motifs to existing motif libraries using different comparisons measures.
  • Use biological knowledge and annotation data, such as conservation, DNaseHS sites, ChIP-Seq data etc. to filter out likely false predictions.
  • Use ensemble methods to compare and combine predictions made by different algorithms.
  • Annotate your motifs with additional useful information, such as as the names of transcription factors binding to the motif, which organisms and tissues these TFs are expressed in, and references to other factors that are known to interact with factors binding to the motif.

Discover cis-regulatory modules (composite motifs)

  • Search your sequences for recurring combinations of binding motifs that could represent cis-regulatory modules.
  • Or define your own modules manually and scan sequences to see if they contain these modules.
  • Impose constraints on the motifs appearing in a module, such as the order of the motifs, the distance between them or their relative orientation.
  • You can also generate a library of modules automatically based on information about known interaction partners for each motif (provided that your motifs have such annotations).

Analyze data

  • Count the number of times each motif occurs in your sequences and compare these numbers to expected frequencies to identify motifs that are significantly overrepresented in your dataset.
  • Compare one subset of your sequences to another to see if some motifs appear with higher frequency in one of the subsets.
  • Use linear regression to identify the motifs that correlate best with gene expression.
  • Find which motifs have the highest average conservation level across all its binding sites.
  • Analyse the positional distribution of binding sites to see if some motifs tend to appear at a certain distance relative to the transcription start site.
  • Compare different data tracks (or other objects) to each other to see how well they correlate.
  • Evaluate the ability of different features to be used as predictors of others.
  • Combine results from several different analyses into larger meta-analyses to simplify comparisons.

Visualize and explore data interactively

  • Sequences, along with any annotation data tracks and motif prediction tracks, are visualized in an internal sequence browser. Since all data is kept in memory at all time, the browser supports very fast navigation with panning to show different parts of the sequences and zoom to any scale.
  • The visualization is highly customizable. Numeric tracks can be displayed as either graphs or heatmaps (color gradients) and region tracks can be displayed in "compact mode" with overlapping regions on top of each other or in "expanded mode" with overlapping regions beneath each other.
  • Tracks containing motif and module predictions have special status. At 1000% scale and higher, motif binding regions can be shown with superimposed "match logos" that convey information about both the appearance of the binding motif itself and also how well the motif matches the sequence in that region.
  • The visibility of individual motifs, modules, sequences or data tracks can be easily altered to show only the parts that you are most interested in focusing on at any moment.
  • Interactive tools can be used to search for motifs and highlight binding sites with selected properties, for instance motifs that match the consensus sequence "CACGTG", motifs that bind the transcription factor "CREB" or binding sites where the average conservation level within the site is greater than some dynamically chosen cutoff value.