Gibbs Sampler Demo

The "Gibbs Sampler Demo" is an interactive implementation of the popular Gibbs Sampler motif discovery method first proposed by Lawrence et al. [1].
It is a nice educational tool for understanding how motif discovery works in practice since it visualizes every step in the process.


The Gibbs Sampler motif discovery algorithm

The Gibbs Sampler algorithm is an example of an alignment based approach to motif discovery. In contrast with e.g. word counting based algorithms that exhaustively enumerate every possible DNA sequence of a given length to find significantly overrepresented motifs in a sequence set, the Gibbs Sampler treats motif discovery as an optimization problem by starting off with an initial model and then iteratively tries to improve this model using a stochastic hill-climbing approach.
  1. The algorithm starts off by selecting a random TFBS site from each sequence and builds a matrix model based on those sites. This initial model is most likely very bad, since we would not assume that the randomly selected sites share a common motif.
    A background model is also created based on the composition of the DNA sequence outside of the selected TFBS sites.
  2. The algorithm selects one of the sequences and removes its site from the current model (i.e. it creates a new model based on only the TFBS from the other sequences). The background model is similarly updated.
  3. The new motif model is now used to score the selected sequence by calculating a motif match score (relative to the background) for each position in the sequence.
    Positions that are more similar to the current model will receive higher scores.
  4. The match scores are turned into a probability distribution and a new candidate TFBS for the sequence is selected at random according to this distribution.
  5. The newly selected TFBS is incorporated back into the model.
  6. Steps 2–5 are repeated until the model converges (or a predetermined number of times).
The sequence in step 2 can either be selected at random or all the sequences can be processed systematically in a round-robin fashion. In the beginning, the sites sampled by the algorithm will most likely not be much similar to each other, and the resulting motif model will have high variability and therefore low information content. But if, at any point, the algorithm happens to hit upon the correct binding site in a sequence, the model will be slightly skewed towards that binding motif. This means that when the next sequence is scored, sites in that sequence which are similar to the previously selected sites will receive slightly higher scores and will therefore have a higher chance of being selected and incorporated into the model during the random sampling step, thus skewing the model even more towards the sites are similar to the ones already in the model. Hopefully, the algorithm will over time incorporate enough similar sites so that the model eventually converges to the correct motif.

Gibbs Sampler Demo

You can start the "Gibbs Sampler Demo" by selecting it from MotifLab's "Tools" menu. This will bring up the dialog shown below.


Creating an example dataset for motif discovery

The Gibbs Sampler can be applied to search for binding sites in a DNA track for any set of sequences with a common motif. If you don't already have such a dataset, you can create an example sequence set with the supplied protocol.
Press the "protocol" button in the Gibbs Sampler Demo dialog to bring up the protocol, and then press the "Execute" button in MotifLab's tool bar to run it. If you have existing sequences you should delete them before you start by selecting "Clear Data" ⇒ "Sequences and Features" from the "Data" menu. The protocol will first ask you to choose a length for the sequences (the default is 200 bp). Next, the protocol imports the TRANSFAC PUBLIC motif collection and asks you to choose one of the motifs to plant in the sequences. Remember to check off the box in front of the motif you want before pressing "OK" in the motif collection dialog, since just selecting the motif name is not enough. The protocol will then plant the selected motif in the middle of each sequence and return a DNA track and a region track with the target sites (TFBS). By default, the protocol will create 25 sequences and plant the exact same binding motif in all of the sequences, but it is possible to adjust the total number of sequences, the number of sequences with planted binding motifs and the variability of the binding motif by editing the protocol.

Tips and tricks




References

Charles E Lawrence, Stephen F Altschul, Mark S Boguski, Jun S Liu, Andrew F Neuwald and John C Wootton (1993) "Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment", Science 262(5131) : 208-214