Inference and validation of a large Saccharomyces cerevisiae cis regulatory motif set
Matias Piipari
Wellcome Trust Sanger Institute, UK

Saccharomyces cerevisiae is a popular model organism for investigating eukaryotic cis-regulatory elements given its compact genome and the breadth of resources and datasets available for its study. We conducted a computational regulatory motif discovery study with the NestedMICA algorithm with the aim to infer a close to complete core promoter motif dictionary of the S. cerevisiae genome (200 motifs from upstream sequences of 2000 genes, totalling a megabase of sequence). We show an analysis of the motif sets to identify known regulatory motifs and motif families amongst the set (81 of 200 found to be close matches to known motifs), and analyze variation patterns of the motif matches. We also describe a novel gene expression aided motif inference algorithm based on NestedMICA and demonstrate its performance with several S. cerevisiae gene expression datasets.

The computationally inferred motifs were also analyzed in the context of the Saccharomyces Genome Resequencing Project dataset that provides a whole genome multiple alignment of and SNP calls for over 80 Saccharomyces strains. Over 2/3 of them show SNP and insertion/deletion rates that are lower than the un-annotated non-coding regions of the S. cerevisiae genome, and a subset of 22 motifs show less variation than protein coding sequence.

Our work demonstrates successful use of genome-scale NestedMICA motif inference in finding potential regulatory features of a eukaryotic genome. We provide strong evidence that our computationally inferred motifs are functional using independent lines of evidence. The software developed for the study is publicly available, and includes an easy to use motif set visualization and analysis tool iMotifs (, and as new additions to the NestedMICA suite a novel gene expression driven motif inference algorithm and a framework for building motif family models.