Miscellaneous datasetsExpected Motif Occurrence FrequenciesThe following table contains files with expected occurrence frequencies for motifs under different conditions. These files can be imported into MotifLab as Motif Numeric Maps (in Plain format) and can be used for the "Background frequencies" argument in the Count Motif Occurrences analysis in order to estimate p-values for statistical overrepresentation of the motifs in your datasets. To get comparable results you should use a background frequency file created with the same scanning algorithm and cutoff threshold that you have used yourself when predicting motif sites (so if you have predicted TFBS in a sequence using the SimpleScanner algorithm with 90% threshold you should select one of the corresponding datasets in this table).The expected occurrence frequencies were estimated empirically by scanning for motif occurrencies in randomly generated DNA sequences (on both strands) and then dividing the total occurrence count of each motif with the maximum number of times that motif could theoretically occur in the sequence (roughly proportional to the length of the scanned sequence). The scanning and frequency estimation was performed on 200 randomly generated sequences of length 5000 bp. Motifs that did not occur at all in these random sequences (or occurred with a frequency less than 1E-6) were subjected to another scanning step using longer sequences (200 x 20000 bp) in order to get more accurate estimations for rare motifs. Motifs that still failed to appear in these longer random sequences have been assigned a frequency of zero. The random DNA sequences were generated with different background DNA models as stated in the table below. The process was repeated 3 times for each condition (i.e. choice of parameters for DNA model, scanning algorithm and cutoff) and the final frequency tables for each condition is the average over these 3 runs.
The motif frequency maps in the table below are not derived from artificially generated sequences but are rather based on genome-wide binding site predictions. The motif scanning program was first run on the full genome (both strands considered) and the predicted number of binding sites for each motif was then divided by the maximum number of positions a motif of that length could possibly occur without overlapping the end of a chromosome or any segments containing N's. (Hence, in the sequence "nnnACGTGAGnnnTTATAGACnnnn" a motif of lenght 4 could occur 9 times on each strand and thus at most 18 times in total).
MATCH profilesThe MATCH motif scanning algorithm (included in the TRANSFAC suite from BioBase) can use different cutoff thresholds for each individual motif rather than using the same threshold for all motifs, which can potentially reduce the number of false predictions made by each motif. The people at BioBase have experimentally determined different thresholds to minimize either the number of false positive predictions (minFP), false negatives (minFN) or the sum of these (minSUM) and collected them into so-called "profiles". (A more in-depth explanation of the different profiles can be found here.)These profiles are presented here in "Plain"-format so that they can be imported into MotifLab as Motif Numeric Maps and used for the "Core threshold" and "Matrix threshold" arguments for the MATCH algorithm. (Please note that some (about 30) of the motifs do not have assigned values in the profiles and will therefore use the default values which have rather arbitrarily been set to 0.95 for Core threshold profiles and 0.98 for Matrix threshold profiles.)
|