&nbsp;&nbsp;MotifLab User Manual

: benchmark, compare clusters to collection, compare collections, compare motif occurrences, compare motif track to numeric track, compare region datasets, compare region occurrences, count module occurrences, count motif occurrences, count region occurrences, evaluate prior, GC-content, motif collection statistics, motif position distribution, motif regression, motif similarity, numeric dataset distribution, numeric map correlation, numeric map distribution, region dataset coverage, single motif regression, collate

Name	Description
analysis	The specific analysis to perform.
analysis-specific parameters	Each specific analysis will have its own parameters that must be set. See the documentation for the individual analyses for further explanation of these parameters.

# Analyses the GC-content in the specified DNA track
Analysis1 = analyze GC-content {DNA track=DNA}

# Compares two numeric maps to see if the values for corresponding entries in the maps are correlated (here motif size is compared against motif information content)
Analysis2 = analyze numeric map correlation {First=Motif_size,Second=Motif_IC}

apply

The "apply" operation will apply a sliding window function to a Numeric Dataset to smooth the track or to find peaks, valleys or edges in the data. The operation goes through each position in the track in turn and defines a "window" region around each target position. The selected window function dictates how a new numeric value can be calculated based on the values of the positions within the current window, and the resulting value is assigned to the target position.

Applies to:	Numeric Dataset
Returns:	Numeric Dataset

Name	Description
window function	The different window functions define how new values can be calculated based on the values of all positions within the window. The "Uniform", "Bartlett" and "Gaussian" windows return different weighted averages of the values within the window. The "Sum" window simply returns the total sum of values within the window. The "Minimum" and "Maximum" windows return the minimum and maximum value within the window respectively. The "Shift" window will return the value of the most downstream position within the window if the "start" or "center" anchors are used or the value of the most upstream position if the "end" anchor is used. This means that for a window with size N, the "start" anchor will shift the values in the track N-1 positions upstream, and the "end" anchor will shift the track N-1 positions downstream. The "Valley" and "Valley2" windows can be used to detect valleys in the data track. That is, sections of the track with low values that are located inbetween sections with high values. The valley-score is based on the definition used in the paper: Ramsey & Shmulevich et al. (2010) "Genome-wide histone acetylation data improve prediction of mammalian transcription factor binding sites", Bioinformatics, 26(17) : 2071-2075. The sliding window is divided into three parts with different sizes, the left 40%, the central 20% and the right 40%, and the highest values within both the left and right flanks are determined. The smallest of these two maximum scores from the left and right flanks is hereafter referred to as "minofmax". If the score in the center point is less than 90% of minofmax, then the center point is considered to be a "valley-point" and is assigned a value greater than zero. If, however, the score in the center is within 10% of the minofmax-value, it is not a valley-point and is assigned a value of zero. The difference between the "Valley" and "Valley2" windows is that with "Valley" the new assigned value for the valley-point will just be the "minofmax" (lowest value in the two flanks), but with "Valley2" the valley-point will be assigned a value reflecting the difference between the minofmax and the previous value (this means that the new value will be proportional to the depth of the valley). It is advisable to use these windows in combination with the "center" anchor and an odd-numbered window size to avoid shifting the location of the valleys. The "Peak" window can be used to detect narrow peaks in the data track. This window functions in basically the same way as the Valley-windows but with opposite results. If the value in a position is much higher than the "maxofmin" in the flanks, the current value will be kept as is. If not, the position will be assigned the value zero. This means that wide sections with similar values will be set to zero, but narrow peaks in the track that have much higher values than the surrounding sequence will be retained. It is advisable to use this window in combination with the "center" anchor and an odd-numbered window size to avoid shifting the location of the peaks. The "Edge" window can be used to detect sharp transitions in the data track. The returned value will simply be the value of the last position in the window minus the value in the first position. This means that positive gradients in the original track will result in positive values in the new track and negative gradients will result in negative values.
size	The size of the sliding window
anchor	The anchor parameter controls how the window should be placed in relation to the target position. The "center" anchor will place the window so that the target position is in the center of the window. The "start" anchor will place the window so that the target position is in the first position of the window (most upstream position in a relative orientation). The "end" anchor will place the window so that the target position is in the last position of the window (most downstream position in a relative orientation).

: position condition

:

# Smooths the track by replacing the value in each position with the (unweighted) average of the values within a 10 bp region centered on the position
apply Uniform window of size 10 with anchor at center to NumericTrack

# Shifts the track 5 bases upstream (not 6 bases!)
apply Shift window of size 6 with anchor at start to NumericTrack

# Removes (i.e. sets to zero) sections in the track that do not correspond to narrow peak regions
apply Peak window of size 41 with anchor at center to NumericTrack

collate

The collate operation can be used to combine information from several different analyses (or Maps) by extracting columns of data from each analysis and putting them together in a larger table. A collated analysis is based around a fundamental data type (Motif, Module or Sequence) and contains rows for each of the data objects of that fundamental type. Only information from analyses and maps that have compatible fundamental types can be collated, and additional properties from the fundamental data objects themselves can also be included in the final table.

Applies to:	Analysis and Map
Returns:	Analysis

Name	Description
data type	The fundamental data type that the source analyses and maps contain information about (Motif, Module or Sequence)
source	The source analysis, map or fundamental data type from which a column of data should be extracted and inserted into the collated analysis. If the source is a map, the values from the map are used, but if the source is an analysis or a data type, the property (column) to be extracted must be specified explicitly. You can include as many sources as you want in the collated analysis to build a table with many columns.
property	If the corresponding source is an analysis or fundamental data type (not a Map), the column to be extracted from the analysis or property to use from the data type must be specified with this argument.
column name	A new name to be given to the column in the collated analysis. If no new column name is specified, the old name of the column/property from the source object is used. Note that each column in the collated analysis must have a unique name, so if you want to e.g. include the same column from two analysis of the same type, you must rename at least one of the columns.
optional title	A new title for the collated analysis which will be displayed as a header when outputting the analysis to e.g. HTML or when showing the analysis in a dialog . The default title is just 'Collated analysis'.

# Returns a new collated analysis with two columns containing respectively the GC-content of each sequence (extracted from Analysis1) and the length of each sequence (a property of the Sequence object)
Analysis2 = collate "GC-content" from Analysis1, "length" from Sequence

# Returns a collated analysis with three columns containing the 'total' column from Analysis1_count and also the 'total' column from Analysis2_count. This second total column is renamed to 'total2' to avoid conflict with the first column. The final column in the table contains values from the map 'ExpectedFrequency' (this column is renamed as 'Expected' in the collated table).
Analysis3 = collate "total" from Analysis1_count, "total" from Analysis2_count as "total2", ExpectedFrequency as "Expected"

: analyze

combine_numeric

Combines multiple Numeric Datasets into a single track, multiple numeric maps into one map or multiple variables into one variable. The value assigned to the target data object could either be based on the minimum value across all source data objects (inputs), the maximum value, the average value, the sum of values or the product of values. If the source objects are Numeric Datasets the tracks are combined position by position, i.e. the value of each position in the resulting target track will be either the minimum, maximum, average, sum or product of the values in that position across all the source datasets. If a condition is specified, only positions that satisfy the condition are combined, and positions that do not satisfy the condition are assigned the value of the first source dataset in the position. If the source objects are Numeric Maps, these will be combined across entries for the same key. E.g. for a Motif Numeric Map, the value in the target map for motif "M00001" will be based on the values for this motif in the source maps. The default values from each map will always be combined in the same way as the individual entries.

Applies to:	Numeric Dataset, Numeric Maps and Numeric Variable
Returns:	Numeric Dataset, Numeric Maps or Numeric Variable (The type of the returned object will depend on the source object)

Name	Description
function	This argument specifies how the "combined" value(s) assigned to the target data object should be calculated based on the values of the source data objects. Possible choices are "min","max","average","sum" or "product".

: position condition

:

# Returns a new track where the value in each position is the average of the values from the tracks X, Y and Z in that position
W = combine_numeric X,Y,Z using average

# Returns a new Motif Numeric Map where the value for each motif is the sum of values for the motif in the three source maps
MM = combine_numeric MotifMap1,MotifMap2,MotifMap3 using sum

: combine_region

combine_regions

Combines regions from multiple Region Datasets into a single track. Each sequence in the resulting track will contain the union of regions found in all the source datasets for that sequence.

Applies to:	Region Dataset
Returns:	Region Dataset

: region condition

:

# Returns a new track where each sequence contains the union of regions found in X,Y,Z for that sequence
W = combine_regions X,Y,Z

: combine_numeric, merge

convert

This operation can be used to convert a Numeric Dataset into a Region Dataset or vice versa. When converting a Numeric Dataset into a Region Dataset, the regions will be based on stretches of the sequence that satisfy a given condition. Hence, if no conditions are specified the resulting track will contain no regions. The most natural way to convert a numeric track into regions would probably be to create regions based on stretches of the numeric track that have values greater than zero, so this condition will be set up by default in the operation dialog. The score of each region can be specified as an argument, and this can either be a constant value (the same for each region) or the score can be based on the minimum, maximum, average, median or sum of the values of a numeric track within the region (the track used for this score would naturally, but not necessarily, be the same as the source track). When converting a Region Dataset into a Numeric Dataset, positions that are not within any regions will be assigned the value 0, and positions that are within regions can be assigned a chosen value which can be either a constant value, the value from a selected numeric track at that position, the number of regions in the source track overlapping with that position, the highest score among all regions in the source track overlapping that position, the sum of the scores of all regions in the source track overlapping that position, or the length of the longest region in the source track overlapping that position.

Applies to:	Numeric Dataset and Region Dataset
Returns:	Region Dataset or Numeric Dataset (The type of the returned object will depend on the source object)

Name

Description

region score

Applicable when converting a Numeric Dataset into a Region Dataset. The argument selects which value to use for the score-property of the resulting regions. The score can either be specified as literal numeric constant, a Numeric Variable or a Sequence Numeric Map. Alternatively, the score can be based on the minimum/maximum/average/median/sum of values of a Numeric Dataset within the region.

numeric value

Applicable when converting a Region Dataset into a Numeric Dataset. The argument defines which value positions that are within regions should receive in the resulting track (positions outside regions are given the value 0). The value could either be a literal numeric constant, a Numeric Variable or a Sequence Numeric Map (in which case all positions within a sequence will be given the same value), or it could be a Numeric Dataset (in which case the value in each position will be copied from that track). In addition to these options, four "special settings" can also be used. These are "region.count" (the value used is based on the number of regions in the source track that overlap the position), "region.highestscore" (the value used is the score of the highest scoring region in the source track among all those that overlap the position), "region.sumscore" (the value used is based on the total sum of scores for all regions in the source track that overlap the position), and "region.length" (the value used is based on the length of the longest region in the source track that overlaps the position).

# Returns a region track where each region corresponds to a conserved stretch of the sequence (i.e. conservation greater than zero). The score of each region is the sum total of conservation values for all positions within the region.
Conserved_regions = convert Conservation to region with region.score=sum Conservation where Conservation > 0

# Returns a numeric track where the value in each position equals the number of 'ChIP_seq_tags' regions overlapping that position
convert ChIP_seq_tags to numeric with value=region.count

# Returns a numeric track that has a value of 1.0 inside RepeatMasker regions of type 'AluJo' and a value of 0.0 everywhere else
convert RepeatMasker to numeric with value=1.0 where region's type equals "AluJo"

: count

copy

The "copy" operation can be used to create an identical copy of an existing data object.

Applies to:	All data objects except Analysis objects and Output objects
Returns:	An object of the same type as the source object

# Creates a copy of the object 'X' and calls this new copy 'Y'.
Y = copy X

count

The "count" operation counts the number of regions that overlap with a sliding window along the sequence and returns a new numeric track containing the result for each position. For each position in the sequence, the operation places a window of chosen size around that position and finds all the regions that either overlap or lie fully within this window. A value is calculated from these regions, either based on just a count of the number regions or by summing up the scores for all of these regions, and the resulting value is assigned to the position.

Applies to:	Region Dataset
Returns:	Numeric Dataset

Name	Description
count	This parameter specifies what kind of value to return for each position. If this parameter is "number" the resulting value will be a count of the number of regions falling within the sliding window, but if the parameter is "score" the resulting value will be the total sum of the scores for all regions falling within the sliding window. MotifLab v2 introduced the possibility of summing up values for other numeric region properties besides "score" and also a special counting function called called "IC-content" that can be used for motif tracks. This function will sum up the information content of the corresponding motif matrix columns for all motifs regions within the sliding window (note that only the positions that are actually covered by the window will be included in the IC sum).
window type	The window type determines the criteria for whether a given region in the track will be considered as "falling within the window" and will thus be included when calculating the count statistic. If the window type is "overlapping" all regions that overlap at least partially with the sliding window will be considered. However, if the window type is "within" only those regions that lie fully within the sliding window (i.e. are fully covered by the window) will be considered.
window size	The size of the sliding window. This can be specified as a constant number, a Numeric Variable or a Sequence Numeric Map (in which case a different window size will be used for each sequence).
anchor	The anchor parameter specifies how the sliding window should be placed relative to the target position. center: The window is placed so that the target position is in the center of the window start: The window is placed so that the target position is at the start of a window which extends downstream end: The window is placed so that the target position is at the end of a window which extends upstream

: position condition

:

# Returns a new track where the value in each position reflects the number of TFBS regions overlapping a window of 20 bp centered at that position
countsTrack = count number of regions in TFBS overlapping window of size 20 with anchor at center

crop_sequences

This operation (introduced in MotifLab v2.0) will either crop the ends of the current sequences by a specified number of bases in one or both directions, or crop the sequences so that they align with the edges of the first and last regions of a specified region track. It works similarly to the "Crop Sequences" tool, but unlike that tool this operation can also be applied to a subset of the sequences.

Applies to:	Sequence Collection
Returns:

Name	Description
amount	This parameter specifies the number of bases that the sequences should be cropped. The value can be a constant number, a Numeric Variable or a Sequence Numeric Map (in the latter case, each sequence can be cropped by a different number of bases). If no direction is specified, the sequences will be cropped by this amount at both ends (so the sequences will end up `2*amount` bp shorter). Alternatively, the sequences can be cropped only at one end or by a different number of bases at the upstream and downstream ends.
use relative orientations	If relative orientations are used, bases will be removed from the sequences from the "upstream" or "downstream" end (or both) relative to the orientation of each individual sequence. If relative orientations are not used, all the sequences will be treated as if they were on the direct strand and bases will either be removed from the start of the sequence (smallest genomic coordinate) or from the end (greatest genomic coordinate).
regions	If this parameter specifies a region track, the sequences will be cropped so that the edges of the sequences align with the (first and last) regions of this track for each sequence. If a region extends across the edge of a sequence, the sequence will not be cropped at that end.

# Crops all sequences by 100 bp at both ends
crop_sequences by 100 bp

# Crops the sequences in the SeqCol1 collection by 100 bp from the upstream end
crop_sequences in SeqCol1 by 100 bp from upstream end

# Crops all sequences by 100 bp from the upstream end and 200 bp from the downstream end
crop_sequences by 100 bp from upstream end and by 200 bp from downstream end

# Crops all sequences by X bp from the 'direct end' of the sequence
crop_sequences by X bp from end

# Crops all sequences so that their edges align with the edges of the (first and last) regions in the 'DNaseHS' track
crop_sequences to DNaseHS

decrease

The "decrease" operation is a subtraction operator which will decrease the value (or values) of a numeric data object by a specified amount. For Numeric Datasets the operation will be applied to all positions in the sequences, and for Numeric Maps it will be applied to all entries in the map (including the "default" value). The operation can also be applied to Region Datasets to decrease the value of numeric properties of regions or to remove strings from a "text" property. By default the operation will be applied to the "score" property of the regions unless a different property is specified.

Applies to:	Numeric Dataset, Numeric Map, Numeric Variable and Region Dataset
Returns:	Numeric Dataset, Numeric Map, Numeric Variable or Region Dataset (The type of the returned object will depend on the source object)

Name

Description

property

Specifies which property of the data object to decrease. For Numeric Datasets, Numeric Maps and Numeric Variables this will simply be the "value", but for Region Datasets you can select which property of the regions to decrease or append to (the default is "score"). Note that user-defined region properties will not be shown in the drop-down list in the operation dialog, but it is still possible to type in the name other properties.

amount

Specifies the amount by which the values in the source object should be decreased. If the amount is a literal number or a Numeric Variable, each potential value in the source object will be decreased by the same amount. If the source object is a Numeric Dataset and the "amount" is also a Numeric Dataset, the value of each position in the source will be decreased by the value in the same position in the "amount" dataset. If the source is a Numeric Dataset and the "amount" is a Sequence Numeric Map, the values of all positions in each sequence are decreased by the value for that sequence in the map (so each sequence is potentially decreased by a different value). If both the source and "amount" are Numeric Maps of the same type, the entries in the source map will be decreased by the corresponding values in the "amount" map. If the source is a Region Dataset and the "amount" is a Numeric Dataset, the region property can be decreased by a value derived from the values in the Numeric Dataset which are covered by the region. This value could be e.g. the minimum, maximum, average, median or sum of all values within the region or it could be the value of a single position at the center, start or end of the region ("start" and "end" refers to the lowest and highest genomic coordinates of the region with respect to the direct strand. The "relativeStart" and "relativeEnd" positions refer to the most upstream and most downstream position in the region when viewed relative to the orientation of the sequence, and the "regionStart" and "regionEnd" values refer to the most upstream and most downstream positions relative to the orientation of the region itself). If the region is a "motif" region, it is also possible to use "weighted average" or "weighted sum" where the values in each position are weighted by the information content of the motif in the corresponding position. If the operation is applied to a "text" property of a Region Dataset, the "amount" should either be a literal string enclosed in parentheses or a Text Variable. The given text entries will be removed from the existing text property if present.

# Decreases the value of data object X by 10. If X is a Numeric Dataset/Numeric Map/Region Dataset, all positions/entries/regions in the data object will be decreased by 10.
decrease X by 10

# Returns a new track containing the difference of the two tracks in each position
newNumericTrack = decrease Track1 by Track2

# Decreases the values for the entries in the first map by the corresponding entries in the second map
decrease Map1 by Map2

# Decreases the current 'score' property of each region in the track with the average value of the NumericTrack within the region
decrease RegionTrack[score] by average NumericTrack

# Removes the three strings 'one', 'two' and 'three' from the text-property 'numbers' of every region in the Region Dataset.
decrease RegionTrack[numbers] by "one,three"

: increase, multiply, divide, set

delete

The "delete" operation can be used to delete data objects that are no longer needed. Its primary use is within protocols scripts to free up memory resources. The operation can be applied to multiple target objects at once.

Applies to:	All data objects except Motifs and Modules
Returns:	Nothing

# Deletes the three data objects X, Y and Z
delete X,Y,Z

difference

The "difference" operation will compare one data object to another object of the same type and return a new data object highlighting the differences between the two objects.

If the two objects are Numeric Datasets, the result will simply be a new numeric track containing the difference between the two tracks in each position (Track1 minus Track2).
Likewise, if the objects are Numeric Maps, the result will be a new Numeric Map where each entry is the difference between the value in the first map and the second map (the default values will also be compared).
If the objects are Region Datasets, the result will be a new Region Dataset containing regions that are present in one but not both of the tracks ("exclusive OR"). In MotifLab versions 1.x, regions that are present in the first track but not the second are assigned to the REVERSE strand in the resulting track ("removed"), whereas regions occuring in the second track but not the first are assigned to the DIRECT strand ("added"). In MotifLab versions 2.0+, the original strand orientations of the regions will be kept as they were, but each region will have an added property called "onlyIn" whose value will be the name of the only track of the two that contained that region. Unless otherwise specified, two regions will be considered identical only if every single property of the two regions match. In MotifLab 2.0 it is possible to specify that two regions should be treated as same if they have the same location, orientation and type ("compare only location and type" option) or if they have the same standard properties which also include score and sequence but not necessarily other user-defined properties ("compare only standard properties" option)
If the objects are DNA Sequence Datasets, the result will be returned as a Numeric Dataset where positions that are different between the two tracks have a value of 1.0 and positions with the same base in the two sequences have a value of 0.
If the objects are two Collections, hereafter referred to as X and Y, the result will be returned as a Partition of the same member type (i.e. if two Motif Collections are compared the result will be a Motif Partition). For MotifLab versions 1.x, this partition will contain three clusters. Entries that are members of both collections will be assigned to a cluster called "Present_in_both". Entries that are members of collection X but not of Y will be assigned to the cluster "Not_in_Y", and members of Y that are not in X will be assigned to "Not_in_X". For MotifLab versions 2.0+, the resulting partition will contain four clusters. Entries that are members of both collections will be assigned to a cluster called "Present_in_both". Entries that are members of collection X but not of Y will be assigned to the cluster "Only_in_X", and members of Y that are not in X will be assigned to "Only_in_Y". Entries that are not in either set will be assigned to the cluster "Present_in_neither".
If the objects are Text Maps, the result will be a new Text Map where entries that have the same values in the two maps will be set to an empty value and entries that have different values in the two maps will be set to "value1 <> value2". The default values will be compared in the same way.

Applies to:	Feature Dataset, Map and Collection
Returns:	Feature Dataset, Map or Partition (The type of the returned object will depend on the source object)

Name	Description
other	The other data object that the source should be compared to. This must be of the same type as the source.

diff = difference between NumericTrack1 and NumericTrack2

: compare collections, compare region datasets, numeric map correlation, compare region datasets, motif similarity, benchmark, decrease

discriminate

The "discriminate" operation takes a regular positional priors track as input and turns it into a "discriminative prior" track which takes into account the priors value of potential motifs in a set of positive sequences (expected to contain binding sites for the target TF) compared with a set of negative sequences (not expected to contain binding sites for this TF). For given k-mer sequence, the discriminative prior score is defined as the ratio between the sum of the priors scores for all the occurrences of this k-mer in the positive set versus the sum of the prior score for the occurrences of the same k-mer in both the positive and negative sets.

See the following references for more information:

Narlikar L, Gordân R and Hartemink AJ (2007) "A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast", PLoS Computational Biology 3(11):e215

Gordân R and Hartemink A (2008) "Using DNA duplex stability information for transcription factor binding site discovery", Pacific Symposium on Biocomputing 2008:453-464

Applies to:	Numeric Dataset
Returns:	Numeric Dataset

Name	Description
positive set	A set of sequences that are assumed to contain the binding motif for a common target transcription factor.
negative set	A set of "background" sequences that are not assumed to contain binding motifs for the target transcription factor. The "discriminate" operation will derive a positional priors track which attempts to discriminate between the positive sequences and the negative sequences.
DNA track	A DNA sequence track which will be used as the basis to discriminate between the positive and negative sequences based on the occurrence of different DNA k-mers in these two sets and their positional priors.
word size	The word size parameter specifies the size of the k-mer words to consider. (This size should optimally correspond to the expected motif length.)
strand	This parameter controls which DNA strand(s) to consider when enumerating the DNA k-mers. Valid options are "direct" (use genomic direct strand), "relative" (determine the strand to consider from the orientation of each individual sequence) and "both".
anchor	The anchor parameter controls how to select the corresponding priors value for a k-mer in a sequence. Valid options are "start" which will use the "genomic start" of the k-mer region (i.e. the value from the position with the smallest genomic coordinate within the region) or "relative start" which will use the value from the most upstream position within the k-mer region (relative to the orientation of the sequence).

: position condition

:

# Converts the track 'RegularPriorsTrack' into a discriminative priors track based on occurrences of all 8-mers from the given DNA track (both strands considered) and their previous prior values in the set TargetSequences (positive) versus BackgroundSequences (negative)
discriminativePrior = discriminate RegularPriorsTrack in TargetSequences from BackgroundSequences based on words of size 8 in DNA on both strands with anchor at relative start

distance

The "distance" operation will return a new Numeric Dataset where the value at each position in the track is determined by its distance from a selected anchor point. The anchor point can be a fixed (or relative) coordinate position, a property of the sequence (such as the upstream or downstream end of the sequence or the TSS of the associated gene), or the anchor point can be the nearest region in a selected Region Dataset.

Returns:

Name	Description
direction	The direction setting, which can take on the values "upstream", "downstream" or unspecified (which is the default and means the same as "both directions"), will determine if the value in each position will be positive or negative. The default is that a position is assigned the (positive) value reflecting the number of bases between that position and the anchor position. If the "upstream" direction is specified, positions upstream of the anchor will be assigned positive values whereas positions downstream of the anchor will be assigned negative values (corresponding to calculating the distance "anchor-X" for a each position X). However, if the anchor is a Region Dataset, the value will be based on the (positive) distance to the closest upstream region in the track. If the "downstream" direction is specified, positions downstream of the anchor will be assigned positive values whereas positions upstream of the anchor will be assigned negative values (corresponding to calculating the distance "X-anchor" for a each position X). However, if the anchor is a Region Dataset, the value will be based on the (positive) distance to the closest downstream region in the track.
anchor point	Decides what type of anchor to use. The anchor point can be the name of a Region Dataset (in which case the value in each position will be based on the distance to the closest region in this dataset), or it can be a literal number/Numeric Variable/Sequence Numeric Map specifying a position within the sequence (in which case the value in each position in the new track will be based on its distance from that position). In addition, four special values are recognized for referring to commonly used positions, these are "transcription start site", "transcription end site", "sequence upstream end" (first position in sequence when viewed in relative orientation) and "sequence downstream end" (last position in sequence when viewed in relative orientation).
anchor setting	If the anchor point is a literal number/Numeric Variable/Sequence Numeric Map, this setting controls how this number should be interpreted relative to the sequence itself (or the chromosome). For example, an anchor point with the value "10" will be interpreted as the 11th position in the sequence (in relative orientation) if the "sequence upstream end" setting is used (the first position is position 0), or if the "chromosome start" setting is used, the value "10" will be interpreted as the 10th position in the chromosome that the sequence resides on.

# Returns a track where the value in each position is the number of bp between that position and the position 187942 (in genomic coordinates)
distanceToTSS = distance from 187942 relative to chromosome start

# Returns a track where the value in each position is the number of bp between that position and the transcription start site of the sequence
distanceToTSS = distance from 0 relative to transcription start site

# Same as the previous example
distanceToTSS = distance from transcription start site

# Same as the previous example but the positions upstream of the TSS have positive values and the downstream positions have negative values
distanceToTSS = distance upstream from transcription start site

# Same as the previous example but the positions upstream of the TSS have negative values and the downstream positions have positive values
distanceToTSS = distance downstream from transcription start site

# Returns a track where the value in each position reflects the distance to the nearest DNase peak region
distanceToTSS = distance from DNase_peak

divide

The "divide" operation is a division operator which will divide the value (or values) of a numeric data object by a specified amount. For Numeric Datasets the operation will be applied to all positions in the sequences, and for Numeric Maps it will be applied to all entries in the map (including the "default" value). The operation can also be applied to Region Datasets to divide the value of numeric properties of regions or to remove strings from a "text" property. By default the operation will be applied to the "score" property of the regions unless a different property is specified. Note that if the "amount" argument (divisor) has a value of 0 for an entry, the division will not be carried out but the original value will be retained for that entry.

Applies to:	Numeric Dataset, Numeric Map, Numeric Variable and Region Dataset
Returns:	Numeric Dataset, Numeric Map, Numeric Variable or Region Dataset (The type of the returned object will depend on the source object)

Name

Description

property

Specifies which property of the data object to divide. For Numeric Datasets, Numeric Maps and Numeric Variables this will simply be the "value", but for Region Datasets you can select which property of the regions to divide or append to (the default is "score"). Note that user-defined region properties will not be shown in the drop-down list in the operation dialog, but it is still possible to type in the name other properties.

amount

Specifies the amount by which the values in the source object should be divided. If the amount is a literal number or a Numeric Variable, each potential value in the source object will be divided by the same amount. If the source object is a Numeric Dataset and the "amount" is also a Numeric Dataset, the value of each position in the source will be divided by the value in the same position in the "amount" dataset. If the source is a Numeric Dataset and the "amount" is a Sequence Numeric Map, the values of all positions in each sequence are divided by the value for that sequence in the map (so each sequence is potentially divided by a different value). If both the source and "amount" are Numeric Maps of the same type, the entries in the source map will be divided by the corresponding values in the "amount" map. If the source is a Region Dataset and the "amount" is a Numeric Dataset, the region property can be divided by a value derived from the values in the Numeric Dataset which are covered by the region. This value could be e.g. the minimum, maximum, average, median or sum of all values within the region or it could be the value of a single position at the center, start or end of the region ("start" and "end" refers to the lowest and highest genomic coordinates of the region with respect to the direct strand. The "relativeStart" and "relativeEnd" positions refer to the most upstream and most downstream position in the region when viewed relative to the orientation of the sequence, and the "regionStart" and "regionEnd" values refer to the most upstream and most downstream positions relative to the orientation of the region itself). If the region is a "motif" region, it is also possible to use "weighted average" or "weighted sum" where the values in each position are weighted by the information content of the motif in the corresponding position. If the operation is applied to a "text" property of a Region Dataset, the "amount" should either be a literal string enclosed in parentheses or a Text Variable. The given text entries will be removed from the existing text property if present.

# Divides the value of data object X by 10. If X is a Numeric Dataset/Numeric Map/Region Dataset, all positions/entries/regions in the data object will be divided by 10.
divide X by 10

# Returns a new track containing the quotient of the two tracks in each position
newNumericTrack = divide Track1 by Track2

# Divides the values for the entries in the first map by the corresponding entries in the second map
divide Map1 by Map2

# Divides the current 'score' property of each region in the track with the average value of the NumericTrack within the region
divide RegionTrack[score] by average NumericTrack

# Removes the three strings 'one', 'two' and 'three' from the text-property 'numbers' of every region in the Region Dataset.
divide RegionTrack[numbers] by "one,three"

: increase, decrease, multiply, set

drop_sequences

This operation can be used to completely delete a set of sequences that are no longer needed in subsequent analyses. The operation will delete the specified Sequence Collection and all the Sequences within that collection. Also, any other data or references related to these sequences in other Collections, Partitions, Maps or Feature Datasets will also be deleted.

Applies to:	Sequence Collection
Returns:

# Deletes the sequence collection 'Downregulated' along with all the sequences therein
drop_sequences Downregulated

ensemblePrediction

The "ensemblePrediction" operation takes motif/binding site predictions generated by several different motif discovery programs as input and return "consensus motifs". The operation will return both a Motif Collection containing the consensus motifs as well as a Region Dataset containing the locations of these motifs in the sequences (binding sites). The actual motif prediction will be performed by an external program, and users can select which ensemble prediction method they like to use from a list of installed programs. To configure additional ensemble prediction methods, go to the "Configure" menu in MotifLab and select "External Programs...".

Applies to:	Region Dataset
Returns:	both Motif Collection and Region Dataset

Name	Description
method	The particular ensemble prediction program to use. Each individual method can have additional parameter settings. See the documentation for the method on how to set these parameters.
motif prefix	The returned motifs will be assigned names starting with this prefix and followed by an incremental counter. For example, if the motif prefix is set to "MF" the returned motifs will be named "MF00001", "MF00002", "MF00003" etc. This argument is optional and the prefix will default to the name of the ensemble prediction method if not specified.
DNA-track	Although the source inputs to the ensemblePrediction operation are Region Datasets, the DNA sequence might also be needed to properly set the sequence-property of the returned binding sites.

[TFBS, Motifs] = ensemblePrediction on Sites1,Sites2,Sites3 with EMD {...}

: motifScanning, moduleDiscovery, moduleScanning, ensemblePrediction

execute

The "execute" operation allows MotifLab to run an external data processing program. MotifLab can pass on any data that the program requires and create new data objects based on the results output by the program. This operation can thus extend the data processing capabilities of MotifLab beyond the operations already provided. In order to run a program with this operation, the interface of the program must be described in XML-formatted configuration files. Ready-made configration files for some programs are already available from the MotifLab web site (under "Tools") or in the "External programs repository" found under "External Programs" in MotifLab's "Configure" menu.

Applies to:	Program dependent
Returns:	Program dependent

# Runs a program called 'CreateBackgroundModel' with the given program-specific parameters and returns a single data object output by this program.
bgmodel = execute CreateBackgroundModel {Sequence=DNA,Strand="Relative",Order=1}

: motifDiscovery, motifScanning, moduleDiscovery, moduleScanning, ensemblePrediction

extend

Extends the size of regions in a Region Dataset in one or both directions. The regions can be extended by a fixed number of bases, or they can be extended (one base at a time) as long as a given condition is satisfied. Note the regions will never be extended past the edge of the associated sequence.

Applies to:	Region Dataset
Returns:	Region Dataset

Name	Description
direction	The extension to the region can be made in both directions (default) or just a single direction ("upstream" or "downstream"). It is also possible to extend the region in both directions independently. The directions are relative to the orientation of the associated sequence (not the orientation of the region to be extended).
amount	The number of bases to extend the region by. This could either be a fixed number (specified as a literal constant, Numeric Variable or Numeric Map), or the amount can be decided by a condition (in which case the region is extended in a direction as long as the condition is satisfied).

# Extends the regions in the DNaseHS track by 10 bp in either direction
extend DNaseHS by 10

# Extends the regions in the DNaseHS track by X bp in the upstream direction and Y bp in the downstream direction
extend DNaseHS upstream by X, downstream by Y

# Extends the regions in the DNaseHS track in either direction as long as they do not overlap with RepeatMasker regions
extend DNaseHS while not inside RepeatMasker

# Extends the regions in the DNaseHS track in either direction until they reach a position where the Conservation track has a value of 0
extend DNaseHS until Conservation = 0

extend_sequences

This operation (introduced in MotifLab v2.0) extends the current sequences by a number of bases in one or both directions. It works similarly to the "Extend Sequences" tool, but unlike that tool the operation can also be applied to a subset of the sequences. Note that the extend_sequences operation can not be used if the sequences have associated feature annotation tracks (since MotifLab will not fill in the missing data).

Applies to:	Sequence Collection
Returns:

Name Description

amount This parameter specifies the number of bases that the sequences should be extended. The value can be a constant number, a Numeric Variable or a Sequence Numeric Map (in the latter case, each sequence can be extended by a different number of bases). If no direction is specified, the sequences will be extended by this amount in both directions (so the sequences will end up 2*amount bp longer). Alternatively, the sequences can be extended only in one direction or by a different number of bases in the upstream and downstream direction.

use relative orientations If relative orientations are used, new bases will be added to the sequences in the "upstream" or "downstream" direction (or both) relative to the orientation of each individual sequence. If relative orientations are not used, all the sequences will be treated as if they were on the direct strand and the new bases will either be added "before the start" of the sequences (extending "upstream" of the smallest genomic coordinates) or "after the end" of the sequences (extending "downstream" of the greatest genomic coordinates).

# Extends all sequences by 100 bp in both direction
extend_sequences by 100 bp

# Extends the sequences in the SeqCol1 collection by 100 bp upstream
extend_sequences in SeqCol1 by 100 bp upstream

# Extends all sequences 100 bp upstream and 200 bp downstream
extend_sequences by 100 bp upstream and by 200 bp downstream

# Extends all sequences by X bp after the end coordinate on the direct strand
extend_sequnces by X bp after end

extract

The "extract" operation will extract a value or property (and sometimes also new derived values) from an existing data object and return this information as a new data object. The value or property to be extracted must be registered as an "exported property" in the source object, and different types of data objects will export different properties. For example, it is possible to extract the value of a single entry in a Numeric Map as a Numeric Variable, or extract the "top X" entries in the map as a collection. Analysis objects often export results as Numeric Maps and Numeric Variables.

Applies to:	Any data object that exports properties
Returns:	The type of the returned data object will depend on the property that is extracted

Name Description

property

The name of the property which should be extracted from the source data. Different types of data will allow different properties to be extracted. Below are listed some of the properties that can be extracted from standard data types (the data type of the extracted object is noted in parenthesis). Note that other data objects may export other properties as well. Text enclosed in angle brackets should be replaced by a suitable value or name of a data object as noted in the description of the property. For example, a property listed as <motif> could represent the name of a motif (without the brackets).

Collections (all types)

"size" (Numeric Variable)
Returns the size of the collection
"random <X>" (Collection)
Returns a random subset of the collection consisting of X entries. X should be a numeric value which can be given as a literal numeric constant or a Numeric Variable.
"random <X>%" (Collection)
Returns a random subset of the collection consisting of X % of the entries in the original collecion. X should be a value between 0 and 100 and can be given as a literal numeric constant or a Numeric Variable.

Sequence Collections

"sequence:<property>" (Sequence Map or Sequence Numeric Map)
Extracts the value of the specified property from each sequence in the collection and returns the values in a Map. The user should declare the operation to return either a Sequence Numeric Map or a general Sequence Map depending on the type of the propery.

Motif Collections

"motif:<property>" (Motif Map or Motif Numeric Map)
Extracts the value of the specified property from each motif in the collection and returns the values in a Map. The user should declare the operation to return either a Motif Numeric Map or a general Motif Map depending on the type of the propery.

The extraction commands below derive new motifs based on the motifs in the collection. They may take a second argument "name_suffix=<suffix>" separated from the extraction property by a semicolon (see examples below). If a suffix is provided, the newly created motifs will be named after the original motifs but with the specified suffix added to the ends of their names. If no suffix is provided, the new motifs will have the same names as the originals and thus replace them.

"combine:<motif>,<distance>" (Motif Collection containing newly created motifs)
Creates a new collection of motifs by concatenating the motifs in the original collection with the named motif. If the numeric distance is greater than 0, the two motifs will be separated by a distance long gap by inserting N's between the two motifs.
"complement" (Motif Collection containing newly created motifs)
Creates a new collection of motifs that are the reverse complements of the motifs in the original collection (same as "reverse" below).
"flank:<left>,<right>" (Motif Collection containing newlycreated motifs)
Creates a new collection of motifs in which the motifs from the original collection are expanded by adding flanking sequences before and after the original motif. The flanking sequences may contain IUPAC degenerate base symbols
"inverse" (Motif Collection containing newly created motifs)
Creates a new collection of motifs where each motif is an inverted mirror image version of the motif from the original collection. (The sequence is reversed but not reverse complemented)
"reverse" (Motif Collection containing newly created motifs)
Creates a new collection of motifs that are the reverse complements of the motifs in the original collection (same as "complement" above).
"round" (Motif Collection containing newly created motifs)
Creates a new collection of motifs where the values in the matrix of the original motifs are rounded to the nearest integer values.
"shuffle" (Motif Collection containing newly created motifs)
Creates a new collection of motifs where the positions in the binding sequence of the original motifs are randomly shuffled (i.e. the columns in the matrices are swapped around)
"trim flanks:<IC-cutoff>" (Motif Collection containing newly created motifs)
Creates a new collection of motifs by removing bases from the beginning and end of the original motifs if the information content of these bases is lower than the specified cutoff (value between 0 and 2)
"trim:<left>,<right>" (Motif Collection containing newly created motifs)
Creates a new collection of motifs by removing the first left bases and the last right bases from the original motifs

Module Collections

"module:<property>" (Module Map or Module Numeric Map)
Extracts the value of the specified property from each module in the collection and returns the values in a Map. The user should declare the operation to return either a Module Numeric Map or a general Module Map depending on the type of the propery.

Partitions

<cluster> (Collection)
If the property is the name of a cluster in the partition, the extract operation will return a collection containing the members of the cluster.
"size" (Numeric Variable)
Returns the number of entries (motifs/modules/sequences) that have been assigned to a cluster.
"number of clusters" (Numeric Variable)
Returns the number of clusters in the partition
"cluster names" (Text Variable)
Returns a Text Variable containing the names of all the clusters in the partition (one name on each line)
"cluster sizes" (Text Variable)
Returns a Text Variable containing the names of all the clusters in the partition and their respective sizes in two columns separated by TAB.

Numeric Maps

<entry> (Numeric Variable)
If the property is the name of an entry in the map (motif/module/sequence), the extract operation will return a variable containing the value for that entry.
"_DEFAULT_" (Numeric Variable)
Returns the default value which is used for entries that have no explicitly assigned value in the map.
"top value" (Numeric Variable)
Returns the highest value in the map (which could be the default value if some entries are unassigned)
"top value in <subset>" (Numeric Variable)
Returns the highest value among the entries in the given subset collection (which could be the default value if some entries are unassigned)
"top:<X>" (Collection)
Returns a collection containing the X entries which have the highest values in the map. X should be a literal numeric constant or the name of a Numeric Variable.
"top:<X>%" (Collection)
Returns a collection containing the X % entries which have the highest values in the map. X should be a value between 0 and 100 and can be given as a literal numeric constant or a Numeric Variable.
"top:<X> in <subset>" (Collection)
Returns a collection containing the X entries which have the highest values in the map among those in the given subset collection. X should be a literal numeric constant or the name of a Numeric Variable.
"top:<X>% in <subset>" (Collection)
Returns a collection containing the X % entries which have the highest values in the map among those in the given subset collection. X should be a value between 0 and 100 and can be given as a literal numeric constant or a Numeric Variable.
"bottom value" (Numeric Variable)
Returns the lowest value in the map (which could be the default value if some entries are unassigned)
"bottom value in <subset>" (Numeric Variable)
Returns the lowest value among the entries in the given subset collection (which could be the default value if some entries are unassigned)
"bottom:<X>" (Collection)
Returns a collection containing the X entries which have the lowest values in the map. X should be a literal numeric constant or the name of a Numeric Variable.
bottom:<X>%" (Collection)
Returns a collection containing the X % entries which have the lowest values in the map. X should be a value between 0 and 100 and can be given as a literal numeric constant or a Numeric Variable.
"bottom:<X> in <subset>" (Collection)
Returns a collection containing the X entries which have the lowest values in the map among those in the given subset collection. X should be a literal numeric constant or the name of a Numeric Variable.
"bottom:<X>% in <subset>" (Collection)
Returns a collection containing the X % entries which have the lowest values in the map among those in the given subset collection. X should be a value between 0 and 100 and can be given as a literal numeric constant or a Numeric Variable.
"rank ascending" (Numeric Map)
Returns a new Numeric Map where each entry is assigned a new value based on the ascending rank order of the values in the original map. This means that the entry which has the lowest value in the original map will be assigned the value "1" in the new map and the entry with the second lowest value will be assigned the value "2" in the new map, and so on. Tied entries (that have the same value in the original map) will be assigned the same rank value in the new map, and the next rank value will then be set to the the number of entries that have lower values. For instance, if the values in the map are "13,13,24,32,32,32,58" the corresponding ranks will be "1,1,3,4,4,4,7".
"rank descending" (Numeric Map)
Returns a new Numeric Map where each entry is assigned a new value based on the descending rank order of the values in the original map. This means that the entry which has the highest value in the original map will be assigned the value "1" in the new map and the entry with the second hightest value will be assigned the value "2" in the new map, and so on.
"assigned entries" (Collection)
Returns a collection of all the entries that have specifically assigned values in the map.
"unassigned entries" (Collection)
Returns a collection of all the entries that do not have specifically assigned values in the map but rather relies on the default value.

Expression Profile

"column:<name>" (Sequence Numeric Map)
Returns the contents of the given column in the profile as a Sequence Numeric Map. The column can either be specified by index number (starting at 1 for the first column) or by header name (if the columns have specifically assigned names).
"subprofile:<columns>" (Expression Profile)
Returns a new Expression Profile object consisting of a subset of the columns in the original profile. The subprofile can be declared as a comma-separated list of columns or as a range or columns (or a combination of both). Column ranges can be specified by listing the first and last column in the range separated by either a hyphen or a colon (e.g. "firstCol-lastCol" or "firstCol:lastCol"). The columns can either be specified by index number (starting at 1 for the first column) or by header names (if the columns have specifically assigned names), and the columns will be added to the new profile in the order listed (if the first column in a range is greater than the last column, their order will be reversed. E.g. if the range "7-4" is given, the columns "7,6,5 and 4" will be added to the new profile in that order).

Background Model

"GG-content" (Numeric Variable)
Returns the GC-content of the background model as a fraction between 0 and 1.

Region Dataset

"types" (Text Variable)
Returns a text variable listing all the different region types encountered in this dataset
"start" (Region Dataset)
Returns a new Region Dataset wherein each region in the original track has been reduced to a region of size 1 bp (all other properties are kept). The new reduced regions are positioned at the "genomic start" position of the original regions (i.e. the position within the original region that has the lowest genomic coordinate).
"end" (Region Dataset)
Returns a new Region Dataset wherein each region in the original track has been reduced to a region of size 1 bp (all other properties are kept). The new reduced regions are positioned at the "genomic end" position of the original regions (i.e. the position within the original region that has the greatest genomic coordinate).
"relative start" (Region Dataset)
Returns a new Region Dataset wherein each region in the original track has been reduced to a region of size 1 bp (all other properties are kept). The new reduced regions are positioned at the "relative start" position of the original regions (most upstream position). For regions in sequences from the direct strand this means the position within the original region that has the lowest genomic coordinate, and for regions in sequences from the reverse strand this means the position within the original region that has the greatest genomic coordinate.
"relative end" (Region Dataset)
Returns a new Region Dataset wherein each region in the original track has been reduced to a region of size 1 bp (all other properties are kept). The new reduced regions are positioned at the "relative end" position of the original regions (most downstream position). For regions in sequences from the direct strand this means the position within the original region that has the greatest genomic coordinate, and for regions in sequences from the reverse strand this means the position within the original region that has the lowest genomic coordinate.
"region start" (Region Dataset)
Returns a new Region Dataset wherein each region in the original track has been reduced to a region of size 1 bp (all other properties are kept). The new reduced regions are positioned at the start of the original region (relative to its own orientation). For regions with "direct" orientation this means the position within the original region that has the lowest genomic coordinate, and for regions with "reverse" orientation this means the position within the original region that has the greatest genomic coordinate.
"region end" (Region Dataset)
Returns a new Region Dataset wherein each region in the original track has been reduced to a region of size 1 bp (all other properties are kept). The new reduced regions are positioned at the end of the original region (relative to its own orientation). For regions with "direct" orientation this means the position within the original region that has the greatest genomic coordinate, and for regions with "reverse" orientation this means the position within the original region that has the lowest genomic coordinate.
"center" (Region Dataset)
Returns a new Region Dataset wherein each region in the original track has been reduced to a region of size 1 bp (all other properties are kept). The new reduced regions are positioned at the center of the original regions.
"TFBS" (Region Dataset)
If the target Region Dataset is a module track with modules consisting of composite motifs (clusters of TFBSs), the extract operation will return a new motif track containing the constituent TFBSs of the modules.

Motif

"Alternatives" (Motif Collection)
Returns a motif collection containing all motifs that are annotated as alternative to this motif.
"Interactions" (Motif Collection)
Returns a motif collection containing all motifs that are are annotated as interacting with this motif.
"combine motif:<motif>,<distance>" (Motif)
Creates a new motif by concatenating the original with the named motif. If the numeric distance is greater than 0, the two motifs will be separated by a distance long gap by inserting N's between the two motifs.
"complement motif" (Motif)
Creates a new motif which is the reverse complement of original (same as "reverse" below).
"flank motif:<left>,<right>" (Motif)
Creates a new motif by adding flanking sequences before and after the original motif. The flanking sequences may contain IUPAC degenerate base symbols
"inverse motif" (Motif)
Creates a new motif which is the inverted mirror image of the original motif. (The sequence is reversed but not reverse complemented)
"reverse motif" (Motif)
Creates a new motif which is the reverse complement of original (same as "complement" above).
"round motif" (Motif)
Creates a new motif where the values in the matrix of the original motif are rounded to the nearest integer values.
"shuffle motif" (Motif)
Creates a new motif where the positions in the binding sequence of the original motif are randomly shuffled (i.e. the columns in the matrix are swapped around)
"trim motif flanks:<IC-cutoff>" (Motif)
Creates a new motif by removing bases from the beginning and end of the original motif if the information content of these bases is lower than the specified cutoff (value between 0 and 2)
"trim motif:<left>,<right>" (Motif)
Creates a new motif by removing the first left bases and the last right bases from the original motif

data type The data type for the extracted property must be specified. Note that if the selected data type does not match the correct type of the property, an error will occur.

# Extracts the value for the motif 'M00003' from the Motif Numeric Map and returns it as a Numeric Variable
x = extract "M00003" from MotifNumericMap1 as Numeric Variable

# Returns the highest value found in MotifNumericMap1
x = extract "top value" from MotifNumericMap1 as Numeric Variable

# Returns a new motif collection containing the 10% of motifs that have the highest values in the numeric map
motifCol = extract "top:10%" from MotifNumericMap1 as Motif Collection

# Returns a new Motif Numeric Map where the entries in the source map are ranked by ascending value
map1 = extract "rank ascending" from MotifNumericMap1 as Motif Numeric Map

# Returns the number of sequences currently known by the system (i.e. the size of the AllSequences collection)
size = extract "size" from AllSequences as Numeric Variable

# Returns a collection with 20 randomly selected sequences (or less if there are fewer than 20 sequences)
random50 = extract "random 20" from AllSequences as Sequence Collection

# Selects half of the current sequences at random and returns these as a new collection
randomHalf = extract "random 50%" from AllSequences as Sequence Collection

# Returns the GC-content of each sequence (as found by the 'GC-content' analysis) in a Sequence Numeric Map
GC = extract "GC-content" from GC_analysis as Sequence Numeric Map

# Extracts the results regarding the total number of times each motif occurs in the sequences from an analysis objects and returns this result as a Motif Numeric Map
motif_counts = extract "total" from CountMotifOccurrencesAnalysis1 as Motif Numeric Map

# Returns a Motif Numeric Map with the size of each motif in the JASPAR motif collection
motif_sizes = extract "motif:size" from JASPAR as Motif Numeric Map

# Returns a SequenceMap with the genome build of each sequence in the AllSequence collection
genomes = extract "sequence:genome" from AllSequences as Sequence Map

# Creates a new Motif based on the motif M0008 by adding the sequence CAG to the front of the motif and the (degenerate) sequence nTsG to the end
M0008_flanked = extract "flank motif: CAG,NTSG" from M00008 as Motif

# Creates a new double 'module' Motif by concatenating the motif sequence from M00008 with the motif sequence from M00051 with a 5 bp spacer between them
double_motif = extract "combine motif: M00051,5" from M00008 as Motif

# Trims the flanks of motifs in the JASPAR motif collection as long as the information content of the flanking bases is less than 0.7. Note that this will update the original motifs
JASPAR_trimmed = extract "trim flanks: 0.7" from JASPAR as Motif Collection

# Creates a new collection of motifs based on the JASPAR collection but where the positions (matrix columns) in the original motifs have been randomly reshuffled. The names of the new motifs are based on the originals but with an added '_shuffled' suffix
JASPAR_shuffled = extract "shuffle; name_suffix=_shuffled" from JASPAR as Motif Collection

: Data

filter

Removes regions that satisfy a given condition from a Region Dataset. If no condition is specified, all the regions in the dataset will be removed.

Applies to:	Region Dataset
Returns:	Region Dataset

: region condition

:

# Removes binding site regions in the TFBS track that are not very conserved (or more specifically removes regions where the average values of the positions from the Numeric Dataset 'Conservation' within a region is less than 0.2)
filter TFBS where region's average Conservation < 0.2

# Removes regions in the CpG_island track that overlap with regions in the RepeatMasker track
filter CpG_island where region overlaps RepeatMasker

# Removes binding site regions in the TFBS track that do not overlap with any DNase_peaks regions
filter TFBS where not region overlaps DNase_peaks

# Removes binding site regions in the TFBS track for motifs that are members of the collection 'MotifCollection1'
filter TFBS where region's type is in MotifCollection1

increase

The "increase" operation is an addition operator which will increase the value (or values) of a numeric data object by a specified amount. For Numeric Datasets the operation will be applied to all positions in the sequences, and for Numeric Maps it will be applied to all entries in the map (including the "default" value). The operation can also be applied to Region Datasets to increase the value of numeric properties of regions or to append new strings to a "text" property. By default the operation will be applied to the "score" property of the regions unless a different property is specified.

Applies to:	Numeric Dataset, Numeric Map, Numeric Variable and Region Dataset
Returns:	Numeric Dataset, Numeric Map, Numeric Variable or Region Dataset (The type of the returned object will depend on the source object)

Name

Description

property

Specifies which property of the data object to increase. For Numeric Datasets, Numeric Maps and Numeric Variables this will simply be the "value", but for Region Datasets you can select which property of the regions to increase or append to (the default is "score"). Note that user-defined region properties will not be shown in the drop-down list in the operation dialog, but it is still possible to type in the name other properties.

amount

Specifies the amount by which the values in the source object should be increased. If the amount is a literal number or a Numeric Variable, each potential value in the source object will be increased by the same amount. If the source object is a Numeric Dataset and the "amount" is also a Numeric Dataset, the value of each position in the source will be increased by the value in the same position in the "amount" dataset. If the source is a Numeric Dataset and the "amount" is a Sequence Numeric Map, the values of all positions in each sequence are increased by the value for that sequence in the map (so each sequence is potentially increased by a different value). If both the source and "amount" are Numeric Maps of the same type, the entries in the source map will be increased by the corresponding values in the "amount" map. If the source is a Region Dataset and the "amount" is a Numeric Dataset, the region property can be increased by a value derived from the values in the Numeric Dataset which are covered by the region. This value could be e.g. the minimum, maximum, average, median or sum of all values within the region or it could be the value of a single position at the center, start or end of the region ("start" and "end" refers to the lowest and highest genomic coordinates of the region with respect to the direct strand. The "relativeStart" and "relativeEnd" positions refer to the most upstream and most downstream position in the region when viewed relative to the orientation of the sequence, and the "regionStart" and "regionEnd" values refer to the most upstream and most downstream positions relative to the orientation of the region itself). If the region is a "motif" region, it is also possible to use "weighted average" or "weighted sum" where the values in each position are weighted by the information content of the motif in the corresponding position. If the operation is applied to a "text" property of a Region Dataset, the "amount" should either be a literal string enclosed in parentheses or a Text Variable. The new text entries will be appended to the existing text property (or a new property will be created). Note that if a text property contains multiple comma-separated entries, the property is treated as a "string set" and strings which are already present in the set will not be appended again.

# Increases the value of data object X by 10. If X is a Numeric Dataset/Numeric Map/Region Dataset, all positions/entries/regions in the data object will be increased 10.
increase X by 10

# Returns a new track containing the sum of the two tracks in each position
newNumericTrack = increase Track1 by Track2

# Increases the values for the entries in the first map by the corresponding entries in the second map
increase Map1 by Map2

# Increases the current 'score' property of each region in the track with the average value of the NumericTrack within the region
increase RegionTrack[score] by average NumericTrack

# Appends the three strings 'one', 'two' and 'three' to the text-property 'numbers' of every region in the Region Dataset. If the region already contains one of these strings they will not be appended a second time.
increase RegionTrack[numbers] by "one,two,three"

: decrease, multiply, divide, set

interpolate

The "interpolate" operation can be used to fill in "missing values" in a Numeric Dataset that only contains (non-zero) values for a few discrete positions. For example, if the values in the track are based on a tiling-array experiment that only returns one value for each consecutive X bp region in the sequence and only the first position in each region is assigned the value whereas the next X-1 positions are set to 0, values for the remaining positions can be filled in by interpolation. The default behaviour of the operation is to interpolate between discrete, consecutive non-zero positions in the sequence (which assumes that no position should be zero). However, it is also possible to specify a maximum distance between the non-zero positions, so that interpolation will not be performed when the distance between two consecutive non-zero positions exceed this limit. If the distance between the discrete positions that are supposed to have legitimate values is fixed and known, it is possible to specify this as a parameter. The operation will then locate the first non-zero position in the sequence and assume that the next positions to interpolate between occur periodically after this position. This means that zero-valued positions will also be allowed.

Applies to:	Numeric Dataset
Returns:	Numeric Dataset

Name	Description
method	This parameter specifies what kind of interpolation method to use when interpolating between values in two positions. Currently implemented interpolation methods are: zero order hold: Missing values are filled in by repeating the last encountered value in new positions linear interpolation: Missing values are filled in as a straight line between the values of the two consecutive anchor positions
period	If specified, the operation will interpolate between consecutive anchor positions in the numeric track where the first anchor position will be the first non-zero position in the track and the next anchor positions are assumed to occur periodically after this (the other anchors can have zero values). If a period is not specified, the operation will just interpolate between non-zero positions in the track (which can occur at variable distances).
max distance	If the period parameter is not set, the operation will normally interpolate between consecutive non-zero positions in the track that can occur with any distance between them. However, if a "max distance" is specified, interpolation will only be performed between two non-zero positions if the distance between them does not exceed the specified limit. (Note that it is not possible to specify both a "period" and a "max distance" parameter)

: position condition

:

# Interpolates between consecutive non-zero positions in Track1 using linear interpolation
interpolate Track1 using "linear interpolation"

# Finds the first position with a non-zero value in Track1 and repeats this value in the next 24 positions immediately following. Then finds the (possibly zero) value in the subsequent position and repeats this for another 24 positions, etc.
interpolate Track1 using "zero order hold" with period 25

: apply

mask

Masks bases in a DNA sequence by replacing the letters in the sequence with either upper- or lowercase versions of the original letter, a new specified letter or random bases sampled from a background model.

Applies to:	DNA Sequence Dataset
Returns:	DNA Sequence Dataset

Name

Description

mask type

Controls how the original letters in the sequence should be substituted.

lowercase letters: Change the case of the bases to be masked to lowercase
uppercase letters: Change the case of the bases to be masked to uppercase
specific letter: Replace the bases to be masked with the letter specified
random bases: Replace the bases to be masked with new base letters that are sampled randomly from the given background model.
sequence property: Given a Region Dataset where the regions have a text property called "sequence" with the same length as region, parts of the DNA sequence overlapping with these regions will have the current bases replaced with these sequence strings (v2.0+).

strand

Specifies which strand that should be masked, either "direct" strand or "relative" strand (for backwards compatibility the words "sequence" and "gene" can be used synonymously with "relative" in protocols). This settings is not important when masking with upper/lowercase or non-base letter. However, if the masking is done with regular base letters (either with a specific letter or by sampling from a background model) this parameter controls which strand should be assigned the chosen letter. The argument is optional and will default to "relative" strand if not specified (i.e. same orientation as the sequence).

: position condition

:

# Masks out repeat regions
mask DNA with "N" where inside RepeatMasker

# Sets the DNA bases to lowercase inside coding regions
mask DNA with lowercase where inside CCDS

# Creates a new DNA sequence by sampling bases according to the background model 'Uniform'
newDNA = mask DNA on relative strand with Uniform

: Background Model

merge

Merges regions within each sequence that are located closer than a specified distance apart from each other. The operation can merge overlapping regions, but also regions that are separated by gaps (in which case the resulting region will cover the full span of the merged regions, including the gaps). If the merged regions have the same type, the resulting region will also have this type, else the region is assigned the type "merged". If the merged regions have the same orientation, the resulting region will also have this orientation, else the region is assigned the orientation "undetermined". The score or the resulting region will be assigned the score of the highest scoring region among those merged.

Applies to:	Region Dataset
Returns:	Region Dataset

Name	Description
distance	Regions that lie closer than this distance from each other in the sequence will be merged. "Closer than 0" means that only regions overlapping with each other will be merged. "Closer than 1" will also merge regions that are located immediately adjacent to each other (with no gaps between). "Closer than 2" will also merge regions that are separated by a gap of 1 bp.
mode	Possible values: "any" or "similar". This setting is not currently used.

: region condition

:

# merges ChIP_Seq regions that overlap in at least one base position
merge ChIP_Seq closer than 0

: combine_regions

moduleDiscovery

The "moduleDiscovery" operation can be used to perform 'de novo' module discovery in a set of sequences, meaning that it can search for possible modules (combinations of binding motifs) that are present in the sequences without having prior knowledge about what the modules look like. The operation will return both a Module Collection containing the discovered modules as well as a Region Dataset containing the locations of these modules in the sequences. The actual module discovery will be performed by an external program, and users can select which module discovery method they like to use from a list of installed programs. To configure additional module discovery methods, go to the "Configure" menu in MotifLab and select "External Programs...". Note that this operation can be applied to both DNA Sequence Datasets and Region Datasets, but which type of source data to use will depend on the chosen module discovery method.

Applies to:	Region Dataset and DNA Sequence Dataset
Returns:	both Module Collection and Region Dataset

Name	Description
method	The particular module discovery program to use. Each individual method can have additional parameter settings. See the documentation for the method on how to set these parameters.
module prefix	The discovered modules will be assigned names starting with this prefix and followed by an incremental counter. For example, if the module prefix is set to "MD" the discovered modules will be named "MD00001", "MD00002", "MD00003" etc. This argument is optional and the prefix will default to the name of the module discovery method if not specified.

[Module_sites, Modules] = moduleDiscovery on TFBS with ModuleSearcher {...}

: moduleScanning, motifDiscovery, motifScanning, ensemblePrediction

moduleScanning

The "moduleScanning" operation can be used to search DNA sequences for matches to a set of predefined modules. The operation will return a Region Dataset containing the locations of these modules in the sequences. The actual module scanning will be performed by an external program, and users can select which module scanning method they like to use from a list of installed programs. To configure additional module scanning methods, go to the "Configure" menu in MotifLab and select "External Programs...". Note that this operation can be applied to both DNA Sequence Datasets and Region Datasets, but which type of source data to use will depend on the chosen module scanning method.

Applies to:	Region Dataset and DNA Sequence Dataset
Returns:	Region Dataset

Name	Description
method	The particular module scanning program to use. Each individual method can have additional parameter settings. See the documentation for the method on how to set these parameters.

ModuleSites = moduleScanning on TFBS with SimpleModuleScanner {...}

: moduleDiscovery, motifScanning, motifDiscovery, ensemblePrediction

motifDiscovery

The "motifDiscovery" operation can be used to perform 'de novo' motif discovery in a set of sequences, meaning that it can search for possible binding motifs that are present in all or several of the sequences without having prior knowledge about what the motifs looks like. The operation will return both a Motif Collection containing the discovered motifs as well as a Region Dataset containing the locations of these motifs in the sequences (binding sites). The actual motif discovery will be performed by an external program, and users can select which motif discovery method they like to use from a list of installed programs. To configure additional motif discovery methods, go to the "Configure" menu in MotifLab and select "External Programs...".

Applies to:	DNA Sequence Dataset
Returns:	both Motif Collection and Region Dataset

Name	Description
method	The particular motif discovery program to use. Each individual method can have additional parameter settings. See the documentation for the method on how to set these parameters.
motif prefix	The discovered motifs will be assigned names starting with this prefix and followed by an incremental counter. For example, if the motif prefix is set to "MF" the discovered motifs will be named "MF00001", "MF00002", "MF00003" etc. This argument is optional and the prefix will default to the name of the motif discovery method if not specified.

[TFBS, Motifs] = motifDiscovery on DNA with MEME {...}

: motifScanning, moduleDiscovery, moduleScanning, ensemblePrediction

motifScanning

The "motifScanning" operation can be used to search DNA sequences for matches to a set of known motifs. The operation will return a Region Dataset containing the locations of these motifs in the sequences (binding sites). The actual motif scanning will be performed by an external program, and users can select which motif scanning method they like to use from a list of installed programs. To configure additional motif scanning methods, go to the "Configure" menu in MotifLab and select "External Programs...".

Applies to:	DNA Sequence Dataset
Returns:	Region Dataset

Name	Description
method	The particular motif scanning program to use. Each individual method can have additional parameter settings. See the documentation for the method on how to set these parameters.

TFBS = motifScanning on DNA with MATCH {...}

: search, motifDiscovery, moduleDiscovery, moduleScanning, ensemblePrediction

multiply

The "multiply" operation is a multiplication operator which will multiply the value (or values) of a numeric data object by a specified amount. For Numeric Datasets the operation will be applied to all positions in the sequences, and for Numeric Maps it will be applied to all entries in the map (including the "default" value). The operation can also be applied to Region Datasets to multiply the value of numeric properties of regions or to append new strings to a "text" property. By default the operation will be applied to the "score" property of the regions unless a different property is specified.

Applies to:	Numeric Dataset, Numeric Map, Numeric Variable and Region Dataset
Returns:	Numeric Dataset, Numeric Map, Numeric Variable or Region Dataset (The type of the returned object will depend on the source object)

Name

Description

property

Specifies which property of the data object to multiply. For Numeric Datasets, Numeric Maps and Numeric Variables this will simply be the "value", but for Region Datasets you can select which property of the regions to multiply or append to (the default is "score"). Note that user-defined region properties will not be shown in the drop-down list in the operation dialog, but it is still possible to type in the name other properties.

amount

Specifies the amount by which the values in the source object should be multiplied. If the amount is a literal number or a Numeric Variable, each potential value in the source object will be multiplied by the same amount. If the source object is a Numeric Dataset and the "amount" is also a Numeric Dataset, the value of each position in the source will be multiplied by the value in the same position in the "amount" dataset. If the source is a Numeric Dataset and the "amount" is a Sequence Numeric Map, the values of all positions in each sequence are multiplied by the value for that sequence in the map (so each sequence is potentially multiplied by a different value). If both the source and "amount" are Numeric Maps of the same type, the entries in the source map will be multiplied by the corresponding values in the "amount" map. If the source is a Region Dataset and the "amount" is a Numeric Dataset, the region property can be multiplied by a value derived from the values in the Numeric Dataset which are covered by the region. This value could be e.g. the minimum, maximum, average, median or sum of all values within the region or it could be the value of a single position at the center, start or end of the region ("start" and "end" refers to the lowest and highest genomic coordinates of the region with respect to the direct strand. The "relativeStart" and "relativeEnd" positions refer to the most upstream and most downstream position in the region when viewed relative to the orientation of the sequence, and the "regionStart" and "regionEnd" values refer to the most upstream and most downstream positions relative to the orientation of the region itself). If the region is a "motif" region, it is also possible to use "weighted average" or "weighted sum" where the values in each position are weighted by the information content of the motif in the corresponding position. If the operation is applied to a "text" property of a Region Dataset, the "amount" should either be a literal string enclosed in parentheses or a Text Variable. The new text entries will be appended to the existing text property (or a new property will be created). Note that if a text property contains multiple comma-separated entries, the property is treated as a "string set" and strings which are already present in the set will not be appended again.

# Multiplies the value of data object X by 10. If X is a Numeric Dataset/Numeric Map/Region Dataset, all positions/entries/regions in the data object will be multiplied by 10.
multiply X by 10

# Returns a new track containing the product of the two tracks in each position
newNumericTrack = multiply Track1 by Track2

# Multiplies the values for the entries in the first map by the corresponding entries in the second map
multiply Map1 by Map2

# Multiplies the current 'score' property of each region in the track with the average value of the NumericTrack within the region
multiply RegionTrack[score] by average NumericTrack

# Appends the three strings 'one', 'two' and 'three' to the text-property 'numbers' of every region in the Region Dataset. If the region already contains one of these strings they will not be appended a second time.
multiply RegionTrack[numbers] by "one,two,three"

: increase, decrease, divide, set

new

Creates a new data object according to given specifications. Exactly how to define an object depends on the type, and most types of objects can be constructed in several different ways. For instance, a background model can be created by explicitly specifying its order and all the oligo-frequencies, or it can be generated automatically based on a DNA sequence.

Two modes of creation are supported by all types of object, namely creating an object based on a file (in an applicable data format) and creating an object based on a Text Variable or Output Data object containing a similarly formatted file. The syntax for these modes are almost identical:

    NewObject = new <data type> (File:"path/to/file",  format=<formatname> {format arguments} )
    NewObject = new <data type> (Input:DataObjectName, format=<formatname> {format arguments} )

The data object referred to by the "Input:" mode must be either a Text Variable or Output Data object. The format arguments (enclosed in braces) are optional, and default values for that format will be used if left out. The format name is also optional, and the default format for the datatype will be used if left out (with default format arguments).

Returns:

Determined by the specified data type

Name	Description
data type	The data type of the object to be created
initialization parameters	This parameter controls how to create and initalize the data object. The parameter settings available are specific to each data type. If the parameter is left out, a default instantiation of the data type will be returned (usually an "empty" object).

# Creates a new DNA Sequence Dataset with a value of 'A' for every position
NewTrack = new DNA Sequence Dataset('A')

# Creates a new DNA Sequence Dataset based on a background model called 'Uniform'
NewTrack = new DNA Sequence Dataset(Uniform)

# Creates a new empty Region Dataset containing no regions
NewTrack = new Region Dataset()

# Creates a new Numeric Dataset with a value of 2.0 for every position
NewTrack = new Numeric Dataset(2.0)

# Creates a new Numeric Dataset based on a preconfigured data track called 'Conservation
NewTrack = new Numeric Dataset(DataTrack:Conservation)

# Creates a new Region Dataset with regions read from file in the default format for Region Datasets (i.e. GFF)
NewTrack = new Region Dataset(File:"cpg_islands.bed")

# Creates a new Region Dataset with regions read from file in BED format
NewTrack = new Region Dataset(File:"cpg_islands.bed", format=BED)

# Creates a new Region Dataset based on a Text Variable containing GFF-formatted data
NewTrack = new Region Dataset(Input:TextVariable1, format=GFF {Position="Genomic",Orientation="Relative"})

: Text Variable, Output Data

normalize

The normalize operation rescales the numeric values of a data object from one range to another. The operation can currently only be applied to Numeric Datasets and Region Datasets (for the latter it will be applied to the score-property), but it will be updated in the future so that it will also work on Numeric Maps. It has two different modes of normalization: "normalize sum to one" or "normalize to range". The first mode will scale all the values so that the sum total equals 1.0 (and the values thus form a probability distribution), while the second mode will scale the values from one range ("old range") to another ("new range").

Applies to:	Numeric Dataset and Region Dataset
Returns:	Numeric Dataset or Region Dataset (The type of the returned object will depend on the source object)

Name Description

mode

"Normalize sequence sum to 1" : Normalizes the values in a sequence by dividing each value with the sum total of all the values in the sequence
"Normalize values to range" : Normalizes the values in the sequence by transforming each value X according to the formula:((X-OldMin)/(OldMax-OldMin))*(NewMax-NewMin)+NewMin

range

If the "normalize values to range" mode is used, the old range to scale from (usually the full range of the values in the source object) and also the new range to scale to must be specified with four parameters:

"old min" : The smallest value in the range to scale from
"old max" : The largest value in the range to scale from
"new min" : The smallest value in the range to scale to
"new max" : The largest value in the range to scale to

Note that the "old min" and "old max" values do not necessarily have to correspond to the actual min and max values in source object, although that will normally be the case. The four values can be specified as numeric constants, with Sequence Numeric Maps (in which case the range will be different for each sequence), with Numeric Variables or with the six "special values": sequence.min, sequence.max, dataset.min, dataset.max, collection.min, collection.max. The "sequence.min" and "sequence.max" values are respectively the smallest and largest values within each sequence (so if these are used the range could be different for each sequence), the "dataset.min" and "dataset.max" values are the smallest and largest values in the whole dataset, and if the operation is limited to a sequence collection, the "collection.min" and "collection.max" values will be the smallest and largest values among the sequenecs in the collection.

# Normalizes the values in the PositionalPriors track so that the values form a probability distribution
normalize PositionalPriors sequence sum to one

# Normalizes all the values in the Conservation track so that the previously smallest value within each sequence is set to 0 and the previously largest value is scaled to 100
normalize Conservation from range [sequence.min,sequence.max] to range [0,100]

# 'Inverts' the value range of the Conservation track by setting the previously smallest value in each sequence to the new largest value and vice versa
normalize Conservation from range [sequence.min,sequence.max] to range [sequence.max,sequence.min]

output

Outputs data items to text documents in selected data formats. The document will be wrapped in a so-called "Output Data" object and the contents of this can be saved to files. If MotifLab is run in CLI-mode (without the GUI), all Output Data objects that are created during the execution of a protocol script will automatically be saved to files after completion of the protocol (the filename will be the name of the data object with a suffix determined by the data format used). If no target Output data object is specified for the output operation, a new Output object will be created automatically and assigned a default name consisting of the prefix "Output" followed by an incremental number. If a target Output data object is specified and it already exists, the output will be appended to that object if possible. If it is not possible to append more text to this data object (because it is formatted in a data format that does not allow additional text to be appended, such as HTML-formats), the operation will end with an error. When Feature Datasets are output, the sequences will be output in the order they are currently sorted.

Graphics output
Output saved in Excel and HTML formats may include graphics, such as charts and motif logos. In Excel, these graphics will always be embedded in the file itself, but in HTML format you have a few options on how to output motif and module logos. Please refer to the HTML format documentation for more information.

Direct output
Version 2.0+ of MotifLab allows the output operation to be used to output literal text strings directly to output objects (but only within protocols). The format is: <Output object> = output "some text string...". The text string enclosed in double quotes can contain references to data objects on the form "{dataobject} and the value of the referenced object will then be included in the output as explained in the documentation for the Template and TemplateHTML data formats (the same formatting-options for referenced objects are also available). The text string can also contain TABs, newlines, double quotes and backslashes if these are properly escaped as \t, \n, \" and \\ respectively.

Applies to:	Any data object (except Priors Generators and other Output data objects)
Returns:	Output Data

Name	Description
format	The data format to use for the output. The type of the data object will determine which formats are available.
output parameters	Each data format can have additional format specific parameter settings. See the documentation for the data format for more information.

# Outputs the DNA Sequence Dataset called 'DNA' in FASTA format (with default settings) to a new output object provided with a default name
output DNA in FASTA format

# Outputs regions from the TFBS track in GFF-format (with genomic rather than relative coordinates) to an Output data document called 'Sites' (or appends the output to this document if it already exists)
Sites = output TFBS in GFF format {Position="Genomic"}

# Outputs motifs in the motifcollection in a format with 4 rows and N columns (with columns separated by commas)
output MotifCollection3 in RawPSSM format {Orientation="Horizontal",Delimiter="Comma"}

# Outputs the string 'Hello world! The value of X is' followed by the actual value of the data object named X (v2.0+)
output "Hello world! The value of X is {X}!"

: Output Data, FASTA, GFF, EvidenceGFF, BED, WIG

physical

The "physical" operation estimates different physical properties of the DNA double helix based on local sequence composition and returns a Numeric Dataset containing a value for the selected property for each position. For each position in the sequence, the value of the physical property is estimated by examining the nucleotide composition within a window region around that position. Depending on the selected property, the resulting value is either derived directly from the base (or oligo) frequencies or it is estimated by summing up values based on a smaller sliding window (2 or 3 bases long) within the larger window region.

Applies to:	DNA Sequence Dataset
Returns:	Numeric Dataset

Name	Description
property	This parameter selects which physical property to estimate values for. Currently available properties are: AT-content AT-skew B-DNA twist bendability DNA bending-stiffness DNA denaturation duplex disrupt energy duplex free energy frequency (further specified by the "oligo" parameter below) GC-content GC-skew nucleosome position preference propeller twist protein-DNA twist protein-induced deformability stacking energy Z-DNA stabilizing energy
oligo	If the "property" parameter is "frequency", the "oligo" parameter specifies which oligomer pattern to determine the frequency for. The parameter value should be a string of letters (normally A,C,G,T or even N). E.g. if the "oligo" parameter is "A", the operation will calculate the frequency of "A" bases within the window and if the oligo is "CAG" the operation will calculate the local occurrence frequency of the oligomer "CAG".
window size	The size of the sliding window. This can be specified as a constant number, a Numeric Variable or a Sequence Numeric Map (in which case a different window size will be used for each sequence).
anchor	The anchor parameter specifies how the sliding window should be placed relative to the target position. center: The window is placed so that the target position is in the center of the window start: The window is placed so that the target position is at the start of a window which extends downstream end: The window is placed so that the target position is at the end of a window which extends upstream

: position condition

:

# Returns a new track where the value in each position reflects the local frequency of the dinucleotide 'AC' within a 50 bp region centered at that position
AC_frequency = physical property "frequency:AC" derived from DNA using window of size 50 with anchor at center

plant

The "plant" operation can be used to create artificial benchmark datasets with known TFBS regions to test the performance of motif or module discovery methods. The operation will take a DNA sequence (which can be real or artificial) as input, insert new motif sites at random locations in the sequence and return the updated DNA sequence along with a Region Dataset containing the planted sites. Either a single motif or module or a collection of up to five different (non-overlapping) motifs can be planted in each sequence according to specifications.

Applies to:	DNA Sequence Dataset
Returns:	both DNA Sequence Dataset and Region Dataset

Name	Description
motif or module	This parameter determines which single motif or module to plant in the DNA sequences. Alternatively, a Motif Collection containing up to 5 motifs can be selected here (these will then be planted independently of each other but non-overlapping).
Plant probability	A number between 0 and 1.0 specifying the probability that a motif/module will be planted in each sequence. The default value of 1.0 will plant one instance of the motif/module in each sequence whereas a value of e.g. 0.5 will only plant the motif/module in about half of the sequences.
Force plant	Sometimes it can be difficult to find a good location to plant the motif/module which conforms with the specified settings. If MotifLab is not able to find a good spot for the motif after several attempts, it will normally give up and skip implanting a motif site in that sequence. However, if the "force plant" parameter is set and all else has failed, MotifLab will instead select a position at random and insert the site there even if that might violate some of the other specified settings (such as e.g. the positional priors).
Min match	A number between 0 and 1.0 specifying the minimum percentage match to the motif required for an implanted TFBS sequence. Lower values means that more degenerate motif instances can be implanted.
Max match	A number between 0 and 1.0 specifying the maximum percentage match to the motif required for an implanted TFBS sequence. This is usually 1.0 which means that a perfect motif match is allowed. If the value is set lower than 1.0, the motif instances will be forced to be degenerate. When a motif is to be inserted in a sequence, the DNA base to insert in each position is sampled according to motif model (frequency matrix) so that the match score between the sampled TFBS sequence and the motif is between "min match" and "max match". Note, however, that it can be very difficult or even impossible to find a TFBS sequence with a match score between these bounds, so these two parameters are only used as guidelines.
Reverse probability	A value between 0 and 1.0 specifying the probability that a motif/module will be implanted on the reverse strand of the sequence. If the value is set to 0 all the motifs will be inserted on the direct strand, if the value is 1.0 all the motifs will be inserted on the reverse strand and if the value is 0.5 (default) the motif/module will be inserted on the direct strand in about half of the sequences and on the reverse strand in the others.
Use same pattern	Normally, the TFBS sequence to be implanted is sampled anew from the motif model for each sequence so that there can be some variation between the binding sites. However, if the "use same pattern" parameter is set, the TFBS sequence will only be sampled once and the exact same TFBS pattern will then be planted in all the sequences.
Positional prior	If this parameter is not set (default), MotifLab will select the location to implant the motif/module in the sequence uniformly at random. However, if a Numeric Dataset is provided as "positional priors", this track will be used as a frequency distribution from which to select the motif location.
Use for prior	If a "positional prior" track is selected, this parameter specifies how to make use the priors track. The default setting "sum" implies that the probability of planting the motif at a given location in the sequence is determined by the sum of values for all positions in the priors track that fall within the TFBS site. If "relativeStartValue" is selected, the probability of planting the motif is determined by the priors value at the relative start of the TFBS region (value in most upstream position). If "startValue" is selected, the probability of planting the motif is determined by the priors value at the genomic start of the TFBS (value in the position with lowest genomic coordinate). If "every positive" is selected, the motif will be inserted at every position that has a positive value in the priors track. This can be useful if you want to manually specify where to plant the motifs. However, if the track contains several positive values for a sequence, the user is responsible for making sure that the positive values in the track are not located too close to each other, since that can lead to overlapping TFBS (where the newly planted TFBS will destroy the binding sequence of any previously planted TFBS at the same site).

# Plants motif sites for M00014 in the DNA track of about 80% of the sequences. The operation returns a new DNA track called 'plantedDNA' which contains these new motif sites and also a track called 'plantedTFBS' containing regions for the planted sites. The motif binding sequence (randomly sampled from the motif) will be the same for every TFBS instance
[plantedDNA,plantedTFBS] = plant M00014 in DNA {Plant probability=0.8,Use same pattern=true}

predict

The predict operation can make use of trained Priors Generator objects to derive "positional priors" tracks where the value of each position in the track can be interpreted as a prior probability of observing a specific feature at that position. The feature which is predicted is already set in the Priors Generator and all tracks that the Priors Generator needs in order to predict the target feature must also be available in order to use the operation. These inputs are not explicitly declared but must have the same name and types as the original tracks used when training the Priors Generator. E.g. If a Priors Generator was trained to predict the locations of transcription factor binding sites, on the basis of three tracks named respectively "Conservation", "DNaseHS" and "ChipSeq", the same three tracks must also be available in order to use the predict operation with this Priors Generator.

Returns:

Name	Description
priors generator	The name of the Priors Generator object to use to predict the target feature.

# Uses the PriorsGenerator1 object to derive a new positional priors track based on a set of feature tracks
TFBS_prior = predict with PriorsGenerator1

: Priors Generator

prompt

The "prompt" operation can be used in protocol scripts to provide users with some control and allow them to interactively select new values for different data objects during the execution of the protocol. When a "prompt" command is encountered in the protocol, a dialog box will appear and ask the user to select a value for the data object. Note that the target data object must already exist (the prompt operation can not be used to create new data objects) but the object can be "empty". The current value of the data object will be used as the default value, and this value will be displayed to the user who can decide to keep the data object as it is or select a new value for it.

Applies to:	Any data object that can be edited (not Sequences, Analysis objects, Output objects or Priors Generators)
Returns:	The same object

Name Description

message An optional message which will be displayed to the user in the popup dialog.

constraints

This optional parameter was introduced in v2.0 to allow the prompt to constrain the values that can be selected for Numeric and Text Variables. The values can be limited to a specific set by explicitly listing the allowed values within curly braces, i.e. {value1,value2,...,valueN}.

If the target data object is a Text Variable and the braces contain a single entry which is the name of a Text Variable, the available options will be taken from this Text Variable (with each line representing a selectable value). If the braces are empty, the user is allowed to enter any single value for the Text Variable (but not a multi-line value).

If the target data object is a Numeric Variable, the list of allowed values can include references to other Numeric Variables as well as literal numbers. For Numeric Variables the values can also be constrained to be within a certain range by specifying the minimum and maximum values within brackets: [minimum:maximum]. An optional step argument can also be added [minimum:maximum:step]. E.g. the range [0:30:5] will limit the allowed values to 0, 5, 10, 15, 20, 25 or 30. Instead of numbers, stars (*) can be used to denote that the range should be unlimited in one of the directions, e.g. the range [5:*] means that the value must be at least 5 but there is no maximum limit. It is also possible to use references to Numeric Variables instead of literal numbers, e.g. the range [1:Limit] means that the value must at least one and at most equal to the current value of the Numeric Variable named 'Limit'.

Constrained values are usually presented in the prompt dialog using a drop-down menu (for value sets) or a spinner (for numeric ranges), but it is possible to suggest that a different type of GUI widget should be used by adding a single letter after the closing brace/bracket. Available options are:

M : Drop-down menu
L : List
R : Spinner (only for numeric ranges)
S : Slider (only for numeric ranges)
T : Textbox (for entering a single value for Text Variables)

# Displays a dialog which allows the user to select a new value for the Cutoff data object
prompt for Cutoff "Enter a threshold value"

# Displays a dialog which allows the user to select a new value between 1 and 100 for the Cutoff data object using a slider
prompt for Cutoff "Enter a threshold value" [1:100]S

# Displays a dialog which allows the user to select between the two values 'absolute' or 'relative' for TextVariable1 using a drop-down menu
prompt for TextVariable1 "Select scoring function" {"absolute","relative"}M

prune

The "prune" operation can be used to remove duplicate regions from a Region Dataset. These duplicates can either be regions that are exactly identical to another region in the same track or they can be overlapping regions for motifs that are considered to be similar to each other (and hence duplicate predictions of the same TF binding site). The operation searches for groups of duplicate overlapping regions and removes all but one of the regions in each group.

Applies to:	Region Dataset
Returns:	Region Dataset

Name

Description

remove

This parameter specifies which "similar" regions to prune from the dataset.

duplicates: Removes regions that are identical copies of another region in the track so that only one copy of each region remains
similar: Removes regions that are have the same type and location (including orientation) as another region in the track. Only the region with the highest score is retained (v2.0.-2).
palindromes: Searches for pairs of palindromic motif occurrences (two overlapping regions for the same motif on opposite strands) and removes one of these.
alternatives: Searches for overlapping regions for motifs that are considered to be 'alternatives' of each other (either according to annotations in the motifs themselves or based on clustering in a Motif Partition) and prunes the overlapping regions so that only one remains. Note that two alternative overlapping regions are only considered to be duplicates of each other if they occur at an optimal alignment relative to each other.

keep

This parameter applies when the "remove" parameter is either "palindromes" or "alternatives" and dictates which of the regions to retain and which to remove.

For "palindromes" this parameter can have one of the following values:

top scoring: Keep the region with the highest score and remove the other
direct strand: Keep the region that is located on the direct strand (genomic orientation) and remove the one on the other strand
relative strand: Keep the region that has the same orientation as the sequence and remove the region on the opposite strand

For "alternatives" this parameter can have one of the following values:

top scoring: Keep the region with the highest score and remove the others
highest IC: Keep the region corresponding to the motif with the highest information content
first sorted name: Keep the region with the motif name which occurs first in the list when the names of the motifs for the overlapping regions are sorted alphabetically

partition

If the "remove" mode is "alternatives" this parameter can be used to specify a Motif Partition that decides which motifs are to be considered alternatives of each other (i.e. motifs in the same cluster). If this parameter is left unspecified, the definition of alternative motifs will be taken from annotations in the motif data objects themselves (as seen in the "Alternatives" tab in Motif dialogs).

# Removes identical duplicate regions in the TFBS track so that only one copy of each region remains
prune TFBS remove "duplicates"

# Finds pairs of palindromic motif occurrences in the TFBS track (where two occurrences of the same motif are found overlapping each other but on different strands) and removes the one that has the lowest score of each pair
prune TFBS remove "palindromes" keep "top scoring"

# Finds overlapping regions in the TFBS track for motifs that are annotated as alternatives of each other and removes all the duplicates so that only the region corresponding to the motif with the highest information content remains
prune TFBS remove "alternatives" keep "highest IC"

# Finds overlapping regions in the TFBS track for motifs that are in the same cluster in the Motif Partition named 'AlternativePartition' and removes all the duplicates so that only the region with the highest score remains
prune TFBS remove "alternatives" from AlternativePartition keep "top scoring"

rank

The "rank" operation will return a new Numeric Map where the values correspond to the rank order of the entries in another Numeric Map, a similar numeric column from an Analysis, or internal numeric properties of data objects. The rank order can also be based on a weighted combination of several such properties. In that case, each property is first ranked on its own and the rank-values are multiplied by the weight for that property (if specified). The ranks are then summed up across all properties and a final rank order is derived from these values (in ascending order). (Note that the entries are not ranked first by the first value, and then by the second value to break ties etc.) Entries that have the same value will receive the same rank. For example, a map with entries "A=3,B=5,C=2,D=13" will be ranked (ascending) as "A=2,B=3,C=1,D=4", and a map with entries "A=3,B=3,C=2,D=13" will be ranked (ascending) as "A=2,B=2,C=1,D=4" (Note that D is still ranked as number 4 and "rank 3" has been skipped).

Applies to:	Numeric Map, Analysis and Data Type (Motif, Module or Sequence)
Returns:	Numeric Map

Name	Description
sort direction	"ascending" (default) or "descending". Controls how the values for the property should be sorted before determining the rank.
property	The "property" argument specifies which values to use from the source object. If the source is a Numeric Map, this argument is not applicable since only the Map values can be used. If the source is an Analysis object, this argument should specify which numeric column to use from the Analysis. If the source is the type of a data object ("Motif","Module" or "Sequence"), this argument should specify which internal numeric property to use.
value	Specifies how each property should be weighted if there are more than one. If no weights are specified, each property will be weighted equally (the default weight is 1.0). Note that a higher weight will punish the property, since the ranks for that property will be multiplied by the weights and lower values are considered better. More important properties should therefore be given lower weights than less important properties.

# Returns a Motif Numeric Map where the value for each motif corresponds to its rank when the map is sorted in ascending order
rank Motif_IC_map

# Returns a Motif Numeric Map where the motifs are ranked by descending size
rank descending "size" from Motif

# Returns a Motif Numeric Map where the motifs are ranked by a weighted combination of three properties, the p-value for a motif overrepresentation analysis, the average value of a Numeric track inside the motif region across all binding sites and the kurtosis of the motif position distribution across all sequences. The first property is considered more important for the final rank than the last two.
rank ascending "p-value" from MotifOccurrenceAnalysis, descending "average" from CompareMotifsToNumericTrackAnalysis with weigth=2.0, descending "Kurtosis" from MotifPositionDistributionAnalysis with weight=2.0

replace

The "replace" operation (v2.0) replaces portions of text in a Text Variable or a textual property of a Region Dataset. The basic mode of this operation will search the body of text for a specified search term (which can be in the form of a regular expression) and replace all instances matching this search term with a given replacement text (which can contain backreferences to capture groups in the search expression).

The operation can also be used to search Text Variables for instances of macro names and replace these with their corresponding definitions ("replace macro") or to add new lines to the beginning or end of a Text Variable ("replace beginning/end").

Applies to:	Text Variable and Region Dataset
Returns:	Text Variable or Region Dataset (The type of the returned object will depend on the source object)

Name	Description
search expression	This parameter specifies the text expression(s) to search for in the source object. Any matching instances of this expression will be replaced with the text provided by the "replacement expression" parameter. The parameter can either be a simple literal search term or it can take the form of a regular expression defining a more complex search pattern. Regular expressions should follow the syntax used by the JAVA programming language as described below. The search expression can also be provided in the form of a Text Variable or Map. If the search expression is a Text Variable, the "replacement expression" parameter must also be a Text Variable with the same number of lines. The two Text Variables then function somewhat like maps and portions of text that match the expression at line n in the first Text Variable will be replaced by the corresponding replacement expression at line n in the second Text Variable (or if you have a single Text Variable with two columns you can use this same Text Variable as both search and replacement expression and MotifLab will automatically use the first column as search expression and the second as replacement). Using Text Variables to specify search/replacement expressions allows you to search for multiple expressions at the same time. If the source data object is also a Text Variable each line of the text will be transformed by applying every search expression in turn, but if the source is a Region Dataset only the first matching search expression will be used for each region. If the search expression is a Map, the replacement expression parameter need not be defined as this will be based on the corresponding map values. If the source data object is a Text Variable, each line of the text will be transformed by replacing every instance of each key from the map with its corresponding map value (the entries will be processed in random order). If the source object is a Region Dataset, the current value of the specified property will be used as a key to retrieve the corresponding value from the map and the property value will then be replaced with the new value from the map. Regular expressions: Some commonly used regular expression rules include: A vertical bar can be used to separate alternative matching expressions Parentheses can be used to group character togethers A dot "`.`" matches any single character A "`+`" plus sign after a character or group means that this character/group should match one or more times A "``" star sign after a character or group means that this character/group should match zero or more times A "`?`" sign after a character or group means that this character/group should match zero or one times (i.e. it is optional) Two numbers in braces "`{n,m}`" directly behind a character or group means that the character/group should match between n and m times A character class* can be defined by listing characters in brackets and will match any single character in the class. E.g. the class "`[abc]`" will match either `a`, `b` or `c` You can negate a character class by placing a "`^`" directly after the first bracket. E.g. the character class "`[^abc]`" will match any single character besides `a`, `b` and `c` The special character class denoted by "`\d`" will match a single digit character and the complementary class "`\D`" will match a single non-digit character The special character class denoted by "`\w`" will match a single "word" character (digit, letter or underscore) and the complementary class "`\W`" will match a single non-word character The special character class denoted by "`\s`" will match a single whitespace character and the complementary class "`\S`" will match a single non-whitespace character The following characters have special meaning within regular expressions and must be escaped with a backslash in front if you want to refer to them in a literal sense: `\.+[]{}()?^$\|` For more information about the syntax of regular expressions in JAVA, consult this tutorial (or this one). Examples:* The expression "`cat\|dogs?`" will match either cat, dog or dogs. The expression "`b[aie]ts`" will match either bats, bits or bets. The expression "`Go{2,5}gle`" will match either Google, Gooogle, Goooogle or Gooooogle. The expression "`M.\d+`" will match any word beginning with "M" followed by any character and then a succession of digits. The expression "`Hip(hop)+opotamus`" will match words beginning with "Hip", followed by "hop" repeated any number of times and then ending with "opotamus".
replacement expression	This parameter defines the text that should replace matching instances of the search expression in the source object. If the search expression is in the form of a regular expression containing "capture groups" (groups of character within parentheses), the replacement expression can contain backreferences to these capture groups on the form "`$n`" where n is the number of a capture group. Example: Consider the search expression "`([A-Z])\$(\S+?)(_\w+)?`" that will match an uppercase letter followed by a dollar sign, a number of non-whitespace characters and optionally ending with a suffix consisting of an underscore followed by numbers or letters. This expression contains three capture groups: the first capturing the single uppercase letter at the beginning, the second capturing the middle part following the dollar sign and the last capturing the suffix starting with the underscore. If the replacement expression is given as "`$2:$1`", all matching instances of the search expression will be replaced by a new text consisting of the middle part of the matching text followed by a colon and then the beginning uppercase letter. Hence, the match "`V$VMYB_01`" will be replaced with "`VMYB:V`" and the match "`F$ABF_C`" will be replaced with "`ABF:F`".
property	If the source data object is a Region Dataset, this parameter specifies the textual property of regions that will be affected by the operation. If left unspecified, it defaults to the "type" property.

: region condition

:

# Replaces all instances of “cats” in TextVariable1 with “dogs” and stores the result in TextVariable2
TextVariable2 = replace "cats" with "dogs" in TextVariable1

# For RepeatMasker regions whose type property matches “Alu” followed by a suffix, the operation will place the suffix at the beginning instead
replace "Alu(.+)" with "$1Alu" in RepeatMasker property "type"

# Goes through every RepeatMasker region and looks up its type property in the NameMap map. Then it replaces the type of the region with the corresponding value from the map
replace NameMap in RepeatMasker property "type"

# Replaces all instances of recognized macro names in TextVariable1 with their corresponding macro definitions
replace macros in TextVariable1

# Adds a line of text to the beginning of TextVariable1. (It does not actually replace any existing text)
replace beginning with "new header text" in TextVariable1

# Adds a line of text to the end of TextVariable1. (It does not actually replace any existing text)
replace end with "new footer text" in TextVariable1

score

The "score" operation uses a basic motif scanning algorithm to compare a single motif model (or a collection of motifs) against a DNA sequence, but rather than returning a track containing matching regions, the operation returns a numeric track with the motif match score for each position. If the operation is used with a collection of motifs rather than a single motif, all the motifs in the collection will be scanned against the DNA sequence and the highest match score obtained for each position will be returned.

Applies to:	DNA Sequence Dataset
Returns:	Numeric Dataset

Name	Description
motif	A single motif or collection of motifs to scan against the DNA sequence
normalization mode	This parameter can either be "absolute" which means that the unnormalized match scores are returned or "relative" which will return match values between 0 (worst match) and 1 (best match) based on the lowest and highest achievable match scores according to the motif model.
score mode	If this parameter is set to "raw" the match score will be calculated by summing up the relative frequency values from the motif matrix for the matching base in each position. However, if the score mode is set to "log-likelihood" the frequency of the matching base according to the motif matrix will be compared (using log-likelihood) against the expected frequency of that base according to a chosen (zero order) background model.
strand	This parameter controls which strand of the DNA sequence to scan. Valid options are "direct" (genomic direct strand), "reverse" (genomic reverse strand), "relative" (strand corresponding to the orientation of the sequence), "opposite" (strand opposite of the sequence orientation) and "both". If "both" strands are considered, the motif will be matched against the sequence in both orientations at every position and the resulting score for each position will be based on the highest scoring orientation.
background	If the "score mode" parameter is set to "log-likelihood" the frequency of a base in a position according to the motif model will be compared against the expected frequency according to the chosen background model. Note that only a zero-order model is used, even if the chosen background model might be of higher order. If no explicit background model is specified, a uniform model will be used.

: position condition

:

# Scans the motif M00014 against the DNA sequence on the relative strand and returns a match score value between 0 and 1.0 for each position based on the motifs match against the sequence
scoreTrack = score DNA with M00014 using relative raw scores on relative strand

: motifScanning

search

This operation can be used to search DNA sequences for occurrences of a given DNA sequence pattern (or multiple patterns), specified as either regular expressions (in JAVA syntax) or as IUPAC consensus patterns. The search pattern can be a literal string enclosed in double quotes or the name of a Text Variable, single Motif or a Motif Collection (without quotes). When searching for a Motif or Motif Collection, the operation will search for the "consensus sequence" representation of the Motif (or all the motifs in the collection). The operation can also be used to search for occurrences of tandem or inverted repeats (two identical DNA patterns that occur close to each other in the DNA sequence). Constraints can be placed on the size of the two halfsites and the size of the gap between them.

Applies to:	DNA Sequence Dataset
Returns:	Region Dataset

Name	Description
search expression	The DNA expression to search for. This can either be a literal expression enclosed in double quotes or the name of a Text Variable, a Motif or a Motif Collection. A literal expression can be a regular expression (in JAVA syntax), an IUPAC consensus sequence or just a plain DNA sequence. If the expression is the name of a Motif or Motif Collection, the search expression used will be the IUPAC consensus string for the motif (or for all motifs in the collection). If the expression is the name of a Text Variable, the search operation will search for all expressions listed in the Text Variable in turn (multiple expression can be specified with one on each line).
repeat type	The type of repeat to search for (when searching for repeats). This can be either "direct" (tandem repeats on the same strand) or "inverted" (the two halfsites should have opposite orientations).
halfsite size	The size range to consider for the two halfsites when searching for tandem or inverted repeats. Specified as [min size, max size].
gap size	The size range to consider for the gap between the two halfsites when searching for tandem or inverted repeats. Specified as [min size, max size].
report	Controls how to define and return a "match" region when searching for tandem or inverted repeats. The allowed values for this setting is "halfsites" or "full". If "report full" is chosen, each match to a tandem or inverted repeat will return a single region covering both halfsites as well as the gap between them. If "report halfsites" is chosen, each match to a tandem or inverted repeat will return two regions (one for each halfsite).
strand orientation	Specifies which strand(s) of the sequence to search for the pattern in. This could be either "both strands", "direct strand" (relative to genomic orientation), "reverse strand" (relative to genomic orientation), "relative strand" (the same orientation as the sequence), or "opposite strand" (the strand opposite to the orientation of the sequence)". The strand orientation setting is only applicable when searching for given expressions, not when searching for tandem/inverted repeats.
mismatches	The maximum number of positions that are allowed to deviate from the search pattern before a sequence region is no longer considered a match. Mismatches are not allowed for regular expression search patterns, only for constant expressions. For example: if no mismatches are allowed, the search pattern "CAG" will match only "CAG". With one mismatch allowed, "CAG" will match "CAG" but also "AAG,GAG,TAG,CCG,CGG,CTG,CAA,CAC and CAT".

# Searches the DNA sequence for the pattern 'CACGTG' on both strands and returns a track containing the matching regions
search DNA for "CACGTG"

# Searches the DNA sequence for the patterns 'CAC' and 'GTG separated by 2 to 3 arbitrary nucleotides on the direct strand
search DNA on direct strand for "CACn{2,3}GTG"

# Searches the DNA sequences for matches to all motifs in the collection. Each reported binding site is allowed to deviate from the motif consensus in up to two position
search DNA for TRANSFAC_Public with 2 mismatches

# Searches the DNA sequences for inverted repeats separated by 1 to 5 bases. Each half site should be between 4 to 7 bases long and the returned track will contain these halfsites
search DNA for inverted repeats {halfsite=[4,7], gap=[1,5], report=halfsites}

: motifScanning

set

The "set" operation is an assignment operator which can be used to set the value of a numeric data object to a new specified value. For Numeric Datasets the operation will be applied to all positions in the sequences, and for Numeric Maps it will be applied to all entries in the map (including the "default" value). The operation can also be applied to Region Datasets to set the value of numeric properties or text properties. By default the operation will be applied to the "score" property of the regions unless a different property is specified.

Applies to:	Numeric Dataset, Numeric Map, Numeric Variable and Region Dataset
Returns:	Numeric Dataset, Numeric Map, Numeric Variable or Region Dataset (The type of the returned object will depend on the source object)

Name

Description

property

Specifies which property of the data object to assign the value to. For Numeric Datasets, Numeric Maps and Numeric Variables this will simply be the "value", but for Region Datasets you can select which property of the regions to assign to (the default is "score"). Note that user-defined region properties will not be shown in the drop-down list in the operation dialog, but it is still possible to type in the name other properties.

value

Specifies the new value for the assignment. If the value is a literal number or a Numeric Variable, each potential value in the source object will be set to this same value. If the source object is a Numeric Dataset and the "value" is also a Numeric Dataset, the value of each position in the source will be set to the value in the same position in the "value" dataset. If the source is a Numeric Dataset and the "value" is a Sequence Numeric Map, the values of all positions in each sequence are set to the value for that sequence in the map (so each sequence is potentially assigned different values). If both the source and "value" are Numeric Maps of the same type, the entries in the source map will be set to the corresponding values in the "value" map. If the source is a Region Dataset and the "value" is a Numeric Dataset, the region property can be set to a value derived from the values in the Numeric Dataset which are covered by the region. This value could be e.g. the minimum, maximum, average, median or sum of all values within the region or it could be the value of a single position at the center, start or end of the region ("start" and "end" refers to the lowest and highest genomic coordinates of the region with respect to the direct strand. The "relativeStart" and "relativeEnd" positions refer to the most upstream and most downstream position in the region when viewed relative to the orientation of the sequence, and the "regionStart" and "regionEnd" values refer to the most upstream and most downstream positions relative to the orientation of the region itself). If the region is a "motif" region, it is also possible to use "weighted average" or "weighted sum" where the values in each position are weighted by the information content of the motif in the corresponding position. If the operation is applied to a "text" property of a Region Dataset, the "value" should either be a literal string enclosed in parentheses or a Text Variable. The text property will then be set to the provided text (multiple values will be comma separated).

# Assigns the data object X the value 10. If X is a Numeric Dataset/Numeric Map/Region Dataset, all positions/entries/regions in the data object will be set to 10.
set X to 10

# Returns a new track where each position contains the highest value in that position between the two tracks
newNumericTrack = set Track1 to Track2 where Track2 > Track1

# Copies the contents of Map2 into Map2
set Map1 to Map2

# Sets the 'score' property of each region in the track to the average value of the NumericTrack within the region
set RegionTrack[score] to average NumericTrack

# Assigns the value 'one,two,three' to the text-property 'numbers' for all regions in the track
set RegionTrack[numbers] to "one,two,three"

: increase, decrease, multiply, divide

split_sequences

This operation (introduced in MotifLab v2) can take an existing set of sequences and derive a new set of sequences based on subsegments of the originals. The original sequences can be kept together with the new sequences or optionally be deleted. The subsegments on which to base the new sequences are taken from the locations of regions in a specified region track. Each region in this track will give rise to one new sequence, so if two regions are overlapping they will result in two overlapping sequences. The new sequences will have names on the form "XXX_n" where XXX is the name of the original sequence and n is an incremental number starting at 1 for each original sequence.

Note that the new sequences are not allowed to extend beyond the edges of the original sequences even if the regions they are based on do that.
For example, if you have a sequence spanning the [-1000,+200] region around the TSS of a gene which is 2000bp long (thus extending 1800bp further downstream of the original sequence) and you use split_sequences to create new sequences based on the gene annotation track, the new sequence location will be the intersection of the old sequence and the gene region, meaning the new sequence will correspond to the 200bp region starting at the gene TSS and extending downstream to the end of the original sequence. The gene region is kept at its original length, however, and is allowed to extend past the edge also in the new sequence.

The operation will return a sequence partition object where each newly created sequence is assigned to a cluster named after the original sequence it was based on. Old sequences not created by split_sequences will not be assigned to any cluster in the partition.

The result of applying split_sequences is usually some form of cropping of the original sequences (and also all associated feature tracks) so in some ways it is similar to the crop_sequences operation. The difference between this operation and crop_sequences is that the latter only modifies the original sequences whereas split_sequences creates new sequences. If your sequences contain exactly one region each, the result of the two operations will be (almost) the same. However, if you have a sequence containing two regions, crop_sequences will crop the original sequence so that it begins at the start of the first region and ends at the end of the second region, whereas split_sequences will create two new sequences where each is cropped to match one of the regions in the original sequence.

The split_sequences operation is the only exception to the rule that new sequences cannot be created after feature datasets have been added. The reason for having this rule is that the feature tracks would normally be undefined within the new sequences. However, since the sequences created by split_sequences are based on subsegments of existing sequences, all the necessary feature data for the new sequences will already be present.

Applies to:	Sequence Collection
Returns:	Sequence Partition

Name	Description
Region track	This argument specifies the track whose regions the new sequences shall be based on
Delete original sequences	If this flag is set, the set of newly created sequences will totally replace the old sequence set. If not set, the original sequences will be be kept and the new sequences will just be added to the current set.

# Creates a new set of sequences based on gene regions from the EnsemblGenes annotation track. The original sequences are kept, so the new sequences are added to the current set
SequencePartition1 = split_sequences based on EnsemblGenes

# Creates a totally new set of sequences where each sequence correponds to a single binding site in the TFBS track. The original sequences are discarded.
SequencePartition2 = split_sequences based on TFBS. Delete original sequences

statistic

Calculates a statistic for each sequence in a dataset and returns a Sequence Numeric Map containing the results for each sequence.

Applies to:	DNA Sequence Dataset, Numeric Dataset and Region Dataset
Returns:	Sequence Numeric Map

Name

Description

function

The statistic function to be calculated. The type of statistics available depend on the input data track.

DNA Sequence Dataset statistics

GC-content: The GC-content of the sequence
A-count: The number of 'A' bases in the sequence
C-count: The number of 'C' bases in the sequence
G-count: The number of 'G' bases in the sequence
T-count: The number of 'T' bases in the sequence
A-frequency: The number of 'A' bases in the sequence divided by the total number of positions (matching the current condition)
C-frequency: The number of 'C' bases in the sequence divided by the total number of positions (matching the current condition)
G-frequency: The number of 'G' bases in the sequence divided by the total number of positions (matching the current condition)
T-frequency: The number of 'T' bases in the sequence divided by the total number of positions (matching the current condition)
Unknown-count: The number of unknown bases in the sequence (not A, C, G or T)
Unknown-frequency: The number of unknown bases in the sequence (not A, C, G or T) divided by the total number of positions (matching the current condition)
base count: the total number of bases in the sequence (matching the current condition)

Numeric Dataset statistics

Minimum value: the lowest value in the sequence
Maximum value: the highest value in the sequence
Extreme value: the value with the largest magnitude (absolute value) in the sequence (added in MotifLab v2).
Average value: the average value across all positions in the sequence
Sum values: the total sum of all values across all positions in the sequence
base count: the total number of bases in the sequence (matching the current condition)

Region Dataset statistics

Minimum score: The score of the region that has the lowest score in the sequence
Maximum score: The score of the region that has the highest score in the sequence
Extreme score: The score of the region that has the highest absolute value in the sequence (added in MotifLab v2).
Average score: The average score across all regions in the sequence
Average score: The average score across all regions in the sequence
Sum scores: The total sum of scores across all regions in the sequence
Region count: The number of regions in the sequence
Region base count: The number of bases in the sequence that are covered by regions.
Note that this number can be smaller than the sum of the region lengths if regions are overlapping, since each base position is only counted once.

strand orientation

Specifies which strand the statistic should be applied to for DNA Sequence Datasets. This could be either "direct strand" (relative to genomic orientation), "reverse strand" (relative to genomic orientation), "relative strand" (the same orientation as the sequence), or "opposite strand" (the strand opposite to the orientation of the sequence)".

: position condition, region condition or subset condition

:

# Returns the highest value within the 'Conservation' track for each sequence
statistic "maximum value" in Conservation

# Returns the number of regions in each sequence of the track
statistic "region count" in TFBS

# Returns the number of bases within known repeat regions in each sequence
statistic "region base count" in RepeatMasker

# Returns the GC-content inside the CpG-islands in each sequence
statistic "GC-content" in DNA where inside CpG-islands

threshold

Assigns all numeric values in a data object that are equal to or above a specified cutoff threshold a new value and those below the cutoff a different value. For Numeric Datasets the operation will be applied to every position in all sequences, for Region Datasets the operation will be applied to the score-property of every region, and for Numeric Maps and Expression Profiles the operation will be applied to every value in the Map/Profile.

Applies to:	Numeric Dataset, Region Dataset, Numeric Map and Expression Profile
Returns:	Numeric Dataset, Region Dataset, Numeric Map or Expression Profile (The type of the returned object will depend on the source object)

Name	Description
cutoff	The cutoff threshold which will divide all numeric values in the source object into two groups: those equal to or above the cutoff and those below the cutoff. The value can be specified as a numeric constant, a Numeric Variable or Sequence Numeric Map (if applicable). It is also possible to specify a relative threshold value by appending a percentage sign after the cutoff (number or data object name). In this case the cutoff value should be between 0 and 100 (if a relative cutoff is outside this range it will be set to either 0 or 100). For example, the cutoff "50%" (or equivalently "50%C") will use a cutoff which is halfway between the smallest and largest values found in the source object (considering only sequences from the specified sub-collection). By appending "%D" instead of "%C" the value range used to derive the relative cutoff is based on all sequences (not just the specified collection). For feature datasets, it is possible to use different relative thresholds for each sequence by appending "%S". E.g. the cutoff "50%S" will for each sequence use a cutoff which is halfway between the smallest and largest values found in that sequence (unlike "%C" or "%D" which will use the smallest and largest values among sequences in the chosen collection or all the sequences respectively). Note: The "%C" operator was added in version 2.0 of MotifLab and at the same time the behaviour of the "%" operator was changed to be equal to "%C" rather than "%D". This is intuitive, since if the threshold operation is applied to a (sub)collection of sequences, only the sequences within that collection would be considered when the relative cutoff-threshold is calculated. (If no particular collection is specified, the "%C" and "%D" operators will behave the same).
above	All entries in the source data object that have values above or equal to the specified cutoff threshold will be assigned a new value which is specified with this parameter. This value can be a constant number, a Numeric Variable, a compatible Numeric Map or one of the six "special values": `sequence.min, sequence.max, dataset.min, dataset.max, collection.min, collection.max`. The "sequence.min" and "sequence.max" values are respectively the smallest and largest values within each sequence (applicable to Numeric Datasets and Region Datasets), the "dataset.min" and "dataset.max" values are the smallest and largest values found in the source data object, and if the operation is limited to a collection, the "collection.min" and "collection.max" values will be the smallest and largest data values in the source object among all the members in the collection.
below	Similar to the "above" parameter. Entries that have values below the specified cutoff will be assigned the "below" value.

# Sets all positions in the Conservation track that have values above or equal to 0.3 to the new value 1.0 and those below 0.3 to the new value 0.0
threshold Conservation with cutoff=0.3 above=1 below=0

# Finds the value which is halfway between the smallest and largest values of the Conservation track within each sequence and sets those positions that have value equal to or above this halfway value to 1.0 and those below to 0.0
threshold Conservation with cutoff=50%S above=1 below=0

# Sets all positions in the Conservation track that have values above or equal to 0.3 to the highest value in the entire dataset and those with value below 0.3 are set to the smallest value in the entire dataset
threshold Conservation with cutoff=0.3 above=dataset.max below=dataset.min

# Version 2.0 of MotifLab introduced a more natural command syntax
threshold Conservation with cutoff=0.3 set values above cutoff to dataset.max and values below cutoff to dataset.min

transform

Transforms each numeric value in a data object according to a selected mathematical function. For Region Datasets the transform will be applied to the 'score' properties of the regions unless a different numeric property is specified. A few special transforms that target Region Datasets may also modify non-numeric values ("reverse" and "type-replace"). Note that values which can not be transformed for some reason will just be skipped (e.g. when taking the logarithm of negative values or dividing by zero). Usually, a warning message will be provided in the log when this occurs.

Applies to:	Numeric Dataset, Numeric Map, Numeric Variable and Region Dataset
Returns:	Numeric Dataset, Numeric Map, Numeric Variable or Region Dataset (The type of the returned object will depend on the source object)

Name

Description

function

Decides which mathematical transform function to apply to each numeric value 'X' in the original data object. The available functions are:

"absolute" : returns the absolute value of X
"ceil" : returns the smallest integer number equal to or larger than X
"cubic-root" : returns the cubic root of X
"floor" : returns the highest integer number equal to or smaller than X
"gaussian" : returns a random real number drawn from a Gaussian distribution with mean 0.0 and standard deviation 1.0
"log" : returns the natural logarithm of X
"logX" : returns the logarithm of X using a log-function with with base specified by the argument
"logit" : returns the "logit" of X, i.e. log(X/(1-X)) (natural logarithm)
"modulo" : returns the value X%argument
"odds" : returns the "odds" of the original value, i.e. X/(1-X)
"power" : returns the value X raised to the power of the argument
"random" : returns a random real number between zero (inclusive) and the argument value (exclusive)
"reciprocal" : returns the reciprocal of X, i.e. 1/X
"reverse" : This transform can be applied to Region Datasets to reverse the orientation of the regions (including their 'sequence' properties).
"round" : returns the integer value closest to X
"sigmoid" : returns the sigmoid function applied to X, i.e 1/(1-e^(-X))
"signum" : returns 0 if X 0, +1 if X is positive, and -1 if X is negative
"square-root" : returns the square root of the original value
"type-replace" : This transform can be applied to Region Datasets to replace the value of the 'type' property of regions (see further details below).
"wave" : returns the value of the function cos((2*pi*X/argument)). If this transform is applied to a numeric dataset where the values inside each sequence is increasing/decreasing linearly (for instance of the track is derived with the 'distance' operation), the result will be a regular cosine wave.

argument

Some of the transform function might require and additional 'argument' to be specified.

"logX" : the argument is the base of the logarithm. E.g. an argument of "2" will return log2(X)
"power" : the argument is the power to which the original value X should be raised. E.g. an argument of "3" will return X^3
"random" : the argument is the maximum value for the returned random number (exclusive). E.g. an argument of "10" will return random numbers in the range [0,10)
"modulo" : the argument is the modulo operator. E.g. and argument of "10" will return X%10 for each original value X.
"wave" : the argument will determine the "width" of the wave.
"type-replace" : the argument should be a Text Variable where each line is on the format "oldexpression=>newtype". The transform operation will go through each region in the Region Dataset and if the type of a region matches the "oldexpression" (which can be in the form of a regular expression), the type will be replaced with "newtype".

# Returns a numeric track where positions with a conservation value lower than 0.5 is set to 0 and those value a value equal to or higher than 0.5 is set to 1
transform Conservation with round

# Returns a Motif Numeric Map where each entry (including the default value) is assigned a random value in the range [0,2). Note that only entries that had specifically assigned values in the original map will be transformed (the rest will default to the same new default value)
transform MotifNumericMap with random(2)

: distance

Protocols

A "protocol" is a document which describes a list of operations to be executed in order (including specifications of their parameters, conditions and constraints). Protocols can be used to document the steps you perform during an analysis session, and they can describe workflows that can be automatically executed by MotifLab. If you like, you can specify exactly which sequences to perform the analyses on in the protocol itself, and the protocol will then always perform the analysis on these squences. However, if the sequences are not explicitly specified, the protocol will just describe a generic analysis workflow which can be applied to any set of sequences (as long as any additional data needed by the protocol is available for the organism and genome build you apply the analysis to).

Creating a protocol

Protocols can either be written manually in the protocol editor (or an external text editor) or they can be made with MotifLab's record functionality which will automatically register all the operations you perform to the protocol.

To create a new protocol, press the "New Protocol" button in the toolbar or go to the "File" menu and select "New Protocol" from there. The protocol editor (described below) will then display the new protocol. You can also open a previously saved protocol by pressing the "Open Protocol" button in the toolbar or selecting "Open Protocol" under the "File" menu.

To activate the "record mode", simply press the round red record button in the toolbar (or select "Record" under the "Protocol" menu). Any operations you perform after activating record mode will be registered in the protocol. Note that the recorded protocol commands will be inserted at the location of the cursor in the editor and not appended to the end (unless the cursor is at the end of the protocol). This means that you can also use record mode to insert new commands anywhere in the protocol by first placing the cursor at a line and then performing a new operation. Press the stop button in the toolbar to deactivate record mode (or select "Stop" under the "Protocol" menu).

Executing a protocol

You can execute a protocol by pressing the "Execute" (play) button in the toolbar or selecting "Execute" from the "Protocol" menu. MotifLab will then go through all the operations that are described in the protocol. If the protocol contains operations that applies to sequences and no sequences are defined in the protocol itself, the protocol will be applied to the sequences that are currently known to MotifLab. If no sequences are known, MotifLab will display the Sequence Dialog and prompt the user to specify which sequences to perform the protocol on.

It is also possible to execute just a subset of the commands listed in the protocol. To execute a number of consecutive lines, select the lines that you want to run by marking the text in the protocol editor (you need not select the full line to include it, it is enough that just parts of a line is selected). Then go to the "Protocol" menu and select "Execute Current Selection". You can also execute only the line where the cursor is currently at by selecting "Execute Current Line" from the "Protocol" menu (NB: this might not work properly in version 1.000 due to a bug), or by holding down the CONTROL key while pressing ENTER inside the protocol editor (if you hold down the SHIFT key at the same time you will suppress any dialogs that might pop up to display the results of the operation).To stop the execution of a protocol before it is finished, just press the "Stop" button in the toolbar.

The protocol language

The standard protocol language employed by MotifLab was designed to be close to natural language so that it should be possible for a human user to read and understand a protocol script without being an experienced programmer. However, the protocol language also has a few constraints in order to make it easily processable by MotifLab. First, each line in the protocol can only contain one command and each command can not span more than one line. Second, the first word of a command (after the assignment operator "=") must be the name of an operation. Apart from that, each operation decides for itself how the command should be expressed. However, most operations rely on a command syntax which follows this general format:

   [target = ] <operation name> [arguments clause] [condition clause(s)]

The target clause at the start of the line states a name for a new data object that is created by the operation. For many operations this target clause is optional and the target will then be the same as the source object. E.g. in the first example command line below, the value of X is increased by 10, since X is both the source object and the (implicit) target. The command in the second line, however, will create a new data object named Y which has a value equal to X+10, but the value of X itself will not be changed. Some operations return more than one data object and in such cases the target is specified as a vector with multiple comma-separated names enclosed in square brackets (as can be seen for the "plant" operation in the last example below).

The arguments clause specify values for different arguments used by the operation. This will almost always include the "source" data object that the operation should be applied to, but different operations may also require additional arguments to be specified. For example, when performing the "increase" operation on a Region Dataset, the operation also requires two additional arguments to be specified: one which tells the operation which property of the regions to increase the values of and another which tells the operation how much the current value of this property should be increased. Some operations have rather many arguments (or even a variable number of arguments) and these operations often rely on "argument maps" to specify values for some or all of their arguments in a more compact form. An argument map is simply a comma-separated list of "argumentName=argumentValue" pairs enclosed in curly braces. The two last example commands below make use of argument maps (operations "motifScanning" and "plant").

The condition clause is always optional but can be used to limit the application of the operation. Depending on the type of condition, this clause will either begin with "where" or "in collection".

In the following example commands the operation name is shown in red, the arguments clause in green, the condition clause in blue and the target clause in pink.

increase X by 10
Y = increase X by 10
multiply Conservation by 2 in collection UpregulatedSequences
filter BindingSites where region's average Conservation < 0.7
TFBS = motifScanning in DNA with SimpleScanner {Motif Collection=JasparCore,Threshold=95}
[ SequencesWithPlantedMotifs , PlantedSites ] = plant M00001 in DNA {Plant probability=0.8}

Comments
Lines in a protocol that start with a # sign will be treated as comments and ignored during execution. Note that all comments must be on their own lines since it is not possible to add comments at the end of other command lines.

Temporary data objects
Sometimes it will be necessary for a protocol script to create temporary data objects that are used for e.g. intermediate calculation steps but are not really interesting for the user after the execution of the protocol has ended. Such data objects can be given names starting with an underscore to mark them as temporary. Temporary data objects will not be displayed in any data panels or in the Visualization panel and they will be deleted immediately after the protocol execution ends.

Flow control

Protocols scripts in MotifLab are designed to be conceptually simple, where each line in the protocol from the first to the last should be executed once and only once in succession. The protocol language and commands to perform various operations are inspired by the paradigm of declarative programming, whereby a programmer describes what they want to achieve rather than micromanaging exactly how to go about to produce the desired outcome. For example, if an operation is applied to a data object that naturally contains subentries, MotifLab will implicitly perform the operation on each of these subentries in turn, as long as all imposed conditions hold true for the entry. Because of this (and also because MotifLab does not support constructs such as data arrays or reference variables), there is really little need for the protocol language to include flow-control statements such as loops and conditional blocks.

Nevertheless, from version 2.0 onwards, MotifLab does support limited flow-control in the form of conditional "if-then-else" statements.
The basic syntax for a conditional statement block is:

  if <condition>  
      ....
      ....
  end if

You can have alternative "else if" condition blocks after the first "if", and the first block whose condition is satisfied will then be executed. An optional default "else" block will only be executed if none of the conditions for any of the previous "if" or "else if" blocks were satisfied. It is allowed to nest "if-else" statements to arbitrary levels.

  if <condition>  
      ....
  else if <condition>  
      ....
  else if <condition>  
      ....
  else
      ....
  end if

So far it is only possible in the condition expression to compare a single data object to another data object or literal value (textual or numeric).
However, multiple conditions can be connected with boolean operators "AND" and "OR" to create more complex compound conditions.

Conditions allowed in flow-control statements

Operand1	Comparator	Operand2	Condition holds true when...
Text Variable	equals	Text	the text value of operand1 is exactly identical to operand2
Text Variable	=	Text	the value of operand1, when viewed as a set of strings, is the same as the set of strings in operand2
Text Variable	<=	Text	the set of strings in operand1 is the same as or a subset of the set of strings in operand2
Text Variable	<	Text	the set of strings in operand1 is a strict subset of the strings in operand2
Text Variable	>=	Text	the set of strings in operand1 is the same as or a superset of the set of strings in operand2
Text Variable	>	Text	the set of strings in operand1 is a strict superset of the strings in operand2
Text Variable	<>	Text	the set of strings in operand1 is not the same as the set of strings in operand2 (but they can still overlap)
Numeric Variable	=	Numeric	the numeric value of operand1 is the same as that of operand2
Numeric Variable	<=	Numeric	the numeric value of operand1 is less than or equal to the value of operand2
Numeric Variable	<	Numeric	the numeric value of operand1 is strictly less than the value of operand2
Numeric Variable	>=	Numeric	the numeric value of operand1 is greater than or equal to the value of operand2
Numeric Variable	>	Numeric	the numeric value of operand1 is strictly greater than the value of operand2
Numeric Variable	<>	Numeric	the numeric value of operand1 is different from the value of operand2
Collection	=	Collection	the entries in operand1 are the same as the entries in operand2
Collection	<=	Collection	the entries in operand1 are the same as or a subset of the entries in operand2
Collection	<	Collection	the entries in operand1 are a strict subset of the entries in operand2
Collection	>=	Collection	the entries in operand1 are the same as or a superset of the entries in operand2
Collection	>	Collection	the entries in operand1 are a strict superset of the entries in operand2
Collection	<>	Collection	the two collections are not the same (but they can still overlap)
Collection	overlaps	Collection	the two collections have at least one entry in common
Data	=	Data	the two data objects have the "same" value
Data	<>	Data	the two data objects do not have the "same" value

When Operand2 is "Text" the operand can either be a Text Variable, a Text Map (in which case only the default value is considered), a Collection or a literal text enclosed in double quotes. When the "equals" comparator is used, the two bodies of text must represent identical documents, but for the other comparators the bodies of texts are considered as "sets of strings" and the order of the strings is not important. For example, if T1 is "apples,oranges" and T2 is "oranges,apples" then "T1 equals T2" is false but "T1 = T2" is true.

When Operand2 is "Numeric" the operand can either be a Numeric Variable, a Numeric Map (in which case only the default value is considered), or a literal number.

For data objects that are not Text Variables, Numeric Variables, or Collections, only the two comparators "=" and "<>" are available to determine if the objects represent the same value or not (the definition of representing the "same" value depends on the data type).

Example: When the protocol below is executed, the user will be asked interactively for which motif scanner to use to predict binding sites (via the prompt command). Depending on the choice of algorithm, which can be either "MATCH" or "SimpleScanner", only one of the two motifScanning commands will be performed and return a BindingSites track.

DNA = new DNA Sequence Dataset(DataTrack:DNA)
Jaspar_Core = new Motif Collection(Collection:Jaspar Core)
Cutoff = new Numeric Variable(0.9)

Algorithm = new Text Variable("MATCH")
prompt Algorithm "Please select which motif scanning algorithm to use" {"MATCH","SimpleScanner"}L

if Algorithm = "MATCH"
   BindingSites = motifScanning in DNA with MATCH {Motif collection=Jaspar_Core,Matrix threshold=Cutoff}
else if Algorithm = "SimpleScanner"
   BindingSites = motifScanning in DNA with SimpleScanner {Motif Collection=Jaspar_Core,Threshold=Cutoff}
else
   !message("Unknown motif scanning algorithm: {Algorithm}")=ERROR
end if

The protocol editor

The protocol editor can be found under the "Protocol" tab in the main panel.

The protocol editor consists of three panels. On the top is a blue header panel which displays the name of the protocol. New protocols are given default names like "Protocol-1", "Protocol-2" etc., but you can change the name by saving the protocol to a file (by going to the "File" menu and selecting "Save" or "Save As..."). The protocol will then be given the same name as the file that you saved it to (minus the file-suffix). A protocol which has not been saved yet (or has been changed since it was last saved) will have an asterisk after the protocol name in the header. It is possible to have multiple protocols open at the same time in MotifLab, and you can then switch between them via a drop-down menu which is available by pressing the down-arrow button on the right side of the header (or by going to the "Protocol" menu and selecting "Change Protocol"). Only the protocol which is currently displayed in the protocol editor will be "active", however.

The main part of the protocol editor is the editor panel itself. Here the currently selected protocol is displayed and can be edited. Each operation command must be written out on a single line in the protocol in order for MotifLab to understand it correctly. (Word wrapping functionality for long lines will hopefully be included in a future version of MotifLab).
The protocol editor can use colors to highlight keywords of different types in the protocol. According to the default color scheme, the names of operations are colored RED, names of specific data objects are colored BLUE, general data types are in ORANGE (as are names of analyses and names of general data formats for input and output), names of external programs are in GREEN, literal numeric constants are in PINK and literal text constants (in double quotes) are in GREEN, display settings are in CYAN and comments are in GRAY. If you don't like these default colors you can change them by selecting "Options..." from the "Configure" menu and go to the "Protocol Editor" tab in the Options-dialog which pops up.

The editor panel has a gray margin area to the left which displays line numbers in front of each protocol line and sometimes also small icons in front of these line numbers. These icons have the following interpretations:

	This protocol line contains an error. Point the mouse at the icon to display the error message.
	MotifLab is currently executing the operation on this line.
	The operation on this line was successfully executed.
	The protocol execution was stopped by the user on this line.
	The execution of the protocol was aborted at this line due to an error. Point the mouse at the icon to display the error message.

At the bottom of the protocol editor is the status panel with three boxes followed by a status message line. The first box contains a "status light" which can either be colored green, yellow or red (or black when there are no protocols). A green light means that the protocol does not contain any errors as far as MotifLab can tell, and it should therefore be possible to execute it. A red light means that the protocol contains errors which makes it impossible for MotifLab to parse it correctly. The number of errors detected in the protocol will be displayed int the status message line, and the lines that contain these errors should also be marked with error icons in the margin. (To see what is wrong with a line, point the mouse at the error icon to see the error message). If you try to execute a protocol containing errors, MotifLab will refuse and display an error message. If the status light has a yellow color this indicates that MotifLab has yet to determine whether the protocol contains any errors or not. This color is usually displayed if you start typing into the protocol. MotifLab will then wait until you have stopped typing before it checks the protocol for errors and then changes the light to either green or red.

The second box on the status line (after the status light) displays the coordinates of the cursor in the format "line:column", and the third box shows if the editor is currently in "insert mode" (INS) or "overwrite mode" (OVR). If the editor is in "insert mode", newly typed characters will be inserted at the position of the cursor and any text that follows the cursor will pushed forward. If the editor is in "overwrite mode", however, any character currently under the cursor will be replaced by a newly typed character. You can toggle between the two modes by pressing the INSERT key on your keyboard (if you have one).

Display settings

When MotifsLab's "record mode" is activated to log a users actions in a protocol, only the operations that the user executes are recorded. Other activities the user performs, such as for instance changing the color or height of a data track, are not recorded. However, it is possible to include such visual cues in the protocol as well, by manually entering display setting statements. A display setting statement starts with a dollar sign '$' (or alternatively an exclamation mark '!') at the beginning of the protocol line and is immediately followed by the name of the setting to be specified.
The general format is:

$setting(target)=value

Note that both the setting and the value are normally case-insensitive but the target is case-sensitive. The difference between using a dollar sign in front of the statement and an exclamation mark, is that when the dollar sign is used, the system will check that the target data objects exist and have the correct type. If an exclamation mark is used instead, the system will not perform any checks but just make a record of the setting for future reference. Hence, using the exclamation signs allows you to set display settings for data objects that have not been created yet.
A table describing all recognized display settings is provided below. The target argument specifies which data object(s) the setting should be applied to. A target can for instance be the name of a feature track, a sequence, a motif or a module depending on the display setting. A comma-separated list of targets can be specified instead of just a single target, and if the setting applies to sequences, motifs or modules, names of collections of such objects can also be used. Alternatively, instead of naming specific targets, a single wildcard (*) can be used to refer to all data objects of the applicable type. For settings that target "region types", a list of types can be provided or a special wildcard that target all region types found in a given Region Dataset like this "datasetname:*". Note that some settings do not have specific targets, in which case the target argument should be left blank.

The allowed values for each display setting are also specified in the following table. Some settings require the value to be a specific keyword (such as for the "graphtype" setting), while others require a numeric (usually integer) value or a boolean value (which can be specified as either TRUE/YES/ON, or FALSE/NO/OFF). The special color value can be entered as either a comma-separated triplet of numeric RGB-values in the range 0 to 255 (e.g.: "255,0,0" for RED or "255,255,0" for YELLOW), as a 6 digit hexadecimal number preceded by # (e.g. "#FF0000" for RED or "#FFFF00" for YELLOW) or using one of the following color-keywords: BLACK, BLUE, CYAN, DARK BROWN, GRAY, GREEN, LIGHT BLUE, LIGHT BROWN, LIGHT GRAY, LIGHT GREEN, MAGENTA, ORANGE, PINK, RED, VIOLET, WHITE or YELLOW. MotifLab v2.0 also allows the color to be specified with a colon-separated triplet of numeric HSB-values in the range 0.0-1.0.

In MotifLab v2.0, some of the fonts used (for instance to draw base letters in DNA tracks or tick labels in graphs) can also be changed. Fonts are specified as a comma-separated triplet defining the fontname, size and style. The fontname can either be one of the five logical fonts ("Serif", "SansSerif", "Monospaced", "Dialog" or "DialogInput") or the name of any font installed on the users computer. The size is an integer between 3 and 200 (recommended range between 8 and 30), and style can be chosen among the following options: "plain", "bold", "italic" or "bolditalic".
E.g.: $setting("system.dnaFont")=Serif,12,bold.

Setting	Target	Value	Description
visible	tracks	boolean	Sets the visibility of the specified tracks
show	tracks		Shows the specified tracks. This is an abbreviation of: visible(x)=TRUE
hide	tracks		Hides the specified tracks. This is an abbreviation of: visible(x)=FALSE
sequenceVisible	sequences	boolean	Sets the visibility of the specified sequences
showSequence	sequences		Shows the specified sequences. This is an abbreviation of: sequenceVisible(x)=TRUE
hideSequence	sequences		Hides the specified sequences. This is an abbreviation of: sequenceVisible(x)=FALSE
regionVisible	region types	boolean	Sets the visibility of regions of the specified types
showRegion	region types		Shows regions of the specified types. This is an abbreviation of: regionVisible(x)=TRUE
hideRegion	region types		Hides region of the specified types. This is an abbreviation of: regionVisible(x)=FALSE
motifVisible	motifs	boolean	Sets the visibility of the specified motifs
showMotif	motifs		Shows the specified motifs. This is an abbreviation of: motifVisible(x)=TRUE
hideMotif	motifs		Hides the specified motifs. This is an abbreviation of: motifVisible(x)=FALSE
moduleVisible	modules	boolean	Sets the visibility of the specified modules
showModule	modules		Shows the specified modules. This is an abbreviation of: moduleVisible(x)=TRUE
hideModule	modules		Hides the specified modules. This is an abbreviation of: moduleVisible(x)=FALSE
graph graphtype	numeric tracks	graph filled graph line graph outlined graph gradient heatmap one-color heatmap two-color heatmap rainbow heatmap	Specifies which type of graph to use for the track
multicolor	region tracks	boolean	Specifies whether to draw all Regions in a Region track using the same color (FALSE) or color the regions according to type (TRUE)
gradient gradientfill	region tracks	boolean integer off vertical horizontal	Specifies whether to draw boxes for Regions using a flat color fill or gradient fill. The keyword vertical, integer value 1 or boolean value TRUE, will set the fill to "vertical gradient fill". The keyword horizontal or integer value 2 will set the fill to "horizontal gradient fill". Any other value will turn off gradient fill and just use plain flat colors.
showScore	region tracks	boolean	Specifies whether to visualize the score of Regions by drawing the height of the Region boxes proportional to the score value
showOrientation showStrand	region tracks	boolean	Specifies whether to visualize the orientation of Regions by drawing regions with same orientation as the sequence above the baseline and regions with opposite orientation below the baseline
color foreground fgColor	tracks	color	Sets the foreground color for the specified tracks
background bgColor	tracks	color	Sets the background color for the specified tracks
secondary secondaryColor	tracks	color	Sets the secondary color for the specified tracks
baseline baselineColor	tracks	color	Sets the color of the baseline for the specified tracks
label labelColor	sequences	color	Sets the color of the labels for the specified sequences
canvas canvasColor		color	Sets the background color of the visualization panel
regionColor	region types	color	Sets the color for regions of the specified types
motifColor	motifs	color	Sets the color for the specified motifs
moduleColor	modules	color	Sets the color for the specified modules
moduleFillColor		None Type color	Specifies the color to use for the intra-module background when drawing module regions. The value can either be a color or one of the two special values: `None` (do not color the background) or `Type` (color the background according to module type)
moduleOutlineColor		None Type color	Specifies the color to use for the border when drawing module regions. The value can either be a color or one of the two special values: `None` (do not draw a border) or `Type` (color the border according to module type)
expanded	region tracks	boolean	Sets the expansion mode of the specified Region tracks. TRUE=expanded, FALSE=contracted
expand	region tracks		Expands the specified Region tracks This is an abbreviation of: expanded(x)=TRUE
contract	region tracks		Contracts the specified Region tracks This is an abbreviation of: expanded(x)=FALSE
height trackHeight	tracks	integer	Sets the track heights for the specified tracks
scale	sequences	value% ToFit	Sets the zoom level for the sequences to the specified percentage value (note that the number must be followed by a percent sign) or adjusts the zoom level so that the entire sequence is visible if the special ToFit keyword is specified
orientation	sequences	Direct Reverse Relative Opposite	Shows the sequences according to the given strand orientation. The keywords Direct and Reverse will show the sequences on the genomic direct or reverse strand respectively. If the keyword Relative is specified, the sequences will be shown relative to the individual orientation of each sequence (and opposite will show the opposite strand.)
margin		integer	Specifies the margin distance between sequences
order	tracks		Sets the order of the tracks according to the given list
sort	mode	Ascending Descending	The sort command was added to MotifLab in version 2.0 and will sort the sequences in the given direction according to the specified sort mode. The direction should be either "ascending" or "descending" (abbreviated "asc" and "desc"). The mode parameter specifies how the sequences should be sorted. Valid modes are: Sequence name Sequence length Region count: <Region Dataset> Visible region count: <Region Dataset> Region coverage: <Region Dataset> Visible region coverage: <Region Dataset> Region scores sum: <Region Dataset> Visible region scores sum: <Region Dataset> Numeric map: <Sequence Numeric Map> Numeric track sum: <Numeric Dataset> GC-content: <DNA Sequence Dataset> Mark Location Some of these modes require an additional data object of a certain type to be supplied as a parameter. In this case the mode should be followed by a colon and the name of the data object. It is also possible to first group the sequences together in clusters (specified by a Sequence Partition) and then sort the sequences within each cluster. To use this grouping option simply type ", group by: <Sequence Partition>" after the mode parameter. Note the comma which separates the mode from the grouping option. Alternatively, it is possible to drop the "group by:" string and just type a comma followed by the name of the Sequence Partition.
updates		boolean	Turns on or off visualization updates. When updates are turned off, the Visualization panel will not be updated when e.g. new sequences are added. This setting should be used with caution, but it can be useful if you have a protocol script where many sequences are added one by one. This will tend to be inefficient since the Visualization panel is updated every time a new sequence is added. However, if you add an $updates()=OFF line before adding sequences and remember to turn updates on again with an $updates()=ON line afterwards (followed by a $refresh() line), adding all the sequences will be much more efficient. (But do remember to turn updates on again, otherwise it may cause trouble later!).
refresh			This command is required to refresh the screen in order to update the graphics properly if normal graphics updates have been turned off with $updates()=OFF.
setting			This is a general command which can be used to change any display setting as long as you know the correct name of the setting. For instance, to change the height of a track named "Conservation" to 20 you can use the command `$setting("Conservation.trackHeight")=20`, or to change its foreground color to red you can use `$setting(Conservation.foregroundColor)=RED`. (Enclosing the name of the setting in quotes is optional). However, it is not recommended to change the values of settings in this way unless you don't have any other choice, since (1) the graphics might not be properly updated in response to your command (at least you should follow such as statement with `$refresh()`) and (2) there are no security checks in place, so unless you know exactly what you are doing you can create serious problems by unintentionally altering some important setting (for instance by setting the height of a track to a color instead of a number).
import	"filename"		Imports a set of display settings from the file with the given name. The file should be a text-file where each line is in the format: `<settingname> = <value>`.
display	data object		If this command is executed in the GUI client, a popup dialog will be shown displaying the contents of the data object (this does not apply to feature tracks). Added in MotifLab v2.0.

Examples of display setting statements:

$visible(*)=YES   # Shows all current feature tracks in the visualization panel
$hideMotifs(*)   # Hides all motifs so that their TFBS are not shown within motif tracks in the visualization panel
$showMotifs(MotifCollection1,M00001,M00002,M00004)   # Shows all motifs in the collection and 3 more
$height(Conservation,RepeatMasker)=26   # Sets the height of these two tracks to 26
$margin()=10   # Sets vertical distance between sequences to 10 pixels
$color(Conservation)=RED   # Sets the color of the Conservation track to red
$color(RepeatMasker)=#0000FF   # Sets the color of the RepeatMasker track to blue
$color(CCDS)=0,255,0   # Sets the color of the CCDS track to green
$expanded(TFBS)=False   # Turns off expanded mode for the TFBS track
$order(DNA,CCDS,Conservation,RepeatMasker,TFBS)   # Changes the order of the given tracks
$moduleFillColor()=Type   # Specifies that all modules should be colored according to their type
$moduleOutlineColor()=BLACK   # Sets outline color of all modules to black
$hideRegion(RepeatMasker:*)   # Hides all regions found in the RepeatMasker track
$showRegion(AluSx,LTR2B)   # Shows regions of the AluSx and LTR2B (repeat) types

Display setting statements can also be used to perform a few other tasks in a protocol that are not necessarily connected to visualization.
The following table contains a few such useful statements:

Setting	Target	Value	Description
saveOutput	Output Data	"filename"	This statement can be used to save a single Output Data object (created with the output operation to the specified file. Note that if the protocol is executed through the command-line interface of MotifLab (not the graphical user interface), all Output Data objects that have been created by the protocol that still exists after execution finishes are automatically saved to file (the filename will be the same as the name of the Output Data object and the file-suffix will be determined by the Data Format used). Hence, this statement is only useful if you either run the protocol with the GUI-interface and want to automatically save the output rather than having to select "Save As..." from the "File"-menu afterwards, or if your protocol script creates a lot of very large Output Data objects and you want to use the delete operation along the way to free up memory (i.e. after creating an Output Data object with the output-operation, you immediately save this Output Data to file and then delete it before outputting any other data objects).
saveSession	"filename"		Saves the current session to the specified file
restoreSession	"filename"		Restores a session from the specified file
clear	"Clear All Data" "Clear Feature Data" "Clear Sequence Data" "Clear Modules Data" "Clear Motifs and Modules Data" "Clear Other Data" "Clear Cache"		Deletes all data objects of the specified types (or clears the cache).
log	"text string"		Outputs the given text string to the log. The text can contain references to data objects as described under the "direct output" section of the output-operation.
message	"text string"	PLAIN INFORMATION ERROR WARNING QUESTION (the value is optional)	(MotifLab v2+) Presents a message to the user. If MotifLab is run with a graphical interface, the message will be presented in a popup dialog, and the user must click "OK" to close the dialog and continue. If a message type is specified, the dialog will be fitted with an icon reflecting the message type. If MotifLab is not run with a graphical interface, this command behaves similar to "log". The message text can contain references to data objects as described under the "direct output" section of the output-operation.
dump	"display setting name"		Outputs the value of the given display setting to the log. If this statement is used without providing any display settings (just "$dump()"), MotifLab will list the names and values of all currently registered display settings. This command is mainly used for debugging (and to snoop around in MotifLab's internal lookup tables)
macro	macro name	text string	Adds a new macro definition (MotifLab v2+). This setting behaves a bit differently depending on whether the macro is defined with `$macro(NAME)=DEFINITION` or `!macro(NAME)=DEFINITION`. If "macro" is preceeded by a dollar sign, the given macro definition will be treated as a default value that will be used for the macro unless a value for that macro has already been defined somewhere else (e.g. in the macro-editor of the GUI-client or by using the "-macro" option with the CLI-client). If an exclamation mark used instead of the dollar sign, the macro will always be assigned the given definition.
option	"setting name"	boolean numeric value text string color	This command can be used to set MotifLab options that are usually configured in the Options dialog. The recognized settings are listed below (with valid values in parentheses behind each option). `maxConcurrentDownloads` (integer) `concurrentThreadCount` (integer) `networkTimeout` (integer) `maxSequenceLength` (integer) `autocorrectSequenceNames` (boolean) `useFeatureDataCache` (boolean) `useGeneIDMappingCache` (boolean) `promptBeforeDiscard` (boolean) `skipPositionZero` (boolean) `notificationsMinimumLevel` (integer) `autoSaveSessionOnExit` ("Never, Always, Ask") `antialiasMode` ("ON, OFF, DEFAULT, GASP, LCD_HRGB, LCD_HBGR, LCD_VRGB, LCD_VBGR") `SequenceWindowSize` (integer) `scaleSequenceLabelsToFit` (boolean) `sequenceLabelFixedWidth` (integer) `mainpanelBackground` (color) `numericTrackSamplingCutoff` (integer) `numericTrackSamplingNumber` (integer) `numericTrackDisplayValue` (integer) `Javascript` ("None, New File, Shared File, Embed, Link") `CSS` ("None, New File, Shared File, Embed, Link") `stylesheet` ("None, New File, Shared File, Embed, Link") `ProtocolEditor_fontName` (text string) `ProtocolEditor_fontSize` (integer) `ProtocolEditor_antialias` (boolean) `ProtocolColor:Data objects` (color) `ProtocolColor:Data types` (color) `ProtocolColor:Data formats` (color) `ProtocolColor:Operations` (color) `ProtocolColor:Analyses` (color) `ProtocolColor:Programs` (color) `ProtocolColor:Numbers` (color) `ProtocolColor:Text strings` (color) `ProtocolColor:Display settings` (color) `ProtocolColor:Comments` (color)
pause		integer	(MotifLab v2+) This command will simply instruct MotifLab to pause and wait the specified number of milliseconds before continuing. It can be used within protocols to make simple timed animations. E.g. the command "$pause()=3000" will wait for 3 seconds. The words "wait" and "sleep" are synonymous with "pause".

Macros

The possibility of defining macros to use in protocols was introduced in MotifLab v2. Macros are named entities that can be referenced in protocol scrips, and right before a protocol is to be executed all occurrences of macros will be substituted with their respective definitions. This makes it possible to rewrite parts of a protocol on-the-fly.

Macros can either be defined in the GUI's macro editor, which can be found by selecting "Macro Editor..." from the "Protocol" menu, or with the command-line argument "-macro <name> "<definition>"" in the CLI-client. Macros can also be defined within a protocol itself using a display setting command, like so

   !macro(name)=definition

The difference between using an exclamation mark versus a dollar sign for the macro command is that the exclamation mark will always assign the new definition to the macro when the command is executed, but if you preceed the command with a dollar sign the macro will only be assigned the new definition if it is not already defined through other means (GUI macro editor or CLI-option).

There are no restrictions on the name of a macro except that it cannot contain a closing parenthesis. However, it is advisable to keep the names simple and only use letters and underscores. Also, since every instance of the macro name anywhere in the protocol will eventually be replaced by its definition, you should make sure that the name is unique enough to not cause any off-target substitutions (for example if the macro name is a substring of some other word used in the protocol).

There are two different kinds of macros in MotifLab, simple macros and list macros.

Simple macro
A simple macro will just replace every occurrence of the macro name in a protocol with the corresponding definition.
For example, the following protocol contains a macro named VALUE with the definition "942".

!macro(VALUE)=942
X = new Numeric Variable(VALUE)

This will result in the following protocol

X = new Numeric Variable(942)

List macro
A list macro is defined by enclosing the macro definition in brackets. Inside the brackets you can list multiple comma-separated values. If a line in a protocol contains a list macro, MotifLab will expand that line into multiple lines with each line using the next value in the list as its macro definition. For example, the following simple protocol contains a list macro named INDEX with four listed values.

!macro(INDEX)=[1,2,3,4]
X_INDEX = new Numeric Variable(INDEX)

The second line in the protocol contains the macro name and will therefore be expanded into four repeated lines with each line using the next value in the list for the macro. The resulting protocol will thus look like this:

  X_1 = new Numeric Variable(1)
  X_2 = new Numeric Variable(2)
  X_3 = new Numeric Variable(3)
  X_4 = new Numeric Variable(4)

List macros can contain any kind of values, not just numbers, but using list macros to append incremental numeric suffixes to data objects and thus creating a kind of "array" of related data objects is a common scenario. For this reason, it is also possible to use the short-hand notation "[1:4]" as a list macro definition instead of listing all the numbers "1,2,3,4" explicitly. In this case, MotifLab will automatically create the list by iterating through all the numbers starting from the first value (before the colon) up to and including the second (after the colon). If the last value is smaller than the first, the numbers will appear in reverse order (e.g. the list "[7:3]" will expand to "7,6,5,4,3"). Since it is most common to start at the value 1 and go upwards, you can even drop the first value if you want in this case. Hence, the simple list macro "[10]" will expand into 10 elements numbered from 1 to 10.

Note that it is possible to nest macros so that the definition of one macro contains the name of a second macro. Every time MotifLab expands a macro into one or more lines, it will check those lines over again for the presence of additional macro names and continue to expand macros until no more macros can be found. (For this reason you should avoid circular macros at all cost since that will cause MotifLab to hang). If a line in the protocol contains more than one macro, these will be expanded in a left-to-right order.

If you have a protocol containing macros, it is possible to preview the resulting expanded protocol by selecting "Expand Macros" from the "Protocol" menu in the GUI. This will expand all recognized macros in the protocol and show the result in a new protocol file (having the same name as the original protocol but suffixed with "-[macro expanded]").

Analyses

WARNING: When you perform analyses on sequences, motifs or modules, the resulting analysis object will store the names/identifiers of these data objects but not necessarily other information about them. When you view an analysis or output it, MotifLab may dynamically add more information about the sequences/motifs/modules to the output based on their current values, but if you have modified or replaced these objects after doing the analysis, these properties may not reflect the actual values that the objects had when the analysis was performed!

This usually only applies to individual sequences/motifs/modules and not to collections and partitions (which will normally be copied by the analysis). For instance, if you perform a "Count Motif Occurrences" with a Motif Collection containing motif M00143 and you later change the matrix values of this motif, the motif logo shown when viewing the results of the analysis will not reflect the actual motif that was used when performing the analysis.

benchmark

This analysis can be applied to: Region Dataset

The benchmark analysis can be used to evaluate the performance of motif discovery programs by comparing tracks with predicted TF binding sites (or other predicted regions) returned by these programs against a track containing the "correct" answer (e.g. all known TFBS in the sequences). The analysis calculates several common performance metrics (statistics), including e.g. sensitivity, specificity, positive predictive value, F-measure and Matthew's correlation as described below.

Some of the metrics (viz. sensitivity, PPV, PC, ASP and F-measure) can be evaluated at both a "nucleotide level" and "site level", whereas the remaining metrics are only defined at the "nucleotide level". The formulas for all metrics are based on four parameters that count the number of true positive instances (TP), false positives (FP), true negatives (TN) and false negatives (FN) respectively. At the "nucleotide level", a true positive is a nucleotide position that is correctly predicted as being part of a binding site (both the prediction track and answer track have regions that overlap with this nucleotide). A false positive is a nucleotide position that is within a region in the prediction track but not in the answer track (the nucleotide is wrongly predicted to be within a TFBS). A true negative is a nucleotide that is correctly predicted to not be within a TFBS but rather being part of the background sequence (it is outside regions in both the prediction and answer tracks). A false negative is a nucleotide that is predicted to be part of the background sequence when it is actually within a true TFBS (it is outside of regions in the prediction track but inside a region in the answer track).

At the "site level", a region in the answer track that is overlapped by a region in the prediction track is counted as a true positive, a region in the answer track that is not overlapped by a predicted region is called a false negative and a predicted region that does not overlap with a region in the answer track is called a false positive (true negatives are not counted at the "site level"). The minimum amount of overlap between the answer region and predicted region that is required in order to call it a true positive can be specified as a parameter to the analysis.

If the benchmark analysis is based on several sequences, the TP/FP/TN/FN parameters will be counted for each sequence and then summed up to produce a total for the whole dataset before calculating the statistics below.

Metric	Description	Definition
Sensitivity (Sn)	Fraction of target regions that was correctly predicted. This metric is also called "recall".	TP/(TP+FN)
Specificity (Sp)	Fraction of background that was correcly predicted	TN/(TN+FP)
Positive predictive value (PPV)	Fraction of predicted regions that correctly correspond to true target regions. This metric is also called "precision".	TP/(TP+FP)
Negative predictive value (NPV)	Fraction of predicted background that correctly correspond to true background	TN/(TN+FN)
Performance coefficient (PC)	The ratio between the intersection and union of the answer and prediction tracks. This metric is also called "Jaccard index".	TP/(TP+FP+FN)
Average site performance (ASP)	The arithmetic mean of sensitivity and PPV	1/2*(TP/((TP+FN))+TP/((TP+FP)))
F-measure (F)	The harmonic mean of sensitivity and PPV	2TP/(2TP+FP+FN)
Accuracy (Acc)	The fraction of nucleotides in the sequences that were correctly classified (as either true target regions or true background)	(TP+TN)/(TP+TN+FP+FN)
Correlation coefficient (CC)	The correlation between the regions in the prediction track and the target regions in the answer track	((TPTN)-(FPFN))/sqrt((TP+FN)(TN+FP)(TP+FP)(TN+FN))

All of these metrics, except CC, has a range between 0 (worst score) and 1.0 (best score). The CC metric has a range from -1.0 to 1.0, where a score of 1.0 means that the prediction and answer tracks are equal (at least in terms of overlapping regions), a score of -1.0 means that the prediction track is exactly the opposite of the answer track (all true regions were predicted as background and all true background nucleotides were predicted as being within TFBS). A score of 0 means that there is no correlation between the prediction track and the answer track (such a result would be expected if the predictions were based on random guessing).

Some motif discovery methods are based on stochastic algorithms and may produce different results if run several times on the same dataset. For such methods it would be useful to report the average results (with standard deviation) across multiple runs. The benchmark analysis allows the results for multiple prediction tracks for the same method to be combined into a single average statistic. In order to do this, the "Aggregate" parameter flag must be set (see below) and the tracks must be given names in the format "methodname_number", i.e. the name of the track (which is often the name of the method) must be suffixed by an underscore followed by a number (which need not be incremental). For example, if you have run a method based on Gibbs sampling five times and the TFBS prediction tracks returned by this program are given the names "Gibbs_1", "Gibbs_2", "Gibbs_3", "Gibbs_4" and "Gibbs_5", the benchmark analysis will take the average score for each metric across these five tracks and present the results as a method called "Gibbs". Standard deviations are shown as error bars in the bar plot (in current versions of MotifLab the standard deviations are not reported as numbers).

The analysis compares the answer track to all other Region Datasets known to MotifLab, but only results for Region Datasets that are currently visible in the GUI will be included when the Analysis object is examined or output to HTML or RawData formats. The order of the tracks in the output is based on their order in the Features Panel, and the colors used for the tracks in the bar chart are based on the current colors of the tracks. If MotifLab is run without the GUI in command-line mode, the visibility of the tracks can be set in the protocol with the "$show(trackname)" and "$hide(trackname)" display setting statements. The colors of the tracks can be set with "$color(trackname)=color" and the order of the tracks can be set with "$order(track1,track2,...,trackN)". These commands can also be used for aggregated tracks if the dollar sign is exhanged for an exclamation sign. E.g. to set the color for the aggregated "Gibbs" track based on the five tracks mentioned above, a command like "!color(Gibbs)=RED" could be used. It is also possible to specify the colors to use for the different performance metrics by using commands on the form "$setting("systemColor.Sensitivity")=RED". The standard colors for these metrics are defined in the startup script for MotifLab (go to the "Configure" menu and select "Edit Startup Script" to see how each one can be changed).

Name	Description
Answer	This parameter specifies which region track should be used as the "correct answer" that the prediction tracks should be compared against. Note that the prediction tracks to compare with the "answer" track is not specified. Rather, all other Region Datasets are automatically evaluated against the answer (but only the results for currently visible tracks are included in the output at any given time).
Groups	This optional parameter can either specify a Sequence Collection or a Sequence Partition. If a Sequence Collection is specified, the analysis will be restricted to sequences contained in this collection. If a partition is provided, separate benchmark analyses will be performed for each individual cluster in addition to a combined analysis based on all sequences in the partition. If the parameter is left undefined, the benchmark analysis will be based on all sequences.
Aggregate	If this parameter flag is set, the benchmark analysis will group tracks together if they start with the same name prefix and end with a suffix consisting of an underscore followed by an integer number (i.e. three tracks named "xxx_1", "xxx_2" and "xxx_3" will be grouped together as "xxx"). For each such group, the analysis will return a single combined score for each performance metric by taking the average value of the scores obtained for each individual track in the group.
Site overlap	A number (which should be greater than 0 and smaller than or equal to 1.0) which specifies the minimum fraction of a target region (in the "Answer" track) that is required to be overlapped by a prediction in order to call that prediction a "true positive" (TP) on the "site-level".

: Region Dataset, Sequence Partition, compare region datasets

binding sequence occurrences

This analysis can be applied to: Region Dataset

This analysis is somewhat similar to the Count Motif Occurrences analysis, except that instead of just reporting the number of sites found for each motif (based on region type), the counts are further subdivided based on the sequence property of the motif site, which means that for each motif the analysis reports the number of sites found for each unique binding sequence. For example, if a motif with consensus "CAsGTG" occurs a total of 7 times, the analysis could report that it occurs 4 times with the specific binding sequence "CACGTG" and 3 times with the binding sequence "CAGGTG". For each combination of motif and specific binding sequence, the analysis reports how many occurrences there are in total of that binding combination, the number of sequences that contains this combination and also a match score for this combination. The match score is a relative score between 0 and 100 that reflects how well the specific binding sequence matches the motif. The best matching binding sequence (the one which gives the highest score according to the binding matrix) is given a score of 100 and the worst possible match is given a score of 0.

Name	Description
Motif track	This parameter specifies the motif track containing the binding sites that will be counted. The motif regions should have defined "sequence" properties that specify the actual binding sequence each the site.
Motifs	This parameter specifies the motifs for which binding sequences will be counted.
Sequences	If a Sequence Collection is selected for this optional parameter, the analysis will only count binding sites within sequences from this collection. If left empty, all sequences will be considered.
Within regions	If a Region dataset is selected for this optional parameter, only motif sites that are located fully within the regions in this dataset will be counted. If left empty, all motif sites in the sequences will be considered.

: count motif occurrences

compare clusters to collection

No documentation currently available.

compare collections

This analysis can be applied to: Collection

This analysis compares two collection objects (of the same type) to see if they have any entries in common. The analysis reports the number of entries that are present in both collections, in one of the collections but not the other and also the number of entries that not present in either of the two collections (but are present in a "total" collection). The analysis also calculates p-values which reflect the probability that the two collections should have at least the observed number of entries in common (or at most this number of entries in common) assuming the entries for the two collections had been randomly sampled from a larger collection (called "total").

Name	Description
First	This parameter specifies the first of the two collections.
Second	This parameter specifies the second of the two collections.
Total	This optional parameter specifies a larger collection that is used when calculating p-values. The "total" collection should include all entries from the two collections above (first and second) and perhaps other entries as well. If left unspecified it will default to a collection containing all known data objects of the relevant type (e.g. if the two collections are motif collections, the "total" collection will default to a collection containing all known motifs).

compare motif occurrences

This analysis can be applied to: Region Dataset

This analysis will count the number of times each type of motif occurs in one set of sequences (target set) and compare this to the number of times the motifs occur in a second set (control set). Statistical tests (either binomial test of hypergeometric test) will assess whether some motifs occur significantly more often in one of these sets compared to the other. The output contains counts of how many times each motif type occurs in the target set and control set respectively and also p-values for the target and control sets. These p-values reflect the probability of encountering the observed number of hits (or higher) given an expected number of hits based on each motif's frequency in the opposite set. E.g. If a specific motif occurs N times in the target set and M times in the control set, the reported "target p-value" will be the p-value of observing N or more motif hits in a dataset of the same size as the target set based on an expected motif frequency given by M divided by the size of the control dataset (or more accurately the maximum number of times a motif of that size could occur within such a dataset). Motifs that are significantly overrepresented in the target set are marked in red colors in the output, whereas those that are significantly overrepresented in the control set are marked in green. Motifs that are occur in both sets but are not significantly overrepresented in either set are marked with yellow.

Name	Description
Motif track	This parameter specifies the motif track containing the binding sites that will be counted.
Motifs	This parameter specifies the motifs which will be considered in the analysis.
Target set	This parameter specifies the first set of sequences. The motif occurrences in this set will be compared against those in the "control set" below.
Control set	This parameter specifies the second set of sequences. The motif occurrences in this set will be compared agaisnt those in the "target set" above.
Within regions	If a Region dataset is selected for this optional parameter, only motif sites that are located fully within the regions in this dataset will be counted. If left empty, all motif sites in the sequences will be considered.
Statistical test	This parameter specifies the statistical test to use for assessing whether a particular motif is significantly overrepresented in one sequence set compared to the other. The options are: Binomial: This test counts the number of times each motif occurs in each sequence set and calculates occurrences frequencies based on these counts divided by the maximum number of possible occurences in each set. Given an observed motif count (for either target or control set), the binomial test considers the probability of encountering at least this many motif hits given an expected frequency based the observed frequency in the opposite sequence set. Hypergeometric: This test only considers the number of sequences in the target and control set that contain each motif and not the actual number of occurrences. Let us say the target set has N sequences and n of these contains the motif. The control set on the other hand has M sequences with m containing the motif. So we have a total of N+M sequences of which n+m contains the motif. The hypergeometric test assesses the probability that n or more sequences should contain the motif if we were to pick N sequences at random from the total set (for comparing the target set to the control).
Significance threshold	The (uncorrected) p-value threshold below which motifs are considered to be significantly overrepresented in a sequence set. Significant p-values below the (corrected) threshold are marked with either red color (for the target set) or green color (for the control set) when output.
Bonferroni correction	This parameter can be used to apply automatic Bonferroni correction to the significance threshold above to account for multiple hypothesis testing. Bonferroni correction is a straightforward way of correcting the threshold by simply dividing the threshold value by the number of hypotheses (i.e. motifs). The threshold can either be divided by the number of motifs tested ("All motifs") or the number of different motifs actually encountered in the motif track ("Present motifs"). If other forms of correction are required, this can be achieved by selecting "None" for this parameter (turn of automatic correction) and rather correct the threshold manually. For example, Bonferroni correction assumes that all hypotheses (motifs) are independent of each other, which will usually not be the case since motif collections tend to contain many similar motifs. Setting the Bonferroni correction to "All motifs" in this case would lead to an overly strict threshold. A better option then would be to e.g. cluster all similar motifs together (using for example a Motif Partition), count the number of motif clusters and set the significance threshold to the uncorrected value divided by the number of motif clusters.

: count motif occurrences, compare region occurrences

compare motif track to numeric track

This analysis can be applied to: Region Dataset and Numeric Dataset

This analysis will compare a motif track against a numeric track and examine the numeric values found within each motif site. For each type of motif, the location of all binding sites (TFBS) for this motif are found. Next, different statistics are calculated based on the values that the chosen numeric track has within these TFBSs, including the smallest (minimum) value in the track within all TFBSs for each motif, the largest (maximum) value, the sum of all values within positions covered by TFBSs and the average value of the numeric track within the TFBSs (found by taking the sum and dividing by the total number of positions within the TFBSs). In addition, the analysis will also count the number of TFBSs for each motif where the average value of the numeric track within the TFBS (found by summing up the values within the TFBS and dividing by the length of the TFBS) is greater than (or equal to) some selected threshold.

Name	Description
Motif track	This parameter specifies the motif track containing the binding sites that will be considered in the analysis.
Motifs	This parameter specifies the motifs whose binding sites will be considered in the analysis.
Sequences	If a Sequence Collection is selected for this optional parameter, the analysis will only consider binding sites within sequences from this collection. If left empty, all sequences will be included.
Numeric track	This parameter specifies the numeric track that the motif track will be compared against
Threshold	One of the statistics reported by the analysis (called "count above threshold") will be based on the number of TFBSs for each motif where the average value of the numeric track within the TFBSs is greater than (or equal to) the threshold value specified here. For example, if this threshold is set to 0.8 and "Conservation" is selected for the numeric track, the analysis will report the number of TFBSs for each motif that have an average conservation score above (or equal to) 0.8.

compare region datasets

This analysis can be applied to: Region Dataset

This analysis compares two region datasets and calculates several ("nucleotide level") statistics based on their overlap, namely sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), performance coefficient (PC), average site performance (ASP), F-measure (F), accuracy (Acc) and Matthews correlation coefficient (CC). See the "benchmark" analysis for a detailed description of these statistics. The formulas for all these statistics are based on four parameters that count the number of true positive nucleotides (TP), false positives (FP), true negatives (TN) and false negatives (FN) respectively. A true positive is a nucleotide position that is inside a region in both of the two Region Datasets. A false positive is a nucleotide position that is within a region in the first dataset but not in the second dataset. A true negative is a nucleotide that is outside regions in both datasets. A false negative is a nucleotide that is outside a region in the first dataset but inside a region in the second dataset. The analysis will also show a pie chart illustrating how much overlap there is between regions in the two datasets (fraction of nucleotides within regions in both sets), as well as the fraction of positions within regions that are unique to either the first or the second dataset and finally the fraction of nucleotides that are outside regions in both datasets ("background").

Name	Description
First	This parameter specifies the first of the two Region Datasets to be compared
Second	This parameter specifies the second of the two Region Datasets to be compared
Sequences	If a sequence collection is specified for this optional parameter, the analysis will be limited to sequences in this collection

: benchmark

compare region occurrences

This analysis can be applied to: Region Dataset

This analysis will count the number of times each type of region occurs in one set of sequences (target set) and compare this to the number of times the regions occur in a second set (control set). A hypergeometric test will assess whether some regions occur significantly more often in one of these sets compared to the other. The output contains counts of how many times each region type occurs in the target set and control set respectively and also p-values for the target and control sets. Regions that are significantly overrepresented in the target set are marked in red colors in the output, whereas those that are significantly overrepresented in the control set are marked in green. Regions that are occur in both sets but are not significantly overrepresented in either set are marked with yellow.

Name	Description
Region track	This parameter specifies the track containing the regions that will be counted.
Target set	This parameter specifies the first set of sequences. The region occurrences in this set will be compared against those in the "control set" below.
Control set	This parameter specifies the second set of sequences. The region occurrences in this set will be compared agaisnt those in the "target set" above.
Statistical test	This parameter specifies the statistical test to use for assessing whether a particular region type is significantly overrepresented in one sequence set compared to the other. So far there is only one option: Hypergeometric: This test only considers the number of sequences in the target and control set that contain each region and not the actual number of occurrences. Let us say the target set has N sequences and n of these contains a particular region type. The control set on the other hand has M sequences with m containing the region. So we have a total of N+M sequences of which n+m contains the region. The hypergeometric test assesses the probability that n or more sequences should contain the region if we were to pick N sequences at random from the total set (example the target set).
Significance threshold	The (uncorrected) p-value threshold below which region types are considered to be significantly overrepresented in a sequence set. Significant p-values below the (corrected) threshold are marked with either red color (for the target set) or green color (for the control set) when output.
Bonferroni correction	This parameter can be used to apply automatic Bonferroni correction to the significance threshold above to account for multiple hypothesis testing. Bonferroni correction is a straightforward way of correcting the threshold by simply dividing the threshold value by the number of hypotheses (i.e. region types) encountered in the datasets ("Present regions"). Selecting "None" for this parameter will turn of Bonferroni correction.

: count region occurrences, compare motif occurrences

count module occurrences

This analysis can be applied to: Region Dataset

This analysis counts the number of times each module occurs in a given module track (i.e. the number of sites for each module), and reports the total count for each module and also the number of sequences that contain each module.

Name	Description
Module track	This parameter specifies the module track containing the module sites that will be counted.
Module	This parameter specifies the modules for which occurrences will be counted.
Sequences	If a Sequence Collection is selected for this optional parameter, the analysis will only count module sites within sequences from this collection. If left empty, all sequences will be considered.
Within regions	If a Region dataset is selected for this optional parameter, only module occurrences that are located fully within the regions in this dataset will be counted. If left empty, all module occurrences in the sequences will be considered.

: count motif occurrences, count region occurrences

count motif occurrences

This analysis can be applied to: Region Dataset

This analysis counts the number of times each motif occurs in a given motif track (i.e. the number of binding sites for each motif), and reports the total count for each motif and also the number of sequences that contain each motif. If a Motif Numeric Map containing expected frequencies for each motif is specified (number of motif sites expected per position in the sequence), a p-value representing the probability of encountering at least as many motif instances as observed in the sequences will be reported and the statistical significance of motif overrepresentation will be assessed by a binomial test.

Name	Description
Motif track	This parameter specifies the motif track containing the binding sites that will be counted.
Motifs	This parameter specifies the motifs for which binding sites will be counted.
Sequences	If a Sequence Collection is selected for this optional parameter, the analysis will only count binding sites within sequences from this collection. If left empty, all sequences will be considered.
Within regions	If a Region dataset is selected for this optional parameter, only motif sites that are located fully within the regions in this dataset will be counted. If left empty, all motif sites in the sequences will be considered.
Background frequencies	If a Motif Numeric Map containing expected frequencies for each motif is specified for this optional parameter (number of motif sites expected per position in the sequence), a p-value representing the probability of encountering at least as many motif instances as observed in the sequences will be reported. The statistical significance of motif overrepresentation will be assessed by a binomial test and compared against a specified significance threshold (possibly corrected for multiple hypothesis testing). P-values of motifs that are significantly overrepresented (below the corrected p-value threshold) will be marked with a light red color when output. For motifs that have an expected frequency of 0.0, the binomial test can not be used to assess significance. Such motifs will be assigned a default p-value of 0.0 and be marked with a saturated red color in the output.
Significance threshold	The (uncorrected) p-value threshold below which motifs are considered to be significantly overrepresented. Significant p-values below the (corrected) threshold are marked with red background colors when output.
Bonferroni correction	This parameter can be used to apply automatic Bonferroni correction to the significance threshold above to account for multiple hypothesis testing. Bonferroni correction is a straightforward way of correcting the threshold by simply dividing the threshold value by the number of hypotheses (i.e. motifs). The threshold can either be divided by the number of motifs tested ("All motifs") or the number of different motifs actually encountered in the motif track ("Present motifs"). If other forms of correction are required, this can be achieved by selecting "None" for this parameter (turn of automatic correction) and rather correct the threshold manually. For example, Bonferroni correction assumes that all hypotheses (motifs) are independent of each other, which will usually not be the case since motif collections tend to contain many similar motifs. Setting the Bonferroni correction to "All motifs" in this case would lead to an overly strict threshold. A better option then would be to e.g. cluster all similar motifs together (using for example a Motif Partition), count the number of motif clusters and set the significance threshold to the uncorrected value divided by the number of motif clusters.

: count module occurrences, count region occurrences, compare motif occurrences, Motif Numeric Map

count region occurrences

This analysis can be applied to: Region Dataset

This analysis counts the number of times each region type occurs in a given region track and reports the total count for each region type and also the number of sequences that contain each region type. For example, for a track containing repeat regions, the analysis will first determine which types of repeat regions are present in the sequences (e.g. different types of "Alu" repeats, SINEs, LINEs and simple repeats etc.) and then count the number of times each such repeat type occurs.

Name	Description
Region track	This parameter specifies the region track containing the regions that will be counted.
Sequences	If a Sequence Collection is selected for this optional parameter, the analysis will only count regions within sequences from this collection. If left empty, all sequences will be considered.

: count motif occurrences, count module occurrences

evaluate prior

This analysis can be applied to: Region Dataset

One of the key functionalities of MotifLab (and its predecessor PriorsEditor) is the ability to create numeric tracks that can be used as positional priors to guide motif discovery programs by assigning higher scores to positions that are considered more likely to harbour transcription factor binding sites. Such priors tracks can be created manually step-by-step by using different operations to combine information from multiple feature tracks or they can be generated automatically with PriorsGenerators that have been trained to discover the relationship between binding site occurrences and other genomic features. In either case, it will be useful to evaluate the potential of positional priors tracks generated in a certain way by comparing such a priors track against a region track containing known binding regions, to see if the track generated with this particular approach indeed has higher values inside these regions compared to outside. The "evaluate prior" analysis will do just this.

The analysis has two different modes of operation depending on whether or not the optional "Priors track" parameter has been specified. If no "Priors track" has been selected, the analysis will be run in "general mode". However, if a "Priors track" has been selected, this particular track will be analyzed in more detail in "specific mode".

General mode
In general mode, all available numeric tracks and region tracks will be compared to the given target track and evaluated. For each track, a ROC-curve will be generated reflecting its potential for discriminating positions within regions from background positions based on the track's score at each position. Also, the area under the curve (AUC) will be calculated for the ROC-curve. The ROC-curve for a track is generated in the following way: First, all the positions within the track are sorted in ascending order according to the value at each position. Then, starting at (0,0) in the graph and going through the sorted positions one by one, the ROC-curve moves one step up if the next sorted position is within a region in the target track and one step to the right if the next sorted position is outside of any regions. After all positions have been covered, the ROC-curve should end up at coordinate (1,1). (Note that the ROC-graph has been normalized so that the x-axis represents the fractional number of positions that lie outside of regions and the y-axis represents the fractional number of positions that lie within regions). Hence, if a certain priors track tends to have higher values within regions of the target track compared to outside, the graph will tend to move more upwards at the beginning and then to the right at the end, resulting in a larger area under the curve. One the other hand, if a track tends to have higher values outside of the target regions, the graph will move to the right at the beginning and then more upwards towards the end, resulting in a smaller area under the curve. Higher AUC values thus means that the priors track tends to have higher values inside of target regions. If all positions inside of regions have higher prior values than the background (so a clear separation between regions and background can be made based on the priors values), the ROC-curve will move from (0,0) to (0,1) and then to (1,1) which gives a perfect AUC-score of 1.0. If a priors track tends to give equally high values to positions inside and outside of regions (so the positions inside and outside are about uniformally distributed when sorted by numeric value), the ROC-curve will tend to move in a straight diagonal line from (0,0) to (1,1) resulting in a AUC-score of 0.5. In this case, the numeric priors track shows no ability to discriminate between regions and background. ROC-curves for region tracks are calculated in a similar fashion by treating positions within regions as having a numerical value of 1.0 and positions outside regions as having the value 0.0.

Note that even though ROC-curves and AUC-scores are calculated for all available numeric and region tracks, only the tracks that are currently visible in the GUI will be included in the graph whenever the analysis is displayed in a dialog or output using the "output" operation. Hence, if you only wish to include a few selected tracks in the graph, you can hide the tracks you don't want to include. Also, the color used for the ROC-curve of each track will be the same as the currently selected display color for that track. For analyses performed outside the GUI (running in CLI-mode from a protocol script), display setting statements can be used to hide tracks and set the colors for each track.

Specific mode
If a specific numeric track has been selected for the "Priors track" parameter, the analysis will be performed in "specific mode" which gives a more in-depth and detailed analysis of the potential of using the selected track as positional priors. First, the ROC-curve and area under the curve (AUC) is calculated for the priors track the same way as if the analysis had been performed in "general mode". Second, a "precision-recall" graph is calculated that shows the maximum "precision" (positive predictive value) that can be achieved for different recall (sensitivity) levels.

The analysis will also produce additional graphs showing how the scores for several different nucleotide-level performance statistics will vary depending on a chosen cutoff threshold for the selected priors track. For a given threshold level, all positions where the value of the priors track is higher than or equal to this threshold (or strictly higher depending on the 'threshold' parameter) are considered "positive" positions and all positions with values below the threshold are considered as "negative". Positive positions that are within target regions are further classified as "true positives" (TP) and those outside are classified as "false positives" (FP). Conversely, negative positions inside target regions are classified as "false negatives" (FN) and those outside as "true negatives" (TN). These four parameters (TP/FP/FN/TN) serve as basis for calculating several nucleotide-level statistics that are described in detail in the manual entry for the benchmark analysis. For each nucleotide-level statistic, such as e.g. sensitivity, the threshold will be varied from the lowest numeric value in the priors track to the highest value (in increments of 1/100 of the range) and the graph will show the performance that can be achieved according to that statistic for each threshold level. For example, when evaluating "Conservation" as a priors track for predicting TFBS, the sensitivity value (y-axis) at threshold=0.65 (x-axis) reflects the fraction of TFBS positions that are correctly predicted if we assume that all positions that have a Conservation value of 0.65 or higher reside within TFBSs. The analysis will also determine two "optimal thresholds". The first is for the threshold value which gives the best trade-off between sensitivity and specificity (which is to say the threshold which results in the highest arithmetic mean of the sensitivity and specificity scores), and the second optimal threshold is the one which results in the highest possible score for the accuracy statistic.

Name	Description
Target track	This parameter should specify a Region Dataset containing known instances of the regions that are predicted by the positional priors track(s) being evaluated (i.e. for evaluating positional priors to predict TF binding sites, the track should contain TF binding sites). The track should preferably be complete and representative for the given region type.
Priors track	This optional parameter can specify a particular positional priors track to analyze in "specific mode". If no track is selected here, the analysis will be done in "general mode".
Sequences	If a sequence collection is specified for this optional parameter, the analysis will be limited to sequences from this collection.
Threshold	The threshold parameter selects which comparison operator to use for classifying positions as either "positives" or "negatives" when the analysis is performed in "specific mode". The setting "Above or equal" will classify all positions that have priors values equal to or above the current threshold level as positive (and those below as negative) whereas the setting "Strictly above" will classify positions with values above the current threshold level as positive (and those with values equal to or below as negative).

: benchmark, numeric dataset distribution

GC-content

This analysis can be applied to: DNA Sequence Dataset

This analysis calculates the GC-content (percentage) in a given DNA track for every sequence and possibly also additional statistics for a group or groups of sequences (such as the minimum, maximum, average and median GC-content for the sequences in the group).

Name

Description

DNA track

This parameter specifies which DNA track to calculate GC-content for.

Groups

This optional parameter can either specify a Sequence Collection or a Sequence Partition. If a Sequence Collection is specified, the analysis will be restricted to sequences contained in this collection and it will also calculate the minimum, maximum and average GC-content (with standard deviation) for the sequences in the group, along with the median value and 1st and 3rd quartiles. If a Sequence Partition is specified, the statistics mentioned above will be calculated separately for each cluster of sequences in the partition. If the parameter is left undefined, the GC-content for every sequence will be reported, but no other statistics will be given. To calculate GC-statistics based on all sequences, make sure to select the "AllSequences" collection.

: Sequence Collection, Sequence Partition

motif collection statistics

This analysis can be applied to: Motif Collection

Calculates statistics related to motif size, IC-content and GC-content for the motifs in a given collection. The analysis reports the minimum, maximum, average, standard deviation, median and 1st and 3rd quartiles for these three motif properties and also shows histograms of their distributions.

Name	Description
Motif Collection	The Motif Collection to apply the analysis to

motif position distribution

This analysis can be applied to: Region Dataset

This analysis will analyze the positional distribution of each motif in a motif track. It can be used to assess whether motifs of certain types are uniformly distributed within sequences or if they tend to be located in the same location relative to a selected alignment anchor position across sequences (for example if some motifs tend to occur at the same distance relative to the transcription start site in several different sequences). To perform the analysis, the sequences are first aligned according to the selected anchor. Next, for each motif type the binding sites (TFBS) for this motif are located and a distribution is calculated based on the distance between the center of each TFBS and the alignment anchor. Different statistics can be calculated based on this distribution, but so far the only statistics reported are the standard deviation and kurtosis. In addition to these two statistics, graphical histograms can be created which shows the distribution of the binding sites for each motif type.

Name	Description
Motif track	This parameter specifies the motif track containing the binding sites that will be considered in the analysis.
Motifs	This parameter specifies the motifs whose binding sites will be considered in the analysis.
Sequences	If a Sequence Collection is selected for this optional parameter, the analysis will only consider binding sites within sequences from this collection. If left empty, all sequences will be included.
Alignment anchor	This parameter specifies an alignment anchor for each sequence which will serve as the reference point when estimating the relative position of each motif site (TFBS). This setting is only important if the sequences have different lengths or if the relative position of TSS/TES varies between sequences. If all sequences have the same lengths, the upstream/downstream/center anchors will all give the same result, and if in addition the relative position of TSS/TES is the same, all anchors will give the same results. Note that sequences are always aligned according to their relative orientation. TSS: The sequences will be aligned at the Transcription Start Site (TSS). TES: The sequences will be aligned at the Transcription End Site (TES) Upstream: The sequences will be aligned at their upstream end. Downstream: The sequences will be aligned at their downstream end. Center: The sequences will be aligned at their center position.
Include histograms	If this option is selected, histograms reflecting the positional distribution of the binding sites for each motif is computed and the data for these histograms are stored in the analysis data object. Note that even if this option is not selected, MotifLab will attempt to generate histograms on-the-fly when displaying the analysis object in the GUI. However, in order to include such histograms in output documents (e.g. HTML or Excel) or include histograms in collated analyses, this option must be selected.
Motif anchor	When calculating the standard deviation and kurtosis of the positional distribution, the distance between the motif and the alignment anchor is always measured from the center of the motif site (TFBS). However, when creating the histogram, the motif anchor parameter can be used to specifiy how to select the target bin(s) in the histogram in relation to the location of a motif site (TFBS). Upstream: The TFBS is assigned to the bin covering the most upstream position in the site. Center: The TFBS is assigned to the bin covering the center position in the site. Downstream: The TFBS is assigned to the bin covering the most downstream position in the site. Span: The TFBS is assigned to all bins overlapping with the span of the site.
Support	If this option is selected, each bin will only be counted once for each sequence and the histogram will reflect the number of sequences that have a binding site for that bin, not the total number of binding sites that are assigned to the bin.
Bins	An integer number which specifies how many bins to divide the sequence range into for the histograms

motif regression

This analysis can be applied to: Region Dataset

Name	Description
Motif track	This parameter specifies a motif track
Motifs	This parameter specifies which motifs to consider in the analysis.
Sequence values	If this parameter flag is set, the benchmark analysis will group tracks
Sequences	If a sequence collection is seleted for this optional parameter, the analysis will be limited to include only sequences from this collection.
Skip non-regulated	This parameter allows
Normalize

: single motif regression

motif similarity

This analysis can be applied to: Motif

This analysis will compare a single selected motif against a collection of motifs using all motif similarity metrics that are known to MotifLab (which currently include "Average Log-Likelihood Ratio","Chi-squared","Kullback-Leibler Divergence","Pearson's Correlation","Pearson's Correlation (weighted)" and "Sum of Squared Distances"). The analysis will report the raw score values for these metrics.

Name	Description
Target motif	The target motif that the other motifs should be compared against
Motifs	The collection of motifs to compare against the target motif

numeric dataset distribution

This analysis can be applied to: Numeric Dataset

This analysis will calculate distribution statistics for a Numeric Dataset; namely the number of bases in the track, the minimum and maximum values, the average and standard deviation and 1st, 2nd (median) and 3rd quartiles. If an optional Region Dataset is specified, the analysis will calculate separate statistics based on values inside regions in this dataset versus values outside these regions. A histogram will also be created showing the distribution of the values. In this histogram, the 1st/2nd/3rd quartiles plus the min and max values will be shown in a box-and-whiskers plot, whereas the average value will be shown as a diamond with standard deviation illustrated by arrows.

Name	Description
Numeric dataset	This parameter specifies the Numeric Dataset that will be analyzed.
Region dataset	If this optional parameter specifies a Region Dataset, the distribution statistics will be calculated separately for positions inside regions in this dataset and positions outside regions. If the parameter is left undefined, only one set of statistics will be calculated based on all values in the track.
Sequences	If this optional parameter is specified, the analysis will be limited to the sequences in this sequence collection. If left undefined, all sequences will be included in the analysis.
Normalize	The graphical histogram generated by this analysis will show how large fraction of the bases in the track that have values falling within the value range of each histogram bin. If the Region dataset parameter above is defined, selecting this "normalize" parameter will normalize the histograms for the "inside regions" distribution and "outside regions" distribution independently of each other (so that each distribution sums to 100%) while showing them at the same scale in the plot. If the normalize parameter is not selected, the two distributions will be scaled so that they together sum to 100%. If one the two distributions are based on very few bases compared to the other, the histogram for that distribution can appear relatively small (low in height) compared to the other when both are plotted at the same scale. The normalization parameter will only affect the appearance (relative heights) of the histograms and not the distribution statistics.
Bins	An integer number specifying the number of bins to divide the value range into for the histogram
Cumulative histogram	If this parameter is selected, the histogram(s) will show the cumulative distribution(s) where each bin reflects the fraction of bases that have values equal to or lower than the (upper) value for that bin.

: numeric map distribution

numeric map correlation

This analysis can be applied to: Numeric Map

This analysis compares two Numeric Maps to determine if the values for corresponding entries are correlated (i.e. if entries that have relatively high values in the first map also have relatively high values in the second map, etc). The analysis calculates and reports two correlation statistics, namely "Pearson's correlation" and "Spearman's (rank) correlation"

Name	Description
First	This parameter specifies the first of the two maps to compare against each other
Second	This parameter specifies the second of the two maps to compare against each other. Note that this must be of the same type as the first map.
Collection	If this optional parameter is specified, the correlation analysis will be limited to entries in this collection (which must be of the same basic type as the two maps). If left unspecified, all the entries in the map will be considered.

numeric map distribution

This analysis can be applied to: Numeric Map

This analysis will calculate distribution statistics for the values in a Numeric Map; namely the number of entries, the minimum and maximum values, the average and standard deviation and 1st, 2nd (median) and 3rd quartiles. If an optional Partition is specified, the analysis will calculate separate distribution statistics for each cluster in the Partition. A histogram will also be created showing the distribution of the values. In this histogram, the 1st/2nd/3rd quartiles plus the min and max values will be shown in a box-and-whiskers plot, whereas the average value will be shown as a diamond with standard deviation illustrated by arrows.

Name	Description
Numeric Map	This parameter specifies the Numeric Map that will be analyzed.
Group	This optional parameter can either specify a Collection or a Partition of the same type as the Numeric Map. If a Collection is specified, the distribution statistics will only be based on entries from that collection. If a Partition is specified, the analysis will calculate separate distribution statistics for each cluster in the Partition. If left undefined, the distribution will be based on all entries in the Numeric Map. (Note that entries with defaulting values will always be included).
Normalize	The graphical histogram generated by this analysis will show how large fraction of entries in the Numeric Map that have values falling within the value range of each histogram bin. If a Partition is selected for the "Group" parameter above, selecting this "normalize" parameter will normalize the histograms for each cluster in the Partition independently of each other (so that each distribution sums to 100%) while showing them at the same scale in the plot. If the normalize parameter is not selected, the distributions will be scaled so that they together sum to 100%. If some of the clusters have very few entries compared to others, the histogram for those distributions can appear relatively small (low in height) compared to the others when all are plotted at the same (unnormalized) scale. The normalization parameter will only affect the appearance (relative heights) of the histograms and not the distribution statistics.
Bins	An integer number specifying the number of bins to divide the value range into for the histogram

: numeric dataset distribution

region dataset coverage

This analysis can be applied to: Region Dataset

This dataset looks at the coverage of regions in a Region Dataset and calculates the fraction of each sequence that is covered by regions (in terms of nucleotides). It can also calculate min/max/average/median coverage statistics for a single group of sequences (specified as a Sequence Collection) or several groups of sequences (specified as clusters in a Sequence Partition).

Name

Description

Region dataset

This parameter specifies the region track that should be analyzed

Groups

This optional parameter can either specify a Sequence Collection or a Sequence Partition. If a Sequence Collection is specified, the analysis will be restricted to sequences contained in this collection and it will also calculate the minimum, maximum and average region coverage (with standard deviation) for the sequences in the group, along with the median value and 1st and 3rd quartiles. If a Sequence Partition is specified, the statistics mentioned above will be calculated separately for each cluster of sequences in the partition. If the parameter is left undefined, the region coverage for every sequence will be reported, but no other statistics will be given. To calculate coverage statistics based on all sequences, make sure to select the "AllSequences" collection for this parameter.

: GC-content

single motif regression

No documentation currently available.

Tools

MotifLab's graphical interface includes several tools that can be used to explore, analyse or manipulate data in an interactive manner. All tools can be found under the "Tools" menu in the main menu bar and some also have buttons in the tool bar.

Actions performed with these interactive tools can not be recorded in protocols and can therefore not be repeated automatically (although some tools, like Crop/Extend Sequence, have analogous operations).

Mouse tools

Selection tool

The Selection Tool can be used to select subsegments ("selection windows") of your sequences and limit the application of some operations to positions within these windows or regions overlapping the windows. To define a selection window, point the mouse at either the start or end of the window within the sequence, press the mouse button and drag the mouse to the other end of the window. Selection windows will be shown as transparent yellow overlays on the sequences. You can define several selection windows by holding down the ALT-key while dragging the mouse (overlapping selection windows will be merged). You can also subtract from the current selections by holding down the SHIFT-key. If you click anywhere within a sequence without holding down either ALT or SHIFT, the current selection windows will be discarded. If you point the mouse at a sequence and press the A-key, the whole sequence will be selected. If you press the I-key, the selection windows in that sequence will be inverted. If you hold down the ALT-key while pressing either A or I, this functionality will be applied to all sequences.

If you have defined at least one selection window and chosen to perform an operation such as e.g. "filter", an additional check box may be shown in the operation dialog which reads "Apply operation only within selected windows" (followed by a specification of the sequence coordinates for these windows). If this option is selected (which it is by default) the application of the operation will be limited to the currently selected segments of the sequences. (See selection windows conditions.)

In MotifLab 2, it is possible to copy the DNA sequence from the selected window(s) of the currently focused sequence to the clipboard by pressing CONTROL+C. The DNA is taken from the topmost DNA Sequence Dataset found in the Features Panel, and if you have selected multiple segments, they will be copied to the clipboard as separate lines.

Move tool

If your sequences span a larger region than can currently fit into the sequence visualization window, you can use the Move Tool to pan the viewport tobring other parts of the sequence into view. Just press the mouse button anywhere inside a track to grab hold of the sequence and drag the mouse to move the sequence viewport left or right. Alternatively, you can use the left and right arrow keys on the keyboard to move the sequence viewport.

Zoom tool

The Zoom Tool can be used to change the visualization scale of a single sequence. Click anywhere inside a track to zoom in at that position or hold down the SHIFT key while clicking to zoom out. (Alternatively, you can zoom out by pressing the middle mouse button, if you have one). You can zoom in on a selected region of the sequence by pressing the mouse button at one end of the region and dragging the mouse to define the region. When you release the mouse button, the viewport will be adjusted to zoom in on the region you selected. You can also zoom in/out at a position by holding down the CONTROL key and using the wheel on your mouse to change the scale. This latter option also works with the Selection Tool, Move Tool and Draw Tool.

Draw tool

The Draw Tool can be used to edit any feature datasets by drawing directly into a track with the mouse.

DNA Sequence Datasets
There are two ways to edit DNA Sequence Datasets with the Draw Tool. The first way is to click on a base in the track with the mouse and then use the keys on the keyboard to type in a new sequence which will overwrite the old one. The new bases are inserted left-to-right in the orientation the sequence is currently shown in. A white cursor is drawn around the base currently being edited (this is easier to see if you zoom in). By default, new bases are entered in uppercase letters, but you can also enter lowercase letters by holding down the SHIFT key. To stop editing, press the ENTER key or move the mousepointer outside of the track (so be careful not to accidentally move the mouse while editing since this could abort the edit prematurely).

The other way to edit DNA Sequence Datasets is to press the mouse button on a base in the track and, while holding the button down, move the mouse up or down to change the base at that position. (Moving the mouse will cycle through the bases A, C, G and T). Moving the mouse sideways will move the cursor so that you can edit other positions as well.
Hold down the SHIFT key to enter lowercase letters instead of the default uppercase letters or hold down CONTROL to insert the non-base letter 'N'. The edit is stopped when you release the mouse button.

Numeric Datasets
To edit a numeric track just press and hold down the mouse button inside the track and move the mouse to draw the new contents. Release the mouse button to finish editing. Moving the mouse outside the track (above or below) while drawing will normally insert the current maximum or minimum value of the track at that position, but you can increase the current numerical range by holding down the SHIFT button while drawing outside the track.

Region Datasets
To add a new region to a region datatrack, press the mouse button on either end of where you want the new region to be and drag the mouse to define the span of the region. If you hold down the CONTROL key when releasing the mouse button, a dialog will appear immediately afterwards to allow you to specify additional properties of the region, such as type, strand orientation and score. You can also edit these properties afterwards by double-clicking on any region with the Selection Tool. (In MotifLab version 1 you must hold down the CONTROL key while double-clicking).
To remove a single region, point at it with the Selection Tool and press the DELETE key on the keyboard, or right-click and select "Delete Region" from the context menu.

Data Browsers

Motif Browser

Documentation in preparation

The Motif Browser tool is also presented in Video Tutorial #3 (part 1).

Module Browser

Documentation in preparation

Sequence Browser

Documentation in preparation

Interactive Analysis Tools

Positional Distribution Viewer

The Positional Distribution Viewer tool can be used to visualize the relative placement of regions across multiple sequences with histograms.
Up to six histograms can be overlaid on top of each other to compare different features.

The region track on which to base the histogram is selected with the drop-down menu on top of the dialog. It is also possible to only consider a subset of the sequences by selecting a Sequence Collection in the second drop-down menu. Only regions that are currently visible in the chosen region track (and sequence subset) will be counted in the histogram, so the tool can be used in combination with other tools, such as e.g. the Motif Browser, to select subtypes of features to be shown. If the "automatic refresh" option is activated (button at the bottom of the dialog), the histogram will be updated automatically whenever the visibility status of regions are somehow changed. If this option is turned off, users must manually press the "Refresh" button in the dialog to update the histogram.

To add a new histogram, simply press one of the six histogram selection buttons to activate it and then change the settings in the dialog and/or update the visibility of regions in the GUI to calculate a new histogram. The histogram is displayed in the color shown on the corresponding button. Although up to six histograms can be shown at the same time, only one of the histograms — the active histogram — is actually updated in response to changes in region visibility.

Press one of the six histogram selectition buttons to activate a histogram and give it "focus". The color of the histogram is reflected on the button. The focused histogram can be updated dynamically to reflect the distribution of the currently visible regions (unfocused ones will not be updated until they are given focus once more). The number on the button of the focused histogram will be shown in white while the unfocused histograms have labels in black. If you press the button of a focused histogram it will be hidden and also loose focus (and the button will no longer be shown in color). Press it once more to show it again. A focused histogram will be updated if the visualization is updated (number of visible regions potentially change) or if any of the settings are changed. Y-scale is normalized so that the height of each histogram bar represents the fractional number of regions falling into that bin relative to the total number of regions in that track.

Sequence alignment mode

In order to derive a histogram, all the sequences are first aligned with each other to find the length of the total sequence span. This span is then divided into the specified number of bins. For example, if the total span is 3000 bp and the number of bins is set to 50, each bin will cover 60 bp. When the alignment mode is Upstream, Downstream or Center, the total span equals the length of the longest sequence. If the alignment mode is TSS (or TES), the length of the total span equals the longest segment upstream of TSS plus the longest segment downstream of TSS (these can belong to two different sequences).

If all the sequences have the same length (and the same relative placement of TSS or TES) the sequence alignment mode makes no difference. Note that unlike most other settings, the alignment mode cannot be changed without invalidating all histograms (inactive histograms will be deleted). So while it is possible to overlay histograms based on different tracks and sequence subsets, or that have different settings for number of bins, alignment anchor and support, it is not possible to overlay histograms that have different sequence alignments.

Bin assignment anchor

When the length of a region spans several consecutive bins, the anchor setting controls which bin(s) the region is assigned to.
This setting has four available options:

Upstream : The region is assigned to the bin spanning the upstream edge of the region
Downstream : The region is assigned to the bin spanning the downstream edge
Center : The region is assigned to the bin spanning the center position of the region
Span : All bins that fully or partially overlap with the region are incremented

The figure below illustrates how the same three regions will be assigned to different bins depending on the anchor setting.

Support mode

When the support option is enabled, the histogram will be based on the sequence support for each bin, i.e. whether or not a sequence has regions that will be assigned to that bin or not. This amounts to merging overlapping regions in a sequence before counting, so each sequence is only counted once for each bin no matter how many regions overlap that bin.

The Positional Distribution Viewer is also presented in Video Tutorial #3 (part 1).

Region Visualization Filters

MotifLab has a very sophisticated sequence and track visualization system, and one of its major strenghts is its ability to dynamically highlight regions of interest either with the use of different colors or by hiding uninteresting regions altogether. An important role in this system is played by Region Visualization Filters that can inspect all the regions in a track and dynamically propose new colors for each individual region (overriding the default colors based on region type). Filters can also decide whether a particular region should be drawn at all. MotifLab keeps a list of all active visualization filters and new filters are added to the end of this list when they are activated. When deciding which color to use for drawing a region, MotifLab queries each filter in order and will use the first non-default color proposed by a filter. When deciding whether to actually draw a region or not, all filters must agree that the region should be visible. If at least one filter insists that the region should be hidden it will not be drawn. Note that general region visibility based on region type is determined before any filters are processed and thus takes precedence. Hence, if you e.g. have selected in the Motifs Panel that motif "M00023" should be hidden, it will not be drawn in a track even if all the active filters says that it should be. (So filters can hide a region that is currently visible but not show a region that is hidden).

MotifLab comes bundled with two interactive tools that utilize the region visualization filtering functionality (Motif Score Filter and Interactions Viewer), but additional filtering tools are available as plugins.

Motif Score Filter / Region Score Filter

The Region Score Filter tool can be used to highlight regions in a track that score above (or below) a dynamically selected cutoff value.

The filter can only be applied to one region track at a time and the target track is selected from the drop-down menu in the upper-left corner of the tool dialog as shown above. The central component in this dialog is the slider that is used to set the score cutoff value. The actual value of the cutoff is displayed in front of this slider. All regions in the target track whose score satisfy the condition set forth by the comparison operator (button behind the slider) are classified as matching regions and the rest are classified as non-matching. Pressing the "Condition operator" button will toggle between the two conditions "above or equal to" ( >= ) and "below or equal to" ( <= ). The "Options" button brings up a menu where you can select how to visualize matching and non-matching regions respectively. The three available options are:

Show matching regions as normal but hide all non-matching regions
Show matching regions as normal but display all non-matching regions in a light gray color
Show matching regions in one color (green) and non-matching regions in a different color (red)

The colors used for matching and non-matching regions with the third option can be changed by clicking on the colored icons in the upper-right corner of the dialog. (They can also be changed through the following display settings : "system.filter.green", "system.filter.red" and "system.filter.lightGray").

Unless otherwise specified, the "score" of each region will simply be based on its regular score property, but it is possible to select a different property from the drop-down menu in the lower-left corner, for instance region length. It is also possible to base the score on a comparison with a numeric track which is selected with the second drop-down menu on the bottom (this menu will not be shown if the selected score property is "score" or "length"). The following score properties are supported:

Score : The normal score property of the region will be used
Length : The score will be the length of the region
Minimum value : The score will be based on the smallest value of the numeric track within the sequence segment covered by the region
Maximum value : The score will be based on the largest value of the numeric track within the sequence segment covered by the region
Average value : The score will be based on the average value of the numeric track within the sequence segment covered by the region
Median value : The score will be based on the median value of the numeric track within the sequence segment covered by the region
Sum value : The score will be based on the sum of values over all positions in the numeric track covered by the region
Center value : The score will be based on the value of the numeric track at the position in the middle of the region
Start value : The score will be based on the value of the numeric track at the first position in the region (direct strand)
End value : The score will be based on the value of the numeric track at the last position in the region (direct strand)
Relative start value : The score will be based on the numeric track value at the first position in the region (relative to the orientation of the sequence)
Relative end value : The score will be based on the numeric track value at the last position in the region (relative to the orientation of the sequence)

The Region Score Filter tool is also presented in Video Tutorial #3 (part 2).

Note: This tool was originally called "Motif Score Filter" in version 1.0 of MotifLab and could then only be used with motif tracks. In version 2.0 the tool was updated so that it could be applied to any region track and the name of the tool was consequently changed to "Region Score Filter".

Interactions Viewer

The "futility theorem" proposed by Wasserman and Sandelin (2014, "Applied bioinformatics for the identification of regulatory elements", Nat Rev Genet, 5:276-287) states that the majority of TF binding sites predicted by motif scanning procedures are likely to be false positives. They are just sites where the DNA sequence happens to bear similarity to some known binding motif, but this motif similarity alone is not enough to make it a functional binding site that play a biological role. However, if you discover multiple binding motifs in close proximity, and these motifs are associated with transcription factors that are known to physically interact with each other, the likehood that all of the sites in the cluster are functional will increase.

The Interactions Viewer is a tool that can highlight such clusters of binding sites for TFs that are known to interact. It has two distinct modes of operation: "Single site" and "Motif types".

Single site mode
In this mode, the user selects a single region in a motif track by clicking on it, and the tool will then highlight other regions nearby associated with motifs for transcription factors that are known to interact with the TF for the region that the user selected, based on the annotated interactions property of the motifs. The region the user selected will be colored black and all other regions – within a specified distance – that can interact with the black region will be colored red. Non-interacting regions will either be hidden or grayed out, depending on the chosen tool settings. It is possible to "cast a bigger net" to capture even more interacting regions by increasing the tool's "level" setting. Level 0 consists of only the region that the user selected (black), while level 1 captures the regions (red color) that can interact with the level 0 region. Each higher level consists of the regions that can potentially interact with any of the regions from the level beneath, so level 2 regions are those that can interact with level 1 and level 3 are those that can interact with any region from level 2, etc..

The figure below shows an interaction network with 4 levels. The user has selected the TATA site in the middle (level 0, black). This TATA motif is known to interact with the PAX2, PAX4 and CDXA motifs that surround it (level 1, red), and these motifs in turn can interact with PBX, TBP and EN1 at level 2 (orange). Also shown are motifs at level 3 (yellow) and level 4 (green). The remaining gray regions are not part of the interactions network.

The colors used for the different levels of the interactions network are:

    Level 0: Black
    Level 1: Red
    Level 2: Orange
    Level 3: Yellow
    Level 4: Green
    Level 5: Cyan
    Level 6: Light blue
    Level 7: Dark blue
    Level 8: Violet
    Level 9: Dark gray
    Level 10 (and above): Gray

These colors will only be used if the "Color by interaction level" option is selected. If this option is not selected, the regions will be shown in their original colors (but non-interacting regions will still be either grayed out or hidden).

When creating the interactions network, the tool works outwards from the single region the user selected. For each new level, only regions that are within a certain distance from the regions in the previous level will be considered. The minimum and maximum distances that define the allowed distance range can either be specified with constant numbers, Numeric Variables or Motif Numeric Maps. With the Numeric Map option, it is possible to define individual distance ranges tailored to each motif type. To also consider overlapping regions, the minimum distance must be set to a negative value.

Interactively clicking on different regions in the track to see if they could potentially be part of local interaction networks can be exiting, but it can also be tedious if you want to check all the regions in a track. For this reason, it is possible to cycle through all the regions in a track, either manually or automatically. To start cycling, first click on a region in the track to start from, and then click on either the "<" or ">" buttons in the tool dialog to jump to the previous or next region respectively. If you click the "(Cycle) Start" button, the tool will automatically advance the selected region after a short time delay. You can then sit back and watch an animation of potential interaction networks in the track. To stop the automatic cycling, either click inside the track or on one of the "<" or ">" buttons in the dialog.

Motif types mode
In this mode, the user selects a group of one or more motifs, and binding sites for these motifs will be shown in black (all instances in all sequences). Other motif sites associated with TFs that are known to interact with the TFs from the selected group will be shown in red. All other regions will be hidden. This mode will only show one level of interactions, but it can do so for multiple target motifs and it does not consider distance constraints.

The Interaction Viewer is also presented in Video Tutorial #3 (part 2).

Sequence Tools

Sort Sequences

The Sort Sequences tool can be used to reorder the sequences in the Visualization Panel with respect to a chosen sort criterion.

Sort property	Effect
Sequence name	Sequences are sorted in natural order according to their names
Sequence length	Sequences are sorted by their lengths
Location	Sequences are sorted first by chromosome and then by position within the chromosome
(Visible) Region count^†	Sequences are sorted according to the number of regions within each sequence with respect to a selected Region Dataset
(Visible) Region coverage^†	Sequences are sorted according to the number of bases covered by regions within each sequence with respect to a selected Region Dataset
(Visible) Region scores sum^†	Sequences are sorted according to the sum of region scores over all regions within each sequence with respect to a selected Region Dataset
Numeric map	Sequences are sorted according to their values in a selected Sequence Numeric Map
Numeric track sum	Sequences are sorted according to the sum of values within each sequence with respect to a selected Numeric Dataset
GC-content	Sequences are sorted according to their GC-content with respect to a selected DNA Sequence Dataset
Mark	This will place all marked sequences before unmarked ones when sorting in descending order

^† These sort modes can optionally consider all regions within a track or only those regions that are currently visible in the Visualization Panel.

The sorting algorithm is stable, so if you first sort by a secondary property (e.g. sequence name) and then by a primary property (e.g. Numeric Map value) the sequences that have the same primary property value (map value) will be sorted internally by the secondary property (name).

Group by Sequence Partition

This option will group sequences together into clusters according to a selected Sequence Partition. The sequences are first sorted by the name of their cluster and within each cluster the sequences are sorted by the chosen sort property.

Note that sorting options are also available from the context menu when right-clicking on a dataset in the Features Panel, and it is possible to sort sequences within a protocol script using the "sort(mode)=asc|desc" display setting command.

Crop Sequences

The Crop Sequences tool can be used to make sequences shorter by removing a number of bases from one or both ends of a sequence.
All existing Feature Datasets will also be updated to conform to the new length.

Cropping can be performed in two different ways:

Removing a specific number of bases

This mode allows you to specify the exact number of bases to remove from the start and the end of the sequences respectively. Note that which side of the sequence is considered the start or end depends on the strand orientation of each sequence but also the selected orientation mode setting. With the "relative orientation" setting, the start of the sequence is the upstream end when oriented according to the origin strand of the sequence. With the "direct orientation" setting, the start will always be the side with the smallest genomic coordinate and the end the side with the largest genomic coordinate.
It is possible to crop a different number of bases from each individual sequence by specifying the number for each sequence in a Sequence Numeric Map and then selecting this map as the argument in the dialog rather than entering a constant number.

Cropping to regions

Provided with a region track, the sequences can be cropped so that the new start of the sequence corresponds with the start of the first region in the selected track and the new end of the sequence corresponds with the end of the last region. In other words, each sequence will be cropped so that it covers all regions present in the track but without additional flanking positions outside. Sequences that contain no regions at all will be left untouched rather than cropping them to 0 bp.

Cropping sequences can also be performed with the operation: crop_sequences.

Extend Sequences

The Extend Sequences tool can be used to make sequences longer by adding new base positions to one or both sides of a sequence.

The tool takes two numeric arguments specifying the number of bases to add to the start of each sequence and the end of each sequence respectively. Note that which side of the sequence is considered the start or end depends on the strand orientation of each sequence but also the selected orientation mode setting. With the "relative orientation" setting, the start of the sequence is the upstream end when oriented according to the origin strand of the sequence. With the "direct orientation" setting, the start will always be the side with the smallest genomic coordinate and the end the side with the largest genomic coordinate.
It is possible to add a different number of bases to each individual sequence by specifying the number for each sequence in a Sequence Numeric Map and then selecting this map as the argument in the dialog rather than entering a constant number.

MotifLab is not able to extend existing Feature Datasets associated with the sequences (since it does not necessarily know the values for these datasets outside the current range and by design refuses to fill in with blanks). Because of this, sequences can only be extended as long as no Feature Datasets are present.

Extending sequences can also be performed with the operation: extend_sequences.

Other Tools

Update Motif Properties

Documentation in preparation

Configuring MotifLab

General options

The general configuration options for MotifLab can be edited in the Options dialog, which can be accessed under "Options..." in the "Configure" menu in the main menu bar. The dialog organizes the options under different tabs.

General

Concurrent Computational Threads

The number of concurrent computational threads can be increased to allow MotifLab to take advantage of parallel processing on computers that have multiple cores (which most computers have these days). Note that this functionality is not fully utilized in the current version of MotifLab (but is used by e.g. the SimpleScanner motif scanning program).

Maximum Concurrent Downloads

This setting specifies how many concurrent download request Motiflab can make to the same server. If this value is set to 1, MotifLab will always wait for any requested file to be completely downloaded before making a new request to the same server. If a value X higher than 1 is specified, MotifLab will have a pool of X connections open simultaneously. Each connection can make a new request to the server as soon as the previous file requested on that connection has been completely downloaded. Allowing more concurrent downloads will normally result in faster download times, but it will also put more strain on the servers (which could potentially result in users being banned from connecting to particular servers).

Network Timeout

The network timeout setting specifies the amount of time (in milliseconds) that a server contacted by MotifLab has to respond before a "network timeout" error will be reported.

Maximum Sequence Length

MotifLab has primarily been designed to perform operations and analyses on multiple, short sequence segments rather than very long (e.g. genome-wide) sequences. The Maximum Sequence Length setting can be used to safeguard against accidentally specifying overly long sequences in the Sequence Dialog (which could for instance happen if a user types a digit or two too many for the end coordinate of a sequence compared to the start coordinate), as this could result in the system being bogged down while attempting to download an excessive amount of data for this sequence region.

TSS at position

In the bioinformatics community it is common to refer to the first base in a gene sequence (the TSS) as position "1" and the second base as "2" (and so on), while the first position upstream of the gene is referred to as "-1". A number line with gene-relative coordinates thus goes directly from the negative numbers (for positions upstream of the gene) to the positive numbers (for positions inside the gene), hence skipping the zero-position. The TSS at position setting can be used to specify whether this particular convention of going directly from -1 to +1 should be followed (by selecting "TSS at position +1") or whether the zero-position should be included (by selecting "TSS a position +0") as with a regular number line.
Note that this setting is used by the Sequence Visualizer (and its ruler and tooltips) but is not necessarily respected by other parts of the system, such as Data Formats that output Region Datasets using TSS-relative coordinates. However, these formats usually have their own parameters that can be used to specify if the +0 position should be skipped.

Autocorrect Sequence Names

This option was introduced in MotifLab version 2.0. MotifLab requires the names of sequences to only consist of letters, numbers and underscores. However, some sequence identifiers, for instance in yeast, can contain other characters as well (hyphens in particular). If the "Autocorrect sequence names" option is selected, MotifLab will automatically convert illegal sequence names to legal sequences names (usually by replacing illegal characters with underscores) whenever data is read from files.

Ask Before Discarding Data

If this option is selected, MotifLab will display a popup dialog whenever one of the "Clear Data" functions is selected from the "Data" menu. The dialog will ask the user to confirm that they really would like to delete the data objects and allow them the chance to change their mind. Also, if a user closes a protocol in the protocol editor or closes an output panel containing contents that has not been saved, MotifLab will ask the user if they would like to save the document before closing it. Note that data that is deleted with the "delete" operation or by selecting a data object and pressing the DELETE key will not be affected by this setting.

Save Session On Exit

This option was introduced in MotifLab version 2.0 and can be set to either "Always", "Ask" or "Never". When set to "Always", MotifLab will always save the current session (to an internal file) when the program is exited and restore this session automatically the next time MotifLab is started. If set to "Ask", MotifLab will display a popup dialog when the program is about to exit which allows the user to choose whether or not to save the current session (and restore it next time). Setting this option to "Never" will disable the auto-save/restore functionality.

Visualization

The "Visualization" tab contains options for configuring how sequences and feature data tracks are displayed in the main visualization panel.

Sequence Window Size

This option will set the width of the sequence windows (displayed tracks), and this can be useful to adjust if you have a computer screen which is either smaller or larger than assumed by the default setup.

Sequence Label Size

This option will set the width of the sequence labels displayed in front of the sequence windows (data tracks). This width can either be set to a fixed size (in pixels) or to a size which is determined by the system based on the size of the label for the sequence with the longest name. The latter option is enabled by checking the scale to fit box (recommended). If a fixed label size (not "scaled to fit") is used and the length of a sequence label is larger than the specified size, the label will be drawn on top of the sequence window (thus obscuring the data tracks displayed underneath).

Antialias text and Antialias motif logos

These two options can be used to turn on or off anti-aliasing on motif logos and text displayed in the sequence window (such as sequence labels, numbers and labels in the ruler and coordinates in the info-panels). Enabling anti-aliasing will allow fonts to be rendered more smoothly than if antialiasing is turned off, which makes the graphics more aesthetically appealing but also easier to read (especially for small font sizes). However, on older computer systems, anti-aliasing would have a performance penalty, which is why these two options were included to turn it off.

Background color

This setting can be used to change the color of the background in the visualization window. The button named "Color" will be displayed in the currently chosen background color, and clicking this button will display a pop-up dialog which allows the user to select a different background color. Clicking the "Reset" button will revert the background color to the default setting (which will be a gray color).

Cache

The "Cache" options tab allows users to turn on or off caching functionality and also clear all the contents of the caches.
Obtaining feature data from external servers could potentially take a long time, which is why MotifLab has options to locally cache data that has been downloaded. Whenever a user requests to obtain data for a certain feature, MotifLab will first check if all or part of the requested data is already available from the local cache, and it will only make connections to external servers in order to obtain data that can not be found in the cache. Also, when users rely on different types of gene identifiers to define which sequence regions to work on, MotifLab will have to contact an external service (usually BioMart) to resolve these gene IDs and determine the genomic coordinates of these sequences. The mapping between gene identifiers and gene locations can also be cached so that this information is readily available for sequences that have been analyzed before. When caching is turned off, no new information will be stored in the cache and MotifLab will not make use of any data that might be present in the cache from before. Turning off caching will not destroy any data presently in the cache, however, so reenabling caching will give access to all data that was previously cached.

Protocol Editor

The "Protocol Editor" options panel can be used to set the colors used for coloring keywords in protocols. The panel contains several colored buttons, including e.g. "Operations", "Data objects", "Data Formats" and "Numbers", that refer to different classes of keywords. The color of each button reflects the color currently used for keywords of that class. To change the color for a particular class, simply press the corresponding button and select a new color from a pop-up dialog.

HTML

The "HTML" panel contains a few options relating to output files in HTML format produced by MotifLab's output operation. HTML documents rely on Cascading Style Sheets (CSS) to define a style for the document (affecting e.g. fonts and colors) and also use JavaScript to enable certain interactive functionality, such as sorting tables by clicking on a column header. The HTML options specify how the required style information and JavaScript code should be made available to the HTML documents.

New File	For each HTML document in MotifLab that is saved to file, new separate file(s) will be created to hold the style information (or JavaScript code) and these files will be referenced by name from within the HTML-file. A new CSS-style (or JavaScript) file is saved to the same directory as the HTML-file and the name of the file will be the same as the HTML-file except that the suffix is changed to ".css" (or ".js") rather than ".html". Using a new style- or javascript-file for each HTML document allows different documents in the same directory to have different styles and functionalities.
Shared File	When this option is selected, the style sheet or JavaScript code is output to a file with a fixed name ("motiflab_style.css" and "motiflab_script.js"), and all HTML documents that are created will contain references to these files (which are assumed to reside in the same directory). When a new HTML-file is saved to a directory which does not contain these files, they will be created. However, subsequent HTML documents saved in the same directory will simply rely on the same files. This means that you can easily change the style or JavaScript functionality for all the HTML-files residing in the same directory simply by editing or replacing the "motiflab_style.css" and "motiflab_script.js" files.
Embed	For each HTML document created, the code for the CSS style or JavaScript required will be included in the HTML document itself. This means that the HTML documents will be self-contained since they do not rely on other external files (at least if both CSS and JavaScript is embedded).
Link	When this option is selected, HTML documents will link to stylesheets and JavaScript files residing on the MotifLab web server. This means that no new stylesheet or JavaScript files are created locally when HTML documents are saved to file, but access to the MotifLab web server would be required in order to display these HTML files properly in a web browser (they can be displayed with default style and no JavaScript functionality, however). Note that linking does not work for CSS style sheets prior to version 2.0 of MotifLab due to a bug.
None	This setting specifies that no CSS style information (or JavaScript) should be associated with the HTML document. This would mean that a default style should be used or that functionality requiring JavaScript should be disabled.

The Stylesheet option is used to select which CSS stylesheet to use for HTML documents. Users can choose between a few predefined styles installed with MotifLab or select a homemade CSS-file. To use a predefined style, the name of the style should be typed in brackets in the stylesheet box (e.g. "[green]"). As of MotifLab v2.0, only two predefined styles are available: default and green.

Configuring external programs

XML configuration files for external programs

In order to use external programs within MotifLab, their interfaces must be explained to MotifLab through special configuration files written in an XML-format which is explained below. If you want to check out more examples you can have a look at the configuration files for various supported programs available from the external programs page.

Example:
The box below shows the XML-code required to configure a simple program called "randomfilter.exe" which takes three arguments: the name of a GFF-formatted input-file, a number between 0 and 1.0, and the name of an output-file. The program would read the GFF-file line by line and with a given probability write the line to the new output-file. The command to execute this program from a CLI-shell would be "randomfilter.exe -i <inputfile> -p <inputfile> -o <outputfile>".

  <?xml version="1.0" encoding="UTF-8"?>

  <program name="RandomFilter" class="Filter">

	 <service type="local" location="C:\bioinformatics\randomfilter.exe" />

	 <parameter type="regular" name="Region Track" class="RegionDataset" required="yes">   
	   <dataformat name="GFF" />
	   <argument type="valued option" switch="-i"/>
	 </parameter>

	 <parameter type="regular" name="Probability" class="Double" required="yes">
	   <min>0</min>
	   <max>1.0</max>
	   <default>0.5</default>
	   <argument type="valued option" switch="-p"/>
	 </parameter>

	 <parameter type="result" name="Result" class="RegionDataset" required="yes">
	   <dataformat name="GFF" />
	   <argument type="valued option" switch="-o"/>
	 </parameter>

  </program>

After the compulsory XML-header in the first line follows a <program> element which contains the actual description of the program and its interface.
The <program> element has two arguments: a name and a class. The name argument is just a name selected to refer to the program.
The class argument tells MotifLab what kind of program this is. Five special classes are recognized which can also have specified requirements on the configration file. These are "MotifDiscovery", "MotifScanning", "ModuleDiscovery", "ModuleScanning" and "EnsemblePrediction". Programs from these five classes are executed with corresponding operations in MotifLab; e.g."MotifScanning" programs are executed with the "motifScanning" operation, etc. For programs that do not fall within the special classes, the class argument is merely descriptive and can be set to any value. For example, since the "randomfilter" program above is used to filter data, it is given the arbitrary value "Filter" for the class argument. Programs that are not one of the special classes can be run with the "execute" operation.

A third and optional argument to <program> is cygwin which can take on the values "yes" or "no" (default is "no"). This argument can be used to signal that the program is originally a UNIX/LINUX program and needs Cygwin to be installed in order to run under WINDOWS operating systems. If cygwin is set to "yes" some filepaths might be converted to UNIX-style as necessary.

The <program> element further contains other elements that describe various properties of the program, including information about where the program is located, how to execute it and descriptions of the input and output parameters of the program.

Program properties

The <program> element can contain an optional <properties> element which describes various properties of the program, including names of the authors, a short description of the program itself, contact information, websites and citations. These properties are displayed in the HELP-page for the program (which is shown for instance when the user double-clicks on a program in the External Programs Dialog), and they are mostly useful if one wants to share an XML-configuration file with other users that are not familiar with the program. The <properties> element can also contain a <license> element with a license agreement that the user must accept in order to use the program and a <register> element containing a web address where the user can be directed in order to register their use of the program. HTML-code can be used in the text of these elements as long as the angle brackets used around HTML-elements are escaped (for example, to use italics, "<i>" must be escaped as "<i>").

  <properties>
      <author>Timothy L. Bailey and Charles Elkan</author>
      <citation>
      	Timothy L. Bailey and Charles Elkan (1994)
      	"Fitting a mixture model by expectation maximization to discover motifs in biopolymers",
      	&lt;i&gt;Proc 2nd Int Conf on Intelligent Systems for Molecular Biology&lt;/i&gt;,
      	(28-36), AAAI Press, 1994
      </citation>
      <contact>donotreply@somewhere.org</contact>
      <homepage>http://meme.sdsc.edu</homepage>
      <description>
           MEME searches for novel motifs in DNA (and protein) sequences
           using an expectation maximization strategy
      </description>   
      <register>http://www.server.org/software/register.cgi</register>
      <license>
           In order to use this program you must agree not to use it for commercial purposes
      </license>  
  </properties>

Service type and location

The <service> element describes the program's location and how it should be accessed. The current version of MotifLab only supports use of programs that are installed locally on the user's computer (type="local"), but future versions might also support the use of web services. (The special setting type="bundled" is used for programs that come shipped with the installation of Motiflab). If the location of the executable program is known, it can provided as an argument to the <service> element, as seen in the example for the "RandomFilter" program on top of this page. If the location of the program is not stated in the XML-file, the user must specify the location when the XML-file is installed in MotifLab. If a precompiled executable of the program can be obtained from an external source such as a web server, the location of this source can be provided inside the <service> element using optional <source> elements. The version and os arguments just provide a description for the program source, but the url argument must point to a single file that can be downloaded and "installed" locally by MotifLab. The downloaded file must be executable and usable "as is" since MotifLab is not capable of performing any special installation steps that the program might require. The only processing MotifLab can do is to unzip a program contained within a ZIP file. In this case the argument compression="ZIP" must be set (as shown for the second source below) and the location of the executable file within the ZIP archive must be specified with the targetInZIP argument.

Version 2.0 of MotifLab introduced the require element which can be used to inform the user that this program or configuration file has certain system requirements, for instance that a certain version of JAVA must be installed or that this particular configuration file is only meant to be used with version X of the program in question. These requirements will be shown to the user when the program is configured in MotifLab. A special requirement is "MotifLab version X" which says that this configuration file will only work with a certain version of Motiflab (or more recent versions) since the configuration relies on functionality that is not present in earlier versions. If such a requirement is specified, a user will not be able to configure the program unless the required MotifLab version is used.

   <service type="local">
     <source version="3.1" os="Windows"          
             url="http://homes.esat.kuleuven.be/~thijs/download/windows/MotifScanner.exe" />
     <source version="3.1" os="Windows (mirror)" 
             url="http://tare.medisin.ntnu.no/priorseditor/tools/windows/MotifScanner.zip"   
             compression="ZIP" targetInZIP="bin/MotifScanner.exe" />
     <source version="3.2" os="Linux"            
             url="http://homes.esat.kuleuven.be/~thijs/download/linux_3.2/MotifScanner" />
     <source version="3.2" os="Linux x86-64"
             url="http://homes.esat.kuleuven.be/~thijs/download/linux_x86-64/MotifScanner" />
     <source version="3.2" os="Mac OS X"
             url="http://homes.esat.kuleuven.be/~thijs/download/macosx_ppc/MotifScanner" />
     <require>MotifLab version 2.0</require>
     <require>Java version 1.7</require>
   </service>

Describing the program's interface

The description of the program's command-line interface mostly consists of a list of <parameter> elements, each describing an input or output parameter of the program.

   <parameter type="regular" name="Positional priors" class="NumericDataset"
              required="no" hidden="no">   
       <description>
          A positional priors track (Note: sum of priors for all positions must not exceed 1.0!) 
       </description>   
       <argument type="valued option" switch="-psp"/>
       <dataformat name="PSP">
           <setting name="Orientation" class="String">Direct</setting>
           <setting name="Motif width" class="Integer">8</setting>
       </dataformat>
   </parameter>

Each parameter has a type argument which can be either "source", "result" or "regular". Source parameters refer to existing data objects that are passed on to the external program for processing and result parameters refer to results output by the external program that are read back and converted into new data objects by MotifLab. The five special classes of external programs (motif/module discovery/scanning and ensemble programs) have specific requirements on the number and roles of source and result parameters. For example, motif scanning programs must have exactly one source parameter representing the DNA Sequence track and one result parameter (which must be called "Result") referring to the Region Dataset (motif track) returned my the motif scanning program. Motif discovery programs on the other hand must have two result parameters, which must be called "Result" and "Motifs" respectively, which refer to the motif track and motif collection objects returned by the motif discovery program. Programs can have additional parameters settings besides the input and output parameters which can be used to modify the behaviour of the program. These are then specified as "regular" parameters. Note that "source" parameters are only used by the five special program classes, and other classes should use "regular" parameters also when referring to any data passed on to the external program.

In addition to a type, a parameter must also have a name argument, which is used to refer to the parameter and is also the name displayed in GUI dialogs. Finally, a parameter must have a class argument which specifies the type of data the parameter holds. The class argument can refer to one of the four "basic types" String, Integer, Double and Boolean (for backwards compatibility Double can also be referred to as Float) or to one of MotifLab's own data types. Such data types must then be written without spaces and in camel case (where each "word" begins with a capital letter), such as for instance RegionDataset, MotifCollection and SequenceNumericMap. For Numeric Maps it is also possible to append a plus-sign to the class name. This then taken to mean that a Numeric Variable or literal numeric constant can be chosen by the user instead of a Map when selecting a value for the parameter (e.g. "MotifNumericMap+").

Parameters can have additional optional arguments such as: required, advanced, hidden and skipIfDefault which can be set to either "yes" or "no". Required parameters must be assigned values, and MotifLab will not allow a user to execute a program before he or she has chosen values for all required parameters (non-required parameters can be left blank and rely on defaulting values). Advanced parameters will not be shown in the GUI unless the user explicitly selects to display them by pressing a "+" button. If a program has many parameters, this option can be used to show only the most important parameters and hide the less frequently used parameters (which usually rely on defaulting values anyway) in order to make the visual presentation of the program's settings more tidy. Although not required, it is recommended that all advanced parameters be listed after the non-advanced parameters. Hidden parameters do not show up in GUI dialogs at all, and the user can not change the value of a hidden parameter directly. Hidden parameters can, however, be used to pass default settings to programs and they can also be indirectly updated in a preconfigured way in response to user selections. Arguments that have the skipIfDefault setting on will not be included on the command line if the parameter has the default value (which can be no value for non-required parameters). Unless these optional arguments are specified their default settings will be required=yes, hidden=no, advanced=no, skipIfDefault=yes.

<parameter> elements can contain other elements, for instance an optional <description> of the parameter which can be displayed to the user in a GUI dialog (HTML-code can be used if angle brackets are escaped as explained above).
The <argument> element inside the parameter is required and describes how the parameter is passed to the program. The argument can specify a switch which will preceed the parameter on the command line. Programs that rely on switches usually allow the parameters to be listed in any order on the command line since the switches can be used to identify the parameters. On the other hand, for programs that do not rely on switches, the parameters must be listed in a specific order to correctly interpret the command line. The argument element must specify a type which can be either "valued option", "flag", "explicit", "implicit", "STDOUT" or "STDIN". (The "explicit" type was introduced in MotifLab v2.0.)
Valued option parameters are those that pass some kind of value along to the program. Basic values, such as numbers, simple text strings or Booleans will be output directly on the command line. More complex data objects, on the other hand, will be written to temporary files (in specified file formats) and the name of the file will be referenced on the command line instead. The filename will normally just be some random (but unique) name chosen by the system. However, when it is necessary to use a particular filename, the argument type can be set to explicit rather than valued option and the filename can then be explicitly specified (similar to the last "implicit" parameter in the example below). Flag parameters are used for boolean settings. If the option related to a flag-parameter is selected, the parameter's switch will be output to the command line. If the option is not selected, the parameter will not show up on the command line at all. An implicit parameter will be tied to a specific value which is fixed and already known in advance. The value of this parameter will thus not depend on any current settings selected by the user. Implicit parameters can for example be used to refer to an output-file created by the external program when the name of that file is always the same and not chosen by the user. Some programs will read their input data from STDIN rather than a regular file and/or write output to STDOUT instead of a regular file. The special type values "STDIN" and "STDOUT" can be used to signal that a parameter relies on these standard streams rather than regular files. These types can thus be considered as special cases of implicit parameters. Note that a program can only refer to one STDIN and one STDOUT parameter per command element (explained below).

Example:
The following configuration file is for a program called "scan.exe" which requires a DNA file (in FASTA format) as its first input argument. It is also possible to specify two additional optional arguments, one which specifies a background model (preceeded by the "-b" switch) and one which tells the program to scan the reverse strand rather than the direct strand ("-r" switch). The program then outputs its results to a GFF-file called "output.gff" (this name is hardcoded in the program and is not possible to change).
The command to execute this program from a CLI-shell would then be "scan.exe <fastafile> [-b <background>] [-r]"

  <program name="Scan" class="scanning">
         <service type="local" location="C:\bioinformatics\scan.exe" />

         <parameter type="regular" name="DNA" class="DNASequenceDataset" required="yes">   
           <dataformat name="FASTA" />
           <argument type="valued option"/>
         </parameter>

         <parameter type="regular" name="Background" class="BackgroundModel" required="no">
           <dataformat name="PriorityBackground" />
           <argument type="valued option" switch="-b" switchseparator=" " />
         </parameter>

         <parameter type="regular" name="Scan reverse strand" class="Boolean" required="no">
           <argument type="flag" switch="-r"/>
         </parameter>

         <parameter type="result" name="Result" class="RegionDataset">
           <dataformat name="GFF"/>
           <argument type="implicit" filename="output.gff"/>
         </parameter>

  </program>

Note that the configuration file specifies four parameters but the command line only has three parameters. This is because the last "result" parameter which captures the output from the program refers to a file which is implicit rather than being explicitly mentioned on the command line. When the command line to run this program is created, the parameters will be included in the order they are listed in the configuration. Because of this, the parameter referring to the FASTA file, which the program expects to be the first argument on the command line, must also be the first parameter in the configuration (a later section will describe a different way to construct the command line which foregoes this requirement). The first parameter (called "DNA") refers to a DNA Sequence Dataset object selected by the user. Since this parameter has the "valued option" argument-type, the selected data object will be output to a file (in the FASTA-format specified by the <dataformat> element) and the filename will be included on the command line. (If the class of the parameter had been either Integer, Double, Boolean, String or Numeric Variable its value would have been included directly on the command line).
The second parameter ("Background") is not required and will only be included on the command-line of the user has explicitly selected a Background Model for this parameter. In this case, the Background Model object will be written to a file in "PriorityBackground" format and the filename will be added to the command line after the parameter's specified switch, which in this case is "-b". The optional switchseparator specifies a string used to separate the switch from the parameter's value (in this case the name of the background file) on the command line. The switchseparator defaults to a single space, but is is also possible to specify other separators, for example a colon or an equals sign (in which case the parameter would appear on the command line as "-b:somefilename.bg" or "-b=somefilename.bg").
The third parameter ("Scan reverse strand") refers to a Boolean setting (these are usually displayed as checkboxes in the GUI). Since the argument-type in this case is set to "flag", the switch specified for this parameter ("-r") will only be added to the command line if the Boolean value is TRUE.
The fourth and final parameter ("Result") is a result-type parameter, which means that MotifLab expects to read some file that has been produced by the external program and use the information therein to create a new data object — which in this case should be a Region Dataset. As specified, the file should be in GFF-format. Also, since the argument-type of this parameter is set to be "implicit" the name of this output file is not referenced on the command line. Rather, the filename is specified directly.

Restricting values of simple parameters

Simple parameters such as Integers, Doubles, Strings and Booleans can be given default settings with a <default> element inside the parameter, as can be seen in the example on top of the page for the second parameter (Probability). For number parameters the allowed range can also be specified by providing <min> and <max> elements (although this is not checked in the current version of MotifLab). String parameters can normally take on any value, but they can also be restricted to a limited set of options:

   <parameter class="String" name="Size" type="regular" >   
     <option>Small</option>
     <option>Medium</option>
     <option>Large</option>
   </parameter>

The options are presented to the user who chooses among the allowed values. The value used for the parameter is normally the text between <option> and </option> (here Small, Medium or Large) but it is also possible to specify that a different value should be used. In the example below, the value "S" is used if the user selects "Small", "M" is used instead of "Medium" and "L" instead of "Large".

   <parameter class="String" name="Size" type="regular" >   
     <option value="S">Small</option>
     <option value="M">Medium</option>
     <option value="L">Large</option>
   </parameter>

Specifying the data format for complex parameters

Complex parameters (not simple numbers, Strings and Booleans) are passed to external programs via temporary files. In order to output these parameters to files, the data format to use must be specified with a <dataformat> element inside the parameter. The name of the format must be given and the format might also require specification of additional format-specific <settings>. Each setting has a name and a class class (similar to the class of parameters as described above). Since the data format settings used by an external program is normally decided in advance and hence fixed, the values for the settings are usually constant values written between the <setting> and </setting> tags. However, it is also possible to dynamically set a value using a link to another previously defined parameter (of the same class). For example, the "PSP" data format below specifies values for four settings (if the PSP format had other settings these would take on default values). The first three settings have fixed values, whereas the last setting "Motif width", which is an integer number, takes its value from another parameter called "Motif Size" (which should be an integer-class parameter that has been defined earlier in the XML-file). Please consult the Data Formats section of the user manual for detailed descriptions of each particular data format and their settings.

   <parameter type="regular" name="Positional priors" class="NumericDataset"   
              required="no" hidden="no">   
       <dataformat name="PSP">
           <setting name="Orientation" class="String">Direct</setting>
           <setting name="Normalize"class="String">Max 1</setting>
           <setting name="Include width" class="Boolean">true</setting>
           <setting name="Motif width" class="Integer" link="Motif Size" />
       </dataformat>
   </parameter>

Setting up the command line

The command line used to execute the external program can be defined in two different ways. One way is to explicitly specify the command-line, using the <command> element as described below. This method is the most powerful. However, programs that have very straightforward interfaces can do without the command-element.
If no <command> element is specified, the command-line is build up by writing out the name of the executable program followed by all the parameters in the order that they appear in the XML-file. The values of "simple" parameter types, like numbers and strings are written directly to the command-line whereas complex types (such as large datasets) are written to temporary files and the filename is written to the command line. If a parameter has an associated switch then the switch is written out before the parameter itself. If the parameter is a boolean "flag", only the switch is output (or not, depending on the boolean value of the parameter). "Implicit" arguments are not written to the command line, however. Implicit arguments can be used when the value for a parameter is always the same, for instance if the external program always writes its output to a file named "output.txt" which is not referenced on the command line. Arguments that are implicit should specify the (already known) filename instead of a switch (unless they link to other parameters).

If the "RandomFilter" program described at the top of this page is executed, and the user has chosen a region dataset to use for the first parameter and a value of e.g. "0.45" to use for the second parameter, the resulting command-line that is executed will look like this:

   C:\bioinformatics\randomfilter.exe -i <tempfile_1> -p 0.45 -o <tempfile_2>

Before executing the command, however, the region dataset the user selected for the first regular parameter is output (in GFF-format) to a temporary file named tempfile_1. The third parameter also refers to a region dataset, but since this is a "result" parameter only the name of the file (randomly chosen for the occasion) is passed to the external program on the command-line. The external program is expected to write its output to this file (in GFF-format as specified in the XML-file) whose contents will later be read back by MotifLab after the program execution has finished.

The command element

If the program requires a more complex command-line than just the name of the program followed by the parameters in the order specified, the command-line can be specified explicitly with a <command> element. For instance, if the RandomFilter program above was not a standalone executable, but rather a perl script, we might have to specify the command-line like this.

   <command>perl %PROGRAM {Region Track} {Probability} {Result}</command>

Here, %PROGRAM is a special string which refers to the program itself (this was implicit when we didn't use the command-element). Other special strings that can be used include %APPDIR which refers to the directory where the program resides, and %WORKDIR which is the "working directory" used when executing the command. Parameters are referred to on the command line by placing the name of the parameter in braces. The command-line will parsed and these braces will be replaced by the actual value of the parameter (or a filename for complex parameters) possibly preceeded by a switch if one is specified.

It is possible to specify multiple commands that should be executed in succession. This can be useful for instance if there is a need to perform any pre- or post-processing steps before or after running the program itself (for instance to convert output in a non-standard format produced by the program to GFF which can be read by MotifLab). There are two ways to specify multiple commands. The simplest way is to just include multiple commands in the same <command> element and separate those commands with a semicolon. Since some programs or operating systems might use semicolons for other purposes on the command line (for example to separate multiple paths in a JAVA classpath), it is possible to specify alternative characters (or even strings) to separate the commands via a separator argument to the command element. For example, the line <command separator="#"> uses the # sign to separate commands rather than the default semicolon.
The second way to specify multiple commands is to include a list of <command> elements. Note than in order to use this option, this list must be enclosed in an outer <commands> element to signal that the commands belong together.

   <commands>
      <command> first command... </command>   
      <command> second command... </command>   
      <command> third command... </command>   
   </commands>

An XML-configuration file should preferably be designed to be usable irrespective of which operating system the program will eventually run on. However, references to specific files within a command line might be tricky since different operating systems have different ways of representing file paths. Also, some operating systems might need to escape filenames containing spaces by enclosing them in quotes. MotifLab performs the necessary conversions automatically for temporary files and the %PROGRAM special string, but if you want to refer directly to other files within the command line, you might have to explicitly state that this part of the string refers to a file and should be processed accordingly. There are two ways to inform MotifLab that you want to refer to a file, and both work by enclosing the filename in "special quotes". The first uses "dollar-brace" style, like this: ${filepath}$ , and the other uses "dollar-quote-brace" style, like so: $'{filepath}'$ . (Note that the closing parenthesis is the reverse of the opening parenthesis). The difference between these are really only apparent for programs that run on WINDOWS using CYGWIN. With the first style, WINDOWS-paths are converted to CYGWIN Unix-style paths and enclosed in quotes if they contain spaces. The latter style does not convert the paths but will enclose them in quotes if they contain spaces. Use the latter style to refer to programs that should be executed and the first style for other file references. For an example of usage of the latter style you can have a look at the XML-configuration file for Weeder.

Sometimes, different operating systems can have a totally different command line syntax for the same program. To cope with such cases, you can specify a different command element for each operating system and use the os argument of the command to tell MotifLab which operating system the command pertains to, like so <command os="windows">. The "windows" string can be used to refer to all versions of windows, but for other operating systems the OS-string should match the (case-insensitive) String that will be returned by a call to the JAVA method System.getProperty("os.name"). The os argument also applies to the <commands> element used to group together multiple commands, so you can have different command groups for each OS. If no OS is specified for a command, it will apply to all operating systems and act as the default if no other more specific commands apply (e.g. if a configuration file contains two command elements, one with os="windows" and one with no OS argument, systems running windows will use the windows-specific command and all other systems will use the other command).

There may be cases when programs behave so differently on different operating systems that a simple rephrasing of the command line to execute the program is not sufficient to make the configuration compatible with multiple systems. It could be, for example, that a program has substantially altered functionality depending on the OS, uses different parameters or relies on other data formats for input and output. In such cases the system element can be used to group together elements that apply to different operating systems. A system element should be a direct child of the program element and can contain command, parameter, report and temporary elements that are specific to an operating system. Just like the command element, each system should also have an os argument which specifies which operating system it applies to (and a system element without such argument applies to all operating systems for which no other more specific system element is found).

Linking to other parameters

It is possible for a parameter to take on the same value as another parameter by "linking" to this other parameter. This is accomplished by specifying a link argument containing the name of the target parameter. Note that parameters can only link to other parameters that have already been defined earlier in the XML-file and they can only link to parameters of the same class. Parameters (except result parameters) that link to others should be "hidden", since their values should not be explicitly set by the user (only indirectly via the parameter being linked to). Settings for data formats can also link to other parameters (but not other settings) as explained above, and this is the only way a user can (indirectly) change values for data format settings (since information about data formats used for passing parameters is not usually revealed to the user).
For example, motif discovery programs require two result-type parameters to be defined called "Results" and "Motifs" which will hold respectively the binding sites and motifs discovered by the the program. Each of these parameters is processed individually by MotifLab since the data produced for each parameter could potentially be output to different files by the program (the MotifSampler program for example outputs one GFF-file containing the prediced binding sites and one file containing the motif PWMs). However, many programs output all their results to a single file, and this will require both of these parameters to reference the same file (and usually this also means that a new program-specific parser has to be included in MotifLab). The code below shows these two result parameters defined for a hypothetical motif discovery program ("ProgramX") which allows the name of the single output file to be given on the command line using the switch "-o <outputfile>. The parameter defined first ("Result") references the file on the command line directly by using a "valued option" argument. The second parameter ("Motifs"), however, references the same file by linking to the first parameter (and declaring itself an "implicit" argument). It would be possible to use two different data formats for parsing the results file, one for each parameter. However, the solution below uses the same data format ("ProgramXFormat") for parsing both the binding sites and the motifs in the same file. Instead, the data format-specific setting "Parse" (which can here have the value "Sites" or "Motifs") is used to tell the ProgramXFormat which parts of the information in the file it should concentrate on and also what data it should return to MotifLab.

   <parameter type="result" name="Result" class="RegionDataset"> 
       <argument type="valued option" switch="-o" />
       <dataformat name="ProgramXFormat">
           <setting name="Parse" class="String">Sites</setting>
       </dataformat>
   </parameter>

   <parameter type="result" name="Motifs" class="MotifCollection" link="Result">   
       <argument type="implicit" />
       <dataformat name="ProgramXFormat">
           <setting name="Parse" class="String">Motifs</setting>
       </dataformat>
   </parameter>

Parameters that link to other parameters will either reference the same atomic value as the target parameter (for the basic types Integer, Double, Boolean and String) or reference the same file as the target parameter (for all other complex data types). However, sometimes it could be necessary for a complex-type parameter to reference the same data object as another parameter but to have this object output to a different file in a different format. This can be accomplished by declaring the parameter to be a softlink rather than a regular link. For example, the motif scanning program FIMO can make use of positional priors and therefore has a parameter called "Positional priors" allowing the user to select a Numeric Dataset. This parameter is output in PSP format, but the FIMO program also requires a second auxiliary file based on the same data which should contain binned priors values. Both of these files must be specified on the command line. By using a hidden parameter called "Binned priors" which softlinks to the "Positional priors" parameter, a second file in a different format can thus be created from the same dataset that the user selected for the "Positional priors" parameter.

Conditions (MotifLab v2.0+)

Sometimes a program can have parameters that are only applicable under certain circumstances, which often depends on the settings of other parameters. For example, if the user has selected a value for an optional parameter, a second parameter might have to be specified also, but this second parameter is not required if the first parameter is unspecified. Hence, for the sake of displaying a tidy user-interface dialog for the program, this second parameter should only be shown to the user after a value has been selected for the first parameter. Such context-specific responses to selections in the dialog can achieved be with conditions. A condition is a child-element of a parameter which is set to monitor a parameter and perform certain actions when the value of this parameter is updated. These actions could include showing or hiding other parameters or setting the value of other parameters.

Example:
Below is an example with an optional parameter called "Background" which has an associated condition monitoring it. When the user selects a value for this parameter, the condition checks if this value is specified (a background model has been selected) or not (the value is left blank). If a background model was selected, a second parameter called "Other" will be shown in the dialog, if not the "Other" parameter will be hidden.

   <parameter type="regular" name="Background" class="BackgroundModel" required="no" > 
       <condition if="selected" then="Other:show" else="Other:hide" />
   </parameter>

Each condition must have an if-attribute which specifies a condition that must be met in order to perform an action. (Alternatively, an ifNot-attribute can be used instead to specify that the action should be performed if the condition is not satisfied). If the if-condition is met (or the ifNot-condition is not met), the then-attribute specifies the action to perform. An optional else-attribute can be used to specify an action that should be performed instead of the then-action if the if-condition is not met.

If-attribute:
The if-attribute can have one of the following values

selected
value=<allowed values>
type=<allowed types>
updated

If the if-attribute is set to "selected", the condition will be satisfied if the monitored parameter has a specific selected value (not left blank).
For Boolean parameters this condition is met if the value is TRUE and not FALSE.

If the condition is based on the "value" of the parameter, the condition will be met if this value equals one of the listed values (multiple values can be separated with vertical bars, e.g. "value=1|2|3"). Note that the value that is used is the value of the parameter as it appears in the GUI dialog and not the value of any selected data object. Hence, if the user has selected a Numeric Variable called "X" (with a value of 54) for the parameter, the value that is checked is "X" and not 54. This condition is thus mostly useful for checking the value of String-type parameters.

If the condition is based on the "type" of the parameter, the condition will be met if the data type of the selected value equals one of the listed types (multiple types can be separated with vertical bars). This could, for example, be used to check if the value for an Integer-type parameter was specified with a literal integer ("type=Integer") or with a Numeric Variable ("type=NumericVariable")

The "updated" condition is always met as long as the user has made selections or updates for this parameter in the dialog (even if the selected value is the same as before).

The condition of the if-attribute will usually refer to the value of the enclosing parameter. However, it is possible to specify that the condition should monitor a different parameter instead by specifying the optional monitor="<parameterName>"-attribute (see example below).

Then- and Else-attributes:
These attributes specify an action to perform when the if-condition is met or not met respectively. Recognized values are:

show
hide
setValue=<somevalue>
setToValueOf=<parameter>

The "show" and "hide" actions will show or hide the parameter in the dialog, whereas the "setValue" and "setToValueOf" will set the value of the parameter to either a specific value or to the value of another named parameter (note that the latter two should only be employed to set values for hidden parameters that the user has no control over anyway to avoid indeterminate behaviour).
The specified action will normally be applied to the enclosing parameter, but it is possible to apply the action to a different parameter instead by prefixing the action with the name of that parameter followed by a colon (e.g. "OtherParameter:hide" or "OtherParameter:setValue=7").

Note that it is only possible to specify a single action to perform when the condition is met (or not). However, if it is desirable to perform several actions one can always include multiple conditions for a parameter.

Example 2:
This example is equivalent to the example above and shows an alternative way to accomplish the same effect from a different perspective. In the above example, the condition was associated with the "Background" parameter which monitored itself. Depending on the value of this parameter the actions to be performed, as specified by the then- and else-attributes, were applied to a second parameter named "Other" by prefixing the value of the then- and else-attributes with "Other:". In the example below, the condition is instead associated with the "Other" parameter, but the condition is set to monitor the value of the "Background" parameter by setting the monitor="Background" attribute of the condition. Since the actions to be performed when the condition is met (or not) is to be applied to the enclosing parameter ("Other"), the prefix was dropped from the then- and else-attributes.

   <parameter type="regular" name="Other" class="..." required="no" > 
       <condition monitor="Background" if="selected" then="show" else="hide" />
   </parameter>

Reports (MotifLab v2.0+)

Often a program that writes its regular results to files will output additional information during execution to either STDOUT or STDERR (or both) to inform the user of the program's progress and report on any errors that have been encountered. The standard way to handle such output by MotifLab is to display each line in the GUI's status bar at the bottom of the screen. Version 2.0 of MotifLab, however, introduced the <report> element which can recognize specific expressions and display them either in the status bar, the log panel or an error dialog. If the program outputs information about how far it has come in its execution in the form of a percentage number or ratio, this information can also be captured and used to set the progress bar in the GUI.

  <reports>
      <report expression="" target="status" />   
      <report expression="WARNING:.+" target="log" />   
      <report expression=".*?next.*" target="log" />   
      <report expression="ERROR:.+" target="error" />   
      <report expression=".+?:(\d+)%.*" target="progress" />   
  </reports>

Each <report> element has one required expression argument and two optional arguments target and output. The expression argument specifies a regular expression that MotifLab should look for in the output. If a line sent to STDOUT or STDERR by the program matches a specified expression, that line will be sent to the designated target which can be either "status" (line is displayed in the status bar), "log" (line is displayed in the the log-panel), "error" (line is displayed in an error dialog and the execution of the program is stopped) or "progress". The "progress" target has some special requirements on the regular expression, namely that it must include either one or two capture groups, i.e. expressions enclosed in parentheses that match a number, such as e.g. "(\d+)" in the last example above. If only one capture group is specified, this should match a (percentage) number between 0 and 100 which will be used directly to set the progress in the progressbar. If two such capture groups are specified, the first group should capture a number reporting how many subtasks that have been completed so far and the second group should capture a number reporting the total number of such subtasks (e.g. "processing sequence 23 of 60"). The ratio between the first and second number will then be used to set the progressbar. Note that the specified regular expression must match a whole line in the output by the program and not just a substring. This means that it could be wise to start the expression with ".*?" and end it with ".*" to be sure that the whole line is matched. An empty expression is considered as a wildcard and will match any output. Hence, in the example above, the first report line will display every line of output produced by the program in the status bar, lines starting with the word "WARNING:" or containing the word "next" will be displayed in the log-panel (note that it is possible to specify multiple reports for the same target), and if the program ever outputs a line starting with "ERROR:", MotifLab will end the execution of the program and report this line in an error dialog. The last report statement will search for lines containing any text followed by a colon and an integer number suffixed by a % sign. This integer number will then be used to set the value of the progressbar.
Normally, the line that is matched by the given expression will be displayed to the user. However, it is also possible to state that a different text should be displayed with an optional output argument. For example, the statement "<report expression=".*?next.*" output="still working..." />" will display the text "still working..." in the status bar (which is the default target) every time a line containing the word "next" is output by the running program. So far, the output-text can not contain references to the matched expression, but hopefully this will be supported in future versions of MotifLab.

Cleaning up

If a program creates any additional files or directories during its execution (besides the temporary files created to pass complex parameters), it is prudent to specify these so that MotifLab can perform the necessary clean up after the execution has finished. The <temporary> element is used to specify the names of these temporary files (or directories). The special strings %WORKDIR and %APPDIR explained above can prefix the filenames if necessary.

  <program>
      ...
      ...
      ...
      <temporary filename="tempfile1" />   
      <temporary filename="%WORKDIR/tempfile2" />
  </program>

Configuring data tracks and sources

Datatracks XML configuration file

Documentation is in preparation...

Data Formats

Data formats define ways to formally describe the information contained in a data track or other data object and thus allows this information to be written to files and shared between computational tools. MotifLab supports many of the standard bioinformatics data formats that are relevant to regulatory sequence analysis, including e.g. FASTA, GFF and BED for feature data tracks and TRANSFAC or JASPAR formats for describing motif models. Data objects can be output to a selected data format with the output operation. This operation will create a textual representation of the data according to the specified format and store this text in special Output Data objects (shown as separate tabbed panels in MotifLab).
The contents of such Output Data objects can then be saved to file. Most data formats can be used for both output and input, meaning that information that has been exported in a specific format can be read back by MotifLab at a later time and used to reconstruct the original data objects. However, a few data formats can only be used for either input or output. For example, MotifLab is able to import data from the compressed binary formats BigBED, BigWIG and 2bit, but is currently not capable of exporting data in these formats. Conversely, information about sequences or motifs can be presented in aesthetically pleasing tables in various HTML-based formats, but MotifLab can not parse this information back again to reconstruct the original sequences or motifs.

Complete versus lossy data formats

Data objects usually have a set of recognized standard properties depending on their type. For example, all sequence objects have a genomic location and strand orientation and motifs have names and PWM models (or IUPAC models). All standard data formats that apply to sequences thus have ways to represent the location and strand of a sequence, and data formats used to describe motifs include descriptions of the name and PWM model. However, in addition to such standard properties, data objects in MotifLab often have non-standard or user-defined properties that are not necessarily supported by standard data formats. Hence, if a data object that contains non-standard properties is exported in a standard data format, these non-standard properties will usually be ignored in the output. Consequently, it will not be possible for MotifLab to fully reconstruct the original data object when reading the information back again with such a data format. Below, we use the term complete when referring to data formats that always support the full set of both standard and non-standard properties, and thus allow data objects to be completely reconstructed from files. Users will never risk loosing information if these formats are used. Conversely, lossy data formats do not save all the necessary information required to fully reconstruct the original data object, and these data formats should then be used with some caution. Potentially complete data formats do not save all the information by default, but can be considered complete if necessary precautions are taken.

Below is an incomplete classification of some of the data formats supported by MotifLab

Data Type	Complete	Potentially complete	Lossy
DNA Datasets	FASTA
Numeric Datasets	PRIORITY	PSP², WIG³, BedGraph³
Region Datasets		GFF⁴, BED^1,5, EvidenceGFF^1,5, Region_Properties¹	GTF
Sequences	Location⁷	Sequence_Properties¹, BED¹, Properties¹
Motifs	MotifLabMotif	Motif_Properties¹, Properties¹	TRANSFAC, Jaspar, MEME_Minimal_Motif, INCLUSive_Motif_Model, RawPSSM, XMS, HTML_MotifTable, HTML_Matrix, BindingSequences
Modules	MotifLabModule	Module_Properties¹, Properties¹
Collections	All applicable formats
Partitions	All applicable formats
Maps	All applicable formats
Background models	All applicable formats⁶

These formats can specify which properties that should be included. Hence, in order to make them complete, all properties must be specified.
The PSP format can be considered complete only if the "motif width" parameter is set to 0.
MotifLab allows sequences to overlap with other sequences but still be treated as completely separate with respect to the contents of associated feature tracks. For example, if you have two separate sequences A and B that have the exact same genomic location and add e.g. a conservation track, the conservation track will initially be the same for the two sequences. Later, however, the conservation track can be manipulated with operations or edited with the draw tool so that the track has different contents for sequence A and B. Whereas the data formats PRIORITY and PSP will save the track information from a sequence-centric perspective (representing the information as a list of values for each sequence without any consideration to where the sequence is located), the WIG and BedGraph data formats take on a genome-centric perspective and make a note of the genomic position that each value in the track is associated with (without considering which sequence it belongs to). Hence, when importing back information stored in WIG or BedGraph formats, information pertaining to one sequence can overwrite another sequence if they overlap. However, if none of the sequences overlap with each other, these formats can also be considered complete.
GFF is only complete for module tracks if the "include module motifs" parameter is selected. GFF currently does not support tracks with other linked regions.
EvidenceGFF and BED formats are not complete when used with module tracks or linked-region tracks
The "INCLUSive_Background_Model" is the only background model data format that fully supports meta-data, but such meta-data is not fully supported by MotifLab.
Provided that the full "10-field" format or a complete custom-format is used

Default data formats

All data types have an associated default format which is the format used for that data type when no other is specificed (e.g. when executing the command "output DataObject" without a following "in format XXX" argument). The default format is also used when importing data objects specified with data injection.

Data type	Default format
DNA Datasets	FASTA
Numeric Datasets	PRIORITY
Region Dataset	GFF
Sequences and Sequence Collections	Plain
Motifs and Motif Collections	MotifLabMotif
Modules and Module Collections	MotifLabModule
Partitions	Plain
Maps	MapFormat
Background models	INCLUSive_Background_Model
Expression Profiles	ExpressionProfile

Feature Dataset formats

FASTA

Applies to:

DNA Sequence Dataset

The output for a sequence in FASTA format consists of a header-line followed by one or more lines of sequence data. The header line is distinguished from the sequence data by a greater-than (">") symbol at the start of the line. The word following the ">" symbol is the identifier of the sequence, and this may be followed by additional descriptive text. The sequence data can be split across multiple lines for improved readability, and the sequences will be sorted in the output according to the current sort order.

Example of sequence data in FASTA format:

>ENSG00000035403 GTAGTCGCTGCACAGTCTGTCTCTTCGCCGGTTCCCGGCC CCGTGGATCCTACTTCTCTGTCGCCCGCGGTTCGCCGCCC >ENSG00000100345 GCAGATCACCGCGGTTCCTGGGCAGGGCACGGAAGGCTAA GCAAGGCTGACCTGCTGCAGCTCCCGCCTCGTGCGCTCGC >ENSG00000107796 AACACCACCCAGTGTGGAGCAGCCCAGCCAAGCACTGTCA GGGTAAGTGGCGCCAGGCCAAGGATGTGACTTATAGATTC

The header can contain other information in addition to the name of the sequence if the fields are separated by vertical bars. The fields are in order: sequence name, sequence location, strand orientation and organism/genome build. MotifLab version 2.0 can also recognize a fifth field specifying the gene name and location (position of TSS and TES). All the extra fields are optional, but the order is important, so if you want to include information about the strand, you must also include the sequence location field preceeding it.

Example:

>ENSG00000035403|chr10:75425878-75428077|Direct strand|9606:hg18 >ENSG00000035403|chr10:75425878-75428077|Direct strand|9606:hg18|VCL:75427878-75549916

The sequence name must not contain spaces or characters other than letters, numbers or underscores. If the name contains spaces, only the first part of the name will be used. If the name contains other illegal characters, an error will be reported. The location must be given as "chromosome:start-end" (where the "chr" prefix for the chromosome is optional). For the orientation, strings starting with "direct", "+" or "1" are interpreted as the direct strand whereas strings starting with "reverse" or "–" are interpreted as the reverse strand (other strings will just default to direct strand). The "organism/genome build" field should be specified as two values separated by a colon, where the first value is an integer taxonomy identifier (or known organism name) and the second value is the genome build. Optionally, the genome build can be stated alone and the system will then try to infer the organism. The fifth "gene location" field introduced in MotifLab v2.0 is on the form "gene name:TSS-TES".

Name	Description
Strand orientation	This parameter controls which strand to output for each sequence. Valid options are "Direct" (output sequence from genomic direct strand), "Reverse" (output sequence from genomic reverse strand) and "Relative" (output sequence data relative to the orientation of the sequence. I.e. use same strand as the strand the sequence originates from).
Header	Specifies what information to include in the header (after the > sign). The default is to output only the name of the sequence, but additional fields (separated by vertical bars) can also be output, such as the genomic location of the sequence, the strand orientation of the sequence and the genomic build of the sequence.
Column width	The number of sequence bases to output on each line. If the length of the sequence is longer than the specified column width, the sequence data will be split across multiple lines. A common value is 80, but the special value of 0 can be used to specify that the whole sequence should be output on one single line.
Extra space	If selected, an extra empty line will be added after the sequence data for each sequence (and before the header of the next sequence) to separate the sequences visually. Note that some external programs might not be able to parse FASTA files correctly if extra lines are added.

: output, DNA Sequence Dataset

2bit

Applies to:

DNA Sequence Dataset

The 2bit format is a binary format for efficiently storing (multiple) DNA sequences in a compact randomly-accessible format (up to 4Gb). MotifLab is currently able to import DNA track data from 2bit files, but is not able to output tracks in 2bit format. More information about the 2bit format and how to create 2bit files can be found here and here. 2bit formatted files are often used to store entire genomes, and in this case it is possible to extract DNA sequences for any segment as long as the location is known. (In contrast with e.g. FASTA-formatted files where you can only import DNA sequences if they have the same name and length as your current sequence objects)

Name	Description
Keep masks	If selected, lowercase letters in the DNA sequence will be kept as is ("masked"). If not selected, all bases will be in uppercase. NOTE: The current implementation of the 2bit format in MotifLab is very inefficient when this option is selected, so it is not recommended to use it.

: output, FASTA, DNA Sequence Dataset

WIG

Applies to:

The WIG (wiggle) format is designed for display of dense continuous data such as probability scores. Further description of the WIG format can be found here and here but is also repeated below.
A WIG file consists of one or more blocks where each block starts with a declaration line and is followed by lines defining data elements. There are two main formatting options: fixedStep and variableStep, and each block can have different formatting as described in the block's declaration line. Note that while MotifLab is capable of reading blocks in both of these formats, it will only produce output in variableStep format (with span=1).

variableStep

variableStep format is designed for data with irregular intervals between data points, and is the more commonly used format. It begins with a declaration line, followed by two columns containing chromosome positions and data values.
The declaration line begins with the word "variableStep" and is followed by space-separated key-value pairs:

chrom (required) - name of chromosome
span (optional, defaults to 1) - the number of bases that each data value should cover

The span allows data to be compressed as follows:

Without span:

variableStep chrom=chr2 300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5

With span:

variableStep chrom=chr2 span=5 300701 12.5

Both of these examples will display a value of 12.5 at position 300701-300705 on chromosome 2.

fixedStep

fixedStep format is designed for data with regular intervals between data points and is the more compact of the two wiggle formats. It begins with a declaration line, followed by a single column of data values.

The declaration line begins with the word "fixedStep" and is followed by space-separated key-value pairs:

chrom (required) - name of chromosome
start (required) - start point for the data values
step (required) - distance between data values
span (optional, defaults to 1) - the number of bases that each data value should cover

Without span:

fixedStep chrom=chr3 start=400601 step=100 11 22 33

Displays the values 11, 22, 33 as single-base features, on chromosome 3 at positions 400601, 400701 and 400801 respectively.

With span:

fixedStep chrom=chr3 start=400601 step=100 span=5 11 22 33

Displays the values 11, 22, 33 as 5-base features, on chromosome 3 at positions 400601-400605, 400701-400705 and 400801-400805 respectively.

Data values

Wiggle element data values can be integer or real, positive or negative. Chromosome positions are 1-relative, i.e. the first base is 1. Only positions specified have data; unspecified positions will be empty.

: output, Numeric Dataset

BigWig

Applies to:

The BigWig format is used to represent dense, continuous numeric data in an indexed binary format. BigWig is the most compact and efficient way to represent and access very large numeric datasets, including datasets covering full genomes. MotifLab is currently able to import numeric track data from BigWig files, but is not able to output tracks in BigWig format. More information about the BigWig format and how to create BigWig files can be found here.

: output, WIG, Numeric Dataset

BedGraph

Applies to:

The BedGraph format allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. Further description of the BedGraph format can be found here. This track type is similar to the wiggle (WIG) format and 4-column BED format.
Each line in BedGraph format contains four columns where the first three define a chromosomal region (similar to the first three columns of the BED format) and the last column specifies a numeric value that applies to all the positions within that region.

Example:

chr19 49302000 49302300 -1.0 chr19 49302300 49302600 -0.75 chr19 49302600 49302900 -0.50 chr19 49302900 49303200 -0.25 chr19 49303200 49303500 0.0 chr19 49303500 49303800 0.25 chr19 49303800 49304100 0.50

Name	Description
Add CHR prefix	If selected, the prefix "chr" will be added before the chromosome number (e.g. chromosome 12 will be output as "chr12" rather than just "12").
Coordinate system	Selects whether the coordinates in the are in the standard BED-coordinate system (with the chromosome starting at position 0 and end-coordinates being exclusive) or in the format used by e.g. GFF, where the chromosome starts at position 1 and both start- and end-coordinates are inclusive.

: output, WIG, BED, Numeric Dataset

PRIORITY

Applies to:

The PRIORITY format for numeric tracks was originally used by the PRIORITY motif discovery program to describe tracks to use for positional priors. The format is inspired by the FASTA format, and each sequence starts with a header line containing the sequence name preceeded by a greater-than sign (">"). The next line after the header lists values for all the positions in the sequence separated by commas. (However, MotifLab also allows the values to be separated by either spaces or TABs).

Example:

>ENSG00000035403 0.118,0.188,0.839,0.887,0.91,0.898,0.903,0.873,0.0,0.002,0.003,0.001,0.0,0.994,0.996 >ENSG00000100345 0.998,0.999,0.998,0.997,0.997,0.998,0.998,0.982,0.994,1.0,1.0,1.0,1.0,1.0,1.0 >ENSG00000107796 0.444,0.519,0.999,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.992,0.997,0.994,0.975,0.396

Name

Description

Orientation

This parameter dictates the order in which to list the values for each sequence. The default setting is "Relative" and the valid settings are:

Direct : List values according to the direct genomic strand. I.e. start with the value in the position with the smallest genomic coordinate, followed by the value in the next position with the second smallest genomic coordinate, etc. End with the value in the position with the highest genomic coordinate.
Reverse : List values according to the reverse genomic strand. I.e. start with the value in the position with the highest genomic coordinate, followed by the value in the next position with the second highest genomic coordinate, etc. End with the value in the position with the smallest genomic coordinate.
Relative : List the values in the order relative to the strand orientation of the sequence itself. I.e. use the "direct" ordering for sequences residing on the direct strand and the "reverse" ordering for sequences on the reverse strand.
Opposite : List the values in the order opposite of the strand orientation of the sequence itself. I.e. use the "direct" ordering for sequences residing on the reverse strand and the "reverse" ordering for sequences on the direct strand.

Separator

The separator to use between the data values. The default is "Comma", but other valid options are "Space" and "TAB".

: PSP, FASTA, output, Numeric Dataset

PSP

Applies to:

The PSP format ("position-specific prior") for numeric tracks is used by programs in the MEME suite to describe tracks to use for positional priors. The format is similar to the PRIORITY and FASTA formats, and each sequence starts with a header line containing the sequence name preceeded by a greater-than sign (">"). The sequence name is followed by a specification of the motif width (W). The next line after the header lists values for all the positions in the sequence separated by spaces. Since the original purpose of the PSP-format was to provide a value for each position reflecting the (prior) probability that a motif of width W could start in that position, the last W-1 positions in each sequence should have the value 0 (since no motifs of width W could start there). In fact, MotifLab will automatically output the value "0.0" for the last W-1 positions, thereby possibly overwriting any previous non-zero values for these positions! The values in a PSP file should preferably be between 0 and 1 and the values in all positions should sum to no more than 1.0 (however, these requirements from the original PSP specification are not enforced by MotifLab).

Example:

>ENSG00000035403 4 0.118 0.188 0.839 0.887 0.91 0.898 0.903 0.873 0.0 0.002 0.003 0.001 0.0 0.0 0.0 >ENSG00000100345 4 0.998 0.999 0.998 0.997 0.997 0.998 0.998 0.982 0.994 1.0 1.0 1.0 0.0 0.0 0.0 >ENSG00000107796 4 0.444 0.519 0.999 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.992 0.997 0.0 0.0 0.0

Name	Description
Orientation	This parameter dictates the order in which to list the values for each sequence. The default setting is "Relative" and the valid settings are: Direct : List values according to the direct genomic strand. I.e. start with the value in the position with the smallest genomic coordinate, followed by the value in the next position with the second smallest genomic coordinate, etc. End with the value in the position with the highest genomic coordinate. Reverse : List values according to the reverse genomic strand. I.e. start with the value in the position with the highest genomic coordinate, followed by the value in the next position with the second highest genomic coordinate, etc. End with the value in the position with the smallest genomic coordinate. Relative : List the values in the order relative to the strand orientation of the sequence itself. I.e. use the "direct" ordering for sequences residing on the direct strand and the "reverse" ordering for sequences on the reverse strand. Opposite : List the values in the order opposite of the strand orientation of the sequence itself. I.e. use the "direct" ordering for sequences residing on the reverse strand and the "reverse" ordering for sequences on the direct strand.
Motif width	This value (hereafter called W) will be output after the sequence name in the header for each sequence, and the last W-1 values for each sequence will be set to 0.0 (as required by the PSP format).
Include width	If selected (default), the motif width parameter will be output after the sequence name in the header. Note that the W-1 values for each sequence will still be set to 0.0 even if the motif width is not included in the header.
Normalize	The original PSP specification requires that all the values lie between 0 and 1 and that the sum of values for each sequence is no greater than 1.0. The "Normalize" parameter can be used to normalize all the values so that the output conforms to these requirements. The default setting is to not perform any normalization, but it is also possible to normalize the values by dividing each position with the largest value for each sequence ("Max 1") or by dividing the value in each position with the sum across all positions ("Sum to 1").

: PRIORITY, FASTA, output, normalize, Numeric Dataset

GFF

Applies to:

The General Feature Format (GFF) is one of the most popular formats for exchanging information about region based features. The official GFF specification can be found here, but briefly described the format outputs one region per line and each line consists of 8 (or optionally 9) fields separated by TAB.

The fields are in order:

The name of the sequence
The source of the feature
The feature type
The start coordinate of the region
The end coordinate of the region
A score value for the region
The orientation of the region. This can be "+" or "-" (or "." if orientation is unspecified)
The reading frame. The value of this field is either 0, 1 or 2 (or "." if the frame does not apply)
Additional attributes. This optional field consists of a list of attributes separated by semicolon. Each attribute has a key (or "tag") followed by value for the attribute (separated by an equals sign).

NOTE:
When importing regions from a GFF-file, the sequence name in the first column must correspond to the name of an existing sequence in MotifLab, and the region will then be added to that sequence. If the first column contains a chromosome name, it will only be added to a sequence if there is a sequence that is actually named after the chromosome; it is not enough that the sequence covers the chromosomal segment that the region from the GFF-file falls within. When the first column contains chromosome names, it is suggested instead to use the GTF format (or convert the file to BED format).

Sequences output in GFF format are output according to the currently selected sorting order of the sequences, but within each sequence the user can specify whether to sort the regions by position, score or type. The start and end positions of each region (fields 4 and 5) can be output as either genomic coordinates or as positions relative to the start of the sequence by setting the "Position" option to either "Genomic" or "Relative". If the "Relative" setting is chosen, the "Relative-offset" and "Orientation" settings will also apply. The "Relative-offset" setting specifies the coordinate of the first position in the sequence. This will normally be 1 but can be set to other values if needed (for instance 0). The "Orientation" setting specifies which orientation to use to determine the relative region coordinates. For example, if a 100 bp long sequence on the direct strand has a binding site region from position 80 to 90, the start and end coordinates will be [80,90] if the "Direct" strand orientation is selected or [10,20] if the "Reverse" orientation is selected. If the "Orientation" is set to "From Sequence" the strand orientation will be selected based on the orientation of the sequence itself, so that sequences on the direct strand will be output in direct orientation and those on the reverse strand will be output in reverse orientation. If the "Opposite" strand orientation is selected, the orientation will be the opposite of the orientation of the sequence.

If the standard GFF format is not adequate, the "Format" setting can be used to specify an alternative output format. The alternative format is specified by a string consisting of a mix of literal characters and special field codes surrounded by braces (e.g. {START} ). For each region, the field codes in the format string (if recognized) will be replaced by the corresponding value of the field as it applies to the target region before the string is output. Some recognized fields are: SEQUENCENAME, FEATURE, SOURCE, START, END, SCORE, STRAND and TYPE (note the capitalization). TABs can be represented with the escape character \t.

For example, the following output format:

Binding site for {TYPE} at {START}-{END} with score={SCORE} in sequence {SEQUENCENAME}

will produce output that looks like this

Binding site for M00378 at 483-494 with score=5.963 in sequence ENSG00000120948 Binding site for M00253 at 3-10 with score=3.801 in sequence ENSG00000116741 Binding site for M00313 at 8-15 with score=5.697 in sequence ENSG00000116741

Name	Description
Position	Specifies whether the coordinate positions [start-end] for a region should be given relative to the start of the chromosome ("Genomic"), relative to the upstream start of the sequence ("Relative") or relative to the transcription start site associated with the sequence ("TSS-Relative")
Relative-offset	If the "Position" setting is set to "Relative", this offset-value specifies what position the first base in the sequence should start at (common choices are 0 or 1). If the "Position" setting is "TSS-Relative", a value of 0 here will place the TSS at position +0 whereas any other value will place the TSS at position +1 (and the coordinate-system will then skip 0 and go directly from -1 to +1)
Orientation	Orientation of relative coordinates (only applicable if the "Position" setting is "Relative").
Sort1,Sort2,Sort3	These three parameters control how to sort the regions in the output. The regions will be sorted first by "Sort1", then by "Sort2" (for regions with similar values for the "Sort1" property) and finally by Sort3. Valid choices for these parameters are "Position", "Type" and "Score" and the three parameters should preferably have different values. The default choice is to first so by "Position", then "Type" and finally "Score".
Include module motifs	If selected, the constituent single TF binding sites making up a cis-regulatory module will also be included for each module region. Hence, if a module consists of three TFBS, the module region will be output first on one line followed by three lines containing each of the TFBS regions. The third column in the output will have the value "module" for the module regions and "motif" for the individual TFBS. Also, a "module_identifier" is output on each line that can be used to group together a module entry with its corresponding motif (TFBS) entries.
Skip header lines	This (hidden) parameter can be used to specify a number of lines that should be skipped at the start of the file (default is 0). These lines are suspected to contain comments or other information that do not conform to standard GFF format and would therefore result in parsing errors if treated as regular input.
Format	The "Format" parameter allows you to specify a different format to use rather than the standard GFF fields. In additional to literal text, the format string can contain field-codes surrounded by braces, e.g. `{TYPE}`. These field codes will be replaced by the corresponding property value of the region. Standard recognized field codes include: SEQUENCENAME,START,END,TYPE,SCORE,STRAND and ATTRIBUTES. Other field codes can be used to refer to user-defined properties. Tabs can be inserted using `\t` and extra newlines can be inserted with `\n`. Example: Use the following format string to output a comma-separated list with the type of the region plus start and end coordinates in the sequence: `{TYPE},{START},{END}`

: output, EvidenceGFF, Region Dataset

GTF

Applies to:

The Gene Transfer Format (GTF) is a refined version of the GFF format. More information can be found here and here. The GTF format is rather restricted in MotifLab. The first field ("sequence name") is set to the chromosome ID. The attributes field has two mandatory attributes: gene_id and transcript_id which are set to the name of the sequence and the type of the region.

: output, GFF, Region Dataset

EvidenceGFF

Applies to:

The EvidenceGFF format is an extension of the popular GFF format for region based features. The format allows the user to specify a list of additional properties that will be output alongside the standard GFF fields for each region. The additional properties can be output either in semicolon-separated "key=value" format as part of the normal "attributes" field in the standard GFF format or as additional fields separated by TAB (which will then extend the standard GFF format). Which format to use can be selected with the "Evidence format" setting.
The additional properties to output are specified as a string in the "Evidence" setting. This setting should be a list of comma-separated fields in "key=value" format. (Alternatively, the list can be separated by semicolons instead of commas and colons can be used instead of "=" to separate the name of the key from its value).
The "key" can either refer to a known feature dataset or be one of the special keywords region, motif, module, sequence or text.

The proper format of the "value" will depend on the type of the key as described in the table below:

If the key is the special keyword "region" the "value" can refer to any property associated with the region.
Some common region properties are:

type: Will output the type of the region
score: Will output the score value associated with the region
orientation: Will output the orientation of the region: 1 (direct), -1 (reverse) or 0 (undetermined).
In versions 1.05+ the property orientationsymbol or orientationstring will return a plus-symbol (+) for regions in the direct orientation, a minus-symbol (-) for regions in the reverse orientation and a dot (.) for regions with undetermined orientation.
sequence: Will output the DNA sequence spanned by the region (this property is usually only defined for regions in motif tracks)

If the key is the special keyword "motif" the following formats for "value" are recognized:

ID: Will output the name of the motif (usually just an identifier)
short name: Will output a short name for the motif (but usually more descriptive than the ID)
long name: Will output a longer name for the motif
consensus: Will output the consensus binding sequence of the motif
classification: Will output the classification of the motif (based on the type of binding factor)
factors: Will output a list of transcription factors that bind to this motif

In MotifLab version 2 (starting with v2.0.-3), the "value" can refer to any standard or user-defined motif property. Note that the "motif" keyword is only applicable for motif tracks where each region refers to a TFBS with an associated motif. If the region is not associated with a motif, the special code "N/A" will be output.

If the key is the special keyword "module" the "value" can be any standard or user-defined module property.

This feature was added in MotifLab v2.0.-3. Note that the "module" keyword is only applicable for module tracks where each region refers to a known cis-regulatory module type. If the region is not associated with a module, the special code "N/A" will be output.

If the key is the special keyword "sequence" the following formats for "value" are recognized:
(requires MotifLab version 1.05+)

name: Will output the name of the sequence
gene name (or genename): Will output the name of the gene associated with the sequence (if specified)
species (or organism): Will output the common name of the organism the sequence originates from
latin species (or latin organism): Will output the latin name of the organism the sequence originates from
taxonomy: Will output the species taxonomy identifier of the organism the sequence originates from (E.g. for human sequences this will be "9606")
build: Will output the genome build that the sequence originates from
start: Will output the genomic coordinate for the start of the sequence
end: Will output the genomic coordinate for the end of the sequence
chromosome: Will output the chromosome that the sequence resides on
chr: Same as "chromosome" above but with an added "chr" prefix.
orientation: Outputs a plus sign (+) if the sequence is from the direct strand, a minus sign (-) if the sequence is from the reverse strand or a dot (.) if the sequence orientation is unknown.

In MotifLab version 2 (starting with v2.0.-3), the "value" can refer to any standard or user-defined sequence property.

If the key is the special keyword "text" then the corresponding value will be output verbatim.
E.g. the evidence code "text=BindingSite" will output "BindingSite" for every region.

If the key is the name of a DNA Sequence Dataset the following formats for "value" are recognized:

direct: Will output the DNA sequence spanned by the region. The DNA sequence will be from the direct strand.
reverse: Will output the DNA sequence spanned by the region. The DNA sequence will be from the reverse strand.
relative: Will output the DNA sequence spanned by the region. The DNA sequence will be from the strand relative to the orientation of the corresponding Sequence.

If the key is the name of a Numeric Dataset the following formats for "value" are recognized:

minimum (or min): Will output the smallest value in the interval spanned by the region
maximum (or max): Will output the largest value in the interval spanned by the region
average (or avg): Will output the average value in the interval spanned by the region
weighted average (or weighted avg): Will output the weighted average value in the interval spanned by the region. This value is only defined if the region refers to a binding site for a motif and will default to the normal unweighted average for all other regions. The value of the numeric track in each position is weighted by the normalized information content of the corresponding aligned PWM column of the Motif
median: Will output the median value in the interval spanned by the region
sum: Will output the sum of the values in the interval spanned by the region
weighted sum: The weighted sum of the values in the interval spanned by the region. This value is only defined if the region refers to a binding site for a motif and will default to the normal unweighted sum for all other regions. The value of the numeric track in each position is weighted by the normalized information content of the corresponding aligned PWM column of the Motif
startValue: Will output the value of the numeric track corresponding to the start position of the region (the smallest genomic coordinate)
endValue: Will output the value of the numeric track corresponding to the end position of the region (the largest genomic coordinate)
relativeStartValue: Will output the value of the numeric track corresponding to the position within the region located furthest upstream when viewed relative to the orientation of the Sequence
relativeEndValue: Will output the value of the numeric track corresponding to the position within the region located furthest downstream when viewed relative to the orientation of the Sequence
regionStartValue: Will output the value of the numeric track corresponding to the position within the region located furthest upstream when viewed relative to the orientation of the region
regionEndValue: Will output the value of the numeric track corresponding to the position within the region located furthest downstream when viewed relative to the orientation of the region
centerValue: Will output the value of the numeric track corresponding to the position at the center of the region

If the key is the name of a Region Dataset (hereafter called the "target dataset") the "value" should be in the following format:
<operator> [qualifiers] <condition> [range] [additional]

The qualifiers field is optional but can contain a space-separated list of keywords as defined below.
The range field is only required when the condition is "within".
The additional field can be added when the operator is "list". Allowed values for this field are described below in connection with the list-operator.

Based on the condition, range and qualifiers, a set of target regions will be obtained from the target dataset.
The following conditions will determine which target regions are included in this set:

overlapping: The set will include those regions from the target dataset that overlap with the region being currently output by EvidenceGFF
inside: The set will include those regions from the target dataset that are fully inside the region being currently output by EvidenceGFF
covering: The set will include those regions from the target dataset that fully cover the region being currently output by EvidenceGFF
within [range]: The set will include those regions from the target dataset that overlap with an interval extending range bases on either side of the region being currently output by EvidenceGFF. The range can be specified as a numeric literal or with a Numeric Variable or Numeric Map.
present: This set will include only those regions from the target dataset that are identical in every way to the region being currently output by EvidenceGFF. This condition is really only useful in statements like "filteredRegions=is present" which will be true if the region being output is also present in the track named "filteredRegions"

The resulting set can be further filtered by requiring the target regions to have additional qualifications:

non-overlapping: Only target regions that do not overlap with the region being currently output by EvidenceGFF will be kept. (This qualifier is really only useful in conjunction with the "within" condition).
interacting (or "interaction partner"): Only target regions that bind transcription factors known to interact with factors bound by the region being currently output by EvidenceGFF will be kept. (This qualifier is only useful when both the current region and the target region represent TFBS)

After the set of target regions have been obtained based on the condition and filtered based on the selected qualifiers, the choice of operator will determine the final output. EvidenceGFF recognizes the following operators:

is: The final output will be a boolean value (YES/NO, TRUE/FALSE or similar) reflecting whether the set of target regions is non-empty (i.e. whether any target regions met the speficied criteria).
count: The final output will be a numeric value reflecting the size of the set of target regions.
list: The final output will be a comma-separated list of type names for the target regions in the set.
As described above, an [additonal] field may be appended having one of the following values: "with scores", "with distances" or "with scores and distances". When the list of target regions is output "with scores", the score of each target region is written out in parentheses behind the type name of the target region. If the list is output "with distances", the shortest distance from the target region to the region being currently output by EvidenceGFF is written out in brackets [] behind the type name of the target. If the two regions overlap, a distance of -1 will be output.
As of MotifLab v2.0.-3, the value of this field can also be "with [motif | module] <propertyname>" which will output the value of the specified region property within parentheses. The property name can be prefixed with either motif or module to signal that the name instead refers to a property of the motif or module associated with the region.
percentage (or percent). (Requires version 1.05+): This operator can only be used in combination with the 'overlapping' condition (i.e. "percentage overlapping") and will output the largest fraction of overlap that the currently output region has with any of the target regions.
As of MotifLab v2.0.-3 it is also possible to use "percentage all overlapping" to output a comma-separated list with percentage overlap for every overlapping target region. Note that the order in which these percentages are listed is the same as the order of the regions output with the corresponding "list overlapping" statement.
distance to <qualifier>: The final output will be a numeric value reflecting the distance to the closest qualified target region or the special value "N/A" if no qualified regions could be found. The required qualifier can be "any" (or "closest") which will just output the distance to the closest target region, "interacting" (or "interaction partner") which will output the distance to the nearest region representing a known interaction partner (assuming both regions are motif sites), or it can be the name of a Collection or Text Variable which will output the distance to the nearest region whose type is a member of the Collection or Text Variable. The qualifier "non-overlapping" can also be added to ignore overlapping target regions.

Note that if the "target dataset" is the same as the region dataset being currently output in EvidenceGFF format, the current region being output will never be included in the set of target regions described here.

Examples: (keys are assumed to be referring to known Region Datasets)

DNaseHS=is overlapping
Will output YES or NO depending on whether the current region being output overlaps with any regions in the DNaseHS track.

ChIP_Seq_tags=count covering
Will output the number of ChIP_Seq_tags that are completely covering the current region being output (so that the current region is fully inside the tag region)

TFBS=list non-overlapping interacting within 20 with scores and distances
Will list the type names of TFBS regions that are overlapping an interval extending 20 bp on either side of the current region but not overlapping with the current region itself. The target regions must be associated with motifs that are known to interact with the motif associated with the current region. The score of the target region will be output in parenthesis after its type name and this will be followed by the distance between the target region and the current region in brackets.

For example, the following "Evidence" format:

motif=short name,Conservation=average,Repeats=is overlapping,TFBS=list within 30

will add 4 new fields to the GFF format. The first new field will contain a short name of the motif associated with the region being output. The second field will contain the average value of the "Conservation" track within the interval spanned by the region. The third field will contain a YES or NO value depending on whether or not the region overlaps with a region in the "Repeats" track, and the fourth and last field will contain a list of type names for regions in the "TFBS" track that are within 30 bp of the current region. The output could look something like this:

NTNG1 BindingSites M00378 48 59 5.963 - . V$PAX4_03 0.109 No RPRM BindingSites M00253 296 303 3.801 + . V$CAP_01 0.235 Yes M00313 RPRM BindingSites M00313 301 308 5.697 + . V$GEN_INI2 0.0 Yes M00253,M00315 ...

Name	Description
Position	Specifies whether the coordinate positions [start-end] for a region should be given relative to the start of the chromosome ("Genomic"), relative to the upstream start of the sequence ("Relative") or relative to the transcription start site associated with the sequence ("TSS-Relative")
Relative-offset	If the "Position" setting is set to "Relative", this offset-value specifies what position the first base in the sequence should start at (common choices are 0 or 1). If the "Position" setting is "TSS-Relative", a value of 0 here will place the TSS at position +0 whereas any other value will place the TSS at position +1 (and the coordinate-system will then skip 0 and go directly from -1 to +1)
Orientation	Orientation of relative coordinates (only applicable if the "Position" setting is "Relative").
Sort1,Sort2,Sort3	These three parameters control how to sort the regions in the output. The regions will be sorted first by "Sort1", then by "Sort2" (for regions with similar values for the "Sort1" property) and finally by Sort3. Valid choices for these parameters are "Position", "Type" and "Score" and the three parameters should preferably have different values. The default choice is to first so by "Position", then "Type" and finally "Score".
Include header	If selected, a single header line (starting with #) will be output at the beginning of the output-document. The header contains a specification of all the fields included in the output.
Skip standard fields	If selected, the standard GFF fields will not be output, only the evidence fields.
Boolean format	This parameter specifies how boolean values should be formatted in the output. Either as "Yes" versus "No" (alternatively "Y" versus "N"), "True" versus "False" (alternatively "T" versus "F") or "1" versus "0".
Evidence format	Specifies how the "evidence" should be output for each region. Options are to output each evidence value in a column of its own or to output all evidences in a single column in key=value pairs (separated by semicolons).
Evidence	The "evidence" parameter should be a comma-separated list of key=value pairs specifying additional information that should be output for each region. See above for a complete description of recognized evidence codes.

: output, GFF, Region Dataset

BED

Applies to:

The BED format consists of one line with TAB separated fields per region in a Region Dataset. The first three fields are required but additional fields can also be specified. MotifLab assumes that files are in a BED-6 format, but it is also possible to use other non-standard formats. The fields of the default BED-6 format are in order:

Chromosome: The name of the chromosome or scaffold.
(Chromosome names can be given with or without the "chr" prefix)
Chromosome start: Start position of the region in chromosomal coordinates.
Chromosome end: End position of the region in chromosomal coordinates.
Name: The type name of the region.
Score: The score for the region.
Strand: The strand orientation of the region.
This can be either "+" (direct strand) or "-" (reverse strand), or even "." if the orientation is undetermined.

Note that the default coordinate system employed by the BED format defines the first position of a chromosome to be position 0 (rather than position 1 which is commonly used by other formats), and the end-coordinate of a region is exclusive (i.e. the coordinate is actually the first position after the region).

Example:

chr10 75427001 75427008 M00101 4.9968641726528125 - chr10 75427002 75427007 M00028 4.686097666202365 + chr10 75427002 75427007 M00029 4.486802949517 + chr10 75427003 75427014 M00472 8.447923342601406 - chr17 8474690 8474701 M00073 7.7850394311299675 + chr17 8474710 8474718 M00428 6.149151076269675 + chr17 8474719 8474730 M00507 8.998892822877837 -

Name	Description
Add CHR prefix	If selected, the prefix "chr" will be added before the chromosome number (e.g. chromosome 12 will be output as "chr12" rather than just "12").
Start position	Selects the position of the first base in a chromosome. The default is 0 for BED, but it is usually 1 for other data formats so MotifLab allows the start position to be specified as either 0 or 1.
Coordinate system	Selects whether the coordinates in the BED-file are in the standard BED-coordinate system (with the chromosome starting at position 0 and end-coordinates being exclusive) or in the format used by e.g. GFF, where the chromosome starts at position 1 and both start- and end-coordinates are inclusive. This parameter replaces the "Start position" parameter that was used in earlier MotifLab versions.
Format	This optional parameter can be used to explicitly define the contents of each line in the BED-file if a non-standard format is used. The `format` should be defined as a comma-separated list of column names. The default format assumes the BED-file contains the following six columns: "`CHROMOSOME, START, END, TYPE, SCORE, STRAND`". A line is allowed to contain fewer columns than the format specifies, in which case the missing columns are simply ignored. If the file contains additional columns that the user wants to import, the names of these columns must be included in the `format` specification. For example, if the file has an additional column containing the property "GeneID" after the STRAND column, the `format` parameter should be set to "`CHROMOSOME, START, END, TYPE, SCORE, STRAND, GeneID`". It is also possible to skip columns by replacing the column name with an asterisk (``). For example, if a non-standard BED-file contains the columns "CHROMOSOME, START, END, TYPE, SCORE, GeneID", and a user wants to import all of these properties except for the SCORE property, the `format` parameter can be set to "`CHROMOSOME, START, END, TYPE, , GeneID`" (with SCORE replaced by `*` in the format definition).

: output, Region Dataset

BigBed

Applies to:

The BigBed format can be used to represent region track data in an indexed binary format based on the BED format. BigBed is the most compact and efficient way to represent and access very large region datasets, including datasets covering full genomes. MotifLab is currently able to import track data from BigBed files, but is not able to output tracks in BigBed format. More information about the BigBed format and how to create BigBed files can be found here.

Name Description

Custom fields This optional parameter can be used to declare additional fields in the BigBed file. A BigBed file is required to contain at least three fields which are in order: CHROMOSOME, START and END. If a line contains more than three fields, the next fields must be TYPE, SCORE and STRAND (+/-/.). If a line contains more than these six fields, the rest will be regarded as custom fields. MotifLab can read these custom fields and add their values to the region as user-defined properties, but to do that the fields must be identified by supplying a comma-separated list of property names. For example, if the Custom fields parameter is set to "count,gene", each entry in the BigBed file is expected to have (at least) 8 fields where the 7th field is named "count" and the 8th field is named "gene". If the name of a custom field is set to "*" it will be ignored. Thus, if the Custom fields parameter is set to "*,gene", the value in the 7th field will be ignored but the value in the 8th field will be added to the region as a user-defined property named "gene".

: output, BED, Region Dataset

Region_Properties

Applies to:

The Region_Properties data format allows users much freedom in customizing their own format for Region Datasets by specifying which properties of the regions they want to include in the output. The data format will either output all the regions from one sequence on the same line or output only one region per line. In the first case, the line will start with a chosen description of the sequence followed by descriptions of all the regions in that sequence. In the second case, each line will start with the sequence description followed by one region, and the sequence description will thus be repeated for every region at every line.

Example of regions output (one on each line) with the Sequence format string "{sequencename}" and Region format "{type} ({motif:short name})\t{sequence:chromosome string}:{genomic start}-{genomic end} [{orientation string}] => {sequence}"

ENSG00000035403 M00428 (V$E2F1_Q3) chr10:75427729-75427736 [Direct] => TTTGGCGG ENSG00000035403 M00048 (F$ADR1_01) chr10:75427746-75427751 [Direct] => TGGGGC ENSG00000035403 M00028 (I$HSF_01) chr10:75427761-75427765 [Direct] => CGAAA ENSG00000100345 M00344 (P$RAV1_02) chr22:35113793-35113804 [Direct] => CTCACCTGAACC ENSG00000100345 M00428 (V$E2F1_Q3) chr22:35113815-35113822 [Reverse] => GTTCCCGG ENSG00000100345 M00497 (V$STAT3_02) chr22:35113817-35113824 [Reverse] => CTGTTCCC ENSG00000100345 M00029 (F$HSF_01) chr22:35113818-35113822 [Direct] => GGAAC ENSG00000173531 M00482 (V$PITX2_Q2) chr3:49701607-49701617 [Direct] => TGTCATCCCAG ENSG00000173531 M00500 (V$STAT6_02) chr3:49701617-49701624 [Reverse] => ACCTTCCC ENSG00000173531 M00048 (F$ADR1_01) chr3:49701652-49701657 [Direct] => AGGGGT ENSG00000173531 M00378 (V$PAX4_03) chr3:49701653-49701664 [Reverse] => TACCTCCACCCC ENSG00000173531 M00048 (F$ADR1_01) chr3:49701657-49701662 [Direct] => TGGAGG

Name	Description
Layout	This parameter controls the general layout of the output. The two available choices are "one sequence per line" (which will output selected information about the sequence followed by selected information about every region within that sequence) and "one region per line" (which will output information on only one region per line, preceeded by information about its parent sequence).
Sequence format	This parameter specifies the information to output for each sequence. In additional to literal text, the format string can contain property codes surrounded by braces, e.g. `{CHROMOSOME}`. These property codes will be replaced by the corresponding property values of the sequence in the output. Some standard property codes include SEQUENCENAME, START, END and STRAND. See the documentation for the "Sequence_Properties" data format for a comprehensive list of standard sequence properties. Note that names of standard properties are case-insensitive but the names of user-defined properties are case-sensitive. Use the escape character `\t` to insert a tab and `\n` to insert a line break. If you leave the field empty it will take on the default value, but you can set it to `` (single asterisk) to signal that the field should not be output at all. Note that leading or trailing whitespace in the format string will be ignored, but you can use the escape character `\s` to represent spaces instead. Tip:* If you don't want to output sequence properties at the beginning of the line but rather mix these in between the other region properties, set the sequence format to `*` (empty) and use property codes prefixed with "sequence:" in the Region format parameter.
Region format	This parameter specifies the information to output for each region. In additional to literal text, the format string can contain property codes surrounded by braces as explained for the Sequence format parameter above. Standard region properties include: type score orientation (or "strand". The absolute orientation of the region) relative orientation (The orientation of the region relative to the parent sequence. +1 if they are the same or -1 if they are opposite) orientation sign orientation string relative orientation sign relative orientation string sequence (This is the sequence of DNA bases associated with the region) start end relative start relative end genomic start genomic end size chromosome (or just "chr" for short. The chromosome number or letter(s)) chromosome string (or "chr string". Same as above but prefixed with 'chr') TSS-relative start TSS-relative end TES-relative start TES-relative end In addition to these region properties, you can also include properties of the parent sequence by prefixing the property name with "sequence:". For example, to output the chromosome of the sequence you can use the property code `{sequence:chr}`. If the region represents a motif site (the type of the region is the name of a motif), you can also refer to properties of this motif by prefixing the property name with "motif:". For example, to output the information content of the motif associated with the region, use the code `{motif:IC-content}`. Similarly, if the region is a module site, you can output module properties by prefixing with "module:". See the other data formats "Sequence_Properties", "Motif_Properties" and "Module_Properties" for more information on standard properties of these data types. Use the escape character `\t` to insert a tab and `\n` to insert a line break in the format string. If you leave the field empty it will take on the default value, but you can set it to `` (single asterisk) to signal that the field should not be output at all. Note that leading or trailing whitespace in the format string will be ignored, but you can use the escape character `\s` to represent spaces instead. A note on coordinate systems: The "start" and "end" properties will output the start (and end) position of the region relative to the start of the parent sequence on the direct strand. The relative* start and end properties outputs these positions relative to the beginning of the parent sequence, which in this case will be strand dependent. The genomic start and end properties outputs positions relative to the start of the chromosome (as long as the genomic location of the parent sequence is known). The TSS/TES-relative properties will output the start and end relative to the position of the TSS/TES of the gene associated with the parent sequence. All of these coordinate systems start at position 0 except for genomic coordinates which start at 1. If you want these positions to start at 1 instead (or 0 for genomic coordinates) you can explicitly add [0] or [1] after the property name, e.g.: `{relative start[1]}`, `{genomic start[0]}` or `{TSS-relative start[1]}`. 1-indexed coordinates relative to TSS/TES work a little bit different than regular coordinate systems since the TSS/TES will be placed at +1 but the immediate upstream position will be called -1. This, in effect, will skip the 0-position: ..., -3, -2, -1, +1 [TSS], +2, +3, ... This choice will affect the positive coordinates of regions downstream of TSS/TES (or regions spanning TSS/TES) but not the negative coordinates of upstream regions.
Sequence delimiter	The delimiter text that separates the sequence information from the region information. The default is a TAB (`\t`). If you leave the field empty it will take on the default value, but you can set it to `*` (single asterisk) to signal that the delimiter should be empty. Note that leading or trailing whitespace in the string will be ignored, but you can use the escape character `\s` to represent spaces instead.
Region delimiter	The delimiter text that separates the information of different regions. This only applies when multiple regions are output to the same line. The default is a TAB (`\t`). If you leave the field empty it will take on the default value, but you can set it to `*` (single asterisk) to signal that the delimiter should be empty. Note that leading or trailing whitespace in the string will be ignored, but you can use the escape character `\s` to represent spaces instead.

: output, Sequence_Properties, Motif_Properties, Module_Properties

Motif formats (and module formats)

MotifLabMotif

Applies to:

Motif and MotifCollection

The MotifLabMotif format is the default format for motifs used by MotifLab, and it is currently the only format that will include information about all the properties related to a motif (and not just the identifier and matrix). The format is basically a direct extension of the INCLUSive Motif Model format but with additional #-fields describing both standard and user-defined motif properties such as the name of the transcription factor (#Short and #Long), the transcription factor class (#Class), binding factors (#Factors), the organisms the TFs are expressed in (#Organisms), motifs for known interacting factors (#Interactions) and alternative motifs models for the same TFs (#Alternatives). A file in MotifLabMotif format must start with a header line reading "#MotifLabMotif" which serves to identify the format.

Example:

#MotifLabMotif (inspired by INCLUSive Motif Model v1.0) # #ID = M00002 #Short = V$E47_01 #Long = E47 (E2A immunoglobulin enhancer binding factor) #W = 15 #Class = 1.2.1.0 #Factors = E47 #Organisms = human (Homo sapiens) #Interactions = M00001,M00002,M00058,M00065,M00066,M00068,MA0048,MA0081,M00454,MA0092 #Alternatives = M00065,M00066,M00071,M00222,MA0091 #Transfac class = C0010 4.0 4.0 3.0 0.0 2.0 5.0 4.0 0.0 3.0 2.0 4.0 2.0 2.0 0.0 9.0 0.0 0.0 11.0 0.0 0.0 11.0 0.0 0.0 0.0 0.0 0.0 11.0 0.0 1.0 2.0 8.0 0.0 0.0 0.0 0.0 11.0 0.0 0.0 11.0 0.0 0.0 0.0 4.0 7.0 1.0 4.0 3.0 3.0 1.0 6.0 2.0 2.0 1.0 4.0 4.0 2.0 1.0 4.0 2.0 3.0 #ID = M00001 #Short = V$MYOD_01 #Long = MyoD (myoblast determination gene product) #W = 12 #Class = 1.2.2.0 #Factors = MyoD,MyoD (376 AA),MyoD (275 AA) #Organisms = chick (Gallus gallus),rat (Rattus norvegicus),human (Homo sapiens) #Interactions = M00001,M00002,M00004,M00006,M00222,M00223,M00225,M00231,M00232 #Alternatives = M00184 #Transfac class = C0010 1.0 2.0 2.0 0.0 2.0 1.0 2.0 0.0 3.0 0.0 1.0 1.0 0.0 5.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 4.0 1.0 0.0 1.0 4.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 5.0 0.0 0.0 1.0 2.0 2.0 0.0 2.0 0.0 3.0 1.0 0.0 3.0 1.0

Name	Description
Include non-standard fields	If selected, information about non-standard, user-defined motif properties will also be included.
Include derived fields	If selected, motif properties that are derived from other properties, such as GC-content, IC-content and IUPAC consensus string (all derived from the matrix representation), will also be included in the output.
Include color info	If selected, information about the current colors used for the motifs in MotifLab will also be included in the output. When a file containing color information is imported into motiflab, the motifs will be assigned their specified colors. If an imported motif file does not contain color information, the motifs will be assigned arbitrary colors.

: INCLUSive_Motif_Model, output

MotifLabModule

Applies to:

Motif and MotifCollection

The MotifLabModule format is the default format for modules used by MotifLab, and it is a variation of the MotifLabMotif format. A file in MotifLabModule format must start with a header line reading "#MotifLabModule" which serves to identify the format, and this is followed by a description of the modules (and optionally also the single motifs involved in these modules).

Each new module is introduced with the line:

#ModuleID = <unique identifier>

This is followed by a list of the motifs involved in the module:

Motifs = <comma-separated list of "module motif" names>

Note that the "module motif" names in the mentioned list are not single motif identifiers referencing Motif objects, but rather descriptive motif names that are internal to the module (the "module motif" names must be unique within the module). Each such "module motif" can be represented by multiple single motif objects, as described by lines in the following format:

Motif(<module motif>) = <list of Motif identifiers>

Additional constraints regarding the motifs within the module can also be specified, for example the maximum length of the module:

MaxLength = <maximum number of base pairs the module can span>

Whether the motifs in the module must appear in the order they are listed in the "Motifs = " line or if they can appear in any order.

Ordered = <true|false>

The module motifs might also have specific orientations relative to each other.

Orientation(<module motif>) = <Direct|Reverse>

Or constraints on the distance between pairs of consecutive motifs in ordered modules.

Distance(<module motif 1>,<module motif 2>) = [<min distance>, <max distance>]

Example:

#MotifLabModule # #ModuleID = MOD0001 Motifs = STAT,GATA Ordered = false MaxLength = 200 Motif(STAT) = M00225,M00492,M00493,M00494,M00500,M00496 Motif(GATA) = M00351,M00350,M00076,M00203,M00077,M00075,M00347,M00346 #ModuleID = MOD0002 Motifs = SRY,AP1 Ordered = true MaxLength = 200 Motif(SRY) = M00160,M00148 Motif(AP1) = M00041,M00172,M00039,M00517,M00040,M00113,M00114,M00174,M00115 Orientation(SRY) = Direct Orientation(AP1) = Reverse #ModuleID = MOD0003 Motifs = STAT,ER,MYC Ordered = true MaxLength = 200 Motif(STAT) = M00225,M00492,M00493,M00494,M00500,M00496,M00497,M00498 Motif(ER) = M00191 Motif(MYC) = M00055,M00322,M00006,M00005,M00007 Orientation(STAT) = Direct Orientation(ER) = Reverse Distance(STAT,ER) = [5,10] Distance(ER,MYC) = [0,16]

Name

Description

Include single motifs

The single motifs listed in the "Motif(x)=..." lines must reference Motif objects that are already known to MotifLab. It is possible to include descriptions for the motifs involved in the modules in a MotifLabModule file, so that the file will contain everything that is needed to restore the modules on import. The motif descriptions will be appended after the module descriptions at the end of the file in MotifLabMotif format.

Include module color info

If selected, information about the current colors used for the modules in MotifLab will also be included in the output. When a file containing color information is imported into motiflab, the modules will be assigned their specified colors. If an imported module file does not contain color information, the modules will be assigned arbitrary colors.

: INCLUSive_Motif_Model, output

INCLUSive_Motif_Model

Applies to:

The following description of the INCLUSive_Motif_Model format is taken directly from the MotifSuite web site.

The file must start with a comment line which identifies the format (#INCLUSive Motif Model v1.0).
Next follows the PWM description of a first motif, starting with some comment lines. The first comment line describes a unique motif identifier (#ID). The second comment line shows a motif score (#Score) which can be a score that is computed from the PWM or any other score that reflects the importance of the motif being described. The following two lines give the PWM length (#W) and a consensus description (#Consensus) of the motif. A consensus description is derived from the information available in the PWM; it is a string-based sequence representation of the motif in IUPAC code symbols (A,C,G,T,n,s,w) that describes the most likely nucleotide(s) on each position in the motif (n = any of A,C,G,T, s = C or G, w = A or T. Note that MotifLab can use additional IUPAC codes as well).

The comment lines are immediately followed by the values that make up the PWM (matrix) : each line describes the tab-separated probabilities (Pr) for nucleotide A, C, G and T on a given position in the motif. The number of lines must equal the length of the motif (#W). The probabilities described in a PWM can be frequencies (normalized values between 0 and 1 and the sum of a row equals 1), or they can be represented as counts (values can be higher than 1 and zeros are also common).
MARK : decimal numbers in a PWM must be described using a DOT (not a comma) e.g. 0.54 (not 0,54).

Pr(A,1) Pr(C,1) Pr(G,1) Pr(T,1) Pr(A,2) Pr(C,2) Pr(G,3) Pr(T,4) ... Pr(A,W) Pr(C,W) Pr(G,W) Pr(T,W)

The motif description ends with a blank line return. The second and following motifs are described in exactly the same way, each time separated from each other by a blank line. The end of the file is recognized by the last blank line return. Note that there is no explicit numbering of the motifs in the file.

Example:

#INCLUSive Motif Model v1.0 # #ID = M00001 #W = 12 #Consensus = srACAGGTGkyG 1.0 2.0 2.0 0.0 2.0 1.0 2.0 0.0 3.0 0.0 1.0 1.0 0.0 5.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 4.0 1.0 0.0 1.0 4.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 5.0 0.0 0.0 1.0 2.0 2.0 0.0 2.0 0.0 3.0 1.0 0.0 3.0 1.0 #ID = M00002 #W = 10 #Consensus = GGGGCGGGGT 2.0 1.0 6.0 2.0 3.0 1.0 6.0 1.0 0.0 0.0 11.0 0.0 0.0 0.0 11.0 0.0 0.0 8.0 2.0 1.0 3.0 0.0 6.0 2.0 0.0 1.0 7.0 3.0 1.0 0.0 8.0 2.0 1.0 2.0 7.0 1.0 3.0 2.0 0.0 6.0

: output

RawPSSM

Applies to:

RawPSSM will output motifs in a FASTA-inspired format where the entry for each motif starts with a header consisting of the motif identifier preceeded by a greater-than sign (">"), and this header is followed by the matrix representation for the motif output as either a 4xN or Nx4 matrix (depending on the chosen orientation).

Example (in "Horizontal" orientation):

>M00001 2 3 0 0 0 3 0 1 1 3 1 1 0 0 8 0 1 0 2 2 6 6 11 11 2 6 7 8 7 0 2 1 0 0 1 2 3 2 1 6 >M00002 1 2 3 0 5 0 0 0 0 0 0 1 2 1 0 5 0 0 1 0 0 1 2 0 2 2 1 0 0 4 4 0 5 2 0 3 0 0 1 0 0 1 0 5 0 2 3 1

Name	Description
Format	If the "Default" format is selected, the matrix will be output exactly as it is represented in each motif (which can be either a count matrix, a frequency matrix or a log-odds matrix). However, if the "Frequencies" format is selected, all matrices will be converted to frequency matrices before being output.
Orientation	The orientation can either be "Vertical" or "Horizontal". If a "Vertical" orientation is selected, the matrix will consist of four columns corresponding to each of the bases A, C, G and T and it will have N rows (where N is the length of the motif). If a "Horizontal" orientation is selected, the matrix will consist of four rows corresponding to each of the bases and the matrix will have N columns (one for each position).
Delimiter	Specifices the character used to separate the columns in the matrix. The default delimiter is "Tab", but other choices are "Space", "Comma" and "Semicolon".
Header	Specifies which information to include in the header for each motif. The possible options are: include only the motif ID ("ID"), include both the motif ID and motif name (short name) separated with a space ("ID Name") or include both the motif ID and motif name separated with a hyphen ("ID-Name").

: output

TRANSFAC

Applies to:

In the TRANSFAC motif format each line starts with a field code consisting of two characters and this is usually followed by a value for the field. The double slash code "//" is used to separate different motifs from each other in the file, and a double X (XX) is used to separate different fields. Some fields that can have multiple values can be repeated on consecutive lines in the file. The following field codes are recognized by MotifLab:

AC : This field will map to the motif identifier.
ID : This field will map to the "short name" of the motif.
NA : This field will output a cleaned up version of the "short name"of the motif (stripped of prefixes and suffixes) but is not used for input.
DE : This field will map to the "long name" of the motif.
BF : This field can contain the name of a TF binding to the motif and the organism in which this happens. This field is used for input to populate the "binding factors" and "organisms" properties of the motif, but it is not used for output.
P0 : This code marks the start of the matrix field

The matrix representation of the motif follows immediately after the "P0" code. Each matrix line has six columns where the first column is the position in the matrix, the next four columns contain matrix values for A, C, G and T respectively and the last column contains an IUPAC consensus symbol for that position.

Example:

VV TRANSFAC MATRIX TABLE XX // AC M00001 XX ID V$MYOD_01 XX NA MYOD XX DE MyoD (myoblast determination gene product) XX P0 A C G T 01 1 2 2 0 S 02 2 1 2 0 R 03 3 0 1 1 A 04 0 5 0 0 C 05 5 0 0 0 A 06 0 0 4 1 G 07 0 1 4 0 G 08 0 0 0 5 T 09 0 0 5 0 G 10 0 1 2 2 K 11 0 2 0 3 Y 12 1 0 3 1 G XX // AC M00002 XX ID V$E47_01 XX NA E47 XX DE E47 (E2A immunoglobulin enhancer binding factor, also known as Transcription factor 3 (TCF3)) XX P0 A C G T 01 4 4 3 0 V 02 2 5 4 0 S 03 3 2 4 2 N 04 2 0 9 0 G 05 0 11 0 0 C 06 11 0 0 0 A 07 0 0 11 0 G 08 1 2 8 0 G 09 0 0 0 11 T 10 0 0 11 0 G 11 0 0 4 7 K 12 1 4 3 3 N 13 1 6 2 2 C 14 1 4 4 2 N 15 1 4 2 3 N XX //

: output

Jaspar

Applies to:

The motif format used by the JASPAR database is a FASTA-inspired format where the entry for each motif starts with a header consisting of the motif identifier preceeded by a greater-than sign (">"), and this header is followed by a 4xN matrix representation of the motif where each row is enclosed in brackets and the row is preceeded by the corresponding base letter.

Example:

>M00001 A [1 2 3 0 5 0 0 0 0 0 0 1 ] C [2 1 0 5 0 0 1 0 0 1 2 0 ] G [2 2 1 0 0 4 4 0 5 2 0 3 ] T [0 0 1 0 0 1 0 5 0 2 3 1 ] >M00002 A [ 4 2 3 2 0 11 0 1 0 0 0 1 1 1 1 ] C [ 4 5 2 0 11 0 0 2 0 0 0 4 6 4 4 ] G [ 3 4 4 9 0 0 11 8 0 11 4 3 2 4 2 ] T [ 0 0 2 0 0 0 0 0 11 0 7 3 2 2 3 ]

Name	Description
Format	If the "Default" format is selected, the matrix will be output exactly as it is represented in each motif (which can be either a count matrix, a frequency matrix or a log-odds matrix). However, if the "Frequencies" format is selected, all matrices will be converted to frequency matrices before being output.
Header	Specifies which information to include in the header for each motif. The possible options are: include only the motif ID ("ID") or include both the motif ID and motif name (short name) separated with a space ("ID Name").

: output

XMS

Applies to:

XMS is an XML-based format for specifying motifs and collections of motifs used by NestedMICA.

Example:

: output

MEME_Minimal_Motif

Applies to:

The MEME_Minimal_Motif format is primarily used by programs from the MEME Suite. The original format specification can be found here.

The format contains the following sections:

Version (required)
Alphabet (recommended)
Strands (optional)
Background frequencies (recommended)
Motifs (required)

For each motif in the motifs section there are the sub-sections:

Motif name (required)
Motif letter-probability matrix (recommended*)
Motif log-odds matrix (optional*)
Motif URL (optional)

*Note that at least one of the two starred sections is required. MotifLab can read motifs in both "letter-probability" and "log-odds" formats (the data read from the file is stored directly in the motif matrix) but will only output motifs in "letter-probability" format.

A file in MEME Minimal Motif format must start with the MEME version line which looks like this:

MEME version <version number>

This line is required to identify the file as a MEME Minimal Motif file. MotifLab will always output "4" as the version number.

The alphabet line specifies what alphabet to expect the motifs to be in. For DNA motifs this line will be

ALPHABET= ACGT

The strands line indicates if motifs were created from sites on both the given and the reverse complement strands of the DNA sequences.

strands: <which strands>

The <which strands> can be replaced with "+" to indicate only the given strand and "+ -" to indicate both strands. MotifLab will always output "+ -".

The background frequencies describe how prevalent each letter of the motif alphabet was in the source sequences which were used to create the motifs. Programs in the MEME Suite use this background to convert between motif letter-probability matrices and log-odds matrices. For DNA alphabets the format is as follows:

Background letter frequencies A <A-frequency> C <C-frequency> G <G-frequency> T <T-frequency>

The four frequencies should sum to 1.0. MotifLab will always output uniform background frequencies (0.25 in each case).

A motif name line indicates the start of a new motif and designates an identifier for it which much be unique to the file. It also allows for an (optional) alternate name which does not have to be unique.

MOTIF <identifier> <alternate name>

The letter probability matrix is a table of probabilities where the rows are positions in the motif and the columns are letters in the alphabet. The columns are ordered alphabetically so for DNA the first column is A, the second is C, the third is G and the last is T. As each row contains the probability of each letter in the alphabet the probabilities in the row must sum to 1. If this section is not specified then the log-odds matrix must be specified.

letter-probability matrix: alength= <alphabet length> w= <motif length> nsites= <source sites> E= <source E-value> ... (letter-probability matrix goes here) ...

All the "key= value" pairs after the "letter-probability matrix:" text are optional. The "alength= alphabet length" and "w= motif length" can be derived from the matrix if they are not specified, provided there is an empty line following the letter probability matrix. The "nsites= source sites" will default to 20 if it is not provided and the "E= source E-value" will default to zero. The source sites is used to apply pseudocounts to the motif and the source E-value is used for filtering the motifs input to some MEME Suite programs.

Example

MEME version 4 ALPHABET= ACGT strands: + - Background letter frequencies A 0.303 C 0.183 G 0.209 T 0.306 MOTIF crp letter-probability matrix: alength= 4 w= 19 nsites= 17 E= 4.1e-009 0.000000 0.176471 0.000000 0.823529 0.000000 0.058824 0.647059 0.294118 0.000000 0.058824 0.000000 0.941176 0.176471 0.000000 0.764706 0.058824 0.823529 0.058824 0.000000 0.117647 0.294118 0.176471 0.176471 0.352941 0.294118 0.352941 0.235294 0.117647 0.117647 0.235294 0.352941 0.294118 0.529412 0.000000 0.176471 0.294118 0.058824 0.235294 0.588235 0.117647 0.176471 0.235294 0.294118 0.294118 0.000000 0.058824 0.117647 0.823529 0.058824 0.882353 0.000000 0.058824 0.764706 0.000000 0.176471 0.058824 0.058824 0.882353 0.000000 0.058824 0.823529 0.058824 0.058824 0.058824 0.176471 0.411765 0.058824 0.352941 0.411765 0.000000 0.000000 0.588235 0.352941 0.058824 0.000000 0.588235 MOTIF lexA letter-probability matrix: alength= 4 w= 18 nsites= 14 E= 3.2e-035 0.214286 0.000000 0.000000 0.785714 0.857143 0.000000 0.071429 0.071429 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.857143 0.000000 0.071429 0.071429 0.000000 0.071429 0.000000 0.928571 0.857143 0.000000 0.071429 0.071429 0.142857 0.000000 0.000000 0.857143 0.571429 0.071429 0.214286 0.142857 0.285714 0.285714 0.000000 0.428571 1.000000 0.000000 0.000000 0.000000 0.285714 0.214286 0.000000 0.500000 0.428571 0.500000 0.000000 0.071429 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.785714 0.214286

: output

Motif_Properties

Applies to:

The Motif_Properties format will output a table with one motif on each row and with columns containing information about different motif properties chosen by the user. In MotifLab v2.0+ this format can also be used to import motif collections.

Example of properties output with the format string "ID,Short name,Classification,Consensus" (with header)

#ID Short name Classification Consensus M00006 V$MEF2_01 4.4.1.1 CTCTAAAAATAACyCy M00005 V$AP4_01 1.3.1.4 wGAryCAGCTGyGGnCnk M00008 V$SP1_01 2.3.1.0 GGGGCGGGGT M00007 V$ELK1_01 3.5.2.0 nAAACmGGAAGTnCGT M00002 V$E47_01 1.2.1.0 vsnGCAGGTGknCnn M00001 V$MYOD_01 1.2.2.0 srACAGGTGkyG M00004 V$CMYB_01 3.5.1.1 nCnrnnGrCnGTTGGkGG M00003 V$VMYB_01 3.5.1.1 AATAACGGnA

Name	Description
Format	This parameter specifies which motif properties to include in the output as an ordered, comma-separated list of property names. Note that standard properties are case-insensitive, but user-defined properties are case-sensitive! Standard motif properties include: ID Short name Long name Consensus Size IC-content GC-content Factors Classification Class name Quality Part Alternatives Interactions Organisms Expression Matrix When importing motifs from an existing file, the `format` parameter should list the properties in the order they appear in the file. If the file contains more properties than you want to import, you are allowed to specify only a subset. For example, if the file has the following five properties ID, Short name, Consensus, Quality and Organisms (in that order), and you only want to import the first three, you can just set the format parameter to "`ID, Short name, Consensus`" and the remaining properties will simply be ignored. You can also skip properties in the middle of the list by replacing the property name with an asterisk. To import only the properties ID, Consensus and Organism from the mentioned file (i.e. only columns 1,3 and 5), set the `format` parameter to "`ID,,Consensus,,Organisms`". If the file contains a "header" in the first line (starting with # and describing all the properties), you can use this header rather than naming the properties explicitly by setting the `format` parameter to a single asterisk (`*`). Note that the ID property must always be included or MotifLab will complain.
Separator	This parameter specifies how to separate the motif properties on each line in the output. The character chosen as the separator will replace the commas in the format string above. Valid options for the separator are "TAB" (default), "Comma", "Semicolon", "Colon" and "Vertical bar".
List-separator	Some motif properties may be represented by lists (e.g. binding factors and alternative motif models). The "List-separator" parameter specifies how to separate the entries in such lists. Valid options are "Comma" (default), "TAB", "Semicolon", "Colon" and "Vertical bar". Note that this separator should be different from the separator used to separate the motif properties that are output.
Header	If this option is selected, a header line starting with # will be included at the beginning of the output. The header line describes which motif properties are included.

: output, HTML_MotifTable, Module_Properties, Sequence_Properties, Properties

Module_Properties

Applies to:

Module and Module Collection

The Module_Properties format will output a table with one module on each row and with columns containing information about different module properties chosen by the user.

Example of properties output with the format string "ID, Size, Max IC, Motifs" (with header)

#ID Size Max IC Motifs MOD0040 3 43.5204389330021 NFKAPPAB65,SP1,EGR2 MOD0078 3 54.12787872722569 STAT,TITF1,STAT3 MOD0081 3 39.159961609322714 P300,STAT,IRF7 MOD0069 6 95.6314848710252 AP1FJ,STAT,AP1,CREBP1CJUN,STAT1,STAT3 MOD0070 4 76.28839485773265 STAT,IRF1,STAT1,ISRE MOD0101 2 42.324578249877625 NRSE,MTATA MOD0105 2 38.59871626791364 PAX5,MTATA

Name	Description
Format	This parameter specifies which module properties to include in the output as an ordered, comma-separated list of property names. Note that standard properties are case-insensitive, but user-defined properties are case-sensitive! Standard module properties include: ID Cardinality (or: Size) Consensus Motifs Ordered Oriented Max length Min IC Max IC
Separator	This parameter specifies how to separate the module properties on each line in the output. The character chosen as the separator will replace the commas in the format string above. Valid options for the separator are "TAB" (default), "Comma", "Semicolon", "Colon" and "Vertical bar".
List-separator	Some smodule properties may be represented by lists (e.g. GO-terms). The "List-separator" parameter specifies how to separate the entries in such lists. Valid options are "Comma" (default), "TAB", "Semicolon", "Colon" and "Vertical bar". Note that this separator should be different from the separator used to separate the module properties that are output.
Header	If this option is selected, a header line starting with # will be included at the beginning of the output. The header line describes which module properties are included.

: output, Motif_Properties, Sequence_Properties, Properties, HTML_ModuleTable

HTML_MotifTable

Applies to:

The HTML_MotifTable format will output an HTML-formatted table with one motif on each row and with columns containing information about different motif properties chosen by the user. It is possible to include graphical sequence logos for the motifs and also specify alternative motif properties to be shown as tooltips when the user points at a cell in the table.

Name	Description
Format	This parameter specifies which motif properties to include in the output as an ordered, comma-separated list of property names. Note that standard properties are case-insensitive, but user-defined properties are case-sensitive! Standard motif properties include: ID Short name Long name Consensus Size IC-content GC-content Factors Classification Class name Classpath Quality Part Alternatives Interactions Organisms Expression Matrix Logo The name of a property can be followed by the name of another property within parenthesis. This second property will be displayed as a tooltip when the mouse is over the table cell with the first property. For example, the format string: "`ID,Short Name(Long Name)`" will create a table with two columns where the first column will contain the motif ID and the second the motif's short name. If the user points the mouse at the short name of a motif, its long name will be displayed in a tooltip. (Note that this might not work within MotifLab but should work if the HTML document is saved to a file and displayed in an external web browser).
Sequence logo height	The height of the sequence logos (if logos are included).
Sequence logo width	The maximum width of sequence logos. If this is specified, logos will be scaled to fit within this limit when necessary. A value of 0 means no limit.
Sequence logos	Specifies whether the image files for motif sequence logos should be made specifically for this output data object and given file names reflecting this fact ("New images") or if the image files should be named after the motifs themselves (e.g. "M00134.gif") and allowed to be shared by other HTML-formatted output data objects that also contain sequence logos for the same motifs ("Shared images").
multiline	Some motif properties may be represented by lists (e.g. binding factors and alternative motif models). If the "multiline" option is selected, these lists will be broken across multiple lines in a table cell. If the "multiline" option is not selected, these properties will be output as comma-separated lists.
Headline	An optional headline which will be displayed above the table.

: output, Motif_Properties

HTML_ModuleTable

Applies to:

Module and Module Collection

The HTML_ModuleTable format will output an HTML-formatted table with one module on each row and with columns containing information about different module properties chosen by the user. It is possible to include graphical logos for the modules and also specify alternative module properties to be shown as tooltips when the user points at a cell in the table.

Name	Description
Format	This parameter specifies which module properties to include in the output as an ordered, comma-separated list of property names. Note that standard properties are case-insensitive, but user-defined properties are case-sensitive! Standard module properties include: ID Cardinality Motifs Ordered Oriented Max length Max IC Min IC Logo The name of a property can be followed by the name of another property within parenthesis. This second property will be displayed as a tooltip when the mouse is over the table cell with the first property. For example, the format string: "`ID,Logo(Max IC)`" will create a table with two columns where the first column will contain the module ID and the second a graphical module logo. If the user points the mouse at the logo for a module, its maximum IC will be displayed in a tooltip. (Note that this might not work within MotifLab but should work if the HTML document is saved to a file and displayed in an external web browser).
Logo max width	The maximum width of module logos. If this is specified, logos will be scaled to fit within this limit when necessary. A value of 0 means no limit.
Logos	Specifies whether the image files for module logos should be made specifically for this output data object and given file names reflecting this fact ("New images") or if the image files should be named after the modules themselves (e.g. "MOD0003.gif") and allowed to be shared by other HTML-formatted output data objects that also contain logos for the same modules ("Shared images"). A third option is to output logos using textual rather than graphical representations ("Text").
multiline	Some module properties may be represented by lists. If the "multiline" option is selected, these lists will be broken across multiple lines in a table cell. If the "multiline" option is not selected, these properties will be output as comma-separated lists.
Headline	An optional headline which will be displayed above the table.

: output

BindingSequences

Applies to:

This format can be used to create motifs based on lists of individual binding sequences provided in a FASTA-like format. The definition of each motif should begin with a header consisting of a greater-than sign followed by the motif ID, e.g. ">M0001". The motif ID should begin with a letter and only consist of letters and numbers. A name ("short name") for the motif can be provided after the motif ID following any non-word character (such as a space or a hyphen). The header should be followed by a set of binding sequences for the motif (one sequence on each line). Note that all the binding sequences for the same motif must have equal lengths, and they can only consist of the letters A,C,G or T (or U can be used instead of T). However, rather than specifying a list of binding sequences, it is possible to state a single consensus motif which is then allowed to include IUPAC symbols for degenerate bases.

When the format is used for output, it will output a header-line for each motif followed by a list of all the binding sequences associated with that motif. If the motif has no annotated binding sequences, it can either output an IUPAC consensus sequence or, optionally, a set of randomly generated binding sequences that taken together will approximate the base frequencies of the motif binding matrix to a given precision.

Example:

>Motif1 E-box CACGTG CAgGTG CACGTG CACGTG CcCGTG CACGaG CACGTG >Motif2 nrATGAyvTA >Motif3 Unknown AGCTACT AGCTAGT GGCTAGT AGCTAGT aGCTACT AGCTAGT AGCTAGG #Motif with separate headers for each binding sequence >Motif4-1 AGCTACT >Motif4-2 AGCTAGT >Motif4-3 GGCTAGT >Motif4-4

Name	Description
Binding sequence property	This parameter specifies the name of the user-defined motif property that contains (or will contain) the binding sequences as a comma-separated list. If no property is specified when importing motifs from file, the binding sequences will not be stored together with the motif. If no property is specified when outputting motifs, the consensus sequence (or a set of randomly generated sequences) will be output rather than the actual binding sequences.
Header	Specifies which information to include in the header for each motif. The possible options are: include only the motif ID ("ID") or include both the motif ID and motif name (short name) separated with a space ("ID Name").
Separate headers	If this option is selected, each binding sequence will be preceded by a separate header on the previous line. These headers will be similar to the regular headers except that the motif ID will be immediately followed by a hyphen and an incremental number.
When missing	This parameter controls what to output when a motif does not have any annotated binding sequences associated with it. The option "Output consensus" will output a single IUPAC consensus sequence which is based on the frequency matrix of the motif. The rules for using different letters are in order: a single base letter (A,C,G,T) is used if the frequency of that base in a position is at least 50% and also double the frequency of any other base. A double-degenerate letter (m,r,w,s,y,k) is used if the combined frequencies of two bases are at least 75%. A triple-degenerate letter (b,d,h,v) is used if one of the bases has a frequency of zero. If none of the previous rules apply, the wildcard letter 'n' will be used. The option "Generate random" will output a set of randomly generated binding sequences that taken together will recreate the frequencies of a motif's binding matrix (to a given precision), so if you output a set of motifs to file using this option and then import them back again, their binding matrices will be exactly or approximately the same. If the binding matrix of a motif is a count matrix (i.e. it only contains integer numbers), the number of binding sequences output will be equal to the sum of base counts for one position (all positions should sum to the same value!). The numbers from the matrix will be directly related to the binding sequences output, so if the matrix has a count of 3 for base 'A' in position 5, exactly 3 of the binding sequences output will have an 'A' in the fifth position. Thus, the original count matrix can in theory be exactly recreated from these binding sequences; however, MotifLab will always generate frequency matrices when importing motifs in BindingSequences format. If the binding matrix of a motif is not a (consistent) count matrix, the matrix will first be converted into a count matrix by multiplying the normalized frequency values with a given factor (which is a power of 10) and rounding to the nearest integer (with some corrections if necessary). The binding sequences will then be generated as for a regular count matrix.
Random sequences precision	This advanced parameter controls the number of sequences that will be randomly generated if the "When missing" parameter is set to "Generate random" and the binding matrix of a motif is a frequency matrix (or log-transformed matrix) rather than a count matrix. The value N (between 1 and 6) of this parameter is directly related to the number of sequences output (=10^(N-1)) and hence also the number of decimals (N-1) used for the frequencies if a frequency matrix is recreated from these binding sequences.

: output

Background formats

PriorityBackground

Applies to:

Background Model

The following description of the PriorityBackground format is taken directly from the PRIORITY user manual.

The background model's order (k) may be any integer between 0 and 5. For a k-th order model the file must contain exactly 4 + 4^2 + 4^3 + ... + 4^(k+1) real numbers between 0 and 1, with one number on each row.

For example for a 3rd order model the numbers represent:

P(A) P(C) P(G) P(T) P(A|A) P(C|A) P(G|A) P(T|A) P(A|C) .... P(T|T) P(A|AA) P(C|AA) ..... P(T|TT) ...... P(T|TT) P(A|AAA) P(A|AAC) ....... P(A|TTT) ........ P(T|TTT)

In this case the file must contain 4+16+64+256=340 numbers.

IMPORTANT!!! Notice that every group of 4 consecutive numbers must add up to 1 to form a probability distribution.

: output

MEME_Background

Applies to:

Background Model

The following description of the MEME_Background format is taken directly from the MEME Suite web site.

The format for n-order Markov background models is as follows.

The file must contain one line for each combination of 1, 2, ..., n-1 letters in the alphabet. The DNA alphabet is ACGT.

Each line must contain the letter combination followed by the letter combination's frequency (probability). All other lines in the file are ignored, including comment lines starting with '#'.

For example, a 0-order Markov model file might contain:

# tuple frequency_non_coding a 0.324 c 0.176 g 0.176 t 0.324

A 1st-order Markov model file might contain:

# tuple frequency_non_coding a 0.324 c 0.176 g 0.176 t 0.324 # tuple frequency_non_coding aa 0.119 ac 0.052 ag 0.056 at 0.097 ca 0.058 cc 0.033 cg 0.028 ct 0.056 ga 0.056 gc 0.035 gg 0.033 gt 0.052 ta 0.091 tc 0.056 tg 0.058 tt 0.119

: output

INCLUSive_Background_Model

Applies to:

Background Model

The following description of the INCLUSive_Background_Model format is taken directly from the MotifSuite web site.

The file starts with a comment line which identifies the format (#INCLUSive Background Model v1.0).
Next follows a description of the order of the background model (#Order) and two informational fields describing respectively a genome identifier (#Organism) and the path referring to the sequences data where the model is extracted from (#Sequences).

The single nucleotide frequencies for A,C,G,T are described by 4 tab separated values (between 0 and 1) on the line following #snf. They represent the probability (Pr) to find the respective nucleotide in the sequence dataset where the background is modelled for, independent of the position of this nucleotide in the sequences.

#snf Pr(A) Pr(C) Pr(G) Pr(T)

The section following #oligo describes the probability of all possible combinations of the nucleotides A,C,G,T of length equal to the background model order (also called an oligonucleotide) in the sequence dataset where the background is modelled for. The total number of oligonucleotides are printed on separate lines and equals 4 powered to the background model order (e.g. 16 for a second order model). The section starts with the oligonucleotide consisting of all A, followed by oligonucleotides where each next position in the oligonucleotide A is repeatedly replaced by respectively C,G,T. Below example is for a second order background model.

#oligo Pr(AA) Pr(AC) Pr(AG) Pr(AT) Pr(CA) Pr(CC) Pr(CG) Pr(CT) Pr(GA) Pr(GC) Pr(GG) Pr(GT) Pr(TA) Pr(TC) Pr(TG) Pr(TT)

The higher-order background model is described in the section following #transition matrix. Each line in this section describes the tab separated probabilities (Pr) of finding nucleotide A respectively C, G and T given a set of preceding nucleotides of length equal to the background model order. The total number of lines equals 4 powered to the background model order. The preceding oligonucleotide for the first line consists of all A, and in next lines A is repeatedly replaced by respectively C,G,T on each next position in the oligonucleotide. Below example is for a second order background model.

Example

#INCLUSive Background Model v1.0 # #Order = 1 #Organism = Human #Sequences = # #snf 0.2570.25340.24650.2432 #oligo frequency 0.257 0.2534 0.2465 0.2432 #transition matrix 0.3121 0.1944 0.2751 0.2184 0.2751 0.3014 0.1547 0.2688 0.24 0.2718 0.2943 0.1939 0.197 0.2469 0.2637 0.2924

: output

Other formats

MapFormat

Applies to:

Map

The MapFormat can be used to output (and read back) entries in Map objects as a list of "key-value" pairs (where the key is the name of the motif/module/sequence depending on the type of map).

Name	Description
Entry separator	This parameter specifies which character to use to separate the key-value entries from each other in the output. The default is to output one entry on each line, but other means of separation is also possible. Valid values for this parameter are "Newline", "Space", "Tab", "Comma", "Semicolon" and "Colon". Note that the separator chosen here must be different from the "Key-value separator" selected below.
Key-value separator	This parameter specifies which character to use to separate the key from the value for each entry. The default separation character is the equals sign (=), but other options are also available. Valid values for this parameter are "Newline", "Space", "Tab", "Comma", "Semicolon" and "Colon". Note that the separator chosen here must be different from the "Entry separator" selected above.
Include entries	A collection which specifies which entries to include in the output. If no collection is specified (or an incompatible collection is chosen) all the entries will be output.
Include default	If this option is selected, an entry for the default value will be included in the output. This will usually have the key "`_DEFAULT_`" (or no key at all).

: MapExpression, output, Numeric Maps

MapExpression

Applies to:

Map

The MapExpression format can be used to read and write data for Map objects where the format to use is explicitly defined by the user in the form of a (regular) expression. The format outputs one entry on each line.

When outputting a Map in MapExpression format, the expression parameter should be a string which contains two special field codes: {KEY} and {VALUE} (the braces must be included and the letters must be in uppercase). The KEY field will be replaced by the name (identifier) of the data object in the map (motif, module or sequence) and the VALUE field will be replaced by the corresponding value for this data object. For example, the expression "{KEY}={VALUE}" will output the name of the data object and the value separated by an equals-sign, whereas the expression "ENTRY\t{VALUE}\t{KEY}" will output three TAB-separated columns on each row where the first column always has the text "ENTRY", the second column is the value and the last column is the name of the data object.

When importing a file in MapExpression format, the expression should be a regular expression string (formatted according to JAVA regex rules) containing at least two "capturing groups" enclosed in parenthesis. The two capturing groups should match the data name (key) and value respectively. The integer parameters "Key group" and "Value group" are used to tell MotifLab which of the groups are associated with each of these fields. For example, if the entries in the file correspond to the (output) expression "{KEY}={VALUE}", then the input expression could be "(\S+?)=(\S+)" with the value of "Key group" set to 1 and "Value group" set to 2. If the file is in the format "ENTRY\t{VALUE}\t{KEY}", then the input expression "ENTRY\t(\S+)\t(\S+)" can be used with "Key group" set to 2 and "Value group" set to 1 (since the key now occurs after the value in each line). It is possible to use more than two capturing groups, and the "Key group" and "Value group" parameters must then be adjusted accordingly.

Note that double quotes should preferably be avoided in the expression string since this can lead to parsing problems in the current version of MotifLab.

Name	Description
Expression	A string which defines the format to use for output or input. When outputting data in MapExpression format, this expression string should contain the two special field codes: `{KEY}` and `{VALUE}`, and these will be replaced by key-value data from the map. When reading input in MapExpression format, the expression string should be a regular expression containing at least two "capturing groups" that can capture the key and value information respectively from each line in the input. The parameters "Key group" and "Value group" below specify which capturing groups should be used to capture these two properties.
Key group	An integer number that specifies which capturing group that captures the "key" from the input. This parameter is only applicable when reading input in the MapExpression format.
Value group	An integer number that specifies which capturing group that captures the "value" from the input. This parameter is only applicable when reading input in the MapExpression format.
Include entries	A collection which specifies which entries to include in the output. If no collection is specified (or an incompatible collection is chosen) all the entries will be output.
Include default	If this option is selected, an entry for the default value will be included in the output. This will usually have the key "`_DEFAULT_`" (or no key at all).

: MapFormat, output, Numeric Maps

ExcelMap

Applies to:

Map

The ExcelMap data format can be used to output (and read back) entries in Map objects to and from Excel files where one column contains the keys (the names of the motifs/modules/sequences depending on the type of map) and another column contains the corresponding values. When importing data from an Excel file, only entries in the key column that correspond to known data objects of the relevant type will be processed, which means that lines containing headers and other information will be skipped. The ExcelMap data format was introduced in MotifLab version 2.0.

Name	Description
Key column	This parameter specifies the column that contains the keys (names of motifs, modules or sequences). The column is specified as an integer number starting at 1 for the first (leftmost) column in the Excel file (first sheet).
Value column	This parameter specifies the column that contains the corresponding map values for the motifs/modules/sequences. The column is specified as an integer number starting at 1 for the first (leftmost) column in the Excel file (first sheet).
Include entries	A collection which specifies which entries to include in the output. If no collection is specified (or an incompatible collection is chosen) all the entries will be output.
Include default	If this option is selected, an entry for the default value will be included in the output. This will usually have the key "`_DEFAULT_`" (or no key at all).

: MapFormat, MapExpression, ExcelProfile, output, Numeric Maps

ExpressionProfile

Applies to:

Expression Profile

The ExpressionProfile format can be used to output Expression Profile data to plain text files and also import expression data from such files. Data for each sequence is output on a separate line with the sequence name at the beginning of the line followed by expression values for different conditions. The character which separates the sequence name from the expression values can be specified as a parameter, as can the character which separates the different expression values from each other (this would normally be the same as the character separating the sequence name from the expression values but it does not have to be).

Name	Description
Sequence name delimiter	This parameter specifies which character to use to separate the sequence names (first column) from the expression data values (the rest of the columns). Valid values for this parameter are "Space", "Tab", "Comma", "Semicolon", "Equals" and "Colon".
Condition delimiter	This parameter specifies which character to use to separate the expression data columns from each other (the different conditions). Valid values for this parameter are "Space", "Tab", "Comma", "Semicolon" and "Colon".
Header	Specifies whether a descriptive header should be included as the first line in the output. The parameter value "None" will not output any header. The value "Column names" will output a headder containing the name of each column (separated by the same characters as the data on the following lines). The value "#Column names" will output the same header as the previous setting, except that the line will be preceeding by a # sign (which will usually be interpreted as a comment line).

: ExcelProfile, output, Expression Profile

ExcelProfile

Applies to:

Expression Profile

The ExcelProfile format can be used to output Expression Profile data to Excel files and also import expression data from such files. When data are output to files, each sequence will be output on a separate line with the sequence name in the first column and data for all the different conditions in subsequent columns. When data is imported from files, it is possible to specify which column that contains the sequence names and also which columns that contain the expression data to be included in the profile. This data format was introduced in MotifLab v2.0.

Name	Description
Key column	This input parameter specifies the column that contains the sequence names. The column is specified as an integer number starting at 1 for the first (leftmost) column in the Excel file (first sheet).
Value columns	This optional input parameter specifies which columns that contain the expression data that should be included in the expression profile. This can be specified as a comma-separated list of column indices (starting at 1 for the leftmost column in the Excel file). It is also possible to specify ranges of columns using a hyphen. For example, if this parameter is set to "2,3,5-7,9" the profile will import data from the columns 2,3,5,6,7 and 9 (note that column 6 is in the range 5-7). If this parameter is left blank, all columns with valid numeric values to the right of the key column (containing the sequence names) will be included.
Include entries	A sequence collection which specifies which sequences to include in the output. If no collection is specified, all sequences will be output.
Header	If this parameter is selected when outputting data, a header containing the names of the condition columns will be output as the first row in the Excel file. If selected when importing data, the values found in the first row of the Excel file will be used as condition headers.

: output, Expression Profile

HTML

Applies to:

The "HTML" format can be used to output results from an Analysis object in a format suitable for viewing by humans. The format can output nicely formatted documents with headlines and descriptive text as well as tables and graphical images. The actual format of the output will depend on the specific analysis type, but will often include tables listing results obtained for each motif, module or sequence.

Name

Description

analysis-specific

Each type of analysis can have its own output parameters for the HTML format, but common parameters include which entries to include in the output (often specified as a Collection) and how to sort the output.

Logos

Analyses related to motifs or modules usually allow you to include a graphical representation of each motif or module in the output. This is called a "logo", and you have a few options regarding how this will be handled.

None : Do not include any representation of the motif/module in the output
Text : For motifs, this will be a textual (non-graphical) representation of the consensus motif (e.g. "CAsyTG"). For modules it will be a list of transcription factors that are included in the module, with perhaps some other constraints.
Shared images : Each motif/module logo will be saved to a separate image file named after the motif/module (e.g. "MA0207"). This allows multiple HTML files to share the same motif/module logo files if you have performed and ouput several analyses based on the same motifs/modules. (Note, however, that this will cause trouble if you output multiple analyses that use different motifs/modules but these happen to have the same names.)
New images : Each motif/module logo will be saved to a separate image file named after the HTML-file followed by a number. This ensures that the image files are always correct, even if the analyses were based on different motifs/modules that happened to have the same names.
Embedded images : This option was introduced in MotifLab v2 and allows you to embed the logo images in the HTML file itself rather than saving each logo to a separate image file. Note, however, that if you choose this option, the logo images will not be viewable from inside of MotifLab (they will just look like broken image links), but it will still work if you save the output to a file and open it in a different web browser.

: output, Analysis

Excel

Applies to:

The "Excel" format can be used to output results from an Analysis to an Excel file. The actual format of the output will depend on the specific analysis type, but will often include tables listing results obtained for each motif, module or sequence. This data format was introduced in MotifLab v2.0.

Name	Description
analysis-specific	Each type of analysis can have its own output parameters for the Excel format, but common parameters include which entries to include in the output (often specified as a Collection) and how to sort the output.

: output, Analysis

RawData

Applies to: