SQL protocol

The SQL protocol can be used to retrieve data from an SQL database, but it is currently only supported for Region Datasets. All the relevant information must be contained in a single SQL-table with one row for each region. At minimum the SQL-table must have three columns containing information about the chromosome, start- and end-coordinates of the regions, since this information is necessary to select and return the regions that lie within the current sequences. However, the database can also supply additional information for each region as well.

The table above specifies which columns the data source should request from the SQL-table and how the information in these columns should be mapped to region-properties in MotifLab. Each row in this table should describe one region-property.

Property
The "Property" column specifies the name of the region-property. MotifLab recognizes six standard properties which are listed below, but additional user-defined properties are allowed as well.
Type
The "Type" column specifies what type of value the property has. The four possible types are: Text, Integer, Double (a real-valued number) or Boolean (TRUE or FALSE). The six standard properties have specific types which are shown in parenthesis behind each property in the list above. For user-defined properties, the user should select and appropriate type (if in doubt, use "Text").

Database column
This specifies the name of the column in the SQL-database from which the property can be obtained. This column-name is frequently, but not necessarily, the same as the property-name. For example, the value for the start-property could be found in the database table-column named "start" or it could be found in a differently named column, e.g. "startPosition".

Value
The "Value" column can be used to specify that the property-value for the regions should be set to an explicit constant value rather than being obtained from the SQL database. For example, let us say that an SQL-table contains information about DNasHS-regions in a genome but that this table only contains three columns for the required properties chromosome, start and end. Since no information can be obtained for the standard type- and strand-properties, these will be set to the default values "unknown" and "undetermined" respectively. However, rather than relying on these default values, the user can explicitly state which values to use instead, such as e.g. "DNaseHS" for the type-property and "DIRECT" for the strand-orientation. Note that you can specify either the "Database column" or "Value" for a property but not both!

Transform
Sometimes a value obtained from the SQL-database can not be used directly by MotifLab but has to be converted first in a transform step in order to be interpreted correctly. Currently, MotifLab recognizes the following transforms:
Map transform Map transforms apply to all properties of type Text and can be used to convert specific values found in the database to other values for use in MotifLab (or vice versa). The transform is specified as a comma-separated list of key=value pairs where the key in each pair denotes the value of the property in MotifLab and the value-part denotes the corresponding value as found in the database.

For example, consider a text property named "RepeatType" which has the following transform "Transposon=LINE,Transposon=SINE,Transposon=LTR,Tandem Repeat=Minisatellite,Tandem Repeat=Microsatellite,Tandem Repeat=Short Tandem Repeat". This transform assumes that the database column corresponding to the RepeatType property can (at least) take on the following six values: LINE, SINE, LTR, Minisatellite, Minisatellite and Short Tandem Repeat. When the datasource encounters a value in the database which can be found on the right-hand side of a key=value pair in the list, it will assign the region the corresponding left-hand side value for the property. Note that the same key can be repeated multiple times in the list in order to map a larger set of values in the database onto a smaller set of values in MotifLab. In this case, for example, the six different types of repeats mentioned in the database will be grouped into two different types ("Transposon" or "Tandem Repeat") for the RepeatType property of the regions.

Values encountered in the database that are not mentioned in the transform will not be converted but used as is.



For the "strand" (orientation) property, MotifLab recognizes the values "direct", "1" or anything starting with "+" as referring to the direct strand, whereas "reverse" or anything starting with "-" would be taken as a reference to the reverse strand. If a database uses a different scheme for denoting strand orientation, e.g. "D" for direct strand and "R" for reverse strand, the map transform can be used to convert these to values recognized by MotifLab, like so: "direct=D,reverse=R".



Although chromosomes in MotifLab are often presented on the form "chr12", they are represented internally using only the chromosome number (or letter) without the "chr" prefix. If a database uses "chr" prefixes for chromosomes, these values must be converted with a map transform on the form "1=chr1,2=chr2,3=chr3, ...etc./" (remember that the left-hand side of each pair is the value used in MotifLab and the right-hand side is the corresponding value in the database). Since this is a very common scenario, MotifLab has a special short-hand notation for this called the "chr-prefix" transform (see below).
Chr-prefix This transform applies only to the "chromosome" property.
MotifLab assumes that the chromosome column in the SQL-table refers to e.g. "chromosome 1" simply as "1" and "chromosome X" as "X", etc. However, some databases might include a "chr"-prefix for all chromosomes (e.g. "chr1" and "chrX"). If the database uses a "chr"-prefix, MotifLab must be made aware of this in advance in order to process the data correctly. This can be done by simply typing any value that is not a valid map transform into the Transform-column of the "chromosome" property. Any such non-empty transform-value will be interpreted to mean that the database uses "chr"-prefixes.
Coordinate offset This transform applies to the "start" and "end" properties.
In the genomic coordinate system used by MotifLab, the first nucleotide on a chromosome is said to correspond to position 1 (so-called "1-indexed" coordinates) and the location of a sequence region is referred to as "chrN:start-end" where the first nucleotide in the region is at position "start" and the last nucleotide is at position "end". Hence, a region corresponding to the first 8 nucleotides of chromosome 2 is referred to as "chr2:1-8". This way of referring to sequence regions is used by the GFF dataformat and many other dataformats. However, some dataformats, such as e.g. the popular BED-format, rely on different coordinate systems. According to the BED-format (which is "0-indexed"), the first nucleotide of a chromosome is said to be at position 0 and the end-coordinate refers to the first position after the region rather than the last position within the region (i.e. the start-coordinate is inclusive but the end-coordinate is exclusive). The first 8 nucleotides of chromosome 2 would hence be referred to as "chr2:0-8" in BED-format. If the coordinates in your database are not in an "inclusive 1-indexed" format, they have to be converted to the coordinate system used by MotifLab using a "coordinate offset" transform. This can be accomplished simply by typing a positive or negative number into the Transform-column of the "start" and/or "end" properties. This value will then be added to the value which is obtained from the database. If your database uses BED-coordinates you can convert these simply by entering the offset-value "1" into the Transform-column for the "start" property, and this will convert the 0-indexed coordinate to an 1-indexed coordinate (note that the end-coordinate does not have to be transformed since an "exclusive 0-indexed" coordinate would be the same as an "inclusive 1-indexed" coordinate).