UniRule is a format describing rules used by the UniProt Knowledgebase (UniProtKB) automated annotation projects.
It defines annotation that can be propagated to confirmed rule matches (UniProtKB entries). It also
includes cases and specific conditions to restrict propagation of the annotation
to suitable subsets of member sequences (e.g.: to a taxonomic range, to a metabolic pathway or only if a
certain feature exists). These conditional elements are defined using:
The Case statement (block starting with 'case', ending with 'end case'
and which can contain 'else case' and/or 'else'). This statement can be used
for any type of annotation.
The term Condition which is only used for the annotation of the Feature lines.
The rules can be displayed in a user-friendly Web View which consists of the following three main sections
and associated sub-sections.
This field indicates the accession number of the rule. It can be in the form MF_xxxxx for
the HAMAP families or in the form
PRUxxxxx for the ProRule database.
Dates
This field is composed of two lines. The first line indicates the rule creation date; the second corresponds to the
last rule revision date.
Data Class
The possible values for this field are:
Protein indicates that the rule is based
on a profile or metamotif that covers the complete protein sequence. In this case the rule
enables the complete annotation of a UniProtKB entry. Rules of this data class will be referred to hereafter as
Protein rules.
Domain indicates that the rule is based on a motif (profile or pattern) that detects a
domain. The propagated annotation will only concern this domain. Rules of this data class will be referred to hereafter
as Domain rules.
Site indicates that the rule is based on a motif that detects a site. The
propagated annotation will only concern this site. Rules of this data class will be referred to hereafter
as Site rules.
Predictors
The line(s) Predictors indicates the motif identifier(s) used to trigger the application of the
rule. The trigger can be either:
A HAMAP profile derived from the seed alignment of representative
members of the Protein rule.
In this case the format is:
HAMAP; the profile identifier; [a link to the match score list]; [a link to
the seed alignment]
(e.g. HAMAP; MF_01322; [distribution of match scores in UniProtKB];[seed alignment for MF_01322])
Clicking on the distribution of match scores in UniProtKB displays the score distribution
of matches in the UniProt Knowledgebase. An explanation of the display appears in the right-hand
side of the window.
Clicking on the seed alignment for MF_xxxxx displays the alignment used to calculate
the profile.
In most cases the application of a Protein rule is triggered by only one profile. However
sometimes there are two profiles: one for bacteria (and plastids, if applicable) and
another for archaea. In these cases, the profile identifier suffix is '_B' for
bacteria (and plastids, if applicable) and '_A' for archaea
(e.g. MF_00563).
A PROSITE pattern or profile.
PROSITE; the PROSITE motif accession number; [a link to the match score list]
(e.g. PROSITE; PS51238; [distribution of match scores in UniProtKB])
This format concerns the ABC transporter subfamilies of HAMAP.
PROSITE; the PROSITE motif accession number; the PROSITE motif entry name
(e.g. PROSITE; PS50292; PEROXIDASE_3)
This format concerns the ProRule database.
A PROSITE metamotif. In this case the format is:
Metamotif; -; the metamotif itself
(e.g. Metamotif; -; PS50021=7,91=PS50021)
Name and function
These fields are optional for Protein rules and mandatory for Domain and Site rules. They provide respectively the name and the function of the protein,
domain or site.
Propagated annotation
This section contains annotation that can be propagated to rule members.
The name and the content of this section depend on the type of rule.
For Protein rules it corresponds to:
An Identifier: the mnemonic code for the protein name
A Description of the protein
The common Gene Name of the protein, when it exists
For Domain and Site rules this is an optional field. It then contains only the part of the description which is common to
all rule members preceeded by a plus (+).
To indicate wich other rule(s) must be applied to completly annotate the protein or the domain.
Two main cases can be distinguished:
Triggering of Domain and/or Site rules:
This concerns rules to annotate a Protein containing domain(s) and/or site(s). It also
concerns any rule aiming to annotate a Domain which contains Site(s).
In both cases the format is:
PROSITE identifier1; identifier2; number of expected hits;
trigger=accession number of the rule to be triggered
(e.g. PROSITE PS50035; PLD; 1; trigger=PRU00153;)
Triggering of other rule(s) to annotate features such
as Transmembrane, coiled coil (...):
This concerns annotation of either a protein or a domain. In this
case the format is:
General feature name; -; number of expected hits; trigger=yes
(e.g. General Transmembrane; -; 6-10; trigger=yes;)
Gene Ontology
This section contains Cross-references to the Gene Ontology database (GO,
http://www.geneontology.org/). Only terms from the "Biological process" and "Molecular
function" ontologies are indicated.
Template feature line(s) It defines the template for all the subsequent Feature lines.
The format is:
From: template name
where template name must be one of the following values:
Identifiers (ID and AC) of a sequence in the seed alignment if the trigger is a
HAMAP profile (e.g. From: ACP_ECOLI (P02901))
The unique identifier of the motif or metamotif if the trigger is from PROSITE
(e.g. From: PS50234).
Conditions may be used in feature lines. They usually correspond to pattern constraints, or to the
presence of a specific amino acid.
e.g.
Key From To Description Condition
DISULFID 60 80 By similarity C-x*-C
Optional label can be used to indicate the presence of a feature which is not mandatory in
the matched sequences.
e.g.
Key From To Description Condition
BINDING (Optional) 153 153 ATP (By similarity) [RQ]
Multiple FT lines that should be applied
either all together or not at all are grouped within an "FTGroup", to force the common presence of all sites.
e.g.
Key From To Description Condition FTGroup
ACT_SITE 42 42 Charge relay system (By similarity) H 1
ACT_SITE 91 91 Charge relay system (By similarity) D 1
ACT_SITE 186 186 Charge relay system (By similarity) S 1
This group can then be referenced by case statements in any other annotation section to be propagated.
For instance:
case <FTGroup:1>
Protein name + (EC 3.4.21.-)
end case
Size range: For Protein rules, the minimal and maximal sizes of
proteins matching the rule are listed. For Domain and Site rules, this line contains
the size range of the complete domains annotated in UniProtKB.
Related UniRules: Lists identifiers of rules that are known to be similar
in sequence, and which may produce cross-matches. These are particularly useful when two different
rules exist for a short and long version of the same protein (as occurs sometimes in Protein rules).
Long proteins will match both profiles; under these circumstances the longer family supersedes the
shorter family (e.g. MF_00344
supersedes MF_00345).
Template: For Protein rules only, lists the accession numbers of the
entries from which the rule's annotation was inferred. The template entries are usually characterized.
"Template: None" indicates that there are no characterization papers on any of the proteins that belong
to that family. This is the case for UPFs (Uncharacterized Protein Family), for example.
Scope: This section indicates the kingdoms covered by the rule.
Fusion: For Protein rules only, indicates if at least one rule member
has been found fused to another protein/domain at its N- or C-terminus. Fusion may be to another
protein or to a known/unknown domain.
Duplicate: For Protein rules only, lists the 5-letter code of the complete proteomes in which more than one
protein matches the rule.
Plasmid encoded: For Protein rules only, indicates the 5-letter code of the organism in which
the protein is encoded on a plasmid.
Repeats: For Domain and Site rules only, indicates the
expected number (single number, a range, or unlimited) of repetitions of a domain or site in rule matches.
Topology: For Domain/Site rules only, specifies the subcellular location(s) in which a Domain or Site may occur.
Example: Optional for Protein rules and mandatory for Domain
and Site rules. One or more example entries targeted by the rule are indicated.
Comments on the rule: This optional section contains
additional useful information including: 5-letter codes of
organisms with possible wrong starts, divergent paralogs, proteins that are excluded from alignment
due to anomalies, etc.