# MC-Sym

### From Lbit-wiki

MC-Sym (Macromolecular Conformations by SYMbolic programming) is an RNA 3-D structure modeling system.

MC-Sym generates a set of 3-D structure models from a syntactic definition of the modeled RNA molecule. Such a definition includes sequence constitution and nucleotide-nucleotide interaction types. MC-Sym creates a conformational search space from this input into which conformations are built and validated according to a set of constraints.

Current version is 4.1.3.

## Contents |

## Abbreviations

Abbreviations used in this article:

base
| nucleobase | ||

bp
| base pair | ||

bs
| base stacking | ||

CSP
| constraint satisfaction problem | ||

HB
| hydrogen bond | ||

HTM
| homogeneous transformation matrix | ||

HTMD
| homogeneous transformation matrix distance metric | ||

RMSD
| root mean square deviation |

## Overview

The goal of MC-Sym is to build RNA 3-D structures. The modeling approach focuses on nucleotide-nucleotide interaction: base pairing (bp) and base stacking (bs). MC-Sym explores a database that contains bp and bs extracted from high-resolution X-ray crystallographic RNA structures to assemble conformations of an RNA from a description of its bp and bs. This conformational search is implemented as a Constraint Satisfaction Problem (CSP).

## Modeling Engine

The modeling engine is designed to build 3-D structure models of an RNA molecule from the exploration of a conformational space. The conformational space is
defined in terms of nucleotide-nucleotide interactions; namely bp, bs and nucleotide adjacency in the sequence. A nucleotide-nucleotide interaction is encoded
as an HTM that expresses the spatial interaction between the nucleotides' base. Such HTMs are extracted from high resolution X-ray crystallographic structures
and stored in a local database. This HTM database, called the *relation database*, constitute the conformational space. Any of these HTMs can be applied to the coordinates of a nucleotide to
express the encoded nucleotide-nucleotide interaction in the building of a 3-D structure model. The HTM database is indexed by the types of the interacting
nucleotides and the symbolic description of the interaction. Thus, for example, if a given modeling task is to build a structure containing a G-G outward
stacking interaction, the appropriate HTMs are selected from the database to compose a specific domain. Exploration of this domain by the modeling engine
will position two G nucleobases in 3-D space in several conformations, each expressing an outward stacking interaction.

### Relation

A *Relation* is defined as an HTM that expresses a specific interaction between two bases. The relation entity contains the HTM and the symbolic annotation that describes the interaction. This annotation falls in three categories: base pairing, base stacking, and base adjacency.

#### Base Pairing

Base pairing is defined as two bases linked together by HBs. The HBs are formed between exocyclic hydrogen donor groups (mainly NH and NH_{2}) and acceptor groups (mainly CO and N). The classical Watson-Crick G-C and A-U bps are well-known: they are the cement of the A-from double-helix of RNA. These two canonical bps are formed by respectively three and two hydrogen bonds. However, there exists several others bps involving three, two, or a single HB between different donor and acceptor groups. Different nomenclatures exist to name these so-called non-canonical bps. In MC-Sym, the *LW+* nomenclature is used (Lemieux and Major 2002). This nomenclature is a refinement of the LW (Leontis and Westhof 2001), where a bp is identified by contact edges. A contact edge is a region of the base where the HB(s) are formed. Three edges characterize any base: the Watson-Crick edge (*W*), the Hoogsteen edge (*H*) and the Sugar edge (*S*). In the *LW+* nomenclature, each *LW* edge is subdivided in faces: *w*, *h*, and *s* (see Figure 1). Thus, a specific bp is named in *LW+* by naming the *LW* edge and the inner *LW+* face in contact for the two bases. As examples, a canonical G-C Watson-Crick bp is named *Ww/Ww*, while a wobble G-U is named *Ww/Ws*. Bifurcated HBs oscillating between two edges have their specific face: *Bh* for a bifurcated HB between the *W* and *H* edges, and *Bs* bewteen the *W* and *S* edges.

Two additional geometrical features are symbolically annotated in bps: the cis/trans relative orientation of glycosidic bonds, and the parallel/antiparallel relative orientation of the bases' plane.

The cis/trans relative orientation of glycosidic distinguishes two categories of bps (see Figure 2a). A bp is cis if the glycosidic bonds oriented from the base to the C1' of the ribose are in the same direction; else the bp is trans if they are in opposite directions. All Watson-Crick bps in A-form helix are cis. The parallel/antiparallel relative orientation of the bases' plane also split bps annotation in two categories, but from a different point of view (see Figure 2b). Let be normal to the plane of a base and oriented in such a way that all bases in A-form helix have their oriented in the same direction: toward the 3' end of the strand (see Nucleobase normal vector). Then, a bp is parallel if the bases' are oriented in the same direction; else the bp is antiparallel if the are oriented in opposite direction.

Two other base pairing nomenclatures are supported by MC-Sym: the Saenger nomenclature (Saenger 1984) and the Gautheret nomenclature. The Saenger nomenclature labels specific bps by roman number from I to XXVIII. The Gautheret nomenclature applies on single hydrogen bonded bps and labels by arabic number from 29 to 137.

#### Base Stacking

Base stacking is characterized by two bases that are stacked on top of each other in a roughly parallel fashion. This spatial arrangment is induced by London dispersion between the pyrimidine and imidazole rings of bases. The RNA Ontology Consortium as recently integrated a new bs nomenclature proposed by Major and Thibault. This nomenclature is based on the enumeration of the four possible relative arrangment of the of two stacked bases (see Nucleotide stacking nomenclature).

If bases *A* and *B* are stacked, then either *B* is above or below *A*, with respect to the of *A*. Additionally, both
of *A* and *B* can be either in the same or opposite direction. Combining both features, four bs arrangment are enumerated: *A* and *B*
stacks upward (`A >> B`

), downward (`A << B`

), inward (`A >< B`

) or upward (`A <> B`

).

#### Base Adjacency

Base adjacency describes two bases that are adjacent, or consecutive in the sequence. Two adjacent bases are labeled either `adjacent_5p`

or
`adjacent_3p`

, to distinguish the strand orientation. Base adjacency is maintained internally by the modeling engine, and is implicit to nucleotide
numbering.

#### Base-Base HTM

Base pairing, stacking and adjacency describes a base-base interaction symbolically. To express a specific interaction between bases *A* and *B* numerically in a model, the relative 3-D
arrangment of the bases is encoded as a relative frame transformation in 4X4 homogeneous coordinates: * ^{A}M_{B}*. This HTM represents the
rigid transformation (combination of translation and rotation) that expresses the coordinate frame aligned on

*B*(right subscript) in the coordinate frame aligned on

*A*(left superscript). A base's coordinate frame has its origin on the terminal nitrogen (N9 for purines, N1 for pyrimidines) and its XY plane aligned with the plane of the base (see Figure 3).

Using such a relative HTM to express A interaction with B allows to apply it independently of the global position of *A*. Given *V _{A}* the
coordinates of base

*A*in the global frame, the global coordinates of base

*B*(

*V*) are easly obtained using

_{A}*:*

^{A}M_{B}

A specific interaction between two bases, *A* and *B*, identified in a given 3-D structure, is encoded as the relative HTM * ^{A}M_{B}*
that is computed by the following equation, where

*and*

^{O}M_{A}*are expressing the local frames of bases A and B, respectively, in the global frame*

^{O}M_{B}*O*of the whole 3-D structure.

### Relation Database

The previous section defined a *relation* between two bases as an entity containing a symbolic annotation of the interaction (base pairing, stacking or
adjacency) and a relative HTM that expresses the transformation positioning the second base with respect to the first. Such relation entities are extracted
from high-resolution X-ray crystallographic structures and stored in a database: the relation database. This database defines the maximal search space of the
modeling engine. Extraction of a relation between bases *A* and *B* from a 3-D structure is two-steps: 1) annotate the interaction between *A* and *B* (using an embedded MC-Annotate) and 2)
compute {}^{A}M_{B} using the equation above. Because a relation is directed by definition, the
relation between *B* and *A* must also be stored. Hence, two relations are extracted for each pair of interacting bases from the reference 3-D structure.

The content of the current version of the relation database (4.1.0) was created from the RNA X-ray crystallographic structures deposited in the public database
RCSB PDB before may 2005 and with resolution of 3 Å or better. A total of 366034 relations were extracted, from which 313902
where retained after a similutude filtering of 0.1 Å. This filter is eliminating the neighboorhood of each relation, where two relations are *neighbor*
if two conditions are met: 1) both base ordered pairs have matching A-C-G-U types and 2) the HTMD between both HTMs is lower than the filter threshold
(0.1 Å). The filter algorithm is always eliminating the largest neighboorhood first.

Each relation item in the database is indexed by the ordered pair of base types, *i.e.* a pair , and a list of symbols describing the base pairing, stacking or adjacency that annotate this interaction.

### Search Space

The modeling engine explores a search space to build conformations of an RNA. This search space is composed of relations, as described in Section 3.1. These relations are used to position the modeled RNA's bases in 3-D space relatively to one another, expressing the RNA base-base interactions. The
maximal search space is defined by the relation database content (see Section 3.2). In theory, the database could be
explored in its wholeness each time a base is to be positioned; however the computational cost would be prohibitive. Moreover, a specific modeling tasks is always looking at
specific base-base interactions. Thus, the search space must be restricted to suit each specific modeling task, indeed.

A search space is specified by enumerating each relations that composes the structure to model. This is achieved by representing the structure as a graph of
relation. In this directed graph, vertices *v _{k}* represents each nucleotides

*k*by their ID (chain and sequence number) and type (A, C, G or U). An arc

*a*is drawn from vertex

_{A,B}*v*to

_{A}*v*if nucleotides

_{A}*A*and

*B*have a relation. A symbolic expression that describes the type of relation between both nucleotides is associated to each arc. This symbolic expression will be parsed by MC-Sym into a query to the database of relation, where the matching relations will compose a domain for the associated arc. When all arcs are thus processed, the search space is completely defined with specific domains associated to each arc.

Here, the user task is to input the graph of relation with symbolic expression associated to each arc. Command `sequence`

is used to specify the vertices of the graph: nucleotides type and numbering. Command `relation`

is used to enumerated the arcs of the graph with associated symbolic expression. Below is an example of search space specification.

#### GNRA stem-loop: a search space specification example

Here is an example of user input and associated search space for a small GNRA stem-loop of sequence 5'-GGGGAGACCC-3'. The structure to model is composed of a
three-bps standard helix capped with a GNRA tetraloop of sequence GAGA. The GNRA tetraloop is a classical RNA motif where the *N*, *R*, and *A* bases are
stacked and the first *G* and last *A* form a *S/H* bp. Figure
4a shows an input example that specify a search space for this
particular structure (see MC-Sym Command Reference for the input syntax). Figure 4b illustrates the graph of relations representing
the input.

The first command, `sequence`

, set up the actual sequence of nucleotides that composes the structure and defines a
numering: from X1 to X10. These are the 10 vertices of the graph of relation that describes this search space: from *v _{1}*
to

*v*.

_{10}The second command, `relation`

, enumerates all relations between the bases. Each relation is specified by the bases' ID, the symbolic expression that
describes the relation and a database sampling size. Only one direction of each relation is necessary in the input, the reversed
relation is automatically added to the search space. For example, the item `X4 X7 { S/H && antiparallel } 10`

queries the database for relations
between a G and an A that are *S/H* and antiparallel bp. 7974 relations matches this query, but only 10 are retained for this domain. A sampling algorithm
ensures that any sampling of a relation set maximizes the space coverage of the complete set. Sampled domains are stored in a cache file, where they can be
restored later, saving the computational overhead cost of running the sampling algorithm again. This set of 10 relations is associated to
arc *a _{4,7}*. Arc

*a*is implicitly added to the graph of relation with expression

_{7,4}```
X7 X4 { H/S &&
antiparallel } 10
```

.
In this example, base adjacency was not specified in the input. Base adjacency is always implicit to the consecutiveness of nucleotides in the sequence, and is
maintained internally. For example, the item `X5:X7 { upward } 8`

is only restricting base stacking between bases X5 and X6, and bases X6 and
X7. However, adjacency is silently added to the queries to the database, since the bases are consecutives. The same applies to item ```
X4 X5 { !stack }
75
```

, where the query is parsed to `!stack && adjacent_5p`

(any relation between adjacent bases that are not
stacked). Figure 4c lists the queries that are parsed by MC-Sym and sent to the database.

#### Spanning tree

Exploration of the search space is defined as a CSP, where the variables are the arcs of the graph of relations and their domain the associated relation set
obtained from the database. Contraints are specified by the user, they will be discussed below. Solving this CSP means instantiating all variables in such a way
that all constraints are satisfied. Here, assigning a value to a variable means selecting a relation from the domain associated to the addressed arc and
applying its HTM to position the base at the end of the arc in the frame of the base at the beginning of the arc. In the GNRA stem-loop example presented above,
arc *a _{4,7}* is a variable; its domain a set of 10 relations. Assigning a relation to this variable will position base X7 in the frame of base X4,
expressing the encoded

*S/H*antiparallel bp.

The arcs contained in any spanning tree of the graph of relation are sufficient to position all the bases in the graph. The root of the spanning tree will act as the global structure. Starting from the root, traversal of the tree will position all bases using the relations in the domain of the traversed arcs. Then, a depth-first traversal such as backtracking will explores all possible assignation to the tree's arcs. If a complete assignation satisfy to all constraints, then a valid conformation is built.

More than one spanning tree can cover a given graph of relation. Hence, selection of a spanning tree and of its root is part of the user input, using the command
`backtrack`

, by listing paths from the tree root to the leaves. Figure 5 shows a selected rooted spanning tree for the GNRA stem-loop example
presented in Figure 4, and how it is input as a MC-Sym command.

A spanning tree over a graph of relation contains a subset of its arcs. This subset is almost never complete, because the graph can contain cycles. In the GNRA
stem-loop example above, the selected spanning tree is not containing arcs *a _{1,2}*,

*a*,

_{3,4}*a*, and

_{4,5}*a*. Three of those overlooked arcs are expressing staking in the stem part, and the remaining other is closing the tetraloop. Thus, it is clear that these structural information must be taken into account. This is currently one of the greatest challenge of MC-Sym. The current solution is to integrate constraints and restraints (see below) that will address these structural features.

_{8,9}Once a rooted spanning tree is selected, the search space is complete and exploration can begins. Assignation of a relation to each arc variable of the tree, in
the order scheduled by the `backtrack`

command and where each relation is selected from the associated arc's domain, position all bases from one another. In the GNRA stem-loop example, assignation proceeds as follow:

- From the frame of base X1, which act as the global frame, base X10 is positioned by selecting a relation from the domain of arc
*a*._{1,10} - From X10 frame, X9 is positioned by selecting a relation from
*a*domain._{10,9} - From X9 frame, X2 is positioned by selecting a relation from
*a*domain._{9,2} - From X2 frame, X3 is positioned by selecting a relation from
*a*domain._{2,3} - From X3 frame, X8 is positioned by selecting a relation from
*a*domain._{3,8} - From X8 frame, X7 is positioned by selecting a relation from
*a*domain._{8,7} - From X7 frame, X4 is positioned by selecting a relation from
*a*domain._{7,4} - Again from X7 frame, X6 is positioned by selecting a relation from
*a*domain._{7,6} - Finally, from X6 frame, X5 is positioned by selecting a relation from
*a*domain._{6,5}

### Backbone Construction

As it was presented above, instanciation of the arc variables of a selected rooted spanning tree of the graph or relation that represents the search space position all bases from one another. However, the conformation is incomplete: backbone is missing. The modeling engine has builtin fonctionnalities to build a numerical approximation of a backbone conformation that fits the bases conformation.

The backbone construction sub-process is able to build a ribose conformation given the position of a base and two phosphate groups. This ribose conformation is built in constant time by estimating the ribose geometrical features (pseudorotation angle and glycosidic torsion). The quality metric of a ribose conformation is defined by the RMSD between the length of the implicit bonds that link the ribose to both phosphate (C5'-O5' and C3'-O3' bonds) and their standard value.

Before a ribose conformation can be built, phosphates group must also be positioned with the bases. Phosphate groups (PO_{4}) are considered rigid
bodies; as are bases. Thus, they are positioned along the bases using relative HTMs. More precisely, when a base is positioned from another and both bases are
adjacent, a phosphate groups is postioned in between, in the same frame. The relative HTM that positions a phosphate group between two adjacent bases is
associated to the relation entity that represents this base adjacency. Hence, all base adjacency relations are composed of a pair of HTMs: one to position
the second base in the frame of the first, and the other to position the phosphate group in betweem, still in the frame of the first base.

In this way, phosphate groups are positioned along with the surrounding adjacent bases. However, there is no guarantee that a given spanning tree selected for
exploration of the search space will cover all adjacent arcs. Consequently, they will be a missing phosphate group for each adjacent arc of the graph of
relation that is not covered by the selected spanning tree. In the GNRA example above, four adjacent arcs are missing from the spanning tree illustrated in
Figure 5: *a _{1,2}*,

*a*,

_{3,4}*a*, and

_{4,5}*a*.

_{8,9}A missing phosphate group is positioned by selecting the relation from the unused domain of the overlooked adjacent arc that is closest to the implicit relation currently expressed by the arc' bases in the built conformation. This closedness between two relation is mesured using the HTMD between their base-base HTM. Once the closest relation from the unused domain is found, its base-phosphate HTM is applied to position the phosphate group.

This missing phosphate group positioning method implies that relation domains must have been specified in the search space, even if these domains are not part of
the spanning tree. In the GNRA stem-loop example above, every relations where specified in the search space. Hence, for example, the phosphate group between
bases X4 and X5 is positioned by selecting the closest relation among the 35 in the domain of arc *a _{4,5}*.

### Constraints and Restraints

With bases and phosphate groups positioned and with riboses built in between, the RNA conformation is complete. Thus, exhaustive exploration of the search space by backtracking will build all possible conformations for the modeled RNA. However, those conformations must be validated. In the context of the CSP, a set of constraints are to be statisfied for a full variable instanciation to be elected as a solution. In the modeling engine, different types of constraint can be applied by the user. See the Constraint Section in the command reference.

### Modeling by fragment

To be continued ...

### Probabilistic Backtracking

To be continued ...

## Command synopsis

Usage: mcsym[-hV] [-v level] [-j n] [-D database] [-C cache] [file] -- Options -- -h print this help -V print the software version info -v increase verbose level by 1 -j n Allow n jobs at once -D database Use this database file instead of MCSYM_DB environment variable -C cache Use this cache file instead of MCSYM_CACHE environment variable -- Arguments -- file Input script file (starts mcsym in interactive mode if not specified) -- Environment variable -- MCSYM_DB Filename with full pathname (from root '/' directory) of database file. MCSYM_CACHE Filename with full pathname (from root '/' directory) of cache file. Default: $HOME/.mcsym/mcsymcache-<version>.bin.gz, where <version> is the database version number.

## Command Reference

All commands available in MC-Sym are listed in the MC-Sym Command Reference.

## Questions, bug report, etc.

Send all requests to adbit@iro.umontreal.ca.

## References

- Lemieux S., Major F. (2002) RNA canonical and non-canonical base pairing types: a recognition method and complete repertoire.
*Nucleic Acids Res.***30**4250-63. - Leontis N.B., Westhof E. (2001) Geometric nomenclature and classification of RNA base pairs.
*RNA***7**499-512. - Saenger W. (1984) Principles of nucleic acids structure. Springer, New York, NY, USA.

--cafĂ©inophil 11:06, 17 October 2006 (EDT)