# MC-Sym

MC-Sym (Macromolecular Conformations by SYMbolic programming) is an RNA 3-D structure modeling system.

MC-Sym generates a set of 3-D structure models from a syntactic definition of the modeled RNA molecule. Such a definition includes sequence constitution and nucleotide-nucleotide interaction types. MC-Sym creates a conformational search space from this input into which conformations are built and validated according to a set of constraints.

Current version is 4.1.3.

## Abbreviations

 base nucleobase bp base pair bs base stacking CSP constraint satisfaction problem HB hydrogen bond HTM homogeneous transformation matrix HTMD homogeneous transformation matrix distance metric RMSD root mean square deviation

## Overview

The goal of MC-Sym is to build RNA 3-D structures. The modeling approach focuses on nucleotide-nucleotide interaction: base pairing (bp) and base stacking (bs). MC-Sym explores a database that contains bp and bs extracted from high-resolution X-ray crystallographic RNA structures to assemble conformations of an RNA from a description of its bp and bs. This conformational search is implemented as a Constraint Satisfaction Problem (CSP).

## Modeling Engine

The modeling engine is designed to build 3-D structure models of an RNA molecule from the exploration of a conformational space. The conformational space is defined in terms of nucleotide-nucleotide interactions; namely bp, bs and nucleotide adjacency in the sequence. A nucleotide-nucleotide interaction is encoded as an HTM that expresses the spatial interaction between the nucleotides' base. Such HTMs are extracted from high resolution X-ray crystallographic structures and stored in a local database. This HTM database, called the relation database, constitute the conformational space. Any of these HTMs can be applied to the coordinates of a nucleotide to express the encoded nucleotide-nucleotide interaction in the building of a 3-D structure model. The HTM database is indexed by the types of the interacting nucleotides and the symbolic description of the interaction. Thus, for example, if a given modeling task is to build a structure containing a G-G outward stacking interaction, the appropriate HTMs are selected from the database to compose a specific domain. Exploration of this domain by the modeling engine will position two G nucleobases in 3-D space in several conformations, each expressing an outward stacking interaction.

### Relation

A Relation is defined as an HTM that expresses a specific interaction between two bases. The relation entity contains the HTM and the symbolic annotation that describes the interaction. This annotation falls in three categories: base pairing, base stacking, and base adjacency.

#### Base Pairing

Base pairing is defined as two bases linked together by HBs. The HBs are formed between exocyclic hydrogen donor groups (mainly NH and NH2) and acceptor groups (mainly CO and N). The classical Watson-Crick G-C and A-U bps are well-known: they are the cement of the A-from double-helix of RNA. These two canonical bps are formed by respectively three and two hydrogen bonds. However, there exists several others bps involving three, two, or a single HB between different donor and acceptor groups. Different nomenclatures exist to name these so-called non-canonical bps. In MC-Sym, the LW+ nomenclature is used (Lemieux and Major 2002). This nomenclature is a refinement of the LW (Leontis and Westhof 2001), where a bp is identified by contact edges. A contact edge is a region of the base where the HB(s) are formed. Three edges characterize any base: the Watson-Crick edge (W), the Hoogsteen edge (H) and the Sugar edge (S). In the LW+ nomenclature, each LW edge is subdivided in faces: w, h, and s (see Figure 1). Thus, a specific bp is named in LW+ by naming the LW edge and the inner LW+ face in contact for the two bases. As examples, a canonical G-C Watson-Crick bp is named Ww/Ww, while a wobble G-U is named Ww/Ws. Bifurcated HBs oscillating between two edges have their specific face: Bh for a bifurcated HB between the W and H edges, and Bs bewteen the W and S edges.

Figure 1: The LW+ base pairing nomenclature. The two canonical Watson-Crick bps and their HBs (dashed lines) are shown (top: G=C; bottom: A-U). The LW nomenclature is schematized by engulfing shaded triangles, where the arcs represent the W, H, and S edges on the four types of bases. In the two bps shown here, the W edges are in contact. Notches along the edges delimit the LW+ faces. The major groove of the A-RNA double helix is toward the H edges, whereas the minor groove is toward the S edges.

Two additional geometrical features are symbolically annotated in bps: the cis/trans relative orientation of glycosidic bonds, and the parallel/antiparallel relative orientation of the bases' plane.

The cis/trans relative orientation of glycosidic distinguishes two categories of bps (see Figure 2a). A bp is cis if the glycosidic bonds oriented from the base to the C1' of the ribose are in the same direction; else the bp is trans if they are in opposite directions. All Watson-Crick bps in A-form helix are cis. The parallel/antiparallel relative orientation of the bases' plane also split bps annotation in two categories, but from a different point of view (see Figure 2b). Let $\vec{\sigma}\,\!$ be normal to the plane of a base and oriented in such a way that all bases in A-form helix have their $\vec{\sigma}\,\!$ oriented in the same direction: toward the 3' end of the strand (see Nucleobase normal vector). Then, a bp is parallel if the bases' $\vec{\sigma}\,\!$ are oriented in the same direction; else the bp is antiparallel if the $\vec{\sigma}\,\!$ are oriented in opposite direction.

Figure 2: Other base pairing annotations. (a) The cis/trans relative orientation of glycosidic bonds. (b) The parallel/antiparallel relative orientation of the bases' plane.

Two other base pairing nomenclatures are supported by MC-Sym: the Saenger nomenclature (Saenger 1984) and the Gautheret nomenclature. The Saenger nomenclature labels specific bps by roman number from I to XXVIII. The Gautheret nomenclature applies on single hydrogen bonded bps and labels by arabic number from 29 to 137.

#### Base Stacking

Base stacking is characterized by two bases that are stacked on top of each other in a roughly parallel fashion. This spatial arrangment is induced by London dispersion between the pyrimidine and imidazole rings of bases. The RNA Ontology Consortium as recently integrated a new bs nomenclature proposed by Major and Thibault. This nomenclature is based on the enumeration of the four possible relative arrangment of the $\vec{\sigma}\,\!$ of two stacked bases (see Nucleotide stacking nomenclature).

If bases A and B are stacked, then either B is above or below A, with respect to the $\vec{\sigma}\,\!$ of A. Additionally, both $\vec{\sigma}\,\!$ of A and B can be either in the same or opposite direction. Combining both features, four bs arrangment are enumerated: A and B stacks upward (A >> B), downward (A << B), inward (A >< B) or upward (A <> B).

Base adjacency describes two bases that are adjacent, or consecutive in the sequence. Two adjacent bases are labeled either adjacent_5p or adjacent_3p, to distinguish the strand orientation. Base adjacency is maintained internally by the modeling engine, and is implicit to nucleotide numbering.

#### Base-Base HTM

Base pairing, stacking and adjacency describes a base-base interaction symbolically. To express a specific interaction between bases A and B numerically in a model, the relative 3-D arrangment of the bases is encoded as a relative frame transformation in 4X4 homogeneous coordinates: AMB. This HTM represents the rigid transformation (combination of translation and rotation) that expresses the coordinate frame aligned on B (right subscript) in the coordinate frame aligned on A (left superscript). A base's coordinate frame has its origin on the terminal nitrogen (N9 for purines, N1 for pyrimidines) and its XY plane aligned with the plane of the base (see Figure 3).

Figure 3: Base coordinate frame. Hydrogen atoms are not shown. (a) Frame for a purine (here an adenine). The origin is is on the N9 atom, and the XY plane is aligned on the plane described by atoms N9, C4 and C8, which are shared accross all purines. (b) Frame for a pyrimidine (here an cytosine). The origin is is on the N1 atom, and the XY plane is aligned on the plane described by atoms N1, C2 and C6, which are shared accross all pyrimidines.

Using such a relative HTM to express A interaction with B allows to apply it independently of the global position of A. Given VA the coordinates of base A in the global frame, the global coordinates of base B (VA) are easly obtained using AMB:

$V_B = {}^{A}M_{B} \cdot V_A \,\!$

A specific interaction between two bases, A and B, identified in a given 3-D structure, is encoded as the relative HTM AMB that is computed by the following equation, where OMA and OMB are expressing the local frames of bases A and B, respectively, in the global frame O of the whole 3-D structure.

${}^{A}M_{B} = {}^{O}M_{A}^{-1} \cdot {}^{O}M_{B} \,\!$

### Relation Database

The previous section defined a relation between two bases as an entity containing a symbolic annotation of the interaction (base pairing, stacking or adjacency) and a relative HTM that expresses the transformation positioning the second base with respect to the first. Such relation entities are extracted from high-resolution X-ray crystallographic structures and stored in a database: the relation database. This database defines the maximal search space of the modeling engine. Extraction of a relation between bases A and B from a 3-D structure is two-steps: 1) annotate the interaction between A and B (using an embedded MC-Annotate) and 2) compute {}^{A}M_{B} using the equation above. Because a relation is directed by definition, the relation between B and A must also be stored. Hence, two relations are extracted for each pair of interacting bases from the reference 3-D structure.

The content of the current version of the relation database (4.1.0) was created from the RNA X-ray crystallographic structures deposited in the public database RCSB PDB before may 2005 and with resolution of 3 Å or better. A total of 366034 relations were extracted, from which 313902 where retained after a similutude filtering of 0.1 Å. This filter is eliminating the neighboorhood of each relation, where two relations are neighbor if two conditions are met: 1) both base ordered pairs have matching A-C-G-U types and 2) the HTMD between both HTMs is lower than the filter threshold (0.1 Å). The filter algorithm is always eliminating the largest neighboorhood first.

Each relation item in the database is indexed by the ordered pair of base types, i.e. a pair $(a,b) \in \lbrace A,C,G,U \rbrace \times \lbrace A,C,G,U \rbrace \,\!$, and a list of symbols describing the base pairing, stacking or adjacency that annotate this interaction.

### Search Space

The modeling engine explores a search space to build conformations of an RNA. This search space is composed of relations, as described in Section 3.1. These relations are used to position the modeled RNA's bases in 3-D space relatively to one another, expressing the RNA base-base interactions. The maximal search space is defined by the relation database content (see Section 3.2). In theory, the database could be explored in its wholeness each time a base is to be positioned; however the computational cost would be prohibitive. Moreover, a specific modeling tasks is always looking at specific base-base interactions. Thus, the search space must be restricted to suit each specific modeling task, indeed.

A search space is specified by enumerating each relations that composes the structure to model. This is achieved by representing the structure as a graph of relation. In this directed graph, vertices vk represents each nucleotides k by their ID (chain and sequence number) and type (A, C, G or U). An arc aA,B is drawn from vertex vA to vA if nucleotides A and B have a relation. A symbolic expression that describes the type of relation between both nucleotides is associated to each arc. This symbolic expression will be parsed by MC-Sym into a query to the database of relation, where the matching relations will compose a domain for the associated arc. When all arcs are thus processed, the search space is completely defined with specific domains associated to each arc.

Here, the user task is to input the graph of relation with symbolic expression associated to each arc. Command sequence is used to specify the vertices of the graph: nucleotides type and numbering. Command relation is used to enumerated the arcs of the graph with associated symbolic expression. Below is an example of search space specification.

#### GNRA stem-loop: a search space specification example

Here is an example of user input and associated search space for a small GNRA stem-loop of sequence 5'-GGGGAGACCC-3'. The structure to model is composed of a three-bps standard helix capped with a GNRA tetraloop of sequence GAGA. The GNRA tetraloop is a classical RNA motif where the N, R, and A bases are stacked and the first G and last A form a S/H bp. Figure 4a shows an input example that specify a search space for this particular structure (see MC-Sym Command Reference for the input syntax). Figure 4b illustrates the graph of relations representing the input.

The first command, sequence, set up the actual sequence of nucleotides that composes the structure and defines a numering: from X1 to X10. These are the 10 vertices of the graph of relation that describes this search space: from v1 to v10.

The second command, relation, enumerates all relations between the bases. Each relation is specified by the bases' ID, the symbolic expression that describes the relation and a database sampling size. Only one direction of each relation is necessary in the input, the reversed relation is automatically added to the search space. For example, the item X4 X7 { S/H && antiparallel } 10 queries the database for relations between a G and an A that are S/H and antiparallel bp. 7974 relations matches this query, but only 10 are retained for this domain. A sampling algorithm ensures that any sampling of a relation set maximizes the space coverage of the complete set. Sampled domains are stored in a cache file, where they can be restored later, saving the computational overhead cost of running the sampling algorithm again. This set of 10 relations is associated to arc a4,7. Arc a7,4 is implicitly added to the graph of relation with expression X7 X4 { H/S && antiparallel } 10.

In this example, base adjacency was not specified in the input. Base adjacency is always implicit to the consecutiveness of nucleotides in the sequence, and is maintained internally. For example, the item X5:X7 { upward } 8 is only restricting base stacking between bases X5 and X6, and bases X6 and X7. However, adjacency is silently added to the queries to the database, since the bases are consecutives. The same applies to item X4 X5 { !stack } 75, where the query is parsed to !stack && adjacent_5p (any relation between adjacent bases that are not stacked). Figure 4c lists the queries that are parsed by MC-Sym and sent to the database.

Figure 4: Search space for a GNRA stem-loop. (a) Input to MC-Sym. (b) Representation of the input as a graph or relation. (c) Queries associated to each arc parsed by MC-Sym and sent to the relation database, with their sampling size.

#### Spanning tree

Exploration of the search space is defined as a CSP, where the variables are the arcs of the graph of relations and their domain the associated relation set obtained from the database. Contraints are specified by the user, they will be discussed below. Solving this CSP means instantiating all variables in such a way that all constraints are satisfied. Here, assigning a value to a variable means selecting a relation from the domain associated to the addressed arc and applying its HTM to position the base at the end of the arc in the frame of the base at the beginning of the arc. In the GNRA stem-loop example presented above, arc a4,7 is a variable; its domain a set of 10 relations. Assigning a relation to this variable will position base X7 in the frame of base X4, expressing the encoded S/H antiparallel bp.

The arcs contained in any spanning tree of the graph of relation are sufficient to position all the bases in the graph. The root of the spanning tree will act as the global structure. Starting from the root, traversal of the tree will position all bases using the relations in the domain of the traversed arcs. Then, a depth-first traversal such as backtracking will explores all possible assignation to the tree's arcs. If a complete assignation satisfy to all constraints, then a valid conformation is built.

More than one spanning tree can cover a given graph of relation. Hence, selection of a spanning tree and of its root is part of the user input, using the command backtrack, by listing paths from the tree root to the leaves. Figure 5 shows a selected rooted spanning tree for the GNRA stem-loop example presented in Figure 4, and how it is input as a MC-Sym command.

Figure 5: Selection of a rooted spanning tree for a GNRA stem-loop. (a) Input to MC-Sym. (b) Representation of the spanning tree. Blue circles are vertices, the open one is the root. Arcs are arrows; they are numbered as scheduled in the command.

A spanning tree over a graph of relation contains a subset of its arcs. This subset is almost never complete, because the graph can contain cycles. In the GNRA stem-loop example above, the selected spanning tree is not containing arcs a1,2, a3,4, a4,5, and a8,9. Three of those overlooked arcs are expressing staking in the stem part, and the remaining other is closing the tetraloop. Thus, it is clear that these structural information must be taken into account. This is currently one of the greatest challenge of MC-Sym. The current solution is to integrate constraints and restraints (see below) that will address these structural features.

Once a rooted spanning tree is selected, the search space is complete and exploration can begins. Assignation of a relation to each arc variable of the tree, in the order scheduled by the backtrack command and where each relation is selected from the associated arc's domain, position all bases from one another. In the GNRA stem-loop example, assignation proceeds as follow:

1. From the frame of base X1, which act as the global frame, base X10 is positioned by selecting a relation from the domain of arc a1,10.
2. From X10 frame, X9 is positioned by selecting a relation from a10,9 domain.
3. From X9 frame, X2 is positioned by selecting a relation from a9,2 domain.
4. From X2 frame, X3 is positioned by selecting a relation from a2,3 domain.
5. From X3 frame, X8 is positioned by selecting a relation from a3,8 domain.
6. From X8 frame, X7 is positioned by selecting a relation from a8,7 domain.
7. From X7 frame, X4 is positioned by selecting a relation from a7,4 domain.
8. Again from X7 frame, X6 is positioned by selecting a relation from a7,6 domain.
9. Finally, from X6 frame, X5 is positioned by selecting a relation from a6,5 domain.

### Backbone Construction

As it was presented above, instanciation of the arc variables of a selected rooted spanning tree of the graph or relation that represents the search space position all bases from one another. However, the conformation is incomplete: backbone is missing. The modeling engine has builtin fonctionnalities to build a numerical approximation of a backbone conformation that fits the bases conformation.

The backbone construction sub-process is able to build a ribose conformation given the position of a base and two phosphate groups. This ribose conformation is built in constant time by estimating the ribose geometrical features (pseudorotation angle and glycosidic torsion). The quality metric of a ribose conformation is defined by the RMSD between the length of the implicit bonds that link the ribose to both phosphate (C5'-O5' and C3'-O3' bonds) and their standard value.

Before a ribose conformation can be built, phosphates group must also be positioned with the bases. Phosphate groups (PO4) are considered rigid bodies; as are bases. Thus, they are positioned along the bases using relative HTMs. More precisely, when a base is positioned from another and both bases are adjacent, a phosphate groups is postioned in between, in the same frame. The relative HTM that positions a phosphate group between two adjacent bases is associated to the relation entity that represents this base adjacency. Hence, all base adjacency relations are composed of a pair of HTMs: one to position the second base in the frame of the first, and the other to position the phosphate group in betweem, still in the frame of the first base.

In this way, phosphate groups are positioned along with the surrounding adjacent bases. However, there is no guarantee that a given spanning tree selected for exploration of the search space will cover all adjacent arcs. Consequently, they will be a missing phosphate group for each adjacent arc of the graph of relation that is not covered by the selected spanning tree. In the GNRA example above, four adjacent arcs are missing from the spanning tree illustrated in Figure 5: a1,2, a3,4, a4,5, and a8,9.

A missing phosphate group is positioned by selecting the relation from the unused domain of the overlooked adjacent arc that is closest to the implicit relation currently expressed by the arc' bases in the built conformation. This closedness between two relation is mesured using the HTMD between their base-base HTM. Once the closest relation from the unused domain is found, its base-phosphate HTM is applied to position the phosphate group.

This missing phosphate group positioning method implies that relation domains must have been specified in the search space, even if these domains are not part of the spanning tree. In the GNRA stem-loop example above, every relations where specified in the search space. Hence, for example, the phosphate group between bases X4 and X5 is positioned by selecting the closest relation among the 35 in the domain of arc a4,5.

### Constraints and Restraints

With bases and phosphate groups positioned and with riboses built in between, the RNA conformation is complete. Thus, exhaustive exploration of the search space by backtracking will build all possible conformations for the modeled RNA. However, those conformations must be validated. In the context of the CSP, a set of constraints are to be statisfied for a full variable instanciation to be elected as a solution. In the modeling engine, different types of constraint can be applied by the user. See the Constraint Section in the command reference.

### Modeling by fragment

To be continued ...

### Probabilistic Backtracking

To be continued ...

## Command synopsis

Usage: mcsym[-hV] [-v level] [-j n] [-D database] [-C cache] [file]

-- Options --
-h            print this help
-V            print the software version info
-v            increase verbose level by 1
-j  n         Allow n jobs at once
-D  database  Use this database file instead of MCSYM_DB environment variable
-C  cache     Use this cache file instead of MCSYM_CACHE environment variable

-- Arguments --
file          Input script file (starts mcsym in interactive mode if not specified)

-- Environment variable --
MCSYM_DB      Filename with full pathname (from root '/' directory) of database file.
MCSYM_CACHE   Filename with full pathname (from root '/' directory) of cache file.
Default: \$HOME/.mcsym/mcsymcache-<version>.bin.gz, where <version> is the
database version number.


## Command Reference

All commands available in MC-Sym are listed in the MC-Sym Command Reference.

## Questions, bug report, etc.

Send all requests to adbit@iro.umontreal.ca.

## References

1. Lemieux S., Major F. (2002) RNA canonical and non-canonical base pairing types: a recognition method and complete repertoire. Nucleic Acids Res. 30 4250-63.
2. Leontis N.B., Westhof E. (2001) Geometric nomenclature and classification of RNA base pairs. RNA 7 499-512.
3. Saenger W. (1984) Principles of nucleic acids structure. Springer, New York, NY, USA.

--caféinophil 11:06, 17 October 2006 (EDT)