MC-SYM FAQ

     Welcome to the MC-Sym FAQ. It's goal is to give modelers hints about editing MC-Sym scripts to produce high quality RNA 3-D structures. Each section focuses on a particular aspect of modeling with MC-Sym, such that modelers can find easily the answers to their questions. Here are the topics covered:
tRNA model on crystal
MC-Sym model (blue) superimposed on X-ray crystal structure (gold; PDB file 1EVV). The RMSD of the superposition is 4.0 Angstroms.

References & Links:

  • Parisien M, Major F. Nature. (2008) 452:51-55. (pubmed) (Nature)
  • The MC-Sym web page
  • Dr. Francois Major's email: major at iro.umontreal.ca
  • Marc Parisien's email: parisien at iro.umontreal.ca



MY FIRST RNA 3-D STRUCTURE

     Let's look at the whole process of modeling an RNA 3-D structure from sequence. Consider the sequence fragment 72-105 from Haloarcula Marismortui's 23S rRNA (PDB file 1JJ2 [1]), with nucleotide A96 mutated into U:
5'-CCAUGGGGAGCCGCACGGAGGCGAUGAACCAUGG-3'
      |    |    |    |    |    |    |
     75   80   85   90   95  100  105

Step 1. Obtain a secondary structure.

     The first step in the modeling process is to obtain a secondary structure. Classical secondary structures, featuring only canonical base pairs (the cWW G=C, A=U and G=U) are difficult to model in 3-D, because of the degrees of freedom an unpaired nucleotide has. The more base pairs a secondary structure has, the easier it will be to project it in 3-D. Using MC-Fold [2][link], we obtain several sub-optimal secondary structures:

      >hairpin
      CCAUGGGGAGCCGCACGGAGGCGAUGAACCAUGG
   1) (((((((((((((((....)))))))))))))))  [view]  [edit]  [submit]
   2) ((((((((((((((((..))))))))))))))))  [view]  [edit]  [submit]
   3) ((((((((((((((.(...)))))))))))))))  [view]  [edit]  [submit]
   4) ((((((((((((((...))))))...))))))))  [view]  [edit]  [submit]
   5) ((((((((((((((...)))))))...)))))))  [view]  [edit]  [submit]
   6) ((((((((((((((...)))))...)))))))))  [view]  [edit]  [submit]
   7) ((((((((((((((...))))))))...))))))  [view]  [edit]  [submit]
   8) ((((((((((((((...)))))..).))))))))  [view]  [edit]  [submit]
   9) ((((((((((((((......))))))))))))))  [view]  [edit]  [submit]

     Because MC-Fold's energetical contributions for non-canonical base pairs are approximated, and that base triples and many non-contiguous nucleobase stackings are not even taken into account, the solution structure (PDB file 1JJ2) is ranked #6. This secondary structure has the correct base pairing registry to allow for a K-turn motif [1]. The confidence on a secondary structure can be heightened by low-resolution structure probing methods like SHAPE [3] or enzymatic and chemical probing (see for instance [4]). Keep MC-Fold's result web page opened for further use.

Step 2. Model by homology known motifs.

     Here, our assymetrical internal loop is most likely a K-turn motif. What we'll do is scan the PDB for this motif using a motif descriptor, and get actual 3-D instances of that motif. Then, by homology modeling, we can adapt the sequence in the 3-D instances to suit the sequence we are modeling. Consider the following MC-Search [5][link] script:

//================================================
// MC-Search script to search for K-turns
// This motif is composed of two strands; A and B
//================================================
sequence( RNA A1 NGGAN )
sequence( RNA B1 NGAAGAAN )
relation(
  A1 B8 { pairing }
  B1 A5 { pairing }
)

     Running MC-Search on that script will bring you to a web page which will display an index of files in your working directory. Working directories, identified by unique 10-digit keys, are used to store results from the MC-Pipeline suite of programs. All 3-D instances which correspond to the motif descriptor are put in the file results.pdb. The search should have returned only one 3-D instance.

     Modeling by homology for nucleic acids necessitate two operations: 1) mutating nucleobases and 2) mutating base pairs. These operations can be performed via the file commands.html. In the "sequence mutation commands" window, copy-and-paste this text:

mutate( nucleobase B4 U )
mutate( basepair A1 B8 G C )
mutate( basepair A5 B1 G C )

     The nucleotides addressed by the mutations are simply called in the same manner as in the MC-Search script; for example, B4 is the fourth nucleotide of the B strand. Now, the results.pdb file should feature all your mutations. Please take note of your working directory key, as it will be used later.

Step 3. Obtain an MC-Sym script.

     From Step. 1, click on the edit button of row #6. This will bring you to a web page of options to the automatic MC-Sym script generator. Leave all options to their default values. Click Download and save your MC-Sym script on your computer.

     Often, the MC-Sym script will have to be edited. Here, we will modify it to use our K-turn motif. Make MC-Sym load our K-turn by adding a library command, like:

k_turn = library(
    pdb( "../BINuPWoBcT/results.pdb" )
    #1:#5, #6:#13 <- A6:A10, A22:A29
    rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

     Where we instruct MC-Sym to get our K-turn in our previous MC-Search session with directory key BINuPWoBcT (your directory key may be different, thus your script will need to use your key instead).

     Furthermore, the backtrack section has to be updated to now use our K-turn fragment instead of the more basic NCMs. The final MC-Sym script is here.

Step 4. Obtain tertiary structures.

     Go to the MC-Sym web page [6][link], and submit your MC-Sym script. You will then again be brought to a working directory. Structures should start appearing, suffixed by -0001.pdb.gz, as you refresh your working directory page. Let MC-Sym complete the structure generation; in 30 minutes or 1000 structures, whichever comes first.

Step 5. Process tertiary structures.

     Once the structures are generated, there are several post-processing steps that can be applied to the structures. These steps can be found in the commands.html file of your MC-Sym working directory. In the "Analysis" section, check the Relieve and the P-Score options, then click "Execute". Relieving the 3-D structures will remove any steric clashes and will correct the positions of the backbone atoms. The P-Score is an unpublished pseudo-potential energy function which measures the A-RNA likelyness of the structures.


References.

  1. Klein et al. EMBO J. 2001 20:4214-21.
  2. Parisien & Major. Nature. 2008 452:51-5.
  3. Merino et al. J Am Chem Soc. 2005 127:4223-31.
  4. Cléry et al. Nucleic Acids Res. 2007 35:1868-84.
  5. Olivier et al. Mol Cell Biol. 2005 25:4752-66.
  6. Major et al. Science. 1991 253:1255-60.




EXAMPLES FROM THE LITTERATURE


Article PubMed
Lemieux S, Chartrand P, Cedergren R, Major F.
   Modeling active RNA structures using the intersection of conformational space: application to the lead-activated ribozyme.
   RNA. 1998 4:739-49.
Parisien M, Major F.
   The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data.
   Nature. 2008 452:51-5.
McGraw AP, Mokdad A, Major F, Bevilacqua PC, Babitzke P.
   Molecular basis of TRAP-5'SL RNA interaction in the Bacillus subtilis trp operon transcription attenuation mechanism.
   RNA. 2008 Epub.




RNA DECOYS DATABASE

This section hosts the RNA Decoys Database. It's purpose is to offer the RNA modeling community 1) various MC-Sym scripts tackling disparate modeling challenges, 2) 3-D decoys with various base pair suites to challenge force-fields. All 3-D structures have been refined to at least a G RMS less than 100 Kcal/mol/A (so they may still need further refining).

Reference
PDB file 2JXV
2JXV

   This structure in an hairpin featuring a 2_3 internal loop. All possible base pairing hypotheses, which saturates the shortest strand, have been modeled. Final models incorporate many variants of this internal loop. This example shows how to model internal loops, and how to incorporate them in final models.
   Best       Model       RMSD       INF       SIM   
Reference 2JXV --- --- ---
RMSD 4230 1.36 0.966 1.408
INF 0045 1.39 0.977 1.423
SIM 4230 1.36 0.966 1.408
PDB file 1D4R
1D4R

   This structure is a duplex featuring suites of non-canonical base pairs. Models feature many different suites of base pair types.
   Best       Model       RMSD       INF       SIM   
Reference 1D4R --- --- ---
RMSD 0643 2.70 0.907 2.977
INF 1139 5.20 0.967 5.378
SIM 4474 2.77 0.935 2.962
PDB file 1SA9
1SA9

   This structure is a duplex featuring a tandem of G=A base pairs in the sheared configuration. However, models may feature the tandem in other base pair configurations.
   Best       Model       RMSD       INF       SIM   
Reference 1SA9 --- --- ---
RMSD 3772 0.99 1.000 0.990
INF 3772 0.99 1.000 0.990
SIM 3772 0.99 1.000 0.990
PDB file 1MIS
1MIS

   This structure is a duplex featuring a tandem of G=A base pairs in the imino configuration. However, models may feature the tandem in other base pair configurations.
   Best       Model       RMSD       INF       SIM   
Reference 1MIS --- --- ---
RMSD 1233 0.99 0.953 1.039
INF 1233 0.99 0.953 1.039
SIM 1233 0.99 0.953 1.039
PDB file 354D
354D

   This structure is a duplex featuring suites of non-canonical base pairs. Models feature many different suites of base pair types.
   Best       Model       RMSD       INF       SIM   
Reference 354D --- --- ---
RMSD 3345 0.95 0.950 1.000
INF 1035 0.96 1.000 0.960
SIM 1035 0.96 1.000 0.960
PDB file 283D
283D

   This structure is a duplex featuring suites of non-canonical base pairs. Models feature many different suites of base pair types.
   Best       Model       RMSD       INF       SIM   
Reference 283D --- --- ---
RMSD 0305 1.01 0.966 1.046
INF 2839 1.03 1.000 1.030
SIM 2839 1.03 1.000 1.030
PDB file 430D
430D

   This structure is a hairpin featuring suites of non-canonical base pairs. Models feature many different suites of base pair types.
   Best       Model       RMSD       INF       SIM   
Reference 430D --- --- ---
RMSD 6841 1.18 0.893 1.321
INF 4794 1.44 0.935 1.540
SIM 6841 1.18 0.893 1.321
PDB file 1NBR
1NBR

   This structure is a hairpin featuring only canonical base pairs. Models feature many different suites of base pair types.
   Best       Model       RMSD       INF       SIM   
Reference 1NBR --- --- ---
RMSD 6139 1.83 0.921 1.987
INF 6915 2.75 0.984 2.794
SIM 6139 1.83 0.921 1.987
PDB file 2TPK
2TPK

   This structure is a pseudoknot featuring only canonical base pairs. Loop 3 has not been modeled. Boxed base pairs are those forming a coaxial stacking.
   Best       Model       RMSD       INF       SIM   
Reference 2TPK --- --- ---
RMSD 1926 1.36 0.874 1.556
INF 0333 2.70 0.947 2.851
SIM 1926 1.36 0.874 1.556
PDB file 2Q1R
2Q1R

   This structure is a duplex with two isolated non-canonical base pairs.
   Best       Model       RMSD       INF       SIM   
Reference 2Q1R --- --- ---
RMSD 5119 1.04 0.972 1.067
INF 4888 1.05 1.000 1.050
SIM 4888 1.05 1.000 1.050




WHY DON'T I OBTAIN MODELS
Modeling RNA structures with MC-Sym is still a business of the initiated. The most common problem for not obtaining models comes from a bad mixture of run parameters in the MC-Sym script. Let's go through one example in details, in which MC-Sym had not produced any models despite running for a few minutes:

Step 1. Confirm that MC-Sym is running.

We use the Condor distributed job queing system. Condor may not start executing MC-Sym right away for multiple reasons, and this could be an explanation why MC-Sym doesn't produce models. To confirm that MC-Sym is running, open the file `condor.log` found in your working directory. In it you should have the confirmation that the job has been submitted and that it is executing:
000 (7355.000.000) 11/25 00:57:11 Job submitted from host: <10.0.0.156:10054>
...
001 (7355.000.000) 11/25 00:57:14 Job executing on host: <10.0.0.233:11011>
...
An MC-Sym script that contains an error will make MC-Sym abort, and this will be logged in this `condor.log` file too.

Step 2. Locate the `Build Order` section.

Now that we know that MC-Sym is running, we will look at the output file of MC-Sym. Open the file `mcsym.out` found in your working directory. In it, locate the `Build order` section; it should look similar to:
> Build order (4.27595e+09 potent solutions):
[0  ]: Place[ncm_01] {10}
[1  ]: Merge[ncm_02]+{[A7|R:RIB],[A10|R:RIB]} {1}
[2  ]: Merge[ncm_03]+{[A6|R:RIB],[A11|R:RIB]} {1}
...
This section is MC-Sym's step-by-step plan to build the structure. When MC-Sym successfully reaches the last step it will output a model. However, many reasons can prevent MC-Sym from attaining a particular step.

Step 3. Locate the last `Reached` section.

Now, locate the last line in the file, to see up to where MC-Sym was able to go:
Reached level 0 Place[ncm_01]
Here, we can see that MC-Sym was able to reach level 0, but not level 1: the problem lies there in the merging of ncm_02 on ncm_01. From the build order section we can see that level 1 is the merging of ncm_02.

Step 4. Inspect the `Building FG` section.

Locate the `Building FG "ncm_02"` section in the same file:
===== Building FG "ncm_02" =====

o            Reading 21 PDB files into LibraryFG "ncm_02"

o 1 models added into library "ncm_02" (read 53).
Here, we see that only one instance has been loaded in ncm_02's building library, so it may not merge well with any of ncm_01's library.

Step 5. Edit your MC-Sym script.

To paliate to this problem, edit your MC-Sym script and change:
ncm_02 = library(
	pdb( "..." ) ...
	rmsd( 1.0 sidechain && !( pse || lp || hydrogen ) ) )
for a library with more instances, say:
ncm_02 = library(
	pdb( "..." ) ...
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )
By reducing the similarity criteria for this library, MC-Sym should then load more instances, hence a greater chance to find an instance which will merge with one of ncm_01's library!

Step 6. Cancel your current job and re-submit.

In the `commands.html` file in your working directory, click on the `Cancel` button; this will cancel your current doomed MC-Sym run. Re-submit your newly edited script. When re-submitting your script, please re-use your working directory key!




MODELING LONG HAIRPIN LOOPS
Short hairpin loops abund in the PDB so we can use them as building blocks. Furthermore, specific sequences adopt specific hairpin folds, consider the GNRA or the UNCG tetra-loops. In modeling such loops, a modeler wishes to use the actual folds found in resolved structures. MC-Sym can readily exploit such structural knowledge by the use of an NCM. However, in modeling, one has to deal with the accumulated data. Hence, long hairpin loops are also subject to modeling.
tRNA ASP anticodon loop Consider fragment 26-44 of Saccharomyces Cerevisiae tRNA(asp) transcripts [1]. Although this sequence is canonical, whereas actual tRNAs feature modified nucleotides and additional base pairs [2], the whole transcript has been shown to adopt the clover-leaf shape [3].
Figure 1. tRNA(asp) fragment 26 to 44, featuring a long hairpin loop.

Step 1: Identify the fragments to use.

Fragments available in MC-Sym are 2-, 3- and 4-mers. These N-mers can be merged to others (nucleotide 36), or can be merged to NCMs (nucleotides 31 and 39). As shown in Figure 1. we will model this loop using fragments 01, 02 and 03, which are 3-, 4- and 3-mers, respectively. Notice that fragments 01 and 03 do not overlap. Notice also that fragment 01 overlaps one nucleotide, 31, with the NCM I (30,31,39,40). Fragment 02 overlaps one nucleotide, 39, also with the same NCM.

Step 2: Load the fragments.

Make MC-Sym load these three fragments. Add these library commands in the "Cycles" section:

lnk_01 = library(
	pdb( "MCSYM-DB/ss3/CUU/*.pdb.gz" ) #1:#3 <- A31:A33
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

lnk_02 = library(
	pdb( "MCSYM-DB/ss4/CGCG/*.pdb.gz" ) #1:#4 <- A36:A39
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

lnk_03 = library(
	pdb( "MCSYM-DB/ss3/GUC/*.pdb.gz" ) #1:#3 <- A34:A36
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

Notice the ss3 and ss4 keywords; they invoke the use of the 3- and 4-mers, respectively.

Step 3: Use the fragments.

Make MC-Sym use the fragments. Add the merge commands in the "Backtrack" section, after having placed the NCMs of the stem:

//----- stem -----
        ....      

//----- loop -----
merge( lnk_01 2.0 )
merge( lnk_02 2.0 )
merge( lnk_03 2.0 )

To merge a fragment, at least one of the nucleotides of the fragment must have been placed previously. Here, we can merge the fragment lnk_01 because nucleotide 31 has been previously placed (in the previous "stem" subsection). Then we merge fragment lnk_02, wich will merge on nucleotide 39. Finally, we merge lnk_03, via nucleotide 36.

References:

  1. Merino et al. J Am Chem Soc. 2005 127:4223-31.
  2. Durant & Davis. J Mol Biol. 1999 285:115-31.
  3. Perret et al. Biochimie. 1990 72:735-43.

MODELING INTERNAL LOOPS
Internal loops are integral parts of RNA 3-D structures. Here, we distinguish two types of internal loops; the symetric and the assymetric loops. Often, symetric interior loops can be closed by base pair mismatches, thus can be modeled using NCMs. On the other hand, assymetric loops may force some nucleotides to be either unpaired, form base triples and/or stack inside the helix. There are too few instances of interior loops in the PDB, specially if the assymetry is large, to be used as is as building blocks. Therefore, we must rely on a simple procedure to build any loop of any sequence.
HIV-1 TAR hairpin Consider the HIV-1 TAR RNA hairpin [1]. This hairpin features a three-nucleotide assymetric interior loop. The loop is used as a recognition site for the TAT protein [2]. We will build this interior loop by using two single-stranded RNA stretches; 01 and 02, as shown in Figure 2.
Figure 2. HIV-1 TAR RNA hairpin, featuring an interior loop.

Step 1: Identify the fragments to use.

Fragments available in MC-Sym are 2-, 3- and 4-mers. These N-mers can be merged to others, or can be merged to NCMs (nucleotides 6, 23 and 24). As shown in Figure 2. we will model this interior loop using fragments 01 and 02, which are 2- and 4-mers, respectively. Notice that fragment 01 bridges two NCMs: the NCM VI (5,6,24,25) with the NCM V (10,11,22,23), whereas fragment 02 does NOT bridge these two NCMs. We choose to bridge by the side that uses the less nucleotides; here with fragment 01.

Step 2: Load the fragments.

Make MC-Sym load these three fragments. Add these library commands in the "Cycles" section:


lnk_01 = library(
	pdb( "MCSYM-DB/ss2/CU/*.pdb.gz" ) #1:#2 <- A23:A24
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

lnk_02 = library(
	pdb( "MCSYM-DB/ss4/AUCU/*.pdb.gz" ) #1:#4 <- A6:A9
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

Notice the ss2 and ss4 keywords; they invoke the use of the 2- and 4-mers, respectively.

Step 3: Use the fragments.

Make MC-Sym use the fragments. Add the merge commands in the "Backtrack" section, after having placed the NCMs of the upper stem:

//----- upper stem -----
           ...

//----- interior loop -----
merge( lnk_01 2.0 )
merge( ncm_06 2.0 )
merge( lnk_02 2.0 )

//----- lower stem -----
           ...

To merge a fragment, at least one of the nucleotides of the fragment must have been placed previously. Here, we can merge the fragment lnk_01 because nucleotide 23 has been previously placed (in the previous "upper stem" subsection). Then we place NCM 06, a part of the lower stem. Finally, to complete the interior loop, we merge fragment lnk_02, wich will merge on nucleotide 6.

References:

  1. Aboul-ela et al. Nucleic Acids Res. 1996 24:3974-81.
  2. Aboul-ela et al. J Mol Biol. 1995 253:313-32.


MODELING MULTI-BRANCHED LOOPS
Multi-branched loops are the result of local RNA folding events; hairpins form, then meet at multi-branched loops. They have been the subject of numerous studies, notably from the Westhof group [1] and from the Turner and Mathews groups [2,3].
Hammerhead-like 1NYI Consider this RNA structure, similar to the hammerhead ribozyme [4]. This RNA has the particular feature to be able to cleave itself [5,6]. We will build this multi-branched loop by using three single-stranded RNA stretches; 01, 02 and 03, as shown in Figure 3.
Figure 3. Hammerhead-like ribozyme, featuring a multi-branched loop.

Step 1: Identify the fragments to use.

Fragments available in MC-Sym are 2-, 3- and 4-mers. These N-mers can be merged to others, or can be merged to NCMs (nucleotides 6, 25, 26, 37 and 38). As shown in Figure 3. we will model this multi-branched loop using fragments 01, 02 and 03, which are 2-, 2- and 4-mers, respectively. Notice that fragment 01 bridges two NCMs: the NCM VII (10,11,24,25) with the NCM XII (26,27,36,37). Furthermore, fragment 02 bridges also two NCMs: the NCM XII (26,27,36,37) and the NCM XIII (5,6,38,39). Fragment 03 does NOT bridge any NCMs, but completes the multi-branched loop.

Step 2: Load the fragments.

Make MC-Sym load these three fragments. Add these library commands in the "Cycles" section:

lnk_01 = library(
	pdb( "MCSYM-DB/ss2/AA/*.pdb.gz" ) #1:#2 <- A25:A26
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

lnk_02 = library(
	pdb( "MCSYM-DB/ss2/UC/*.pdb.gz" ) #1:#2 <- A37:A38
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

lnk_03 = library(
	pdb( "MCSYM-DB/ss4/CUGA/*.pdb.gz" ) #1:#4 <- A6:A9
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

Notice the ss2 and ss4 keywords; they invoke the use of the 2- and 4-mers, respectively.

Step 3: Use the fragments.

Make MC-Sym use the fragments. Add the merge commands in the "Backtrack" section. We start the modeling with the multi-branched loop, then we extend the stems from there:

//----- multi-brached loop -----
ncm_07
merge( lnk_01 2.0 )
merge( ncm_12 2.0 )
merge( lnk_02 2.0 )
merge( ncm_13 2.0 )
merge( lnk_03 2.0 )

//----- stem_1 -----
           ...

//----- stem_2 -----
           ...

//----- stem_3 -----
           ...

To merge a fragment, at least one of the nucleotides of the fragment must have been placed previously. Here, we can merge the fragment lnk_01 because nucleotide 25 has been previously placed (in the first building instruction "ncm_07"). Then we place NCM 12, a part of the second stem. Then we merge lnk_02, and add to it the NCM 13, a part of the third stem. Finally, we add fragment 03, to complete the multi-branched loop. The rest of the stems are built from this loop toward their extremities.

References:

  1. Lescoute & Westhof. RNA. 2006 12:83-93.
  2. Tyagi & Mathews. RNA. 2007 13:939-51.
  3. Diamond et al. Biochemistry. 2001 406971-81.
  4. Dunham et al. J Mol Biol. 2003 332:327-36.
  5. Kruger et al. Cell. 1982 31:147-57.
  6. Guerrier-Takada et al. Cell. 1983 35:849-57.


MODELING BRIDGING LOOPS
Because of the exponential nature of the backtracking algorithm used in MC-Sym, it is best to wait to the last minute to add bridging loops, i.e. loops that bridges parts of the modeled RNA structure. For example, nucleotides 122 to 126 in the P4-P6 domain of group I intron [1] could be modeled last, after modeling all of it's helices. If multiple bridging loops are modeled, build each one in a separate MC-Sym script, in a sequential fashion, as to add the next loop on partial structures containing the previously modeled ones.

Here is a simple MC-Sym script generator which generates the essential script parts to model a bridging loop. Simply enter the chain ID, and the first and last nucleotides of the loop. The first and last nucleotides must already be modeled (from a previous MC-Sym run, or from a partial model in a PDB file).
 Bridging Loop Generator        
      Chain ID:
      First nt:
      Last nt:

      
Here is an MC-Sym script which uses the script parts of the generator: loopin.mcc.

References:

  1. Cate et al. Science. 1996 273:1678-85.


MODELING BASE TRIPLES
Base triples occur in RNA structures, therefore it can be useful to include them in the 3-D models.
RNA thermometer 2GIO Consider this RNA structure, which acts as a thermometer [1]. It features a base triple which we'll model using implicit relations (as opposed to explicitely building the base triple using relations), as shown in Figure 4.
Figure 4. RNA thermometer, featuring a base triple.

Step 1: Build the triple as an internal loop.

Add the single-stranded fragments 01 and 02, as if modeling an internal loop (described here).

Step 2: Enforce the pairings to form the triple.

Force MC-Sym to build structures such that the base triple is present. Add these pairing commands in the "implicit_relation" section (right after the "sequence" section):

implicit_relation(
  // these constraints will define the base triple
  A7 A23 { pairing }
  A8 A23 { pairing }
)
Here, we make sure that the nucleotides (A7,A23) and (A8,A23) are paired, using the "pairing" keyword. However, it is possible to further specify the type of pairings, see modeling base pair types.

References:

  1. Chowdhury et al. Embo J. 2006 25:2487-97.


MODELING WITH DISTANCE CONSTRAINTS
Distance constraints in modeling are welcomed as they reduce (considerably) the accessible search space, hence the time to explore it. Distance constraints come from two major sources; 1) the inferred constraints and 2) the experimentally derived.

Inferred distance constraints are cheap as they imply no further experimental procedures. Consider the GNRA tetraloop. Find it's receptor [1,2] in the structure you're modeling, and you'll have an additional distance constraint. Comparative sequence analysis can also unveil additional inter-helical or inter-domain base pairs, which can in turn serve as distance constraints [3].

Experimentally derived distance constraints can be obtained from high resolution methods, like NMR or X-Ray, but also from mid- and low-resolution methods like cryo-EM, SAXS, and multiplexed hydroxyl radical cleavage analysis (MOHCA) [4].

Consider the data amassed by Das et al. using MOHCA [4], which can be readily used in MC-Sym as distance constraints. The hydroxyl cleavage traces are between sugars less than 25 Angstroms apart. In MC-Sym, since the distance constraints are hard (i.e. cannot be violated) we can extend this range to, say, 30 Angstroms. Hence, for each pair of nucleotides picked up by the method we can add a constraint in the MC-Sym script to reflect the proximity of the pair (see Table S2 of [4]):

distance( A109:C4'  A182:C4'  4.0  30.0 ) // P4_P5 / P5A
distance( A113:C4'  A197:C4'  4.0  30.0 ) // P4_P5 / P5A
distance( A114:C4'  A196:C4'  4.0  30.0 ) // P4_P5 / P5A
distance( A115:C4'  A195:C4'  4.0  30.0 ) // P4_P5 / P5A
distance( A121:C4'  A193:C4'  4.0  30.0 ) // P4_P5 / P5A
The distance constraints can be put right after the "backtrack" section. The distance keyword is used to specify a pair of atoms, here the C4' atoms of A109 and A182, in a range of minimum and maximum distances, here between 4.0 and 30.0 Angstroms.

Distance constraints can be also used for nucleobase stacking:

distance( A25:PSY  A26:PSY  0.0  5.0 )
Since the PSY pseudo-atoms are found in the middle of the rings closest to the glycosydic bond of nucleobases. Typical stacked nucleobases have PSY-PSY distances within 5.0 Angstroms.

Distance constraints can be also used for nucleobase pairing; typical C1'-C1' distances for W/W pairs are:

  • 11 Angstroms, for purine + pyrimidine
  • 9 Angstroms, for pyrimidine + pyrimidine
  • 14 Angstroms, for purine + purine
Allow a few Angstroms more for the error in modeling.

Finally, distance constraints can steer MC-Sym in large loop modeling or serve as gards for unmodeled parts of an RNA. Indeed, for two nucleotides i and i+k the 99th percentile distance D between their C1' atoms goes like:

// D = 4.4 * k + 5.8
distance( B40:C1'  B43:C1'  0.0   19.0 )
which are distance constraints to paliate for the absence of modeling of nucleotides B41 and B42 (we then force B43 to be not too far from B40).

References:

  1. Jaeger et al. J Mol Biol. 1994 236:1271-6.
  2. Cate et al. Science. 1996 273:1678-85.
  3. Massire et al. J Mol Biol. 1998 279:773-93.
  4. Das et al. PNAS. 2008 105:4144-9.


MODELING BASE PAIR TYPES
Knowledge about a specific base pair type can be exploited in MC-Sym by forcing it to produce models that contain the base pair type wanted. NCMs are by default flanked by base pairs, but they do not disclose their base pair types. We let the merging process in MC-Sym to decide which base pair type actually allows for the merging of consecutive NCMs, as the base pair types are not always obvious. Base pair types can be specified in MC-Sym under the "relation" or the "implicit_relation" sections, just following the "sequence" section, and using either the Saenger [1], the Leontis/Westhof [2], or the Lemieux/Major [3] notations. For example, one could use:

implicit_relation(

  // the G=A base pair flanking a GNRA tetraloop
  // in the Leontis/Westhof style
  A8 A23 { S/H && antiparallel && trans }

  // the canonical C=G Watson-Crick base pair
  // in the Saenger style
  A7 A24 { XIX }

  // the most underspecified pair
  A6 A25 { pairing }
)

  • The relative glycosidic bonds orientation in a base pair can be either cis or trans
  • The Saenger notation describes base pairs with at least two hydrogen bonds.
    Valid pairs are I to XXVIII (table), but notably:
    • XIX: the C=G Watson-Crick
    • XX: the A=U Watson-Crick
    • XXVIII: the G=U Watson-Crick

  • The faces implicated in a base pair can be one of:
    W S H Ws Ww Wh Sw Ss Hw Hh C8 Bs Bh O2P O2'
  • The parallel or antiparallel describes the relative strands orientation.

References:

  1. Saenger. Principles of nucleic acid structure. 1984.
  2. Leontis & Westhof. RNA. 2001 7:499-512.
  3. Lemieux & Major. Nucleic Acids Res. 2002 30:4250-63.


MODELING COAXIAL STACKING
Coaxial stacking can be found between stems linked by assymetrical bulges, pseudoknots, or short single-stranded stretches, to form a longer helical segment. The coaxial stacking of two (or more) stems participates in the stability of the final 3-D fold. Coaxial stacking can be predicted from sequence [1,2], or observed within certain NMR spectra [3]. In MC-Sym, coaxial stacking can be enforced using a distance constraint between the concerned nucleobases. For example, to force a coaxial stacking between stems 2 and 3 in the structure of Figure 3. simply add a distance constraint between nucleotides 25 and 26 after the "backtrack" section:
distance( A25:PSY  A26:PSY  0.0  5.0 )
Here, we use the PSY pseudo-atoms, as they are found in the middle of the rings closest to the glycosydic bond of nucleobases. Typical stacked nucleobases have PSY-PSY distances within 5.0 Angstroms.

References:

  1. Parisien & Major. Nature. 2008 452:51-5.
  2. Mathews et al. PNAS. 2004 101:7287-92.
  3. Wu & Feigon. PNAS. 2007 104:6655-60.


MODELING A PSEUDOKNOT
A pseudoknot form when a hairpin head makes contact with a sequence part downstream [1].
pseudoknot 2TPK Consider this pseudoknot, found in gene 32 messenger RNA of bacteriophage T2 [2], and shown in Figure 4. Here, loops 1 and 2 will be modeled using 2-mers. Furthermore, we will impose a coaxial stacking between nucleotides 13 and 14. Loop 3 will not be modeled, as it is underspecified (but could potentially form base triples with stem 1 [1,2]). In MC-Sym, there is nothing particular about modeling a pseudoknot.
Figure 4. Pseudoknotted RNA structure.

References:

  1. Aalberts & Hodas. Nucleic Acids Res. 2005 33:2210-4.
  2. Holland et al. RNA. 1999 5:257-71.


MODELING WITH RELATIONS
In MC-Sym, relations are binary operators which position a nucleobase with respect to another.
triple 2JXV Consider this base triple, found in let-7 miRNA:lin-41 mRNA complex from C. elegans [1], and shown in Figure 5.
Figure 5. A base triple, modeled using explicit relations.

Step 1: Declare the relations.

Indicate to MC-Sym the type of relation to employ when positioning a nucleobase with respect to another. Types of relations are adjacent_5p (and _3p), stack and pairing (see also modeling base pair types). All relations are declared in a "relation" section, right after the "sequence" section. For our example:
relation(
  A9  A10 { adjacent_5p } 75%
  A10 A23 { pairing }     75%
  A9  A25 { pairing }     75%
  A25 A24 { pairing }     75%
)

The logical and (&&) and not (!) operators can be used in relations, for example:
relation(
  A9  A10 { adjacent_5p && !stack } 25%
)
to express an adjacency, but an unstacked one.

A standard A-RNA double-helix conformation is available for modeling. This is useful to model a Watson-Crick base pair or a stacking relation using only one conformation. For instance, the CCA tail in tRNAs, as well as a long-range base pair A15 A48, can be modeled as:
relation(
  A74:A76 { file( "helixA_RNA" ) stack } 1
  A15 A48 { file( "helixA_RNA" )  W/W  } 1
)

Step 2: Use the relations.

Instruct MC-Sym to build, in the "backtrack" section, the base triple in the order shown in Figure 5. that is, nucleobase A9 will place nucleobase A25 (declared previously as a pairing), then A25 will place A24 (also as a pairing). Then will place A10, adjacently to A9. Finally, A23 will be paired to A10. Here, we did not use the stack relation between A9 and A10, or between A23 and A24, because it is not obvious that these are stacked. For our example:
structure = backtrack(
    // the triple
    ( A9 A25 A24 )
    // next stage
    ( A9 A10 )
    // the other base pair
    ( A10 A23 )
)
Notice that we did not specify the relation between A23 and A24; they are adjacent by default. Furthermore, since we do not explicitely place A24 with respect to A23, we do not need to declare this adjacency.

References:

  1. Cevec et al. Nucleic Acids Res. 2008 36:2330-7.


MODELING CLASSIC SECONDARY STRUCTURES
It is possible to model classic secondary structures with MC-Sym. However, classic secondary structures often feature large (interior) loops, such that many nucleotides are now unpaired. This has for consequence to increase the conformational search space, since these unpaired nucleotides have many more degrees of freedom. Furthermore, classic secondary structures are not in the spirit of the MC-Fold | MC-Sym pipeline, because RNA structures must be, indeed, structured!
rat 28S 430D Consider this RNA structure, from the rat 28S rRNA [1], and shown in Figure 6. This structure features a Sarcin/Ricin motif which is composed of many non-canonical base pairs [2,3].
Figure 6. Rat 28S rRNA, featuring the Sarcin/Ricin motif. a) Classic secondary structure. b) Enhanced secondary structure.
One can go on and model the classic secondary structure, using the techniques described here. However, it is almost improbable that MC-Sym generates the enhanced structure, with all the non-canonical base pairs, from modeling with single-stranded stretches. To model a structure such as Figure 6A, we suggest other alternative programs, like FARNA [4], which takes into account non-canonical base pairs, RNA2D3D [5], and iFoldRNA [6].

Classic secondary structures can be further analyzed by MC-Fold for it to produce enhanced secondary structures. These now could then be modeled in 3-D more easily using MC-Sym.

References:

  1. Correll et al. PNAS. 1998 95:13436-41.
  2. Szewczak et al. PNAS. 1993 90:9581-5.
  3. Wimberly et al. Biochemistry 1993 32:1078-87.
  4. Das & Baker. PNAS. 2007 104:14664-9.
  5. Martinez et al. J Biomol Struct Dyn. 2008 25:669-84.
  6. Ding et al. RNA. 2008 14:1164-73.


MODELING WITH NCM SETS
Several NCM sets are available to the modeler. These sets can be used in MC-Sym's library command.

  • *_t: theoretical set, only for 2_2 NCMs with A=U/C=G base pairs. For example:
    ncm_01 = library( pdb( "MCSYM-DB/2_2/UCGA/*_t.pdb.gz" ) ... )
    
    It's purpose is to provide only one 3-D instance for that NCM, reducing the conformational search space. Useful to model large RNAs, or to force MC-Sym to sample other parts of the RNA structure. One can use the R20 sets instead (see below); which would also add the G=U base pair in a strict set with few, but "high quality", NCMs. Two problems arise when using the _t sets; 1) consecutive pyrimidines won't be stacked, at least on the 5' strand. 2) base pairs won't be propelled.


  • *_1: observed set, found in the PDB with the actual sequence. For example:
    ncm_01 = library( pdb( "MCSYM-DB/4/GGAA/*_1.pdb.gz" ) ... )
    
    It's purpose is to provide 3-D instances for that NCM that can be found in the PDB with the actual modeled sequence. Useful to model parts of an RNA structure which are known to adopt particular folds, like the GNRA-tetraloops, or base pair steps with proper helical parameters [1,2,3].


  • *_x: artificial set, built from backbone templates. For example:
    ncm_01 = library( pdb( "MCSYM-DB/4_2/GUUCGC/*_x.pdb.gz" ) ... )
    
    It's purpose is to provide 3-D instances for that NCM that were built from a database of backbone templates onto which the flanking base pairs were fitted, and any other nucleobases added. Useful to model any NCMs which cannot be found in the PDB. As more nucleotides are implicated in the NCM, the number of backbone templates diminishes rapidly. Hence, a user should opt for other modeling options, like modeling internal loops. The artificial set construction is described in details in [4].


  • *_e: energy optimized artificial set, built from backbone templates. For example:
    ncm_01 = library( pdb( "MCSYM-DB/2_2/GAGA/*_e.pdb.gz" ) ... )
    
    This set is a subset of *_x, in which each NCM has been annotated, and only the ones with best energies are kept, for each suite of base pairs. This set is only available for 2_2 NCMs. This set is useful to sample many base pair types, but with few instances only (to keep the search space at minimum).


  • *_R*: 2_2 NCMs that come from X-Ray only. For example:
    ncm_01 = library( pdb( "MCSYM-DB/2_2/GAGA/*_R*_1.pdb.gz" ) ... )
    
    This set is a subset of *_1, in which each NCM has been solved with X-Ray. This set is only available for 2_2 NCMs.


  • *_Rr*: 2_2 NCMs that come from X-Ray only, with a resolution better or equal to r. For example:
    ncm_01 = library( pdb( "MCSYM-DB/2_2/GAGA/*_R20*_1.pdb.gz" ) ... )
    
    This set is a subset of *_R, in which each NCM has been solved with X-Ray, at a resolution better or equal to r. This set is only available for 2_2 NCMs. Here, R20 are those NCMs from resolutions better or equal to 2.0A. Available sets are R10, R15, R20, R25, R30 and R35, for resolutions better or equal to, respectively, 1.0, 1.5, 2.0, 2.5, 3.0 and 3.5 Angstroms. Hence, by using R35, you also use those lower than 3.5A, that is R30 downto R10. Theoretical NCMs featuring stacks of Watson-Crick base pairs could be replaced with the R20 sets, for example. Don't expect to find your favorite 2_2 NCM in the R10 set though (i.e. solved at a resolution better than 1.0 Angstrom)!


  • *: a combination of all other sets. For example:
    ncm_01 = library( pdb( "MCSYM-DB/4/UACG/*.pdb.gz" ) ... )
    
    However, MC-Sym will load in that order: 1) the theoretical set (if any), 2) the observed set (if any), and 3) the artificial set.

References:

  1. Hunter. J Mol Biol. 1993 230:1025-54.
  2. Hunter. J Mol Biol. 1998 280:407-20.
  3. Hunter & Lu. J Mol Biol. 1997 265:603-19.
  4. Parisien & Major. Nature. 2008 452:51-55. Supp Inf


MODELING WITH FRAGMENTS
In MC-Sym it is possible to model using fragments of RNA 3-D structures. The NCMs are already an example of modeling with fragments. Furthermore, user-built fragments can also be used. For example, consider a base triple built from a previous modeling session in the working directory key IXYuDpUiAB. You can import this fragment in the new modeling session using the library command:
triple = library(
	pdb( "../IXYuDpUiAB/*.pdb.gz" ) #1:#2, #3:#5 <- A9:A10, A23:A25
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )

One can also load structures from different working directories, like:
triple = library(
	pdb( "../{WorkDir1,WorkDir2,...}/*.pdb.gz" ) #1:#2, #3:#5 <- A9:A10, A23:A25
	rmsd( 0.1 sidechain && !( pse || lp || hydrogen ) ) )
As long as all fragments can map to the desired nucleotide numbering scheme. This is useful when one wants to embed in a stem an internal loop which features different base pair patterns, each pattern modeled individually in their own working directory.


MODELING A TRNA
The transfer RNA molecule is often used as a benchmark for 3-D structure prediction [1]. Here, we use the Yeast tRNA-PHE (PDB code 1EVV) [2] as the 3-D target. Within 24h of computation you should obtain a 3-D model at 4 Angstroms from the crystal structure.
tRNA model on crystal
  • The MC-Sym script
  • The canonized tRNA 3-D structure, without modified nucleotides
  • Superimposed MC-Sym solution (model 2) on crystal (model 1)
Figure 7. MC-Sym model (blue) superimposed on X-ray crystal structure (gold; PDB file 1EVV). The RMSD of the superposition is 4.0 Angstroms.

References:

  1. Major et al. PNAS. 1993 90:9408-12.
  2. Jovine et al. J Mol Biol. 2000 301:401-14.


SAMPLING AN RNA
It is possible to sample an existing RNA 3-D structure using MC-Sym. The trick is to sample each part individually, then assemble the whole molecule from it's sampled parts. This is not quite equivalent to MD or LD, because no energy is implied here, and we make use of "nodes" in the sampled structures (if the sampled parts do not overlap). Here we sample a recently published tRNA, PDB file 2K4C [1].
tRNA sampling
  • The MC-Sym script which samples the acceptor stem
  • The MC-Sym script which samples the anticodon stem/loop
  • The MC-Sym script which samples the T stem/loop
  • The MC-Sym script which samples the D stem/loop
  • The MC-Sym script which assembles the whole tRNA using previous samples
Figure 8. Superimposed MC-Sym models.

References:

  1. Grishaev et al. J Biomol Nmr. 2008 42:99-109.


MODELING LARGE RNAS
Modeling large RNAs is possible using MC-Sym. It all boils down to the number and quality (resolution) of distance constraints; the lesser the number/quality, the greater the number and diversity of possible solutions, and hence the time to explore the search space. Basically, you should:
  1. Model each stem/hairpin separately.
    • Generate many MC-Sym models (change model_limit = 9999).
    • use the SCORE + FILTER in the commands.html file,
      once all models are generated.

  2. Model the whole structure using previously built stems/hairpins.
    Consider the Modeling with fragments section.
We suggest that you have mastered all aspects of modeling using MC-Sym before tackling such a task.

WARNING If you do not have distance constraints between your stems/hairpins then the number of possible 3-D structures is as large as there are grains of sand on a beach!



MODELING WITH NMR RESTRAINTS
NMR methods produce distance constraints which in turn can be used in MC-Sym for it to generate "starting" structures. There are two ways to exploit the NMR data; the first way is to declare distance constraints in the MC-Sym script (see the Modeling with distance constraints section). However, MC-Sym will treat these distance constraints as "hard", i.e. cannot be violated. Therefore, these distance constraints must be relaxed for MC-Sym to generate models.

The other way to exploit the NMR data is to evaluate the fitness of each MC-Sym model with respect to the NMR restraints. Basically, you should:

  1. Generate many MC-Sym models (change model_limit = 9999).
  2. Use the "NMR" section in the "commands.html" file,
    once all models are generated.
Each MC-Sym model will then be given an NMR violation energy; the lower the better. The energy is 0 if the distance falls between Dmin and Dmax, is linear if the distance violation is less than 1 Angstrom, and is quadratic elsewise. Use "ANALYZE" in the "commands.html" file to be informed of the many legitimate base pair types for each base pair steps.

Supported format #1:

(example)
    15  GUA H1             19  GUA H1             3.500    6.500  
    19  GUA H1             13  URA H3             2.600    5.000  
    20  ADE H2             12  GUA H1             2.600    5.000  
    20  ADE H2             13  URA H3             1.800    3.600  
    20  ADE H2             19  GUA H1             1.800    3.600  

123456789012345678901234567890123456789012345678901234567890123456
         1         2         3         4         5         6

Supported format #2:

(example)
assign (residue 1 and name H2') (residue 1 and name H3')  2.60  0.78  0.78
assign (residue 1 and name H1') (residue 1 and name H3')  3.20  0.96  0.96
assign (residue 1 and name H3') (residue 1 and name H8)   3.20  0.96  0.96
assign (residue 1 and name H2') (residue 1 and name H8)   4.00  1.20  1.20
assign (residue 1 and name H1') (residue 1 and name H8)   2.90  1.10  1.10


* Other formats may be supported by explicit demand to Dr. Francois Major.



ADDING FRAGMENTS TO A PARTIAL STRUCTURE
In MC-Sym it is possible to add fragments to a partial structure. TODO.


USING MC-SYM LOCALLY

Installation.

A modeler can download and install MC-Sym locally. However, basic knowledge of UNIX commands are required. Here are the steps:
  1. Download MC-Sym from here.
  2. Install MC-Sym as prescribed.
  3. Goto the pipeline page here.
  4. Click on the "snapshot" button of the "What is MC-Sym" section.
  5. Download your tarball "MCSYM-DB.tar.gz" on your computer.
  6. Unzip and untar the tarball:
    gunzip MCSYM-DB.tar.gz; tar xvf MCSYM-DB.tar
Attention: we suggest to install the tarball in MC-Sym's folder, i.e. where the MC-Sym executable is found. For example, if you launch MC-Sym like this: "~/RNA/MC-Sym/mcsym", then install the tarball in "~/RNA/MC-Sym/". To use the MC-Sym database from a specific modeling folder, say "~/RNA/Modeling/Hammerhead/" install a UNIX symbolic link in that folder:

cd ~/RNA/Modeling/Hammerhead/
ln -s MCSYM-DB ~/RNA/MC-Sym/MCSYM-DB/

such that when MC-Sym encounters a "library" command, like:

ncm_01 = library(
      pdb( "MCSYM-DB/4/GGCC/*.pdb.gz" )  ...

then MC-Sym will be able to find it's databases through the symbolic link. To use MC-Sym:

cd ~/RNA/Modeling/Hammerhead/
~/RNA/MC-Sym/mcsym  ./Hammerhead.mcc

Updating your database.

There will come a time when MC-Sym will produce an error like:
Failed to expand pattern 'MCSYM-DB/4/UACG/*.pdb.gz': No such file or directory
This is because the NCM database is built on-the-fly on our web server, hence, when you take a snapshot of our database you will not have those NCMs built subsequently. Since your local version is not updated automatically you'll have to update your database through these steps:
  1. Make a "fake" MC-Sym script with all the "library" commands which fails. For example:
    // these are my missing blocks:
    library( pdb( "MCSYM-DB/4/UACG/*.pdb.gz" ) )
    library( pdb( "MCSYM-DB/4/UCCG/*.pdb.gz" ) )
    library( pdb( "MCSYM-DB/4/UGCG/*.pdb.gz" ) )
    library( pdb( "MCSYM-DB/4/UUCG/*.pdb.gz" ) )
    
  2. Submit this "fake" script to MC-Sym here; wait until the job is queued.
  3. Refresh your database; repeat steps 3 to 6 of the previous "Installation" section.

To go local or not?

Although running MC-Sym locally has it's advantages, using our web server has also theirs:
  • Automatic updates of the NCM database
  • Scoring of the 3-D models with an RNA-tailored force-field
  • Refinement of the backbones
  • Distributed computing facillity
  • Analysis of base pair types