INTRODUCTION

RaptorXFM protein folder is an innovative fragment-free protein folding software package for protein structure prediction, esp. for discovering the structure of the proteins with new folds.

We use continuous representations for the backbone angles in the continuous protein structure space. This enables us to sample and discover the rarely occurring super-secondary structure motifs which occur most likely for the "new fold" proteins.

RaptorXFM uses the probabilistic graphical models to model the sophisticated relationship between backbone angles, sequence profile and predicted secondary structure, so that it can estimate the probability distribution of backbone angles more accurately and sample protein conformations significantly better and more efficiently than the other models.

Further, RaptorXFM uses merely backbone and C-beta atoms in the sampling process without working with the heavy side-chain atoms, which dramatically reduces the computational cost.

RaptorXFM can generate high quality decoys on the test proteins including the CASP8 free-modeling targets. According to our experience, RaptorXFM protein folder can do well on mainly-alpha proteins with sequence length under about 150 amino acids, or small beta proteins with less than 90 amino acids. For more details, please visit our poster on ISMB2010 and the REFERENCES.

INSTALLATION

Unzip the RaptorXFM.tar.gz file, you will find four sub-directories: bin, config, DATA and SEQ. The files in the bin directory are the executables for generating necessary preliminary data files from a given sequence, and for sampling decoys (ProteinFolder*); the config and DATA directory contains the configuration files and model files for ProteinFolder to generate decoys; the SEQ directory is for the user specific data.

RaptorXFM protein folder needs to run PSI-BLAST and PSIPRED to generate the preliminary data files. By default, PSI-BLAST uses NR database, which can be downloaded at ftp://ftp.ncbi.nih.gov/blast/db/ or here.

Please modify the file generateFolderData.sh by setting the following parameters in the TODO section:

# TODO: please specify the directory where you have installed BLAST and PSIPRED
export PSIPRED_HOME=~/PSIPRED
export BLASTDB=~/NR/nr
export BLAST_HOME=~/BLAST

Or using the following command for bash:

PSIPRED_HOME=~/PSIPRED; export PSIPRED_HOME
BLASTDB=~/NR/nr; export BLASTDB
BLAST_HOME=~/BLAST; export BLAST_HOME

RUN RaptorXFM protein folder

RaptorXFM protein folder works in two steps.

Step 1. Preparation

RaptorXFM ProteinFolder accepts FASTA format sequence file as input. Given a sequence file named *.seq, you need to copy this file to the SEQ directory, then run generateFolderData.sh.
Here's an example (Suppose the sequence file is 1enhA.seq):

./generateFolderData.sh 1enhA 1

The first parameter is the target protein name. The second parameter indicates the model used for sampling: 0 means choosing the CRFs model; 1 means the CNFs model requiring the secondary structure (SS) predicted by PSIPRED; 2 means the CNFs model which requires no predicted SS information. Here CRFs stands for Conditional Random Fields, a probabilistic graphical model which models the relationship between backbone angles and sequence profile linearly; CNFs stands for the Conditional Neural Fields, another probabilistic graphical model that perfectly integrates the neural network with CRFs. CNFs models the relationship between backbone angles and sequence profile nonlinearly.

The above command will result in some preliminary data files generated in the SEQ directory. If the chosen model is CRFs (0), you will see the following files in the SEQ/INPUT_CRF directory: 1enhA.chk, 1enhA.horiz, 1enhA.newss, 1enhA.pretag, 1enhA.psp, 1enhA.seq, 1enhA.ss, 1enhA.ss2. If the chosen model is CNF with predicted SS(1), you will see those files shown in the SEQ/INPUT_CNF directory. If the chosen model is CNF without predicted SS(2), the result files will reside in the INPUTnoSS directory. Please note that the CRF model also uses the predicted SS information.
We have enclosed the sample files for 1enhA in the package and you can find 1enhA.* under the SEQ, Merge_cnf, INPUT_CRF/INPUT_CNF/INPUTnoSS and etc under the SEQ directory.

Step 2. Sampling

When the preliminary data files are ready, you can run the ProteinFolder or ProteinFolder_noSS in the bin directory to sample the decoys. We have provided two sample shell script files: runFolder.sh for sampling using Simulated Annealing method, and runFolder_RE.sh for sampling using the Replica Exchange method.

Example:

./runFolder.sh 1enhA CNFo2 Decoy_1enhA

The first parameter is the target protein name. The second parameter indicates the model to be used for sampling. There are 5 options: CRFo1 for the 1^st order CRFs model; CRFo2 for the 2nd order CRFs model; CNFo1 for the 1^st order CNFs model; CNFo2 for the 2nd order CNFs model; and noSS for a special trained CNF model if no secondary structure information is available. The first 4 options all require the secondary structure (SS) predicted by PSIPRED. The 3^rd parameters denotes the name of the directory where the new decoys are to be stored.

The parameters used by runFolder_RE.sh are the same as runFolder.sh.

The above command will result in a new directory Decoy_1enhA, and the protein folder used by runFolder.sh will generate 2 decoys in this sub-directory. If you are using runFolder_RE.sh, normally you will find 6 to 12 decoys in this sub-directory; as a rare case, for the most difficult sequences, it may generate as few as 2-4 decoys each time you run the runFolder_RE.sh; for the easy sequences, it is possible to generate up to 18-20 decoys. Please notice that it takes runFolder_RE.sh more time to finish than runFolder.sh. The average sampling time for each decoy is similar for both methods.

As you may find in runFolder.sh, the -f option of protein folder indicates the configuration file it uses. We have provided 5 configuration files in the DATA directory: proteinFolderCRFo1.conf for the CRFo1 option, proteinFolderCRFo2.conf for the CRFo2 option, proteinFolderCNFo1.conf for the CNFo1 option, proteinFolderCNFo2.conf for the CNFo2 option, and proteinFolder_noSS.conf for the noSS option. In the references, our result was obtained with the SS option. However, it is worth a try using the noSS option when the confidence of the predicted secondary structure by PSIPRED is very low.

CONFIGURATIONS

In this section, we will explain the parameters in the configuration file and option file.

1. proteinFolder*.conf

This is the main configuration file for ProteinFolder (also for ProteinFolder_noSS) with the -f option. It contains a few sections as follows:

1) Settings for CRF/CNF models

crfModelDir tells the location of the model files, we set it to be in the DATA directory, i.e. "./DATA".

crfOptionFile is the name of the option file located under crfModelDir. We have different option*.txt for different proteinFolder*.conf. For example, it is set to be optionCRFo1.txt for the CRFo1 model in proteinFolderCRFo1.conf, and option_noSS.txt in proteinFolder_noSS.conf for the noSS model, and so on for the other models.
crfFeatureTemplateFile is the feature template file. It is fixed to be feature_template.txt.2. Please do not change it.
thetaTauClusterFile is the cluster file for the theta-tau pairs. It is set to be the Kent-Frag100-param.txt file under the config directory. Please do not change it.

2) Settings for input of protein folder, i.e. the directory containing the preliminary data files, including the sequence file, secondary structure files, and the *.pretag files generated by generateFolderData.sh. We set it to be SEQ/INPUT_CRF for the CRFs models, SEQ/INPUT_CNF for CNFs models with predicted SS and SEQ/INPUTnoSS for noSS. The parameters in this section include SeqPath, SSPath, and PretagPath. They are assigned the same value in the examples.

3) Weights used for the energy function.
As indicated in the references, we used 3 terms in the simple energy function: DOPE, BMK, and ESP. In our experiments, we used following weights:

ENERGY_WEIGHT_DOPE_BETA=1
ENERGY_WEIGHT_BMK=0.4
ENERGY_WEIGHT_ESP_NEW=100

You can also sample without using the energy function. It is done by setting

ENERGY_WEIGHT_RADIUS=1

#ENERGY_WEIGHT_DOPE_BETA=1
#ENERGY_WEIGHT_BMK=0.4
#ENERGY_WEIGHT_ESP_NEW=100

If sampling without energy function, protein folder tries to minimize the radius of gyration of the target protein. This option only work with the Simulated Annealing method, therefore runFolder_RE.sh is not suitable.

4) Setting for controlling the sampling process

MAXIMUM_ITERATION: If sampling using the simulated annealing method, it is the number of iterations we use to reach a decoy. Please note it generates 2 decoys each time you run the proteinFolder. If the sampling method is replica exchange, this is the number of iterations we sample for each replica. We use 20 replicas which will result in 20 potential decoys in the end. We output those decoys with the energy difference within 15% range from the lowest energy obtained among the 20 decoys. Normally we can get 6-12 decoys for each run. It is set to 24000 in out experiments.
START_TEMPERATION and END_TEMPERATION are the start temperature and stop temperature of the SA process, or the highest and lowest temperatures of the replicas. They are set to be 50000 and 10 respectively in our experiment.
ANNEAL_STEPS decides how many iterations to perform before exchanging replicas between neighboring temperatures for replica exchange method. Suggested values can be 240, 300, or 400.

We used some experimental parameters to adjust the chance of sampling on different type of secondary structures. The following parameters only work when the SS option is chosen. We suggest you keep them unchanged.
STARTING_WEIGHTED_SIMULATION_STEP=7200
SIMULATION_WEIGHT_ALPHA=1
SIMULATION_WEIGHT_BETA=5
SIMULATION_WEIGHT_COIL=3

5) System reserved settings. Please do not change the following settings.

fragLen=5
libLen=100
interval=2
fragLibFile=./DATA/lib_100_z_5.txt

2. option*.txt
As mentioned above, the option*.txt contains the parameters for the models. Please refer to the reference for the details about the models.

Here, model=1 means CRFs model and model=2 means CNFs; order=1 means first-order CRFs/CNFs, order=2 means second-order CRFs/CNFs, and order=4 for second order CNFs with 2 layers neuron which is used for noSS only.

Different combination of the model/order requires different model file(model_file) and label mapping file(cnf_label_file).

We have enclosed in the package 5 model files: modelCRFo1.dat (used when model=1 and order=1), modelCRFo2.dat (used when model=1 and order=2), modelCNFo1.dat (for model=2 and order=1), modelCNFo2.dat (for model=2 and order=2), and modelCNFnoSS.dat (for model=2 and order=4).

The label file (cnf_label_file) should be set to cnf_label_SS_1.dat for 1^st order CNFs model, and cnf_label_SS_2.dat for 2nd order CNFs model. Please note that no label mapping file is needed for CRFs models.

REFERENCES

1. Fragment-free Approach to Protein Folding Using Conditional Neural Fields, Feng Zhao, Jian Peng and Jinbo Xu, The Eighteenth Annual International Conference on Intelligent System for Molecular Biology (ISMB2010), Bioinformatics. 2010 June 15; 26(12):i310-i317; doi:10.1093/bioinformatics/btq193

2. A probabilistic and continuous model of protein conformational space for template-free modeling, Feng Zhao, Jian Peng, Joe Debartolo, Karl F. Freed, Tobin R. Sosnick, Jinbo Xu. Journal of Computational Biology. June 2010, 17(6): 783-798. doi:10.1089/cmb.2009.0235.

3. Discriminative Learning for Protein Conformation Sampling, Feng Zhao, ShuaiCheng Li, Beckett W. Sterner and Jinbo Xu, PROTEINS: Structure, Function and Bioinformatics, 2008;73(1):228-240.