INTRODUCTION
RaptorXFM protein folder is an innovative fragment-free
protein folding software package for protein structure prediction,
esp. for discovering the structure of the proteins with new folds.
We use continuous representations for the backbone angles in the continuous protein structure space. This enables us to sample and discover the rarely occurring super-secondary structure motifs which occur most likely for the "new fold" proteins.
RaptorXFM uses the probabilistic graphical models to model the sophisticated relationship between backbone angles, sequence profile and predicted secondary structure, so that it can estimate the probability distribution of backbone angles more accurately and sample protein conformations significantly better and more efficiently than the other models.
Further, RaptorXFM uses merely backbone and C-beta atoms in the sampling process without working with the heavy side-chain atoms, which dramatically reduces the computational cost.
RaptorXFM
can generate high quality decoys on the test proteins including the
CASP8 free-modeling targets. According to our experience, RaptorXFM
protein folder can do well on mainly-alpha proteins with sequence
length under about 150 amino acids, or small beta proteins with less
than 90 amino acids. For more details, please visit our poster
on ISMB2010 and the REFERENCES.
INSTALLATION
Unzip the RaptorXFM.tar.gz file, you will find four
sub-directories: bin, config, DATA and SEQ. The files in the bin
directory are the executables for generating necessary preliminary
data files from a given sequence, and for sampling decoys
(ProteinFolder*); the config and DATA directory contains the
configuration files and model files for ProteinFolder to generate
decoys; the SEQ directory is for the user specific data.
RaptorXFM
protein folder needs to run PSI-BLAST
and PSIPRED to
generate the preliminary data files. By default, PSI-BLAST uses NR
database, which can be downloaded at ftp://ftp.ncbi.nih.gov/blast/db/
or here.
Please modify the file generateFolderData.sh by setting the
following parameters in the TODO section:
# TODO: please
specify the directory where you have installed BLAST and PSIPRED
export PSIPRED_HOME=~/PSIPRED
export BLASTDB=~/NR/nr
export
BLAST_HOME=~/BLAST
Or using the following command for bash:
PSIPRED_HOME=~/PSIPRED;
export PSIPRED_HOME
BLASTDB=~/NR/nr; export BLASTDB
BLAST_HOME=~/BLAST; export BLAST_HOME
RUN
RaptorXFM protein folder
RaptorXFM protein folder works in two steps.
Step
1. Preparation
RaptorXFM
ProteinFolder accepts FASTA format sequence file as input. Given a
sequence file named *.seq, you need to copy this file to the SEQ
directory, then run generateFolderData.sh.
Here's an example
(Suppose the sequence file is 1enhA.seq):
./generateFolderData.sh
1enhA 1
The first parameter is the target protein name. The
second parameter indicates the model used for sampling: 0 means
choosing the CRFs model; 1 means the CNFs model requiring the
secondary structure (SS) predicted by PSIPRED; 2 means the CNFs model
which requires no predicted SS information. Here CRFs stands for
Conditional Random Fields, a probabilistic graphical model which
models the relationship between backbone angles and sequence profile
linearly; CNFs stands for the Conditional Neural Fields, another
probabilistic graphical model that perfectly integrates the neural
network with CRFs. CNFs models the relationship between backbone
angles and sequence profile nonlinearly.
The above command will
result in some preliminary data files generated in the SEQ directory.
If the chosen model is CRFs (0), you will see the following files in
the SEQ/INPUT_CRF directory: 1enhA.chk, 1enhA.horiz, 1enhA.newss,
1enhA.pretag, 1enhA.psp, 1enhA.seq, 1enhA.ss, 1enhA.ss2. If the
chosen model is CNF with predicted SS(1), you will see those files
shown in the SEQ/INPUT_CNF directory. If the chosen model is CNF
without predicted SS(2), the result files will reside in the
INPUTnoSS directory. Please note that the CRF model also uses the
predicted SS information.
We have enclosed the sample files for
1enhA in the package and you can find 1enhA.* under the SEQ,
Merge_cnf, INPUT_CRF/INPUT_CNF/INPUTnoSS and etc under the SEQ
directory.
Step 2. Sampling
When
the preliminary data files are ready, you can run the ProteinFolder
or ProteinFolder_noSS in the bin directory to sample the decoys. We
have provided two sample shell script files: runFolder.sh for
sampling using Simulated Annealing method, and runFolder_RE.sh for
sampling using the Replica Exchange method.
Example:
./runFolder.sh 1enhA CNFo2 Decoy_1enhA
The first
parameter is the target protein name. The second parameter indicates
the model to be used for sampling. There are 5 options: CRFo1 for the
1st order CRFs model; CRFo2 for the 2nd order CRFs model;
CNFo1 for the 1st order CNFs model; CNFo2 for the 2nd
order CNFs model; and noSS for a special trained CNF model if no
secondary structure information is available. The first 4 options all
require the secondary structure (SS) predicted by PSIPRED. The 3rd
parameters denotes the name of the directory where the new decoys are
to be stored.
The parameters used by runFolder_RE.sh are the same as runFolder.sh.
The
above command will result in a new directory Decoy_1enhA, and the
protein folder used by runFolder.sh will generate 2 decoys in this
sub-directory. If you are using runFolder_RE.sh, normally you will
find 6 to 12 decoys in this sub-directory; as a rare case, for the
most difficult sequences, it may generate as few as 2-4 decoys each
time you run the runFolder_RE.sh; for the easy sequences, it is
possible to generate up to 18-20 decoys. Please notice that it takes
runFolder_RE.sh more time to finish than runFolder.sh. The average
sampling time for each decoy is similar for both methods.
As
you may find in runFolder.sh, the -f option of protein folder
indicates the configuration file it uses. We have provided 5
configuration files in the DATA directory: proteinFolderCRFo1.conf
for the CRFo1 option, proteinFolderCRFo2.conf for the CRFo2 option,
proteinFolderCNFo1.conf for the CNFo1 option,
proteinFolderCNFo2.conf for the CNFo2 option, and
proteinFolder_noSS.conf for the noSS option. In the references, our
result was obtained with the SS option. However, it is worth a try
using the noSS option when the confidence of the predicted secondary
structure by PSIPRED is very low.
CONFIGURATIONS
In this section, we will explain the parameters in the
configuration file and option file.
1.
proteinFolder*.conf
This
is the main configuration file for ProteinFolder (also for
ProteinFolder_noSS) with the -f option. It contains a few sections as
follows:
1) Settings for CRF/CNF models
crfModelDir
tells the location of the model files, we set it to be in the DATA
directory, i.e. "./DATA".
crfOptionFile
is the name of the option file located under crfModelDir. We have
different option*.txt for different proteinFolder*.conf. For example,
it is set to be optionCRFo1.txt for the CRFo1 model in
proteinFolderCRFo1.conf, and option_noSS.txt in
proteinFolder_noSS.conf for the noSS model, and so on for the other
models.
crfFeatureTemplateFile is the feature template file. It
is fixed to be feature_template.txt.2. Please do not change it.
thetaTauClusterFile is the cluster file for the theta-tau pairs.
It is set to be the Kent-Frag100-param.txt file under the config
directory. Please do not change it.
2) Settings for input of
protein folder, i.e. the directory containing the preliminary data
files, including the sequence file, secondary structure files, and
the *.pretag files generated by generateFolderData.sh. We set it to
be SEQ/INPUT_CRF for the CRFs models, SEQ/INPUT_CNF for CNFs models
with predicted SS and SEQ/INPUTnoSS for noSS. The parameters in this
section include SeqPath, SSPath, and PretagPath. They are assigned
the same value in the examples.
3) Weights used for the energy
function.
As indicated in the references, we used 3 terms in the
simple energy function: DOPE, BMK, and ESP. In our experiments, we
used following weights:
ENERGY_WEIGHT_DOPE_BETA=1
ENERGY_WEIGHT_BMK=0.4
ENERGY_WEIGHT_ESP_NEW=100
You can also sample without using the energy function. It is done by setting
ENERGY_WEIGHT_RADIUS=1
#ENERGY_WEIGHT_DOPE_BETA=1
#ENERGY_WEIGHT_BMK=0.4
#ENERGY_WEIGHT_ESP_NEW=100
If sampling without energy function, protein folder tries to minimize the radius of gyration of the target protein. This option only work with the Simulated Annealing method, therefore runFolder_RE.sh is not suitable.
4)
Setting for controlling the sampling process
MAXIMUM_ITERATION:
If sampling using the simulated annealing method, it is the number of
iterations we use to reach a decoy. Please note it generates 2 decoys
each time you run the proteinFolder. If the sampling method is
replica exchange, this is the number of iterations we sample for each
replica. We use 20 replicas which will result in 20 potential decoys
in the end. We output those decoys with the energy difference within
15% range from the lowest energy obtained among the 20 decoys.
Normally we can get 6-12 decoys for each run. It is set to 24000 in
out experiments.
START_TEMPERATION and END_TEMPERATION are the
start temperature and stop temperature of the SA process, or the
highest and lowest temperatures of the replicas. They are set to be
50000 and 10 respectively in our experiment.
ANNEAL_STEPS decides
how many iterations to perform before exchanging replicas between
neighboring temperatures for replica exchange method. Suggested
values can be 240, 300, or 400.
We used some experimental
parameters to adjust the chance of sampling on different type of
secondary structures. The following parameters only work when the SS
option is chosen. We suggest you keep them unchanged.
STARTING_WEIGHTED_SIMULATION_STEP=7200
SIMULATION_WEIGHT_ALPHA=1
SIMULATION_WEIGHT_BETA=5
SIMULATION_WEIGHT_COIL=3
5) System reserved settings. Please do not change the following settings.
fragLen=5
libLen=100
interval=2
fragLibFile=./DATA/lib_100_z_5.txt
2. option*.txt
As mentioned above, the option*.txt
contains the parameters for the models. Please refer to the reference
for the details about the models.
Here, model=1 means CRFs model and model=2 means CNFs; order=1 means first-order CRFs/CNFs, order=2 means second-order CRFs/CNFs, and order=4 for second order CNFs with 2 layers neuron which is used for noSS only.
Different combination of the model/order requires different model file(model_file) and label mapping file(cnf_label_file).
We have enclosed in the package 5 model files: modelCRFo1.dat (used when model=1 and order=1), modelCRFo2.dat (used when model=1 and order=2), modelCNFo1.dat (for model=2 and order=1), modelCNFo2.dat (for model=2 and order=2), and modelCNFnoSS.dat (for model=2 and order=4).
The
label file (cnf_label_file) should be set to cnf_label_SS_1.dat for
1st order CNFs model, and cnf_label_SS_2.dat for 2nd order
CNFs model. Please note that no label mapping file is needed for CRFs
models.
REFERENCES
1. Fragment-free Approach to Protein Folding Using
Conditional Neural Fields, Feng Zhao, Jian Peng and Jinbo Xu, The
Eighteenth Annual International Conference on Intelligent System for
Molecular Biology (ISMB2010), Bioinformatics. 2010 June 15;
26(12):i310-i317; doi:10.1093/bioinformatics/btq193
2. A
probabilistic and continuous model of protein conformational space
for template-free modeling, Feng Zhao, Jian Peng, Joe Debartolo, Karl
F. Freed, Tobin R. Sosnick, Jinbo Xu. Journal of Computational
Biology. June 2010, 17(6): 783-798. doi:10.1089/cmb.2009.0235.
3.
Discriminative Learning for Protein Conformation Sampling, Feng Zhao,
ShuaiCheng Li, Beckett W. Sterner and Jinbo Xu, PROTEINS: Structure,
Function and Bioinformatics, 2008;73(1):228-240.