Vous êtes sur la page 1sur 5

WWW.C-CHEM.

ORG FULL PAPER

B-Factor Profile Prediction for RNA Flexibility Using


Support Vector Machines
Ivantha Guruge,[a] Ghazaleh Taherzadeh,[a] Jian Zhan,[a] Yaoqi Zhou ,*[a] and
Yuedong Yang*[a,b]

Determining the flexibility of structured biomolecules is impor- Machines yields Pearson’s correction coefficient at 0.51 for five-
tant for understanding their biological functions. One quantita- fold cross validation and 0.50 for the independent test. Analy-
tive measurement of flexibility is the atomic Debye-Waller sis of the performance indicates that the model has the best
factor or temperature B-factor. Most existing studies are lim- performance on rRNAs, tRNAs, and protein-bound RNAs, for
ited to temperature B-factors of proteins and their prediction. long chains in particular. The server is available at http://
Only one method attempted to predict temperature B-factors sparks-lab.org/server/RNAflex. V
C 2017 Wiley Periodicals, Inc.

of ribosomal RNA. Here, we developed and compared


machine-learning techniques in prediction of temperature B- DOI: 10.1002/jcc.25124
factors of RNAs. The best model based on Support Vector

Introduction and actual temperature-B-factor is 0.39 for the best sequence-


based method and 0.48 for the best structure-based method.
Three-dimensional structures of proteins and RNA determined
Only a small dataset was used (13 crystal structures of ribo-
by X-ray crystallography are the average positions of atoms.
somal 50S subunits) without performing necessary normaliza-
Thermal fluctuation around the average positions can be mea-
tion for consistence of structural flexibility across different
sured by temperature B-factor, or Debye-Waller factor.[1] Atoms
structures that are crystallized at different conditions and envi-
with high B-factor values are in general more flexible. Struc-
ronments.[22] More importantly, there is no web-based server
tural flexibility and dynamic motions are essential for protein
available for the academic community.
catalysis and allostery[2] and for secondary structure formation The objective of this work is to establish a method for pre-
and the folding of RNA catalysts as well as protein–RNA recog- dicting temperature B-factor based on RNA sequence informa-
nition.[3] Temperature B-factors have been used in many appli- tion only. Developing a sequence-based method for B-factor
cations including analysis of protein active sites,[4] protein prediction is important because the vast majority of genome
disorder regions,[5] and protein thermal stability.[6] were coded for RNAs, rather than proteins and more and
Importance of temperature B-factors has led to development
more of these non-coding RNAs are found functional.[23] Com-
of methods for sequence and structure-based analysis and pre-
pared to proteins, RNAs are more difficult to fold into unique
dictions. If a protein structure is known, molecular dynamics
three-dimensional structures (i.e., inherently more flexible)
simulations have been used to correlate root-mean-squared fluc-
because the differences between four bases are small (all
tuations and temperature B-factors.[7] However, MD simulations
hydrophilic with similar sizes). In fact, only a few hundreds of
are too time consuming to perform large-scale studies. As a
non-redundant RNA structures were determined in high
result, various simple models using normal-mode analysis,[8]
graph theory,[9] mean-field theory[10,11] have been developed.
Normal-mode analysis, for example, achieved correlation coeffi- [a] I. Guruge, G. Taherzadeh, J. Zhan, Y. Zhou, Y. Yang
cients between experimental B-factor and computational calcula- School of Information and Communication Technology and Institue for
tions at about 0.6.[12–14] Interestingly, similar correlation can be Glycomics, Griffith University, Parklands Drive, Southport, Queensland
4215, Australia
achieved using a weighted contact number.[15] E-mail: yaoqi.zhou@griffith.edu.au
For sequence-based approaches, various machine models [b] Y. Yang
such as support vector machines (SVM),[16,18,19] random forest School of Data and Computer Science, Sun Yat-sen University, Guangzhou
(RF),[17] and neural network[20] were used. These sequence- 510275, China
E-mail: yangyd25@mail.sysu.edu.cn
based methods with real-value prediction achieved a correla-
Contract grant sponsor: National Health and Medical Research Council
tion coefficient between predicted and actual B-factors at (Australia); Contract grant numbers: 1059775 and 1083450; Contract
about 0.5. grant sponsor: Australian Research Council’s Linkage Infrastructure,
Unlike multiple methods developed for protein temperature Equipment and Facilities funding scheme (project number
LE150100161, to Y.Z.); Contract grant sponsor: National Natural Science
B-factors, there is only one study that predicted ribosomal Foundation of China; Contract grant numbers: 61772566 and U1611261
RNA B-factor profiles based on their sequence and structure (to Y.Y.)
information.[21] The correlation coefficient between predicted C 2017 Wiley Periodicals, Inc.
V

Journal of Computational Chemistry 2018, 39, 407–411 407


FULL PAPER WWW.C-CHEM.ORG

resolution. Majority of RNAs without known structures make it homologous sequences.[30] The j base probability (j 5 A, T/U,
imperative to develop sequence-based methods. G, C) in multiple aligned homologous sequences at a given
P
In this study, we have built a non-redundant dataset of 142 position i, Pi,j was calculated as Pi,j 5 –log[(Ni,j)/ j(Ni,j)], where
RNA structures that are randomly separated into training (108 Ni,j is the number of observed base type j at position i. To
RNAs) and test (34 RNAs) sets. We examined single-sequence avoid zero values, a small number correction s(bi) was used in
and sequence-profile-based features and several machine- Ni,j based on the normalized expected average occurrences for
learning techniques (SVM, Artificial Neural Networks [ANN], the types (native base and other types). s(bi) was set to 0.3 for
and RF) for their ability to predict temperature B-factors. The the other base type bi and 9.0 for the query base type. The
best model was based on SVM and produced a Pearson’s obtained sequence profiles were normalized to a range of (–1,
correlation coefficient (PCC) of 0.51 for the fivefold cross 1) before used for training and test.[28]
validation and 0.50 for the independent test set.
Window-based features
Methods To predict the B-factor of a given RNA base, we also input the
Data sets sequence or sequence-profile information of neighboring RNA
bases. The size of the sequence window is optimized for train-
We have obtained 1093 RNA structures deposited in protein ing and cross validation.
databank with resolution <3.0 Å and RNA chains longer than
32 bases (March 2015). To avoid over training, we have Support vector machine
removed redundant chains by the program cd-hit-est with the
lowest allowed sequence-identity cut-off at 80%.[24] A total of We used LibSVM[31] to build the predictive SVM models based
142 RNA chains were obtained. We randomly divided these on radial basis function (RBF) kernel. Support Vector Regres-
chains into roughly 75% for training and cross validation (108 sion (SVR)[32] was used to predict the real value of B-factors.
chains, 26,764 bases) and 25% for test (34 chains, 5989 bases). The two parameters of RBF (gamma and C) were optimized by
The B-factor for each nucleotide is represented by atom a grid search to find the best model that produced the best
type C1. performance for fivefold cross validation. The optimal values for
the gamma and C parameters were 0.03125 and 1, respectively.
B-factor normalizations
Artificial neural network
B-factors are not consistently measured across different struc-
tures because different refinement procedures and different ANNs are made of interconnected multi-layer units to learn
temperatures are used for structure determination.[22] Thus, B- non-linear relations between input and desired output. Deep
factor normalization is necessary so that the relative flexibility learning or structured learning is ANNs with three or more hid-
for a given RNA sequence is used for method development.[25] den layers. We used a rectified linear unit as the activation
Here, the data was normalized according to the method of function except that tanh was utilized for the activation in the
Smith et al.[26] In this method, outliers were first detected and output layer. Stochastic gradient descent algorithm (Adam)
removed using a median-based approach. Next, the mean and was used for weight optimization. We also examined the effect
standard deviation of B-factors in an RNA structure was calcu- of different number of hidden layers (1 to 4), number of neu-
rons (100 and 800), number of epochs (10 and 100), and the
lated. The normalized B-factors for a given RNA structure are
size of mini-batches within an epoch which is, the number of
the deviations of the raw B-factors from the mean divided by
samples over which we average to find the updates to
the standard deviation. The normalized B-factor profile for the
weights/biases needed to descend the gradient (50 and 300).
training set falls roughly between 23.00 and 4.00.
A grid search was implemented to find the best parameters
above. The final model used 1 hidden layer, 600 neurons, 50
Single-sequence input features
epochs, and 50 bases per Mini batch. The learning rate was
A simple method to represent the RNA sequence is to use set to 0.001. Here, we have used ANN implemented in the
vector-based orthogonal codes[27] in which A, U, G, and C are Keras software package which is a high-level neural networks
represented by four-dimensional vectors of (1000), (0100), API.[33]
(0010), and (0001), respectively.
Random Forest
Evolution-based sequence profile
RFs or random decision forests[34,35] are ensemble learning
Evolution-based sequence profiles have been found useful in methods for classification, regression and other tasks. RF cre-
predicting protein temperature B-factors (e.g., Ref. 20). This is ates a multitude of decision trees when training the model
because sequence conservations in regions with different flexi- and outputs the classes (classification) or mean prediction
bility have different patterns. In a previous study, we have (regression) of the individual trees. Each individual tree is
obtained evolution-based sequence profiles[28] by querying trained using a subset of the train set and is evaluated on the
the RNA sequences against RNA sequence library using test set. The final prediction output is then calculated by com-
BLASTN[29] with E-value < 0.001 and maximum of 50,000 bining the output of all trees. The RF model parameters are

408 Journal of Computational Chemistry 2018, 39, 407–411 WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG FULL PAPER

Figure 1. Pearson’s correlation coefficient (PCC) between predicted and


actual temperature B-factors as a function of window sizes for fivefold
cross validation on the train set using SVM models. Single-sequence and
sequence-profile-based models are shown in dashed and solid lines,
respectively.

optimized by setting the minimum number of samples at a


leaf node, the number of trees in the forest, the maximum
number of features in each tree and a function to measure
the quality of split. A grid search was implemented to find the
optimized values for each parameter. Thus, the best model is
trained using 1 minimum leaf size and 1000 trees with the
maximum number of features in each tree set to “Auto.” The
RF was implemented using the Scikit-Learn machine learning
library.[36]

Cross-validation test

We performed the fivefold cross validation test on the training


set. Here, the dataset was randomly divided into five folds with
similar number of chains. Each fold was selected as the test set Figure 2. Density plots of actual normalized temperature B-factors against
while the other folds were used as the training set. This process predicted temperature B-factor for the single-sequence-based (a) and
profile-based (b) SVM models for the test set. [Color figure can be viewed
was repeated five times so that every fold was tested once.
at wileyonlinelibrary.com]

Independent test
Results
Once the best model was identified, the trained model was
Figure 1 displays PCC as a function of the window size based
tested with our independent test set which was not used
on a SVM model and fivefold cross validation on the train set.
when training the model. Similar performance in fivefold cross
There are two curves: one is based on single-sequence
validation and independent test would indicate robustness of
features only and the other is based on evolution-derived
the model trained.
sequence profile. There is a fast increase in PCC as the window
size increases and the change is small after the window size is
Performance evaluation criteria greater than 20. We found that the best window sizes are 20
and 30 for SVM models based on single sequence and
The overall performance of the SVM model is assessed by PCC sequence profiles, respectively.
between predicted and actual values of temperature B- Table 1 compares the performances of SVM models based
factors.[37] on single sequence and sequence profile, respectively. The
performances in fivefold cross validation and independent test
Table 1. Performance using different features and models.
are similar for both single-sequence and sequence-profile
based techniques, suggesting the robustness of the model in
Input/Model Fivefold PCC Test set PCC its application to unseen data. The performance by the profile-
Sequence/SVM 0.4467 0.4640 based method is significantly better than the single-sequence
PSSM/SVM 0.5176 0.5028 based method: the PCC value increases from 0.4 to 0.5. This
PSSM/RF 0.4908 0.5036 highlights that the sequence conservation plays a significant
PSSM/DNN 0.4791 0.4439
role in chain flexibility.

Journal of Computational Chemistry 2018, 39, 407–411 409


FULL PAPER WWW.C-CHEM.ORG

conformation. We can also classify RNAs into those bound


with proteins (93 chains) or not bound with proteins (49
chains). We found that overall PCC values are 0.56 for protein-
bound RNAs and 0.11 for protein-free RNAs.
Figure 3 further shows the PCC values for individual chains
as a function of RNA chain length. Longer chains have more
accurately predicted temperature-B factors. The structures of
all long chains are rRNA or tRNA complexed with proteins. For
short chains, RNAs complexed with proteins are in general
predicted more accurately than protein-free RNAs. Most
difficult-to-predict RNAs are in other categories and not bound
with proteins. Examples of those poorly predicted other RNAs
Figure 3. Pearson’s correlation coefficients for different types of RNA chains
are 1f1t (ligand-bound RNA aptamer), 1xjr (short virus RNA
as a function of chain length. Because of large size difference, X-axis is shown
in a logarithmic scale. [Color figure can be viewed at wileyonlinelibrary.com] element), and 3p22 (short virus RNA element) with PCC of less
than 0.01.
In addition to SVM models, we also used Neural Networks Figure 4 demonstrates one example of highly accurate pre-
and RFs based on sequence profiles as input. We also found a diction for a long chain. This is a 23S rRNA of thermophilic
window size of 30 as the optimal widow size for NN and RF bacterium called thermus thermophiles. The peaks and valleys
models. Results for fivefold cross validation and independent of thermal fluctuations were reproduced quite accurately with
tests are shown in Table 1. The performances of the three PCC 5 0.73.
models are similar with SVM having the best performance.
Figure 2 compares predicted to actual normalized tempera-
ture B-factors given by single-sequence-based and profile-based
SVM models for the test set for all bases in the test set. It is
clear that the profile-based method is substantially more accu- Discussion
rate than the single-sequence-based method based on the
In this study, we have developed a method that predicts tem-
spread of the distribution around the overall regression line.
To further understand the method performance for different perature B-factors of RNAs. Using SVM models and sequence
types of RNAs, we separated RNAs into rRNA (39 chains), tRNA profiles, we achieved PCC values at 0.52 for fivefold cross
(46 chains), riboswitches (10 chains) and others (47 chains). validation and 0.50 for the independent test set, respectively.
Here, we have combined cross-validated and test results so Similar performance in cross validation and test sets and simi-
that we have a large dataset to analyze. We can do this lar performance to other machine learning models (Table 1)
because our method has similar performance in cross valida- confirm the robustness of the performance obtained. The per-
tion and independent test. We found that the overall PCC formance is also similar to the published performance for a
values are 0.58 for rRNA, 0.48 for tRNA, 0.05 for riboswitches, structure-based method in a prior work,[21] which is limited to
and 0.13 for others. Low accuracy of riboswitches is somewhat rRNA. Analysis of performance of our sequence-based method
expected because they have more than one conformation. indicates that it provides reasonable accuracy in predicting
Temperature B-factors are fluctuations around a single temperature-B factors of rRNA, tRNA and protein-bound RNAs,
for long chains in particular.
Predicting B-factor profiles is challenging. This is because of
intrinsic noises residing within the experimental data. B-factors
depend on the environment such as the temperatures, the
stages of measurements and refinement methods.[38,39] The
measured B-factor profiles should reflect the authentic fluctua-
tion and static, dynamic and lattice disorders.[16] Radivojac
et al. found a high correlation between homologous proteins
(average PCC of 0.8),[5] which Yuan et al. believed to be the
upper limit of predicting B-factor profiles.[16] All existing
sequence-based techniques for predicting protein temperature
B-factors yielded PCC around 0.5, regardless of the type of
method used for prediction.[16–18] Similar performance
achieved for RNA temperature B-factor in this study suggests
Figure 4. Comparison of predicted versus actual normalized B-factor profile that better features are required to further improve the overall
of 23S rRNA of thermus thermophiles (Chain RNA in PDB 4lnt). Dashed
performance of B-factor prediction. We have attempted to
lines with a node represent the actual temperature B-factor and solid line
represents the predicted temperature B-factor. The PCC value between incorporate predicted secondary structures.[40] No significant
predicted and experimental B-factors is 0.73. improvement was observed.

410 Journal of Computational Chemistry 2018, 39, 407–411 WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG FULL PAPER

[17] X.-Y. Pan, H.-B. Shen, Protein Pept. Lett. 2009, 16, 1447.
Acknowledgments [18] Y. Pan, F. Lv, F. Tian, X. Luo, X. Kong, Y. Li, Q. Yang, Mol. Inform. 2010,
29, 195.
We also gratefully acknowledge the support of the Griffith Univer- [19] A. G. de Brevern, A. Bornot, P. Craveur, C. Etchebest, J. C. Gelly, Nucleic
sity eResearch Services Team and the use of the High Performance Acids Res. 2012, 40, W317.
Computing Cluster “Gowonda.” This research/project has also been [20] A. Yaseen, M. Nijim, B. Williams, L. Qian, M. Li, J. Wang, Y. Li, BMC
Bioinform. 2016, 17, 281.
undertaken with the aid of the research cloud resources provided by
[21] F. Tian, C. Zhang, X. Fan, X. Yang, X. Wang, H. Liang, Mol. Inform. 2010,
the Queensland Cyber Infrastructure Foundation (QCIF). 29, 707.
[22] D. Tronrud, J. Appl. Crystallogr. 1996, 29, 100.
Keywords: RNA flexibility  temperature B-factor  support [23] C. P. Ponting, P. L. Oliver, W. Reik, Cell 2009, 136, 629.
vectors regression [24] W. Li, A. Godzik, Bioinformatics (Oxford, England) 2006, 22, 1658.
[25] P. A. Karplus, G. E. Schulz, Naturwissenschaften 1985, 72, 212.
[26] D. K. Smith, P. Radivojac, Z. Obradovic, A. K. Dunker, G. Zhu, Protein
Sci. 2003, 12, 1060.
How to cite this article: I. Guruge, G. Taherzadeh, J. Zhan, Y. [27] J. Jonsson, T. Norberg, L. Carlsson, C. Gustafsson, S. Wold, Nucleic Acids
Zhou, Y. Yang. J. Comput. Chem. 2018, 39, 407–411. DOI: Res. 1993, 21, 733.
[28] Y. Yang, X. Li, H. Zhao, J. Zhan, J. Wang, Y. Zhou, RNA (New York, NY)
10.1002/jcc.25124
2017, 23, 14.
[29] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W.
Miller, D. J. Lipman, Nucleic Acids Res. 1997, 25, 3389.
[1] P. Debye, Ann. Phys. 1913, 348, 49. [30] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, J. Mol. Biol.
[2] R. M. Daniel, R. V. Dunn, J. L. Finney, J. C. Smith, Annu. Rev. Biophys. 1990, 215, 403.
Biomol. Struct. 2003, 32, 69. [31] C.-C. Chang, C. J. Lin, ACM Trans. Intell. Syst. Technol. 2011, 2, 27.
[3] S. D. Levene, In eLS; Wiley, 2002. doi: 10.1038/npg.els.0003125 [32] D. Basak, S. Pal, D. C. Patranabis, Neural Inf. Process. Lett. Rev. 2007, 11,
[4] O. Carugo, P. Argos, Proteins 1998, 31, 201. 203.
[5] P. Radivojac, Z. Obradovic, D. K. Smith, G. Zhu, S. Vucetic, C. J. Brown, [33] F. Chollet, GitHub: GitHub Repository; 2015. Available at: https://github.
J. D. Lawson, A. K. Dunker, Protein Sci. 2004, 13, 71. com/fchollet/keras
[6] S. Parthasarathy, M. R. Murthy, Protein Eng. 2000, 13, 9. [34] T. K. Ho, IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832.
[7] J. L. Klepeis, K. Lindorff-Larsen, R. O. Dror, D. E. Shaw, Curr. Opin. Struct. [35] T. K. Ho, Proceedings of the Third International Conference on Document
Biol. 2009, 19, 120. Analysis and Recognition, 1995; pp. 278–282.
[8] I. Bahar, A. R. Atilgan, B. Erman, Fold. Des. 1997, 2, 173. [36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
[9] D. J. Jacobs, A. J. Rader, L. A. Kuhn, M. F. Thorpe, Proteins Struct. Funct. M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Mach. Learn. Res.
Genet. 2001, 44, 150. 2011, 12, 2825.
[10] C. Micheletti, J. R. Banavar, A. Maritan, Phys. Rev. Lett. 2001, 87, 088102. [37] K. Pearson, Proc. R. Soc. Lond. 1895, 58, 240.
[11] B. P. Pandey, C. Zhang, X. Z. Yuan, J. Zi, Y. Q. Zhou, Protein Sci. 2005, [38] H. Frauenfelder, G. A. Petsko, D. Tsernoglou, Nature 1979, 280, 558.
14, 1772. [39] D. Ringe, G. A. Petsko, Prog. Biophys. Mol. Biol. 1985, 45, 197.
[12] D. A. Kondrashov, A. W. Van Wynsberghe, R. M. Bannen, Q. Cui, G. N. [40] R. Lorenz, S. H. Bernhart, C. Honer Zu Siederdissen, H. Tafer, C. Flamm,
Phillips, Structure 2007, 15, 637. P. F. Stadler, I. L. Hofacker, Algorithms Mol. Biol. 2011, 6, 26.
[13] T. Z. Sen, Y. P. Feng, J. V. Garcia, A. Kloczkowski, R. L. Jernigan, J. Chem.
Theory Comput. 2006, 2, 696.
[14] D. Riccardi, Q. Cui, G. N. Phillips, Biophys. J. 2009, 96, 2548.
Received: 11 September 2017
[15] C. P. Lin, S. W. Huang, Y. L. Lai, S. C. Yen, C. H. Shih, C. H. Lu, C. C.
Accepted: 7 November 2017
Huang, J. K. Hwang, Proteins Struct. Funct. Bioinform. 2008, 72, 929. Published online on 21 November 2017
[16] Z. Yuan, T. L. Bailey, R. D. Teasdale, Proteins 2005, 58, 905.

Journal of Computational Chemistry 2018, 39, 407–411 411