Protein solvent accessibility prediction systems
Ritta Shaheen1, Hani Amasha2, Majd Aljamali3
1Department of Biomedical Engineering, Faculty of Mechanical and Electrical Engineering, Damascus University, Damascus, Syria.
2Department of Biomedical Engineering, FMEE, Damascus University and Faculty of Informatics and Communication Engineering, Arab International University, Damascus, Syria
3Faculty of Pharmacology, Damascus University, Damascus, Syria
To cite this article:
Ritta Shaheen, Hani Amasha, Majd Aljamali. Protein Solvent Accessibility Prediction Systems. American Journal of Biomedical and Life Sciences. Special Issue: Spectral Imaging for Medical Diagnosis "Modern Tool for Molecular Imaging". Vol. 3, No. 2-3, 2015, pp. 21-24. doi: 10.11648/j.ajbls.s.2015030203.14
Abstract: Background: Prediction of protein solvent accessibility, also called accessible surface area (ASA) prediction, is an important step for tertiary structure prediction directly from one-dimensional sequences. Traditionally, predicting solvent accessibility is regarded as either a two- (exposed or buried) or three-state (exposed, intermediate or buried) classification problem. However, the states of solvent accessibility are not well-defined in real protein structures. Thus, a number of methods have been developed to directly predict the ASA based on information such as amino acid composition. Results: In this study we use physicochemical properties of amino acid such as hydrophobicity for ASA prediction by considering amino acid composition. We propose a systematic method for identifying residue groups with respect to protein solvent accessibility. The hydrophobicity of amino acid are used to generate features. Finally, Adaptive neuro fuzzy inference system (anfis) is adopted to construct a ASA predictor. Experimental results demonstrate that the features produced by the proposed selection process are informative for ASA prediction. Conclusion: Experimental results based on a widely used benchmark reveal that the proposed method performs good among several of existing packages for performing ASA prediction depending on amino acid sequence only .The program and data are available from the authors upon request.
Keywords: Protein Structure, Protein Solvent Accessibility, Accessible Surface Area, Structure Prediction, Adaptive Neuro Fuzzy Inference, Hydrophobicity
Predicting protein tertiary structures directly from one-dimensional sequences remains a challenging problem (1).The studies of solvent accessibility have shown that the process of protein folding is driven to maximal compactness by solvent aversion of some residues (2). Therefore, solvent accessibility is considered as a crucial factor in protein folding and prediction of protein solvent accessibility, also called accessible surface area (ASA) prediction, is an important step in tertiary structure prediction (3). Traditionally, predicting solvent accessibility is regarded as either a two- (exposed or buried) or three state(exposed, intermediate or buried) classification problem. Various machine learning methods have been adopted, including neural networks (4) (5) (6) (7) (8) (9) (10) (11), Bayesian statistics (12),logistic functions (13), information theory (14) (15) (16) and support vector machines (SVMs) (17) (18) (19). Among these machine learning methods, neural networks were the first technique used in predicting protein solvent accessibility and are still extensively adopted in recent works. In addition, SVMs were also effective for ASA prediction. Several features were used to train these machine learning methods, such as local residue composition (4) (5), probability profiles (20) and position specific scoring matrix (PSSM) (21). Ahmad et al. developed a method, RVP-net, to predict the real values of relative solvent accessibility (RSA) (22). The RVP-net used the local amino acid composition to train a neural net-work and yielded an accuracy of 74.1%. Yuan and Huang (23), also used the local amino acid composition and adopted support vector regression (SVR) (the regression version of SVM) to achieve an accuracy of 74%. Wang et al. (24) proposed a real value ASA predictor with an accuracy of 78 % by combining the amino acid composition with multiple linear regression. Table 1 summarizes the recent developments in predicting ASA. Neural networks and SVRs were extensively adopted and outperformed other machine learning methods. This study proposes a systematic process to predict ASA. ANFIS is used to construct an ASA predictor. The present method is compared with three ASA.
|Work||Regression tool||Description of features||Q (%)1|
|Ahmad et al., 2003||NN2||Amino acid composition||74.1|
|Yuan and Huang, 2004||SVR3||Amino acid composition||74|
|Wang et al., 2005||MLR3||Amino acid composition, PSSM and sequence length||78|
This study collects two independent datasets, first data set for training ASA predictors. The second, small data-sets, (R126) are used for the evaluating the predictor.
2.1. TRAIN Dataset
This dataset contains all proteins in Protein Data Bank (PDB) which have at least 30 amino acids long withe no chain breaks.this set consists of 1180 sequences corresponding to 282,303 amino acids.
2.2. Evaluating Dataset (RS126)
The This is one of the oldest datasets created for evaluating secondary structure prediction schemes. The dataset contains 126 proteins which did not share sequence identity more than 25% over a length of at least 80 residues.
3. Practical Study
Solvent accessibility problem can be considered as a pattern recognition problem, where an artificial neural network is trained to identify the solvent accessibility corresponding to each amino acid in the protein sequence.
In this study we use, Adaptive neuro fuzzy inference system network available in MATLAB R2011a Fuzzy toolbox, with one input layer ,one output layer.
We applied a sliding window of size 15 (an odd number of respectively amino acids) as the input to the network to predict the solvent accessibility of the residue in the middle of the window; this will add the influence of the neighbors into the prediction. ach amino acid in the input window encoded with is hydrophobicity of amino acid represented in table (2).
The output layer is two units, each one corresponds to a solvent accessibility state of amino acid and encoded using a binary system to build the target matrix of the neural network (representing the corresponding solvent accessibility to each amino acid in the input matrix) as following: 1 0 for buried residue. 0 1 for exposed one.
Thus, using the previous input and output matrices , we have created an anfis network shown in Fig (1).
4. System Specifications
The table(3) demonstrate the anfis specifications.
|Epochs||Error Tolerance||Optim. Method||Sub. Clustering||Unit|
|Reject Ratio||Accept Ratio||Squash Factor||Range of Influence|
This section displays the results of system, to be compared, and the comparison depends on the accuracy of each system Q, which is calculated according to the following equation:
Where Pe,Pb, are the number of amino acids of solvent accessibility class buried and exposed respectively that were correctly predicted, and N is the total number of amino acids.
The total accuracy for predicting of the solvent accessibility is Q= 70.9%, with an accuracy of Qa= 74.4 for buried residue and Qb= 66.92 for exposed residue.
In this paper we developed a system to predict the solvent accessibility relying solely on the amino acid sequence of the protein chain without using any additional information, which was used train data set which selected and encoded using hydrophobic values for the training the ANFIS system. System consists of two units each unit is predicting only one type in types of solvent accessibility, then the highest value among the two output of ANFIS units is consider the final output, which is the solvent accessibility of the amino acid located in the middle- of the income window . The accuracy of the system has reached to 71%, which is good accuracy. The following is a table demonstrate the comparison between the prediction accuracy of solvent accessibility that have been reached in this research of other systems that depending on the amino acids sequence only as input.
|Research||Regression tool||Description of features||Accuracy|
|Ahmad et al||Neural Network||Amino acid sequence||74.1|
|Yuan and Huang||Support Vector Machine||Amino acid sequence||74|
|Wang et al||Multiple Linear Regression||Amino acid sequence||78|
|Suggested System||Adaptive Neuro Fuzzy Inference System||Amino acid sequence||70.9|
The authors would like the Damascus university for their support.