Modelling of Normal Boiling Points of Hydroxyl Compounds by Radial Basis Networks
Liangjie Jin, Peng Bai
School of Chemical Engineering and Technology, Tianjin University, Tianjin, PR China
To cite this article:
Liangjie Jin, Peng Bai. Modelling of Normal Boiling Points of Hydroxyl Compounds by Radial Basis Networks. Modern Chemistry. Vol. 4, No. 2, 2016, pp. 24-29. doi: 10.11648/j.mc.20160402.12
Received: February 18, 2016; Accepted: March 11, 2016; Published: May 4, 2016
Abstract: Radial basis networks (RBN) were applied to link molecular descriptor and boiling points of 168 hydroxyl compounds. The total database was randomly divided into a training set(134), a validation set(17) and a testing set(17). Each compound in the lowest energy conformation was numerically characterized with E-dragon software. Then 8 molecular descriptors were selected to develop the RBN model. Simulated with the final optimum RBN model [8-35(64)-1], the root mean square errors (RMSE) for the training, the validation and the testing set were 5.55, 4.28, and 5.33, and the correlation coefficients R=0.994(training), 0.994(validation), 0.993(testing). The final RBN model was compared with the multiple linear regression approach and showed more satisfactory results.
Keywords: Radial Basis Networks, Normal Boiling Point, Hydroxyl Compounds, QSPR Model
The normal boiling point (NBP) can be defined as the temperature at which a pure saturated liquid has a vapor pressure of 760 mm Hg. NBP can be used to estimate many key physical and physicochemical properties such as critical temperature, enthalpy of vaporization and vapor pressure, etc [1-2]. So having an accurate knowledge of the NBP is very important for the chemical industry. The direct measurement of normal boiling point of organic compounds may be costly, laborious, and even dangerous to the researcher or the environment if the compound has some hazardous properties. Therefore, it is essential to develop reliable methods for estimating the NBP of the compounds.
The NBP of compounds is an indicator of the strength of the intermolecular forces which bind them together. The stronger the intermolecular forces, the more tightly packed the atoms and, therefore, the higher NBP. NBP is directly correlated to the chemical structure of a molecule. The classical approach based on chemical structures to predict NBP is the group contribution methods [3-4], where each molecule is considered as made of fundamental groups, each one giving a constant increment to the value of the NBP for a compound [5-6]. The method is applicable only to the compounds for which all group contributions have been established. Another well-known solution is quantitative structure-property relationships (QSPR) approach [7-9]. In this approach, a QSPR model is introduced by developing a correlation between the NBP and a variety of molecular features.
In recent years, neural networks have become an important modeling technique in the field of QSPR. The advantage of them is in their inherent ability to incorporate nonlinear and cross-product terms into the model. Besides, they do not require knowledge of the mathematical function to be known in advance. Q. F. Li et al.  used radial basis function neural networks to link the molecular structures with the boiling points of 106 compounds. The final 10-parameters model showed satisfactory prediction results. Simulated with the final model, the predictive correlation for the training, the validation, and the testing set were 0.998, 0.998 and 0.991, respectively. Gharageizi et al.  optimized a three-layer feed forward artificial neural network (ANN) with 44 molecular descriptors to predict the NBP of a very large database. The final model gave R2=0.943 with an RMS error of 22°C for a training set of 14216 compounds, an RMS error of 21°C for a validation set of 1776 compounds and an RMS error of 21°C for a test set of 1776 compounds. The results indicated the ANN would be a promising strategy to predict the NBP of pure chemicals.
In this paper, the QSPR method was applied to predict the boiling points of 168 hydroxyl compounds at standard pressure using radial basis networks. A large number of molecular descriptors were calculated from the chemical structure and were used to describe the structure of hydroxyl compounds. Some of these descriptors were selected and quantitatively related to the boiling points of 168 hydroxyl compounds by using radial basis networks. The results obtained were validated and tested.
2. Database and Mathematical Methods
Experimental data set of the normal boiling points of 168 compounds containing the group "-OH" are taken from literature , which is a handbook of boiling points drawn from the primary chemical literature. From these 168 hydroxyl compounds, 134 compounds were randomly chosen as the training set to generate the structure of the neural networks, 17 for validation of the generated neural networks set and 17 for testing.
2.2. Determination of Molecular Descriptors
Molecular descriptors are numerical characteristics associated to the chemical structures of compounds . The optimized molecular structures are a requisite for the calculation of molecular descriptors. In this paper, the molecular structure of each compound was sketched via the drawing capabilities of Materials studio. Then these chemical structures were initially energy-minimized with compass molecular mechanics method and subsequently subjected to AM1 semi-empirical quantum chemical method for final geometry optimization. The optimized molecular structures were loaded into E-dragon software, which can calculate molecular descriptors free of charge [14-15]. E-dragon software is capable of calculating 1666 descriptors from 22 diverse blocks: Constitutional descriptors, Topological descriptors, Walk and path counts, Connectivity indices, Information indices, 2D-autocorrelation indices, Edge adjacency indices, Burden eigenvalue descriptors, Topological charge indices, Eigenvalue-based indices, Randic molecular profiles, Geometrical descriptors, RDF descriptors, 3D-MoRSE descriptors, WHIM descriptors, GETAWAY descriptors, Functional groups, Atom-centered fragments, Charge descriptors, and Molecular properties.
Among the huge number of the calculated molecular descriptors, a pre-selection was performed to remove some information-poor descriptors by a series of objective methods. Descriptors matching any of the following criteria were eliminated: (1) descriptors were not available for all structures; (2) the values of descriptors were constant for all structures; (3) descriptors with the R2 value of the one-parameter correlations were lower than 0.1. Besides, among the collinear descriptors, whose pair-correlation coefficient value was greater than 0.98, the one having the highest R2 value with the boiling points was retained while the rest were discarded.
The next step was to generate an optimal subset of descriptors for the QSPR model. Sequential forward selection was used for descriptors selection. All the remaining descriptors were listed in decreasing order according to the one-parameter R2. Starting from the top descriptor, other descriptors were added sequentially. At each step, the probability of the F-value was evaluated to determine the descriptor entry or removal. If the probability of the F-value was below 0.05, the variable was entered, and if the probability of F-value was above 0.10, the variable was removed. The process was repeated until the average absolute relative deviation (AARD) was less than a threshold value 1%. The mathematical definition of AARD is presented as follows:
where pred and exp stand for the predicted value by model and its corresponding experimental value, respectively.
2.3. RBN Model Development
The selected molecular descriptors were introduced to the radial basis network (RBN) for the final model development. The RBN model is a three layered network where the connections (the hidden layer) are feed-forward between the input and the output layers. The input layer consists of the selected descriptors and the values of the output layer are the target values. The hidden layer neurons have a Gaussian activation function that determines the excitation level of the neurons depending on how close the input data is located with respect to the neuron’s center of the activation functions. In this study, the RBN was designed with an artificial neural network toolbox in MATLAB. In MATLAB subroutine, the generation of the RBN model involved the determination of the optimum number of neurons n in the hidden layer and the appropriate Gaussian function parameter spread capable of predicting the target with minimum error. The number of neurons n of radial basis functions greatly influences the performance of the radial basis neural networks. If the number is too low, the networks may not calculate a proper estimation of the data. On the other hand, if too many hidden layer units are used, the networks tend to overfit the training data. The larger that spread is the smoother the function approximation will be. Too large a spread means a lot of neurons will be required to fit a fast changing function. Too small a spread means many neurons will be required to fit a smooth function, and the networks may not generalize well. The generalized network configuration could be represented by [P-nspread-a2], where P was the input and a2 was the target. The optimal values of n and spread were obtained by minimization of the objective function. The root mean square error (RMSE) between the outputs of RBN and the experimental data was set as the objective function. The mathematical definition of RMSE is presented as follows:
where pred and exp stand for the predicted value by model and its corresponding experimental value in literature, respectively. The problem of "over-fitting" may occur during neural networks training. Over-fitting means that the networks have memorized the training examples, but have not learned to generalize to new situations. In this situation, the error of the training set is driven to a very small value, but when new data is presented to the networks the error is large. To prevent over-fitting, the RMSE value of the training set and the validation set were monitored simultaneously during the training phase. If the RMSE value was still decreasing on the training set but began to increase on the validation set, the RBN model began to over-fit. The optimum values of n and spread were selected at the minimum RMSE in the validation set. Such a simple method was found sufficient for the considered problem. The test set was not used in the model development, but was applied to assess the predictive capability of the model.
3. Results and Discussions
The selected descriptors as well as their definitions and their blocks are shown in Table 1. Eight descriptors belong to 6 descriptor blocks . The Constitutional descriptor here represents information related to the electronic and topological state of the atom in the molecule . The Topological descriptors depict the topological information in the molecule from different aspects . The GETAWAY descriptor (GEometry, Topology, and Atom-Weights AssemblY) match 3D-molecular geometry provided by the molecular influence matrix and atom relatedness by molecular topology, with chemical information by using atomic weightings [18-19]. WHIM descriptors are calculated from (x, y, z)-coordinates of a molecule with the atom charge distribution related weighting scheme in a straightforward manner . The Molecular property descriptor here digitizes the hydrophilic properties of the molecules caused by the group "-OH". The 3D-MoRSE descriptor reflects 3D molecular structure based on electron diffraction .
|Ss||Sum of Kier-Hall electrotopological states||Constitutional|
|GNar||Narumi geometric topological index||Topological|
|PW4||Path/walk 4 - Randic shape index||Topological|
|R1p||R autocorrelation of lag 1/weighted by atomic polarizabilities||GETAWAY|
|E1s||1st component accessibility directional WHIM index / weighted by atomic electrotopological states||WHIM|
|MAXDN||Maximal electrotopological negative variation||Topological|
|Hy||Hydrophilic factor||Molecular properties|
|Mor13u||3D-MoRSE - signal 13/unweighted||3D-MoRSE|
The best RBN model generated with the hidden neurons n=35 and the Gaussian function parameter spread=64. The optimum networks configuration can be represented by [8-35-1]. The RMSE for the training set, validation set and testing set are 5.55, 4.28, and 5.33, respectively, indicating the accuracy of the RBN model. Table 2 reports the experimental and predicted data for each compound, as well as the relative error (A%D). By analyzing the relative errors in Table 2, the large portion of the investigated NBP values were successfully predicted. 90.5% of the investigated data were predicted within a promising range of 0-2%. 7.7% of investigated values were predicted within the range of 2-3%. Only 3 compounds (1.8%) of the predicted values have the relative error greater than 3%. The results indicate that the RBN model has an acceptable predictive capability.
The scatter plots of the predicted boiling point versus the experimental data, which could provide a prompt indication of the accuracy of the RBN model, are reported in Fig. 1. With the RBN model, the performance of prediction for the training set is NBPExp=(5.19±4.25)+(0.990±0.009)×NBPPre (R=0.994, n=134), for the validation set is NBPExp=(1.49±2.79)+(0.995±0.029)×NBPPre (R=0.994, n=17), and for the testing set is NBPExp=(-5.15±13.48)+(1.008±0.030)×NBPPre (R=0.993, n=17).
A multiple linear regression approach was also employed to describe the relation between NBP and their molecular descriptors. By using these descriptors selected above, the best multiple linear regression function can be obtained as follows:
With the multiple linear regression model, the performance of prediction for the training set is NBPExp=(7.24±5.63)+(0.984±0.012)×NBPPre(R=0.990, n=134), and for the test set (combined with the validation set) is NBPExp=(43.31±8.05)+(0.993±0.018)×NBPPre (R=0.993, n=34). Compared with the multiple linear regression method, the calculated slope and intercept of the RBN model are more close to the ideal values of 1 and 0, respectively. It is clear that RBN model can give more satisfactory predicted results.
|No.||Compounds||Exp. NBP||Pred. NBP||A%D|
,*: compounds in the validation set,
**: compounds in the testing set, the other compounds belong to the training set.
In this paper, a radial basis network was developed to predict the normal boiling point of hydroxyl compound. The structure features of all compounds in the dataset were numerically characterized by a huge number of molecular descriptors. These descriptors obtained were analyzed carefully and finally eight important descriptors remained by a series of methods. The results revealed that 8 molecular descriptors can be used to construct the RBN model for the prediction of the boiling points. The three-layer radial basis network model can be represented by [8-35-1]. With the same 8 descriptors, a multiple linear regression approach was also applied and was compared with the RBN model.The results showed that the RBN model could provide more accurate values of the predicted boiling point. In summary, the results of the current study indicated that radial basis networks would be a promising strategy for QSPR modeling.