Comparative Study of Backpropagation Algorithms in Forecasting Volatility of Crude Oil Price in Nigeria

: This paper explores the application of artificial neural network in volatility forecasting. A recurrent neural network has been integrated in to GARCH model to form the hybrid model called GARCH-Neural model. The emphasis of the research is to investigate the performance of the variants of Backpropagation algorithms in training the proposed GARCH-neural model. In the first place, EGARCH (3, 3) was identified in this paper most preferred model describing crude oil price volatility in Nigeria. Similarly, Levenberg-Marquardt (LM) training algorithms were found to be fastest in convergence and also provide most accurate predictions of the volatility when to other training techniques


Introduction
Crude oil is considered to be an important export commodity in Nigeria because of its contribution to the economy of the country. It was first discovered in 1958 and first produced well in 1956. Before that time, the country mainly depends on the exports of the agricultural commodities that comprised groundnuts, cocoa beans, palm oil, cotton and rubber. Palm oil was the leading export from 1946-1958, followed by cocoa beans while groundnut/oil ranked third. From a production level of 1.9 million barrels per day in 1958, crude oil exports rose to 2.35 million barrels per day in the early 2000s. However, it had fluctuated between 2.35 and 2.40 million barrels per day between 2011 and 2015 which was far below the OPEC quota due to the socio-political instability in the oil-producing areas of the country. In terms of its contribution to total revenue, receipts from oil that constituted 26.3 per cent of the federally collected-revenue in 1970, rose to 82.1 per cent in 1974 and 83.0 per cent in 2008 largely on account of a rise in crude oil prices at the international market.
Over the last two years, global oil prices have been dropping and bearing in mind that Nigeria is an import dependent economy, this development is worrisome. Our reviews of the current oil exports also reveal a southward trend due to significant oil theft and lower global demands. Indeed, NNPC (Nigerian National Petroleum Corporation) puts total value of revenue loss due to oil theft at $11bn in 2013.
More importantly, crude oil for the last three decades has been the major source of revenue, energy and the foreign exchange for the Nigerian economy. In 2000 oil and gas export earnings accounted for about 98% and about 83% of federal government revenue. [1].
The term volatility has been given different definitions by different scholars across disciplines. In relation to crude oil price, volatility is the variation in the worth of a variable, especially price [2]. Volatility is the measure of the tendency of oil price to rise or fall sharply within a period of time, such as a day, a month or a year [3]. [4] Defines volatility as the standard deviation in a given period. She notes that volatility has a negative and significant impact on economic growth instantly, while the impact of oil price changes delays until after a year. She concludes by saying that -it is volatility/change in crude oil prices rather than oil price level that has a significant influence on economic growth. In a nutshell, volatility is a measurement of the fluctuations (i.e rise and fall) of the price of commodity for example oil price over a period of time.
Artificial neural networks (ANNs) have been known to have the capability to learn the complex approximate relationships between the inputs and the outputs of the system and are not restricted by the size and complexity of the system [5]. The ANNs learn these approximate relationships on the basis of actual inputs and outputs. Therefore, they are generally more accurate as compared to the relationships based on assumptions. ANNs have been used for volatility forecasting in several papers for example [6][7][8][9][10]. Despite their popularity in applications to financial variables, ANNs have not been utilized very well in Nigerian financial market. Similarly, multi-layer feedforward Artificial Neural Networks using Backpropagation algorithms for training have been used in several literatures for example [11][12][13][14]. Since the Backpropagation algorithm has been successfully applied to train neural networks, this work aims to investigate the training performance of the some variants of the back propagation algorithm in training the proposed model for forecasting volatility.

Volatility Models
Let be stock price at time . Then = 100( − ) denotes the continuously compounded daily returns of the underlying assets at time .
The most widely used model for estimating volatility is ARCH (Auto Regressive Conditional Heteroscedasticity) model developed by [15] as also contained in [33]. Since the development of the original ARCH model, a lot of research has been carried out on extensions of this model among which GARCH [16] and is defined as = + , = , and , , are non-negative parameters to be estimated, is an independently and identically distributed (i.i.d.) random variables with zero mean and unit variance and is a serially uncorrelated sequence with zero mean and the conditional variance of which may be nonstationary, the GARCH model reduces the number of parameters necessary when information in the lag (s) of the conditional variance in addition to the lagged terms were considered, but was not able to account for asymmetric behavior of the returns.
Because of this weakness of GARCH model, a number of extensions of the GARCH (p, q) model have been developed to explicitly account for the skewness or asymmetry. The popular models of asymmetric volatility includes, the exponential GARCH (EGARCH) model, [17] GJR-GARCH model, asymmetric power ARCH (APARCH).
The GJR-GARCH (p, q) model was introduced by [9] to allow for allows asymmetric effects. The model is given as: In the GJR-GARCH model, good news > 0 and bad news, < 0, have differential effects on the conditional variance; good news has an impact of while bad news has an impact of + < 0 . If > 0, bad news increases volatility, and there is a leverage effect for the ' ℎ −order. if ≠ 0, the news impact is asymmetric [17]. The exponential GARCH (EGARCH) model advanced by [18] is the earliest extension of the GARCH model that incorporates asymmetric effects in returns from speculative prices. The EGARCH model is defined as follows: Where , , and ! are constant parameters. The EGARCH (p, q) model, unlike the GARCH (p, q) model, indicates that the conditional variance is an exponential function, thereby removing the need for restrictions on the parameters to ensure positive conditional variance. The asymmetric effect of past shocks is captured by the γ coefficient, which is usually negative, that is, cetteris paribus positive shocks generate less volatility than negative shocks [19]. The leverage effect can be tested if γ < 0. If γ ≠ 0, the news impact is asymmetric.
The asymmetry power ARCH (APARCH) model of [20] also allows for asymmetric effects of shocks on the conditional volatility. Unlike other GARCH models, in the APARCH model, the power parameter of the standard deviation can be estimated rather than imposed, and the optional γ parameters are added to capture asymmetry of up to order r. The APARCH (p, q) model is given as: where δ >0, i γ ≤1 for i =1, …, r, γi = 0 for all I > r, and r ≤ p If γ ≠ 0, the news impact is asymmetric. The introduction and estimation of the power term in the APARCH model is an attempt to account for the true distribution underlying volatility [33]. The idea behind the introduction of a power term arose from the fact that, The assumption of normality in modeling financial data, which restricts d to either 1 or 2, is often unrealistic due to significant skewness and kurtosis [19]. Allowing d to take the form of a free parameter to be estimated removes this arbitrary restriction.

Recurrent Neural Networks
Neural networks can be classified into static and dynamic categories [21]. Static networks have no feedback elements and contain no delays; the output is calculated directly from the input through feed forward connections. In dynamic networks, the output depends not only on the current input to the network, but also on the current or previous inputs, outputs, or states of the network. These dynamic networks may be recurrent networks with feedback connections or feed forward networks with imbedded tapped delay lines (or a hybrid of the two). For static networks, the standard back propagation algorithm [22] can be used to compute the gradient of the error function with respect to the network weights, which is needed for gradient-based training algorithms. For dynamic networks, a more complex dynamic gradient calculation must be performed. Although they can be trained using the same gradient-based algorithms that are used for static networks, the performance of the algorithms on dynamic network can be quite different and the gradient must be computed in a more complex way [23] Dynamic networks are generally more powerful than static networks (although somewhat more difficult to train). Because dynamic networks have memory, they can be trained to learn sequential or time-varying patterns. This has applications in such disparate areas as prediction in financial markets [24], phase detection in power systems [25] and many more dynamic network applications in [26]. The Non-linear Auto-Regressive with Exogenous (NARX) inputs is a recurrent dynamic network, with feedback connections enclosing several layers of the network. The NARX model is based on the linear ARX model, which is commonly used in time-series modeling.
The defining equation for the NARX model is as follow: Where the next value of the dependent output signal is regressed on the previous values of the output signal and previous values of an independent (exogenous) input signals.
A diagram showing the implementation of the NARX model using a feed forward neural network to approximate the function 6 in (6) is given in figure 1   Two different architectures have been proposed to train a NARX network [27]. First is the parallel architecture as shown in figure 2, where the output of the neural network is fed back to the input of the feed forward neural network as part of the standard NARX architecture. In contrast, in the series-parallel architecture as shown in Figure 3, the true output of the volatility (not output of the identifier) is fed to the neural network model as it is available during the training. This has two advantages [27]. The first is more accurate value presented as input to the neural network. The second advantage is the absence of a feedback loop in the network thereby enabling the use of static back propagation for training instead of the computationally expensive dynamic back propagation required for the parallel architecture. Also, assuming the output error tends to a small value asymptotically so that : ; < = ; : , the series parallel model may be replaced by a parallel model without serious consequences required.

GARCH-Neural Model
In this research, we proposed a hybrid model for forecasting volatility of the crude oil price in Nigeria. Initially, a preferred GARCH model is identified upon which the hybrid model is built. For this reason, optimum lags for GARCH model is estimated using AIC, and BIC indices.
The underlying concept for the hybrid model is that there are some explanatory factors other than historical prices that affect the future volatility crude oil price in the market. So this model is estimated with this number of variables that include squared prices, returns and it's squared in addition to the historical prices of the crude oil. Now since the volatility estimated from preferred GARCH model is available during the training it is used as the target output to the network. Figure 4 and 5 shows the flowcharts schematic diagram of the system process respectively.

Training of the GARCH-Neural Model
Neural networks are fitted to the data by learning algorithms during a training process. In this research, supervised learning algorithms were employed. These learning algorithms are characterized by the usage of a given output that is compared to the predicted output and by the adaptation of all parameters according to this comparison. The parameters of a neural network are its weights. All weights are usually initialized with random values drawn from a standard normal distribution. During an iterative training process, the following steps are repeated: 1). The neural network calculates an output > ? for given inputs ? and current weights. If the training process is not yet completed, the predicted output > will differ from the observed output .
Where, N is the size of the training dataset, > and C are the target and predicted value of the output of the neural network when E F input is presented and D is the error (difference between the target and predicted value) for the E F input. The performance index V in (7)

is a function of weights and biases, G [G G … G I ] and can be given by
The performance of the neural network can be improved by modifying G till the desired level of the performance index, A(G) is achieved. This is achieved by minimizing A(G) with respect to G and the gradient required for this is given by Where, L(G) is the Jacobian matrix given by and D(G) is the error for all the inputs. The gradient in (8) is determined using back propagation algorithms, which involves performing computations backward through the network. The process stops if a pre-specified criterion is fulfilled, i.e. if the values of the gradient are smaller than a given threshold. This gradient is then used by different algorithms to update the weights of the network. These algorithms differ in the way they use the gradient to update the weights of the network and are known as the variants of the back propagation algorithms.
Gradient descent algorithm with other variants is discussed below: Gradient Descent algorithms (GD): The network weights and biases, G is modified in a direction that reduces the performance function in (8) most rapidly i.e. the negative of the gradient of the performance function [28]. The updated weights and biases in this algorithm are given by Where, G ! is the vector of the current weights and biases, ∇A ! is the current gradient of the performance function and ! is the learning rate. Scaled Conjugate Gradient Descent algorithm (SCGD): The gradient descent algorithm updates the weights and biases along the steepest descent direction but is usually associated with poor convergence rate as compared to the Conjugate Gradient Descent algorithms, which generally result in faster convergence [29], in the Conjugate Gradient Descent algorithms, a search is made along the conjugate gradient direction to determine the step size that minimizes the performance function along that line. This time consuming line search is required during all the iterations of the weight update. However, the Scaled Conjugate Gradient Descent algorithm does not require the computationally expensive line search and at the same time has the advantage of the Conjugate Gradient Descent algorithms [29]. The step size in the conjugate direction in this case is determined using the Levenberg-Marquardt approach. The algorithm starts in the direction of the steepest descent given by the negative of the gradient as ^_ = −∇A _ (12) The updated and weights and biases are then given by Where ! is the step size determined by the Levenberg-Marquardt algorithm [30]. The next search direction that is conjugate to the previous search directions is determined by combining the new steepest descent direction with the previous search direction and is given by The value of ! is given in (Moller, 1993), by Where ! is given by Levenberg-Marquardt algorithm (LM): Since the performance index in (8) is sum of squares of non linear function, the numerical optimization techniques for non linear least squares can be used to minimize this cost function. The Levenberg-Marquardt algorithm, which is an approximation to the Newton's method is said to be more efficient in comparison to other methods for convergence of the Backpropagation algorithm for training a moderate-sized feedforward neural network [30]. As the cost function is a sum of squares of non linear function, the Hessian matrix required for updating the weights and biases need not be calculated and can be approximated as e = L M (G)L(G) The updated weights and biases are given by Where is a scalar and I is the identity matrix. Automated Bayesian Regularization (BR): Regularization as a mean of improving network generalization is used within the Levenberg-Marquardt algorithm. Regularization involves modification in the performance function. The performance function for this is the sum of the squares of the errors and it is modified to include a term that consists of the sum of squares of the network weights and biases. The modified performance function is given by f "Sg = @@2 + @@h Where SSE and SSW are given by Where is the total number of weights and biases, i in the network. The performance index in (19) forces the weights and biases to be small, which produces a smoother network response and avoids over fitting. The values and are determined using Bayesian regularization in an automated manner [31] and [32].

Computational Results
The data for this research is obtained from central bank of Nigeria website www.cbn.gov.ng and has also been used in [33]. It covers the monthly price of the crude oil from January, 1982 to February, 2016. Figure 6 represents the plot of both the crude oil price and its returns. Engle's ARCH test also examines for autocorrelation of the squared returns. With a brief glance at the table above, it can be seen that the mean of time series return in Nigerian crude oil price in the period under investigation is -0.030761 and its standard deviation is 8.79191. By comparing these two, it can be understood that this time series has experience a high level of volatility during this period. The Jarque-Bera test indicates non-normal distribution of this time series. Besides, the kurtosis statistics also indicate that the distribution of the mentioned time series is of fat tail. Observing the Liang-Box statistics (with twenty lags), can find, the null hypothesis about the lack of a serial correlation between the terms of the time series be rejected [33]. Now having confirmed from the table 1 above that the returns series of the crude oil price have ARCH effects, and then it is possible to estimate its conditional heteroscedasticity effects using GARCH models. Various combinations of (p, q) parameters ranging from (1, 1) to (3,3) of GARCH models were calibrated on historical return data and best fitted ones were chosen from each group based on some certain performance measures as shown in table 2 [33]. By comparing information criteria related to different types of GARCH models, it can be easily found that the EGARCH (3, 3) model has the lowest Akaike and Schwarz information criteria, so, it is the best model for explaining the behavioral pattern of volatility in the crude price and is therefore chosen for the construction of the hybrid model.
The neural network proposed in section 4 was trained using training algorithms described in section 5. The dataset analyzed earlier in this section was used as training set. Table  2 summarizes the results of training the proposed network using the four training algorithms discussed above. Each entry in the table represents 100 different trials, with random weights taken for each trial to rule out the weight sensitivity of the performance of the different training algorithms. The network was trained in each case till the value of the performance index in (8) was 0.0001 or less. The Gradient Descent algorithm was generally shown very slow in convergence for the required value of the performance index. The average time required for training the network using the Levenberg-Marquardt algorithm was the smallest whereas, it took maximum time to train the network using the Scaled Conjugate Gradient Descent algorithm. The training algorithm employing Bayesian Regularization continuously modifies its performance function and hence, takes more time as compared to the Levenberg-Marquardt algorithm but this time is still far less than the Scaled Conjugate Gradient Descent method. From Table 2, it can be seen that the Levenberg-Marquardt algorithm is the fastest of all the training algorithms considered in this work for training a neural network to forecast the volatility. Since the training time required for different training algorithms have been compared, the conclusion drawn from the results for the offline training may also be extended to online training. Therefore, it can be assumed that similar trend of training time required by the different training algorithms will be exhibited during online training of the proposed model for continuous updating of the offline trained model. Now the neural networks trained using the training algorithms listed in table 2 were tested. The datasets used for testing the networks were those points not included in the training set. Twenty set of data was considered. As the Gradient Descent algorithm was too low to converge for the desired value of the performance index, the neural networks that were trained using the rest of the three training algorithms listed in table 2 were tested using the datasets. The results for these are shown in the figures 7, 8 and 9 for Scale Conjugate Gradient, Levenberg-Marquardt and Bayesian Regularization respectively. The average absolute error is the least for the neural network trained using the LM method.

Conclusion
A GARCH-Neural model has been proposed to forecast the volatility of the crude oil price in Nigeria. In the first place, A GARCH model was first identified upon which a hybrid model is built. EGARCH (3, 3) was identified to be preferred model for forecasting the volatility of crude oil price in Nigeria. Variants of the Backpropagation were used to train the proposed model. Investigations in to the training performance of the different algorithms show that the Levenberg-Marquardt (LM) algorithm is the fastest to converge. It was also established that LM algorithm gives most accurate predictions in comparisons to the target values of the volatility forecasted earlier by the preferred GARCH models.