Time Series Modeling and Forecasting of Somaliland Consumer Price Index: A Comparison of ARIMA and Regression with ARIMA Errors

In recent years, the Consumer Price Index (CPI) prediction has attracted the attention of many researchers due to its excellent measurement of macroeconomic performance. It is an important index that is used to measure the rate of inflation or deflation of commodities. In this paper, Autoregressive Integrated Moving Average (ARIMA) and regression with ARIMA errors, where the covariate is the time, were compared to forecast Somaliland Consumer Price Index using monthly time series data from 2013 – 2020. The study used and applied both models to produce the necessary forecasts. Also, Akaike Information Criterion (AIC), Corrected Akaike Information Criterion (AICc), Bayesian Information Criterion (BIC) and other model accuracy measures were used to measure model’s predictive ability. By utilizing these methods, it is obtained that ARIMA (0, 1, 3) is the most suitable model for predicting CPI in Somaliland. Furthermore, the diagnostic tests show that the model presented is reliable and appropriate for forecasting Somaliland CPI data. The study results obviously indicate that CPI in Somaliland is more likely to proceed on an upward trend in the coming year. The study guides policymakers to use strict monetary and fiscal policy measures to address Somaliland’s inflation.


Introduction
Consumer price index (CPI) is the most common economic indicator and measures the changes in prices of a group of goods over time. Therefore, it measures shifts in the purchasing power of money [1,2]. Costa defined CPI as a weighted aggregate index which is calculated and published monthly [3]. In Somaliland, Ministry of National Planning and Development (MoNPD) publishes and compiles CPI every month in direct collaboration with Central Bank. Since 2013, the CPI has been the most often used measure of inflation in Somaliland. The Somaliland Annual Headline Inflation is estimated at 6.3 percent for the year ending April 2020 compared to the 5.6 percent reported for the year ending March 2020 [4]. In particular, there was a consistent monthly increase in prices of items beyond November 2016. The CPI was highest in April 2020. This is a good indication of higher inflation for the last three years in Somaliland.
The CPI is one of the most important variables for analyzing macroeconomic data in Somaliland. The main objective of the Somaliland monetary authorities like Central Bank is to fight inflation and maintain stable prices as inflation is measured directly from CPI data. The negative effects of inflation are well understood, which can contribute to a decline in the national currency's purchasing power leading to deteriorating socioeconomic conditions and living standards [5]. However, to recognize the factors that will determine its development in the near future, make the policymakers and domestic and foreign investors face a major challenge [6]. Such information would allow the Central Bank to predict future macroeconomic development and respond appropriately to economic shocks [6].
The CPI data is a time series data because it can be generally ordered depending on a time sequence. The advantage of this data type is that it can be forecasted. Time series data forecasting is a prediction by using the relationship pattern between the variables based on time [7,8]. CPI time series has an internal dynamic system that is regulated by itself, such that the time series fluctuation follows a particular order [9]. Autoregressive Integrated Moving Average (ARIMA) model is one of the methods widely used for predicting historical data [10][11][12]. Box and Jenkins developed this model and it is known as the Box-Jenkins time series method [13]. They strongly suggest that ARIMA process forecasts be made using the Difference Equation method, as this is the simplest approach.
A considerable number of studies were done to analyze the CPI data. For instance, Subhani and Panjawani studied on a monthly response to CPI announcements by government bonds [2]. Also, Adam, Awujola and Alumgudu used the ARIMA model to study CPI in Nigeria [14]. Hamid and Dhakar studied seasonality analysis in the monthly US CPI data from January 1913 to December 2003 [15]. Similarly, Zhang, Che, Xu and Xu studied by ARMA model to analyze, model and forecast CPI time series data in China from January 1995 to May 2008 [16]. A research conducted by Kharimah, Usman, Widiarti and Elfaki further found that ARIMA (1, 1, 0), compared with ARIMA (0, 1, 1) and ARIMA (1, 1, 1), as the best and accurate model for forecasting CPI in Bandar Lampung, Indonesia [17]. In this research study, the univariate Box-Jenkins technique and regression with ARIMA errors will be used to analyze and predict the future value of Somaliland monthly CPI data.

Data Description
The data used in this study is secondary data. This secondary data was obtained from the Central Statistics Department (CSD) of the MoNPD website (www.somalilandcsd.org). For the research purpose, monthly CPI data from January 2013 to April 2020 was used to forecast Somaliland CPI by using R software (version 3.6) especially forecast package. The statistical summaries, as well as time series distribution, would be tested using the skewness and kurtosis coefficients to check the presence of typical stylized data.

Box-Jenkins (ARIMA) Models
The Box-Jenkins (ARIMA) model is the most general class of time series prediction models in theory and was first popularized by Box and Jenkins [13].
( , , ) ARIMA p d q ignores independent variables and assumes the prior series values plus previous error terms provide information for forecasting purposes. The integers apply to the data set parts of the Autoregressive (AR), Integrated (I), and Moving Average (MA), respectively. In certain cases, the models are applied to data showing signs of non-stationarity which can be stationarized by transformation such as detrending and logging. The model takes historical data into account and breaks it down into AR process, where there is a memory of past events; an integrated process that accounts for stationarity, making it easier to predict; and an MA of forecast errors, so that the longer the historical data, the more reliable the predictions will be because they learn over time. The ARIMA models only apply to a stationary data series where the function of mean, variance, and autocorrelation remains constant over time.
AR process expresses a response variable as a function of the response variable's past values. A p th -order AR process is given by: Where t y is the stationary response variable being forecasted at time t, 1 2 , ,..., 1 2 , ,..., p φ φ φ are the parameters to be estimated and t ε is the error term at time t with mean zero and a constant variance. Using the backshift operator, we can define the ( ) AR p process as: The MA process of order q,

( )
MA q , can be written in the form: Where q is the number of lags in the moving average and is an ( ) , ARMA p q , then the process is said to be ( , , ) ARIMA p d q . This is generally written as: A first-differenced CPI series is of the form: Therefore the ( ,1, ) ARIMA p q model may be stated as: Where t CPI is the differenced CPI series of first order, and , and φ β θ are the coefficients to be estimated. Before applying to a time series model, the equation has to assume stationarity. Successive differences are taken in case of non-stationarity before the sequence is stationary. In practice, the differences are rarely greater than two. The Box-Jenkins time series approach seeks to identify the most suitable ( , , ) ARIMA p d q model and use it for forecasting purposes. It utilizes a six-phase iterative scheme: A priori determination of the order of differentiation, d (or selection of appropriate transformation); A priori determination of the orders of AR and MA processes, p and q; Model identification; Estimation of the model parameters ( , and , φ β θ

Testing Autocorrelation
One of the assumptions of time series regression is that the errors are independent. Error terms correlated over time are said to be serially correlated or autocorrelated. Serial correlation of the disturbances with Ordinary Least Square (OLS) estimation can have the following effects: Estimated coefficients of regression are still unbiased but no longer have the minimum variance property (inefficient); The OLS estimate of s 2 (MSE) could underestimate the true error variance; The true standard error of estimate could be underestimated by { } k se b ; Statistical inferences using t and F tests are also no longer valid.
So, it is important to test the existence of serial correlation. In general, there are two ways of detecting autocorrelation. The first is the informal way which is done through graphs (plotting residuals against time) and therefore we call it the graphical method. The second is through formal tests for autocorrelation, like Durbin Watson Test. The Durbin Watson Test is used to test the hypothesis: H 0 : 0 The Durbin Watson statistic is: The value of Durbin Watson test statistic is compared with the critical value relevant to it. If the test statistic is less than critical value then we reject the null hypothesis and conclude that there is autocorrelation, i.e. If D < d L , reject H 0 : 0 ρ = and accept H 1 : 0

Testing Stationarity
To model a time series data, we examine the data structure to obtain some preliminary information about the stationarity of the series; whether a trend or seasonal pattern exists. If both the mean and variance are constant over time, a time series is said to be stationary. A time series plot of the data is suggested to decide whether any differencing is required before formal tests are conducted. If the data is non-stationary, we perform a Box Cox transformation or take the series' first (or higher) order difference which may result in a stationary time series. Data differentiation times are indicated by parameter d in the model, ARIMA p d q . An Augmented Dickey-Fuller (ADF) test is then used to evaluate the stationarity of the results.

Augmented Dickey-Fuller Test
The ADF test for unit root is powerful test used to check whether a time series is stationary or not. The ADF test procedure is similar to the Dickey-Fuller test except it is implemented to the model. A random walk with trend and drift is defined as follows: Where α is a constant, β is the parameter on a time trend and p is the lag order of the AR process. Substituting 0, 0 α β = = into eq. (11) corresponds to random walk model and using 0 β = corresponds to random walk model with drift.
The test statistic, τ value is given by: it another way, the null hypothesis is that the data has a unit root while the alternative hypothesis is that the data does not have a unit root. The value of the test statistic is compared with the critical value relevant for the Dickey-Fuller test. If the test statistic is smaller than the critical value then we reject the null hypothesis and deduce that there is no unit root. The ADF test fails to test for stationarity explicitly, but indirectly through the presence (or absence) of a unit root. Using the normal threshold of 5%, differencing is needed if the p-value exceeds 0.05.

Correlograms
In addition to graphical stationarity checking, formal test schemes are implemented using autocorrelation function (ACF) and partial autocorrelation function (PACF). The correlograms examine the data from the time series by plotting the ACF and PACF to try and obtain the data's functional form.
The ACF reflects the degree of continuity over the respective variables lags; a correlation at time t y and t k y + between two values of the same variable. The sample ACF can be written mathematically as follows: The PACF calculates the degree of association of two variables that is not described with given set of other variables by their mutual correlations. The sample PACF is: ACF will be used to decide the order of MA process while the PACF will determine the order of AR process. The key defining characteristics of the theoretical ACFs and PACFs for stationary processes are given in Table 1. If the original or differenced time series turns out to be non-stationary some suitable transformations will be made to achieve stationarity, then we must proceed to the next step where initial values are identified.

Model Identification and Estimation
Box-Jenkins approach is applied by observing the ACF and PACF of the time series. ACF and PACF are therefore at the heart of how to classify the ARIMA model. There are three rules for the identification of ( , , ) ARIMA p d q model: If ACF graph is cut off after lag q and PACF dies down, we recognize ( ) MA q resulting in the model of (0, , ) ARIMA d q . If ACF graph dies down and PACF is cut off after lag p, If ACF and PACF die down; the ARIMA model is mixed, differencing is necessary.
If the model order has been specified, (i.e., p, d and q values), the parameters , and φ β θ need to be estimated. In fitting the ARIMA model, the concept of parsimony is important whereby the model should have the smallest possible parameters and still be able to explain the sequence (p and q should be 2 or less). The more parameters, the greater the noise that can be inserted into the model, and hence the greater the variance. Moreover, the following methods are also applied: maximum likelihood estimation (MLE), Akaike Information Criterion (AIC), Akaike Information Criterion Corrected (AICc) and Bayesian Information Criterion (BIC).

Maximum Likelihood Estimation (MLE)
The ARIMA model will be estimated using maximum likelihood estimation (MLE) technique. This method obtains parameter values which maximize the likelihood of obtaining the data we observed. The MLE is quite similar to the least-square estimation for ARIMA models. The likelihood function in a standard Gaussian is: Where T is the time 1, 2, , t T = … of the time series data, σ and ε are the constant variance and the error terms respectively. The log likelihood presents the logarithm of the probability of the observed data from the fitted model. We select the model with maximum log likelihood.

Information Criteria
The AIC is useful in deciding the order of an ARIMA model. It is used for the comparison of competing models that fit the same series. It can be defined as follows: Where L is the likelihood of the data and p is the number of fitted model parameters (including the residual variance). The original representation of AIC applies a linear penalty term to the number of free parameters, but the AICc introduces a second term to factor into the sample size, making it more appropriate for smaller sample sizes. The AICc (corrected for small sample bias) is given by: The Bayesian Information Criterion (BIC) can be defined as: In general, the BIC penalizes free parameters stronger than the AIC, although it depends on the size of n and relative magnitude of n and p. Potential models are obtained by decreasing the AIC, AICc or BIC and optimizing log likelihood. Our choice is to use the AICc and to choose the parsimonious model with the smallest AICc and the greatest log likelihood.

Model Validation and Forecasting
Estimated model(s) would be considered the most suitable if it usually simulates historical behavior and constitutes white-noise innovations. Historical behavior will be checked by the ACF and PACF of estimated series and pick the one that better describes the temporal dependency in the CPI series, i.e. the model(s) whose residuals do not display significant lags. White noise innovations will be checked as well as overfitting through a series of diagnostic tests based on projected residuals. The Ljung-Box test can also be used to verify if a time series autocorrelation varies from zero. If the result rejects the hypothesis, that implies that the data is independent and uncorrelated; otherwise, serial correlation persists in the sequence and the model needs adjustment.
A good feature of the ARIMA class is its power of forecasting. Gujarati believed that ARIMA's popularity was due to its prediction success [18]. To forecast future values of the time series, we use this equation: Where t is the past until T, $ Thus, supposing that forecast errors are normally distributed, a (1 )100% α − prediction interval for the future values, T h y + , can be created as follows:

The Ljung-Box Test
Ljung and Box developed the standard portmanteau test to test that the data is a realization of a powerful white noise [19]. It is about calculating the following statistic and rejecting the powerful white noise hypothesis if ( ) Q m is larger than ( 1 α − ) quantile. n is the sample size, k r is the autocorrelation of the sample at lag k and m is the lag order that should be stated. This is a one-tail (i.e. one-sided) test, so that the calculated p-value should be compared to the entire significance level ( α ). In practice, the selection of m will rejected. The best fitting model(s) will then go through different residual and normality checks and only suitable model(s) will be chosen for the purpose of forecasting.

Results and Discussion
The Somaliland CPI data was subjected to descriptive statistics. Table 2 shows the descriptive statistics of the CPI series. The total number of observations was 88, the highest reported CPI was 183.56 recorded in April 2020 and the lowest reported CPI was 106.27 recorded in January 2013. The skewness is 0.29 implying that the CPI series is negatively skewed and non-symmetric. The estimated kurtosis was obtained to be -1.44 indicating that the CPI series is not normally distributed.  Figure 1 shows a time series plot of Somaliland CPI series from January 2013 to April 2020. The figure clearly exhibits an upward trend in the monthly CPI. To discuss it in more detail, the CPI gradually increased in the years 2013 up to 2016. In April 2016, it decreased until October 2016 and after that it began to rise sharply. In the last three years, CPI has increased suddenly up to 40 percent higher than the previous years. For this reason, the mean definitely does not seem to be constant and therefore the series is not stationary. In addition, there is no seasonal variation from the series and thus there is no seasonal part from the data. Before we test stationarity, an important step is to test the independence of errors assumption. From table 3, we observe that the Durbin Watson statistic is 0.0580 with a p-value of approximately zero. We conclude that the CPI data has positive autocorrelation. Durbin-Watson test DW = 0.058047, p-value < 2.2e-16 alternative hypothesis: true autocorrelation is greater than 0 Next, we proceed to test stationarity in the formal way. It can be checked by the absence or presence of unit root. Table 4 shows the unit root test for determining the stationarity of the series. We see that the Dickey-Fuller statistic is -2.2425 with a p-value of 0.4764. Then, we assume that the CPI series is non-stationary time series.    Figures 2 and 3 show ACF and PACF of the CPI series. The coefficients of ACF start with a high value and declines slowly as lags increase, indicating that the series is non-stationary. The spikes in the ACF plot that cross above the cut-off line suggest that the current level of CPI is significantly autocorrelated with its lagged values. The corresponding PACF plot only has a significant spike at lag 1 and then cuts off, which means that the autocorrelations at lag 2 and above are solely due to the outbreak of autocorrelation at lag 1. The non-stationarity is of the order one since only the first-lagged bar is considerably higher than the critical limit i.e. the first lag of PACF plot is above the significant line. This implies non-stationarity and suggests differencing of the first order as the remedy.
Since the Somaliland CPI data is non-stationary, it must be configured in first differences to become stationary. The series was transformed by taking the first differences of the values in the series so as to attain stationarity in the first moment. The equation representing the transformation is given by: where t CPI represents the monthly values for the CPI series. The time series plot for the differences of the series is presented in Figure 4. The mean appears to be constant over time, i.e. there is no trend increment.  Table 5 shows the unit root test for determining the stationarity of the first differences. Again, we observe that the Dickey-Fuller statistic is -6.3802 with a p-value of less than 0.01, so we reject the null hypothesis that the first difference series is not stationary. This means we assume that the CPI first differences are stationary time series. Augmented Dickey-Fuller Test data: diff Dickey-Fuller = -6.3802, Lag order = 0, p-value = 0.01 alternative hypothesis: stationary Warning message: In adf.test (diff, k = 0): p-value smaller than printed p-value Figures 5 and 6 show the ACF and PACF for the first difference of the CPI series. The spikes at lag 0, 1, 3, 13, 15 and 16 are beyond the significant line, so we can say that there is autocorrelation at lag 0, 1, 3, 13, 15 and 16. The bar at lag 5 and 6 is around zero, so the CPI first difference is stationary.

Utilizing Box-Jenkins Methodology
We assumed that the differenced CPI series is stationary, so it is safe to identify an ARIMA model and estimate its parameters for the CPI series. Table 6 shows the results obtained from ARIMA models with different orders. A model with the greatest log-likelihood and lowest AIC, AICc and BIC is better than the other models. Theil's U lies 0 to 1, the nearer it gets to zero, the better the forecast technique [20]. The study will consider the log-likelihood, AIC, AICc and BIC, Theil's U, ME, MAE, RMSE and MAPE only as the criteria for forecasting CPI in Somaliland, and thus the ARIMA (0, 1, 3) is carefully chosen. This model is the only one to meet the above-mentioned conditions and the parsimony principle which prioritizes the smallest parameter possible in the model. After selecting the model, the parameters or coefficients of the model need to be estimated. As shown in Table 7, The coefficients of the MA (1), MA (3) and drift components are positive and statistically significant at 1% level of significance, while the coefficient of the MA (2) component is negative and not statistically significant at 1%, 5% and 10% levels of significance. This indicates that undetected CPI shocks have a positive impact on current CPI in Somaliland. Such shocks can include but not limited to, shocks from monetary policy and favorable political outcomes. In fact, the results show that an increase of 1 percent in these shocks would lead to an increase in CPI of around 0.42%, 0.34% and 0.90% respectively, hence higher inflation.  Before model validation, we carefully look at the stability of the chosen model. In Figure 7, the inverse roots of MA characteristic polynomials for the stability of the ARIMA (0, 1, 3) model are presented. As conventionally expected, we see that the ARIMA model is stable as the accompanying inverse roots of the characteristic polynomials are in the unit circle. This illustrates that our model is reliable and most appropriate for forecasting CPI in Somaliland over the period under study.
As in Figure 8, the ACF plot of the residuals from the ARIMA (0, 1, 3) model shows that all autocorrelations are within threshold levels, indicating the residuals act as white noise. In table 8, A Ljung-Box test returns a great p-value, also suggesting that the residuals are white noise. As the purpose of this study, after identifying, estimating and validating the model, it is important to forecast the future values. Table 9 and Figure 9 (with a projected range from May 2020 to April 2021) simultaneously show that CPI is expected to continue growing sharply in Somaliland over the next year.

Utilizing Regression with ARIMA Errors
In this section, we model the CPI data using regression with ARIMA errors as it can be a potential model for autocorrelated time series data. Table 10 presents the output obtained from different regression with ARIMA errors models. As mentioned before, a model with the lowest AIC, AICc, BIC, Theil's U, ME, MAE, RMSE and MAPE and the greatest log-likelihood will be selected. The priority of the selection criteria will be given to these forecasting accuracy measures and thus regression with ARIMA (2, 0, 3) errors is carefully selected. After the model selection, we estimate the parameters of the chosen model. Table 11 shows that the coefficients of the Time, AR (1), AR (2), MA (3) and Intercept components are statistically significant at 5% level of significance, while the coefficient of the MA (1) and MA (2) components are negative and not statistically significant at 1%, 5% and 10% levels of significance. Actually, the results show that an increase of 1 unit in time will increase the CPI by about 0.94.  The next step is to look at the stability of the selected model. Figure 10 represents the inverse roots of AR and MA characteristic polynomials for the stability of regression with ARIMA (2, 0, 3) errors model. We can observe that the regression with ARIMA (2, 0, 3) errors model is stable since the corresponding inverse roots of the characteristic polynomials are in the unit circle. This demonstrates that our model is stable and suitable for predicting CPI in Somaliland over the period under consideration. Before forecasting future values, we should validate the selected potential model. As in Figure 11, the ACF plot of the residuals from the regression with ARIMA (2, 0, 3) errors model shows that all autocorrelations are within the dashed lines, suggesting that the residuals behave like white noise. Also, in Table 12, a Ljung-Box test returns a large p-value, suggesting that the residuals are not significantly different from white noise.  The last step is to forecast Somaliland monthly CPI in the short term (12 months ahead). As shown clearly in Table 13 and Figure 12 (with a projected range from May 2020 to April 2021), the CPI is likely to continue rising sharply in Somaliland over the next twelve months.

Comparative Analysis
In this comparative study, it is obvious that ARIMA (0, 1, 3) and regression with ARIMA (2, 0, 3) errors models are competing with each other. However, we strongly recommend ARIMA (0, 1, 3) as the ideal model since it captures the stochastic variation in the data better than the other model, i.e. it has the smaller information criteria, prediction error and model parameters.

Conclusion
The ARIMA and regression with ARIMA errors were engaged to investigate Somaliland's monthly CPI from January 2013 to April 2020, after applying the Box-Jenkins methodology and regression analysis. The study mainly aimed to forecast the monthly CPI in Somaliland for the May 2020 -April 2021 period and selected the best fitting model based on how well the model captures the stochastic variance in the data. It was obtained that the ARIMA (0, 1, 3) model is a reasonable and acceptable model for forecasting Somaliland's CPI in the next twelve months. Generally, CPI in Somaliland has shown an upward trend over the forecasted period. Based on the findings of the study, policymakers in Somaliland should pursue more sensible monetary policies to combat such an increase in inflation as reflected in the forecasts. In this respect, fiscal and monetary authorities are advised to take tight economic policy measures to tackle the inflation threat in Somaliland.