Forecasting of Tomatoes Wholesale Prices of Nairobi in Kenya: Time Series Analysis Using Sarima Model

Price forecasting is more sensitive with vegetable crops due to their high nature of perishability and seasonality and is often used to make better-informed decisions and to manage price risk. This is achievable if an appropriate model with high predictive accuracy is used. In this paper, Seasonal Autoregressive Integrated Moving Average (SARIMA) model is developed to forecast price of tomatoes using monthly data for the period 1981 to 2013 obtained from the Ministry of Agriculture, Livestock and Fisheries (MALF) in the agribusiness department. Forecasting tomato prices was done using time series monthly average prices from January 2003 to December 2016. SARIMA (2, 1, 1) (1, 0, 1)12 was identified as the best model. This was achieved by identifying the model with the least Akaike Information Criterion. The parameters were then estimated through the Maximum Likelihood Estimation method. The time series data of Tomatoes for wholesale markets in Nairobi are considered as the national average. The predictive ability tests RMSE = 32.063, MAPE = 125.251 and MAE = 22.3 showed that the model was appropriate for forecasting the price of tomatoes in Nairobi County, Kenya.


Introduction
In Kenya, the agriculture sector is the mainstay of the economy, contributing about 30% of the GDP and accounting for about 80% of employments. The total domestic value of the horticulture sector in 2012 amounted to Ksh 217 billion occupying an area of 662,835 ha with a total production quantity of 12.6 million tons. As compared to 2011, the total value, area and production increased by 6%, 9% and 38% respectively [13]. Vegetables contributed about 38% of the domestic value of horticulture with 287,000 ha under production and producing 5.3 million tons valued at Ksh 91.3 billion. Production increased by 13% while there was a slight reduction in value by 4% from 2011 levels. The increased production is occasioned by favourable weather conditions that resulting to high yield, thus reducing the value of vegetables. However, there was a drop in prices for commodities like Cabbages and Tomatoes thereby reducing the overall value for the year.
According to [13] tomato (Lycopersicon esculentum mill) is amongst the promising commodities in horticultural expansion and development in Kenya. It accounts for about 14% of the total vegetable produce and 6.72% of the total horticultural crops [12,13]. Tomato is grown either on open field or under greenhouse technology. Open field production account for about 95% while greenhouse technology accounts for 5% of the total tomato production [13].
The aim of this paper is to analyze the price fluctuations of tomatoes in Nairobi County Kenya using SARIMA model as the analysis tool. Vegetable price analysis is used to formulate price stability policy and increased production. Accurate public information to farmers and market stakeholders like middlemen can inform policy forecasters to reduce price variance in other markets. The application of SARIMA as analysis tool can give an early warning message of tomatoes price fluctuation in the future.

Review of Previous Studies
According to [15] wholesale prices for vegetables are characterized by large seasonal variations, the degree and different timing of the changes. Due to the price fluctuations, vegetable producers normally have large losses, therefore the adaption of production to seasons, market research and technological development should all be improved. Assis et al.; [14] compared four different Univariate time series models: exponential smoothing, autoregressive integrated moving average (ARIMA), generalized autoregressive conditional heteroscedasticity (GARCH) and the mixed ARMA and GARCH models. Dieng [16] investigated the performance of parametric models in forecasting selected vegetable prices in Senegal and suggested the use of the parametric ARIMA model as compared to the nonparametric models.
The efficiency of ARIMA and GARCH models were compared for modelling and forecasting the spot prices of Grams in India [23] and the SARIMA models were used to forecast the prices of Tomatoes in selected Indian states [19]. Gathondu [11] fitted four models to wholesale prices of major vegetables: tomato, potato, cabbages, kales and onions for markets in Nairobi, Mombasa, Kisumu, Eldoret and Nakuru in kenya using Autoregressive Moving Average (ARMA), Vector Autoregressive (VAR), Generalized Autoregressive Condition Heterostadicity (GARCH) and the mixed model of ARMA and GARCH. In the study they found ARIMA (3,1,2) to be the best fitting model for tomatoes. The model failed to capture seasonal variability. Dragan et. al., [18] analyzed the changes and future tendencies of the price of tomatoes with descriptive statistics and found that the ARIMA models were suitable for price forecasting. Sampson et. al., [17] argued that among the seasonal decomposition models of forecasting, the Seasonal Autoregressive Integrated Moving average (SARIMA) models could enable producers to achieve better market prices by adopting the practice. In another study, [20] applied SARIMA models to forecast the prices of tomatoes in Turkey and found SARIMA (1, 0, 0) (1, 1, 1) 12 model as the most suitable. They reported that the highest tomatoe prices seasonally adjusted were in October. Boateng et al., [19] found that the predictability of the model increased with seasonal ARIMA (SARIMA). They noted wide fluctuations in prices of tomatoes in different months, prices sometimes increase 10 times compared to prices during peak harvest periods which implied that if farmers plan their area under tomatoes properly, sowing dates and sales by considering forecasted prices from the ARIMA models to receive increased prices, earnings may increase at least three to four times with 90% predictability of the forecast accuracy. According to [21], accurate prediction of agricultural prices is beneficial to correctly guide the circulation of agricultural products and agricultural production and realize the equilibrium of supply and demand of the agricultural area.

Data Overview
The wholesale price data is gathered from the Ministry of Agriculture, Livestock and Fisheries (MALF) in the agribusiness department which was collected by extension officers in the various wholesale markets. The data was available on weekly prices and covered the four year period from 2003 to 2018 which was computed to obtain monthly average prices. Under this study, the average wholesale prices for markets Nairobi County is considered as the classical national average. The time series data is measured in Kenya shillings per Kilograms (Ksh/Kg) and the data ranged from January 2004 until December 2018.

ARIMA Model
A generalization of ARMA models which incorporates a wide class of non-stationary time-series is obtained by introducing the differencing into the model. The simplest example of a non-stationary process which reduces to a stationary one after differencing is Random Walk. A process { } is said to follow integrated ARIMA model denoted by The model is written as Equation (1): The ARIMA methodology is carried out in three stages, viz. identification, estimation and diagnostic checking. Parameters of the tentatively selected ARIMA model at the identification stage are estimated at the estimation stage and adequacy of selected model is tested at the diagnostic checking stage. If the model is found to be inadequate, the three stages are repeated until satisfactory ARIMA model is selected for the time-series under consideration. An excellent discussion of various aspects of this approach is given in Box and Jenkins [3]. Most of the standard software packages, like SAS, and RGui contain packages and procedures for fitting of ARIMA models.

SARIMA Models
SARIMA models are an adaptation of autoregressive integrated moving average (ARIMA) models to specifically fit seasonal time series. That is, their construction takes into account the underlying seasonal nature of the series to be modelled. Seasonality in a time series refers to a regular pattern of changes that repeats over in time-periods, where S defines the number of time-periods until the pattern repeats again. For monthly rainfall data S = 12. In a seasonal ARIMA model, seasonal AR and MA terms predict x t using data values and errors at times with lags that are multiples of S (the span of the seasonality). The seasonal ARIMA model incorporates non-seasonal and seasonal factors in a multiplicative model and is denoted as: ARIMA ( , , ) × ( , , ) Where p = non-seasonal AR order, d = non-seasonal differencing, q = non-seasonal MA order, P = seasonal AR order, D = seasonal differencing, Q = seasonal MA order, and S = time span of repeating seasonal pattern.
Without differencing operations, the model can be written as: The non-seasonal components are: The seasonal components are: Seasonal AR: Season MA:

Model Identification in SARIMA
The first step of applying the model is to identify appropriate order of ARIMA (p, d, q) model. Identification of ARIMA model involves selection of order of AR (p), MA (q) and I (d). The order of d is estimated through I (1) or I (0) process. The model specification and selection of order p and q involves plotting of ACF and partial PACF or correlogram of variables at different lag lengths. The significance level of individual coefficients is measured by Box-Pierce Q statistics and jointly together by Ljung-Box LB statistics. The Box-Pierce Q statistics is defined as; Where n = sample size and m is lag length. And Ljung Box (LB) Statistics is defined by: Where n = sample size and m is the lag length of the date. The possible SARIMA model is determined that best fit the data under consideration. SARIMA model is appropriate for stationary time series therefore, the data under consideration must satisfy the condition of stationarity that is the mean, variance and autocorrelation are constant over time.

Parameter Estimation SARIMA
To estimate SARIMA models the ML method is used. Under the assumption of independent and distributed standardized 9 , the log-likelihood (LL) function of { ( )} for a Τ observations sample, is given by: where is the vector of the parameters that have to be estimated for the conditional mean, conditional variance and density function. 9 is a sequence of independent and distributed random variables with mean as zero and variance as one. The approach of maximum likelihood (ML) requires the specification of a particular distribution for a sample of T observations . D(E , E 8/ , … , E = !) = D( C8/ , … |H) denote the probability density of the sample given the unknown parameters (2 × 1) parameters H . Following the notation of Box and Jenkins, 1(H| ) with respect to derivatives to zero and using vector notation and suppressing y the result becomes teach of the unknown parameters of the vector H the notation is the most appropriate. Setting the I(J) IJ = 0. As a rule, the likelihood equations are non-linear.
Therefore, the ML estimates must be found in the course of an iterative procedure. This is true for the exact likelihood function of every Gaussian ARMA (p, q) process.

Forecasting with the SARIMA Model
Forecasting is the process of making statements about events whose actual outcomes have not yet been observed. It is an important application of time series. After the model has passed the entire diagnostic test, it becomes adequate for forecasting which the last step is in Box-Jenkins model building approach. For instance, let us consider the given Seasonal ARIMA (0, 1, 1) (1, 0, 1) 12 we can forecast the next step which is given by Cryer and Chan [24] as: The one step ahead forecast from the origin t is given by The next step is and so on. The noise term /L , /+ , // , … . , / (as residuals) will enter into the forecasts for lead times ? = 1, 2, … , 13, but for ? > 13 the autoregressive part of the model takes over; 9̂ NR = 9̂ 8RN/ + Φ9 NR8/+ − Φ9 NR8/L , DST ? > 13 (15)

Forecasting Performance
The accuracy for each model can be checked to determine how the model performed in terms of in-sample forecast. In terms of out sample forecasting, some of the observations are left out during model building. The accuracy of the model can be compared using forecast measure or some statistic such as mean error (ME), root mean square error (RMSE), where y t is the actual observation, X is fitted or the forecast value and T is the sample size. If we have perfect forecast then MAE = MSE = RMSE = 0. The smaller the value the better the prediction and the big the value the poorer the Series Analysis Using Sarima Model predictive power of the model.

Empirical Results
The data used in this study is average monthly prices of tomatoes in Nairobi County from 2003 to 2016. The wholesale price data was gathered from the Ministry of Agriculture, Livestock and Fisheries (MALF) in the agribusiness department which was collected by extension officers in the various wholesale markets. Figure 2 (Table 1). Figure 1 shows the presence of trend and seasonality in each time series of tomatoes price in Nairobi, county. Table 1 shows the monthly descriptive statistics of tomatoe prices. The average monthly price was highest in December and lowest in October.
As observed from Figure 4, tomato prices do not indicate a significant trend. This indicates that the series is in a stationary structure. In fact, null hypothesis was rejected in Augmented Dickey-Fuller (ADF) test, which was performed to determine if the series is stationary or not. This shows that the series does not have a root unit which means it is stationary ( Table 2).       ACF and PACF values of the series from which the seasonal differences are taken are presented in Figure 6. The seasonal spikes at ACF and PACF after 1 lag (12, 24,…) are observed as being cut off after taking the seasonal difference of the series. This also indicates the seasonal model of AR (1) and MA (1). Therefore, to include the model of (1, 1, 1) to the part (P, D, Q) of the model will be formed can be considered as one of the best possibilities among the alternative choices. At the non-seasonal part of the model (p, d, q); the discontinuation of PACF value after 1 lag indicates that the addition of the AR (1) term may be appropriate (see Figure 5). On the other hand, even the discontinuations occurs after 1 lag at ACF values, these values are observed to be increased after a certain lag. Therefore, there is no clarity for the MA term at the non-seasonal part of the model. In this case, two alternatives to be taken into account occur for the non-seasonal part of the model. One of these alternatives is the model which MA term is not added to (1, 0, 0), and the other one which MA term is added (1, 1,).
Nine possible alternative models were analyzed based on the seasonal part of the model in order to select the SARIMA model which will be used to forecast the prices of tomatoes. The analyzed models are compared according to the Akaike Information Criterion (AIC) and the Schwarz Bayesian Criterion (SBC). The model selected should have the smallest AIC and SBC values (Wang and Lim, 2005). SARIMA (2, 1, 1) x (1, 0, 1) 12 model has comparatively lower AIC and SBC values (Table 3). Therefore, this model was selected as the most suitable model or the best fit model from amongst the four models. Considering this model, the autoregressive and seasonal parameters were estimated respectively (Table 4). Although the constant term in the estimated SARIMA model is not significant at the different levels, the autoregressive and seasonal parameters are significant at the 1% level.
After estimating the parameters of this model, further analysis was done with the selected SARIMA (2, 1, 1) x (1, 0, 1) 12 model to check whether the residuals of the model are independent. The autocorrelation and partial autocorrelation up to 36 lags were computed and their significance was tested using Box-Ljung test. It is evident from Figure 6 that the values of the SARIMA (2, 1, 1) x (1, 0, 1) 12 residuals lie within the upper and lower confidence limits. Panel (b) shows p-values for the Ljung-Box statistics. Given the high p-values associated with the statistics, we cannot reject the null hypothesis of independence in this residual series. The results indicate that none of these correlations are significantly different from zero at a 95% confidence level. This shows that the selected SARIMA (2, 1, 1) x (1, 0, 1) 12 model is appropriate model for the monthly tomato price forecasting.
The residuals were checked to find out if they followed a white noise process. This was achieved by plotting the residual Q-Q and normality test plots as shown in Figure 7. The Q-Q plot is reasonably straight so normality is okay. The histogram shows a bell shaped distribution with a p-value = 0.0639237 > 0.05 which is an indicator for normality.
In addition, the ACF plot of the residuals in Figure 5 shows that for the first 20 lags, all sample autocorrelations fall inside the 95% confidence bounds indicating the residuals appear random. The forecasting evaluation statistics in Table 4 reveals that SARIMA model is appropriate in forecasting tomatoes prices in Kenya. The selected SARIMA (2, 1, 1) x (1, 0, 1) 12 model was used to forecast the mean monthly real tomato prices from January-2011 to December-2011 by using the observed data of the period January-2003 to December-2016. The predicted prices were compared with the observed prices ( Table 5). The predicted real tomato prices are close to the observed prices, except for the months of March and April. This result indicates that the model provides an acceptable fit to predict the tomato prices.
After obtaining satisfactory forecasting results over a short period, the selected SARIMA (2, 1, 1) x (1, 0, 1) 12 model was employed to forecast stream flow over a longer period. Table  5 displays a forecast of the monthly real tomato prices in Nairobi County, Kenya for 2019. The forecasted tomato prices were compared with the observed prices in Figure 8. As evident from Figure 8, SARIMA (2, 1, 1) x (1, 0, 1) 12 model is able to capture the flow trend. The forecast series tracks the actual series quite well during the period. The accuracy of this model is calculated based on the MAPE. The outcome shows that the proposed model can forecast the real tomato prices with an accuracy of MAPE value 125.251. MAPE is 125.251%, meaning that the forecasts are off by about 125% on average. The error at the estimation is at an acceptable level considering the extraordinary factors. Changes of conditions in the market entry of exporting countries immediately impact the tomato export and the prices (entry price, request for active ingredient, etc.).
As shown in Figure 8; the structure of fluctuation of the prices predicted for the following three years is similar to previous years, but it exhibits a more stationary structure. This result indicates that important changes will not occur in tomato prices until the end of 2014 under normal conditions.

Conclusion
The results obtained from this study shows that the prices of tomatoes in Nairobi County have not shown any trends towards an increase or decrease, in other words the series is stationary. The forecasts predicted from SARIMA (2, 1, 1) (10, 1) 12 model which was chosen in order to determine the course of the prices of the next 2 years show that any significant changes will not occur by the end of 2019. Better price forecast methods, which take into consideration of seasonality, need to be developed to accurately forecast tomato price information. When there are huge price fluctuations, there is a benefit for dissemination of accurate price forecasts among stakeholders to make informed decisions relating to land in production, marketing, trade, and storage. Adoption of cooperative, and contract, farming to shift price risk from farmers to large retailer can stabilize farmer income. Farmers can use forecast price information to hedge their positions by storing tomatoes in cold storage, selling them in other markets where prices are higher, and processing them to make tomato paste, tomato sauce, and ketchup if forecasted prices are too low.