Generalized Regression Control Chart for Monitoring Crop Production

Recently, Nigeria focused on Agriculture as a way to diversify her economy. Crop production, which is a proxy to measure agricultural output is considered very important. So, controlling crop production (output) among states in Nigeria is very key. In this study, the generalized regression control chart was used rather than the conventional control chart. The conventional control chart does not put into consideration factor(s) that affect crop production. The generalized regression control chart considers the factor (independent variable) that affect crop production (dependent variable). The normal distribution is a special case of the generalized regression control chart. The possibility of using Weibull regression and other non-normal models were considered. In this research, Gaussian distribution was used as the underlying distribution because it fitted the crop production data. The cost of seed/seedling was selected from a set of independent variables, because it is most significant among other independent variables. The data were collected from secondary sources, precisely National Bureau of Statistics (NBS). All the 36 states in Nigeria, including the Federal Capital Territory (FCT) were involved in the study. The result of the generalized regression control chart showed that crop production is not in control in Nigeria, which was traced to assignable cause of variation in FCT, Abuja. This implied that FCT, Abuja produced below the lower control limit of crop production, despite the relative cost of seed/seedlings.


Introduction
In industry, the quality of the expected product and the actual goods manufactured should be the same, but sometimes some variations are found, thus producing deviations that are random or assignable. Statistical quality control and Six Sigma are ways of controlling the quality of products by reducing such deviations from standard.
Shewhart [1] was the first author to propose control charts and since then a lot of charts have been established in monitoring and controlling different production processes. A conventional Shewhart control chart is plotted with the mean of process observations at different points with a pair of control limits. In developing a Shewhart control chart, one of the important assumptions is that the distribution function of the underlying process data is normal and the other assumption is that process data are independently distributed.
Statistical quality control (SQC) was defined by Montgomery [2] as a technique of analysing the process, setting standards, comparing performance, verify and study deviations, to seek and implement solutions, analyse the process again after the changes, seeking the best performance of machinery and or persons. In statistics, control charts are statistical process control tools used to monitor and control a process. The process is said to be in control if all the points plotted fall within the upper and lower control limits.
Alwan and Roberts [3] showed that about 85% of a sample of 235 control charts displayed incorrect control limits, and Karaoglan and Bayhan [4] mentioned that more than half of these displacements were due to violation of the independence assumption, this implies that the remaining half of these displacements could be due to violation of normality assumption. So, for a conventional control chart to display a correct control limits, the process data must be normally distributed. The conventional control chart does not put into consideration factors that may affect variable to be controlled. The generalized regression control chart considers the factor that can affect the variable of interest. This is enough reason why the conventional control chart will not be appropriate for modelling such variable as crop production, which is not just dependent on time but also on other factors. It might even increase or decrease with time, with varying mean and variance. So, the need for a generalized regression control chart for such data is necessary. Regression will take care of the factor that affects the response variable. So, combining regression model and the conventional control chart will give a regression control chart.
The regression control chart was first published in 1955, in a book titled "Statistics: a new approach" by Wallis. and Roberts [5]. Mandel [6] popularised it and in 1969 applied it to monitor and control man hours spent in dispatching of mails in post office, regressed on the pieces of mail handled [7]. Mandel [7] mentioned that the regression control chart has proved useful for a variety of postal management problems and offers possibilities for more widespread applications in government, business, industry and agriculture.
Statistical model is a description of the probability distribution of random variables which can be assumed to represent a real world phenomenon [8]. A linear regression model describes the relationship of covariate x and a continuous response variable Y [8]. One important assumption of linear regression model is that the distribution of the response variable (y) and the error term are normal.
Some examples of the application of regression control chart to autocorrelated processes were given by Karaoglan [4]. The regression control chart considers the factor that can affect the dependent variable but assumes the normal distribution as a default distribution. The generalized regression control chart however, assumes any distribution, which a Gaussian (normal) distribution is a special case.
Thus, this paper, however, focused on applying generalized regression control chart as a means of setting standards in the controlling and monitoring crop production among states in Nigeria, to guide against over or under production. This combination of the conventional control chart and generalized regression model is an improvement to the work of Mandel [9] by selecting an independent variable that mostly affected the variation in the dependent variable, and also opening ground for generalized regression control chart (Weibull regression, Gamma regression, Rayleigh regression, Exponential regression and so on).
The remaining part of this paper is organized as follows. Section two comprises control charts. In section three, generalized control chart was presented as well as parameter estimation for the generalized regression control chart, and establishing the control chart limits. Section four consist of the application of the generalized regression control chart to regression of crop production on cost of seed/seedlings. Section five contains the concluding remarks.

Conventional Control Chart
The control chart was invented by Walter A. Shewhart, while working for Bell Labs in the 1920s. What makes the control chart such a useful tool is the fact that the chart can reveal the amount of variation by time, thus enabling the user to observe patterns for interpretation and the discovery of changes in the process. Grant and Leavenworth [10] showed an example of the use of Stewarts, use as the tool of the analysis on the tolerance of rheostat. In addition, conducting a control chart analysis prior to conducting a six sigma calculation allows the six sigma calculation to reveal the true inherent process capability [11], while Woodall et al. [12] stated that statistical quality control is a collection of tools that are essential in quality improvement activities.
An example of the conventional control chart is depicted in Figure 1. The average characteristic ( ̅ ) is plotted against time. This conventional control chart is useful if a large variation is not suspected to be caused by another variable. If the characteristic variable is affected by another variable, then the conventional control chart will not be appropriate, hence, the need for a generalized regression control chart, of which Gaussian regression control chart is a special case.

Regression Control Chart
The conventional control chart uses a line of average performance with control limits parallel this central line. The upper control, lower control and central lines all parallel to the horizontal axis, implying that a single average is being controlled [9]; and Mandel [9] stated that the regression control chart has the following elements, which distinguished it from the conventional control chart.
1. It is a model that controls a varying average rather than a constant average. The central line is the regression line. 2. The control limits are parallel to the regression line rather than to the horizontal axis. The scatter plot is very useful here. Three lines are drawn on the scatter plot, the central line (line of best fit), upper control limit and lower control limit. The three lines are expected to slant upward or downward. 3. The computation for the construction of the regression control chart is time consuming compared to the conventional control chart, but with the help of modern high speed computers, the problem of computation is solved. The standard deviation of the regression control chart is the standard error estimate of the regression line. It is the standard deviation estimate based on the deviation of the observed values about the regression line. It is quite different from the standard error of a predicted value of the dependent variable. 4. The regression control chart is appropriate for a number of applications, which the conventional control chart does not readily applies. It provides the basis of measuring the gains or loss in the response variable, for predicting and forecasting the response variable and scheduling the covariate resources. The schematic representation of the conventional control chart and the regression control chart are shown in Figure 1 and 2 respectively. The two charts look alike but are different. The conventional control chart is univariate, while the regression control chart is bivariate. The two figures are a replica of the one in [9].

Data Description
The data collected initially are panel data consisting of 37 cross-sections (the states in Nigeria including FCT, Abuja) and 10 periods (10 years from 2006 60 2015). The average of the 10 years was computed for each cross-section, reducing the panel data to cross sectional data. The data was collected from the administrative records and publications of National Bureau of Statistics (NBS), through the two data collection infrastructure; National Integrated Survey of Household (NISH) and National Integrated Survey of Establishment (NISE). NISH Master Sample was constructed from the frame of EAs of 2006 Housing and Population Census by National Population Commission (NpopC). The household listing of the EAs were stratified into farming and nonfarming household and the sample size is taken from the farming through randomization. (See [13], [14]). The data collected are crop production (Y), total area cultivated (X 1 ), fertilizer usage (X 2 ), rural employment in crop production (X 3 ) and cost of seed/seedlings (X 4 ). Generalized regression line is fitted to this historical data, establishing limits around the regression line.

Generalized Regression Control Chart
The generalized regression control chart has all the attributes of the regression control chart of Mendel [9]. The difference between these two charts is the difference between the ordinary regression model and generalized regression model. The formal assumes normality of the response variable and the error term, while the later assumes any distribution other than the normal distribution. So, the regression control chart is a special case of the generalized regression control chart. In the ordinary regression control chart by Mendel [9], it is assumed that the response variable, y values are linearly related to the covariate, x values. For each specific x value, it is assumed that the y values are normally and independently distributed with a mean value estimated from the regression line, and with a standard error, which is independent of the values of x and it is estimated from the deviations of the actual observations, Y from the Ŷ estimated from the regression line. The generalized regression control chart also assumed that the y values are independently distributed with a mean value estimated from the regression line, but are not necessary normally distributed. A good example is the beta regression control chart (BRCC) by Bayer et al. [15].

Linear Model
In statistics, a multiple linear regression model describes the relationship of a continuous response variable, Y, and a covariate, X. This model is defined as The model in equation (1), if k =1, we have the simple linear regression model given by.
where 0 β is the intercept term, 1 β is the regression coefficient for variable X and i e is the error term. Assume that the error terms are random, independent and normally distributed with Note that the variance is independent of x. The error term, i e , in equation (1) is written explicitly. It is also possible to write the model in equation (2) without explicitly specifying the error term, i The model in equation (3) specifies the expected value of Y conditional on x. Equation (3) does not specify how the values of Y vary around the expected value E(Y i |x i ). By defining the Var(Y i ) = σ 2 , we obtain a model equivalent to model specified in equation (3).
The linear model in (3) is transformed to a generalized linear model by letting where g(.) is the link function, which is a real-valued monotonic and differentiable function and the term i η is the linear predictor. Canterle and Bayer [16] presented several possible choices for link functions such as logit, probit, loglog, complement log-log, Cauchy, and also parametric links. It is obvious that µ i is the expected value of y, i η is a linear combination of the predictors, and g(.) defines the relationship between µ i and i η . Since g(.) is monotonic, then the relationship of µ i and i η is monotonic as well. Thus, the inverse of g(.) is given as which is an alternative to the linear model. Thus, the linear model is a special case of the generalized linear model, if g(µ i ) = µ i . If the independent variables are more than one, then equation (4) becomes For equations (4) and (6) to be possible, some assumptions must hold for Y i in the model. The distribution of Yi must belong to the exponential class of family, they must be mutually independent, and have expected value ( ) The exponential class of family has a probability density function given by where i θ and φ are location and scale parameters respectively, and functions. Since the variation in Yi is distribution with exponential family of distribution, then it has mean and variance given by where equations (8) and (9) (9) becomes (10).
As mentioned earlier, the second aspect of the generalization is that instead of modeling the mean, as µ i , we use a one-to-one continuous differentiable transformation g(µ i ) given as The function g(µ i ) is called the link function. It is further assumed that the transformed mean follows a linear model, so that equations (4) and (6), which is equated to (11) is written in matrix form as Since the link function is one-to-one, we can invert equation (12) to obtain equation (5), making i µ the subject of the formula. It should be noted that the response variable Y i was not transformed but rather its expected value µ i .

Gaussian Regression Model
Recall from equation (1), if Y follows a normal distribution with mean, µ and variance, σ 2 . then its pdf is given by The same procedure used here for Gaussian (normal), can be used to achieve everything other distribution belonging to the class of exponential family.
Equation (13) can be rewritten as Take the log of (14) to have Take back the exponential of (15) to have the desired exponential class of distribution, given by By comparing equation (16) to (7), we have that Mean and Variance of Y Recall from equation (8), we have 5 6 ( ' %&', so that 5 6 6 ' %&' So, 5 9

6
Since, ( 6 : Also, recall from equation (8) (17) and (18) are the mean and variance of Y respectively, where Y is normally distributed.
The link function of Normal distribution is given by equation (19) = ( ( where ( ) i g µ is the link function, and i µ is the mean.

Maximum Likelihood Estimation for Normal Regression Parameters
From the pdf in equation (16), the log-likelihood is given as    (25) and (26) are the unbiased estimates of β 0 and β 1 respectively. This process can be used to derive the parameter estimates of other member of exponential family. However, in a situation where the differentiation looks difficult or not in close form, we can use the equation defined by [8] to obtained the first derivative of the log-likelihood function of the exponential family defined in equation (6) in terms of β as The regression parameters of other distributions that are member of the exponential family can also be derived using (27). This will set the pace for generalized regression control chart. Other examples could be Gamma regression control chart, Weibull regression control chart, Rayleigh regression control chart, Exponential regression control charts.

Establishing the Regression Control Chart
Using the generalized regression line derived using maximum likelihood method, and twice the standard error of estimate (i.e, 2S e ), the control chart, with control limits set at 2 standard deviations above and below the generalized regression line are given by The value of σ is unknown but is estimated with S e . The use of 2σ or 3σ in equations (28)

Measuring Progress
The difference between expected and actual crop production (ˆi    Table 1 is a longitudinal data with 37 cross-sections (states) and 10 time periods, spanning 370 data points. The yearly average data for each state is used to construct the regression control chart. The heat map displayed in Figure 3 shows that cost has a high correlation with production. The green colour indicates a very high value, while red indicates a very low value. So the variation from green to red shows how the values reduce from highest to lowest. If you look at the heat map very well, you will discover that the states with green cells for production also have green cells for cost, and the ones with red cells for production also have red cells for cost as compared with other variables.  Table 2 shows the summary statistics of the data collected for the analysis. One of these independent variables will be used to construct the regression control chart. The variable that contributes most to the variation in the dependent variable is selected. This can be determined from the multiple linear regression model.  Table 3 shows that the skewness of the dependent variable (crop production) is 0.466 and the kurtosis is -0.715, which shows that the variable is non-Gaussian. Also, the histogram, QQ plot and boxplot all show that the variable is non-Gaussian. It is necessary that we subject the data to confirmatory test otherwise, the Gaussian regression model will not be relied upon, rather, the generalized regression model is appropriate. See also Figure 4.  The result of the confirmatory test in Table 4 shows that Gaussian distribution adequately fit the data. This implies that the data is normally distributed as against the results from the exploratory data analysis, which earlier suggested that the data might not follow a Gaussian distribution. If the data is not Gaussian, then other non-Gaussian distributions like Gamma, Weibull, Rayleigh, Exponential and so on would be used.

Figure 5. Relationship between Crop Production and the Independent
Variables.
where y is crop production, x 1 is area, x 2 is employment, x 3 is fertilizer and x 4 is cost. It is very obvious from Table 5 that x 4 , that is, cost of seed/seedling is the most significant independent variable. Thus, x 4 , will be used to control the variability in crop production (y). See also the scatter plots in Figure 5 for pictorial explanation.  Table 6 shows the least square parameter estimates of the simple linear regression model. The table shows that both the intercept and the slope are significant. From Table 6, the simple linear regression model is given by where 2045 is the intercept, meaning that the value of crop production when cost of seed/seedling is equal to zero is 2,045 thousand tons; and 0.5737 is the slope of the regression model and it implies that for each unit increase in cost of seed/seedling, crop production will increase by 0.5737 thousand tons (573.7 tons). The analysis shows that 85.99% of the variation in crop production can be explained by the variation in the cost of seed/seedling. Thus, there is a significant linear relationship between crop production and cost of seed/seedlings. Note that equation (32) cannot be used for the regression control chart because, the chart is a 2dimensional plot, containing only a dependent variable on the vertical axis and an independent variable on the horizontal axis. So, equation (33) is appropriate. It is obvious from Figure 6 that many points fall outside the control limits, which implies that inputs are also obviously not the same. Most of the states spent different amount on cost of seed/seedlings, which is not captured by the conventional control chart. Here the CL y = , 2 y LCL y Se = − , 2 y CL y Se = + , where Se y is the standard error of y.

Regression Control Chart
To establish a regression control chart in this study, data from agricultural data collected from [13] through the two data collection infrastructure; National Integrated Survey of Household (NISH) and National Integrated Survey of Establishment (NISE), which was first collected in 2006 census. The data is displayed in Table 1. Since a 2dimensional plot involves only two variables, cost of seed/seedlings is selected as an independent variable among other independent variable as a result of its contribution and relationship with the dependent variable, crop production. The first step is to use the data in Table 1 to plot the production against cost on a scatter diagram, which is shown in Figure 7. This scatter diagram is needed to primarily check on the linearity of the relationship and to detect atypical points. It should be noted that crop production depends on many other factors other than cost of seed/seedling, such as area crop cultivated, fertilizer consumption, employment in crop farming and so on, some of these factors vary and some are stable, but among the ones used in this study, cost of seed/seedling has most variability and explains the variation in production more than other variables. The points that depart from linear pattern are not easily detected. These points can be due to assignable causes of variation.
So, a good way to detect these points is through the regression control chart. It should be noted that each point is traceable to each state of the federation. A defaulted state can easily be detected and controlled.
The following values were computed using the data in Table 1, considering only production and cost. It is now easy to compute the regression control chart. In this case, the centre line (CL) is the regression line in equation (33). The lower and upper control limits are CL-2S e and CL+2S e respectively. The narrower the limits, the higher the risk of false alarms. However, in this study, two-sigma is used. This control chart depicted in Figure 8 can be used in variety of applications. The first use of this regression control chart is to maintain control over performance of crop production in each state in Nigeria on continuous basis. For instance, if the cost of seed/seedling in 9 billion naira, what is the justification on the crop production performance for such state of the federation. Is there a gain or loss in productivity and is this performance acceptable? If the crop produced at this cost fall outside the control limits, then the performance is not acceptable, it can be counted as assignable causes of variation but if the point is within the control limits, then it is an acceptable performance and such variation can be attributed to chance. When performance is not acceptable, it is the duty of the management or people in authority to decide whether or not to investigate the cause of the variation.
It should be noted that in this study, points that fall below the lower control limit signifies under production, meaning that the cost of seed/seedling was not justified. On the other hand, if the points fall above the upper control limit, it signifies very high performance. This high performance should also be investigated due to the following reasons. Firstly, Other states can learn from them to see how they can have such a high performance, secondly it could be regarded as over production, if the demand is lower than the supply or if there is no good storage facilities or good market for the export of the excess production. The regression control chart is shown in Figure 4 for viewing.

Measuring Progress
The difference between the predicted _ and observed y is a way to simplify regression control chart and getting additional information from it. These difference can be plotted against time just like the convention control chart. In this case, runs and trends can easily be observed, overcoming analysing results from cluttered scatter diagram. The cumulative production gain or loss can also easily be determined and tested for significance using t-test. x and 4 x are the values of the observed production for points in control, and their corresponding predicted y, cost, and average cost respectively. Table 7 contains 36 data points because 1 data point is below the lower control limit (LCL). This point is FCT-Abuja, and it is deleted from the  The formula for the t-test above can be approximated to the one in equation (31) if m and n are close. The results obtained from the two formulas are equal when rounded up to 4 decimal places. Thus, the formula below is a good approximation for the t-test. This calculated value of t can be compared with 2 or to the critical value of t checked on the tables at n-2 degrees of freedom. Alternative, the p-value can be derived from the R code and it is given by dt(-0.2418, 35) = 0.38438. Since the p-value is greater than the level of significance (α = 0.05), then we cannot reject the null hypothesis, and conclude that the cumulative net gain is not statistically different from zero. Note the following recommendations. If the t-test is significant, a new control chart would have been drawn based on current's year data, showing new performance level. This shows that there is gain but the gain is not significant, it could be due to chance. It is a gain because actual productivity is greater than the expected productivity.

Concluding Remarks
The generalized regression control chart is a combination of generalized regression model and control charts. The regression line is the central line, which is applicable to linear and non-linear models as well as generalized regression model, depending on the shape of the data under consideration. The crop production data used in this work appeared to be non-Gaussian from the histogram and boxplot, but the confirmatory test shows that it is Gaussian, so it will not be necessary to consider other distributions since Gaussian shows a good fit.
Based on the result of the analysis, we conclude that there is a significant relationship between crop production and the independent variable (cost of seed/seedlings). The result shows that among the four independent variables, cost of seed/seedling is the most significant. The regression line is fitted and the regression control chart fitted using the regression line as the central line (CL), and CL±2Se as the control limits.
The regression control chart is out of control as a result of a point just a little below the lower control limit. This point is FCT, Abuja. This shows that crop production in FCT, Abuja does not measure up to the cost incurred in seed/seedlings. To make adjustment and use the control chart for monitoring crop production subsequently, this point out of control (FCT-Abuja) was deleted from the table, since it is an assignable cause of variation. The cumulative gain or loss table developed can be used to determine whether the regression line and control chart limits need revision. This model will capture the data during production process and gives alarm at every deviation (variation) in the production line at the end of each year.
Major stakeholders and policy makers should work with the available statistical models to monitor the expected crop production in Nigeria by conscious effort. Cost of seed/seedling is a very important factor to be considered, when measuring crop production level at any point.