Estimating Total Energy Demand from Incomplete Data Using Non-parametric Analysis

The validity and usefulness of empirical data requires that the data analyst ascertains the cleanliness of the collected data before any statistical analysis commence. In this study, petroleum demand data for a period of 24 hours was collected from 1515 households in 10 clusters. The primary sampling units were stratified into three economic classes of which 50% were drawn from low class, 28% from medium class and 22% from high class. 63.6% of the questionnaires were completed whereas incomplete data was computed using multivariate imputation by chained equation with the aid of auxiliary information from past survey. The proportion of missing data and its pattern was ascertained. The study assumed that missing data was at random. Nonparametric methods namely Nadaraya Watson, Local Polynomial and a design estimator Horvitz Thompson were fitted to aid in the estimation of the total demand for petroleum which has no close substitute. The performance of the three estimators were compared and the study found that the Local Polynomial approach appeared to be more efficient and competitive with low bias. Local polynomial estimator took care of the boundary bias better as compared to Nadaraya Watson and Horvitz Thompson estimators. The results were used to estimate the long time gaps in petroleum demand in Nairobi county, Kenya.


Introduction
Data is a long lived asset used in many unforeseen ways including data mining, mass customization and optimization. Data need to be of high quality so that decisions can be made on the basis of its reliability and validity. Quality data is an accurate representation of the part of the "real world" that it models and should be fit for the purpose of which it is designed for [1]. This entails that data ready for analysis should be accurate, precise, complete, current, non-redundant, portable and credible. Poor quality data is costly to diagnose and repair with most costs being hidden and hard to quantify. More specifically, incomplete data leads to extra resources in correcting, dealing with reported errors and reworking. The consequences are adverse across disciplines from loss of revenue through customer dissatisfaction in business, lowered employee morale in human resource among others. It is thus imperative that to improve on the quality of data, appropriate tools and techniques need to be developed [2]. Most data quality problems emanate from poorly defined concepts, incomplete questionnaires, inaccurate data entry and checking processes [3]. To mitigate these problems, a robust data quality plan is eminent and should include every definition of a record, its timeliness, completeness, accuracy and how it will be monitored. With complete data statistical inference can provide a transitional framework to generate insight to inform decision. Thus identifying any data gaps before other factors are considered is imperative.
In the energy sector, to facilitate the installation of different energy storage mechanisms including pump dispensers, pipe work at service stations, consumer installations and highly inflammable gas requiring huge safety and environmental implications data driven informed decisions help to avoid calamities as evidenced with oil spill over [4]. In most developing countries, analysis of energy demand has concentrated on innovative ways of switching from the use of non-renewable energy to renewable energy like wind substituting diesel as a source of power for pumping, converting diesel powered process machinery to electrically powered machines, briquetting agricultural residuals to replace kerosene as cooking fuel in urban areas, alcohol for transport vehicles to replace diesel is another thought among others. Thus with discovery of oil in some developing countries and depletion of oil in others, modeling energy demand has become essential due to its adverse consequences in terms of environmental and its overall cost implication [5].

Incomplete Data due to Missing Values
Missing values can either be omitted at random (MAR) or omitted not at random (MNAR). Omitted at Random implies that the propensity for a data point to be missing is not related to the omitted data, but it is related to some of the observed data. This has nothing to do with the missing values but has to do with the values of some other variable. On the other hand, for MNAR, the probability of a value missing varies for reasons that may be unknown. To distinguish between MNAR and MAR it is imperative to assess the nature of data and its quality in terms of completeness, the use of descriptive statistics such as frequency polygons as well as the scatter plots and more advanced methods including the use of estimators such as Nadaraya-Watson, Horvitz Thompson and Local Polynomial are some of the most promising methods of solving data gaps [6]. For empirical data analysis, where the samples are identical and independently distributed, non-parametric estimation procedure with a bandwidth parameter can be used as a modeling technique for the missing values. The starting point is identifying a sample of paired observations , , = 1,2, … , from a population , of size N. This enables one to find an estimator for = of a missing population [6].

Multiple Imputation for Missing Data
In survey, missing values as a result of non-response can be addressed through multiple imputation (MI) methods. In singular imputation methods, a measure of central tendency is used to input the missing values. Multiple imputation narrows uncertainty about missing values by creating several different imputation options of which several forms of the same data set are created, which are then combined to make the most optimal values. When used correctly MI can reduce bias, improve validity, increase precision and the results are robust and resistant to outliers [7]. Missing data point can be estimated by the average values of the parameter estimate obtained as point estimate based on the standard errors. This involve combining prior information about a parameter of interest with new evidence from a sample or through resampling with an appropriate probability distribution [7]. A naive or unprincipled imputation method may create more problems than it solves, distorting estimates, standard errors and hypothesis tests [8].

The Chained Equation Approach to Multiple Imputation
Multivariate imputation by chained equations (MICE) operates under the assumption that missing data is at Random (MAR), which means that the probability that a value is missing depends only on the observed values [9]. Many of the initially developed multiple imputation procedures assume a large joint model for all the variables, such as a joint normal distribution of which for large datasets, with hundreds of variables of varying types, this is rarely appropriate [7]. Multivariate imputation by chained equations (MICE) is a flexible approach in which a series of regression models are run whereby each variable with missing data is modeled conditional upon the other variables in the data. This means that each variable can be modeled according to its distribution whereby binary variables are modeled using logistic regression, continuous variables modeled using linear regression, multinomial logit model for categorical variables and a Poisson model for count variables. In MICE procedure, all variables that are to be used in subsequent analyses, whether or not they have missing data may be predictive of the missing values and are included in the imputation process.
One key point is to include the variables that are likely to satisfy the MAR assumption. Beyond that, the specific issues that often come up when selecting variables include: creating an imputation model that is more general than the analysis model; imputing variables at the item level vis a vis the summary level and imputing variables that reflect raw scores vis a vis standardized scores.

Nonparametric Analysis
Non-parametric methods estimate the distance between a point and its neighbors and the estimation depends heavily on the bandwidth and its span. These methods consider correlation between available auxiliary data-set and the missing response variable as missing at random. The general nonparametric regression models are either of fixed or of random design, such that if data points , , , . … , have been collected, the relationship is modeled as Where, is the predictor variable also known as the regressor, is the response variable and is the unknown regression function with observation error of which = 0 and = . The fixed design model is concerned with controlled non-stochastic repressor variable, this implies that the regressors are controlled by the researcher and are simply assumed to be measured without error. In fixed design, for any given observation , , ∈ ℜ " and is an independent random variable with = . Random design models are used in observational studies and are common in non-experimental science. The observed predictor variables are independent and identically distributed # random variables such that Where * , is the joint density of , and * = * + is the marginal probability function of . The approximation of the function is through smoothing method of the form , = ∑ . / / 0 , where . / are the weights. The common smoothing methods include the kernel, the nearest neighbor, the orthogonal series and the spline method. This study concentrated on kernel techniques which are linear estimators in that the value of the estimator at any point is the weighted sum of the responses. The weight function is defined as: With the function 1 supported on 8−1,19 that has a maximum at zero. The shape of the weights satisfies the moment conditions Where K is symmetric about zero. The kernel with finite support can be rescaled to have support on 8−1,19. Kernels, which have infinite support on the entire line, result to estimators with global bias difficulties. The smoothing parameter ℎ determines the size of the weights. Small ℎ leads to wigglier (rougher) estimators while larger ℎ leads to a more averaging (horizontal) estimator.

Nadaraya and Watson Estimator
Assume the observation of some variable have been taken times for some utility at times A ⋯ , A . Let , be decomposed into two parts, • the regression function which represents the true underlying change curve following the economic and physical potential and D the errors which may not depend on time. These errors not only stand for observational error but also for economic random variation due to seasonal and other exogenous factors. We assume that 0 ≤ A ≤ A ≤ ⋯ ≤ A ≤ 1 where A , A , … , A is the explanatory variable analogous to , , … , for ease of notation. A kernel estimator , A F for A F can be written as: where . , are the weights given by The weights do not depend on M N and therefore , A F is a linear estimator which can be expressed as a minimiser of the locally weighted least squares Where the squared part of the right hand side of (4) represents the polynomial part and the other part represents the local constant. While the sum ranges from 1 to , only those lying in the interval A F − ℎ, A F + ℎ contribute to , A F . This leads to For simplicity, we define (5)  and * A = % * A, # W 5W

Local Polynomial Estimator
The idea of local polynomial estimators has been widely studied by [10], [11] and [12]. Suppose that Locally the regression function m can be approximated by For z in a neighborhood of x. By using Taylors expansion, m (z) can be modeled locally by a simple polynomial model. This suggests using a locally weighted polynomial regression of the form Where K (.) denotes the kernel function and h is a bandwidth as presented in (3). Denote by P U / ^= 0, … , _ the minimizer of the equation (7). The above exposition suggests that the estimator for Where the whole curve ,` . is obtained by computing the local polynomial regression with x varying in an appropriate estimation domain. With p=1, the estimator , F is termed a local linear regression smoother or a local linear fit. This estimator can be explicitly expressed as Local polynomial fitting is an attractive method both from the theoretical and practical point of view [11]. The method adapts to various types of designs including random, fixed design, highly clustered and nearly uniform design. There is an absence of boundary effect, the bias at the boundary stays automatically of the same order as in the interior, without use of specific boundary kernels. The asymptotic minimax efficiency is100% among all linear estimators and only a small loss has to be tolerated beyond this class [11].

The Horvitz-thompson Estimator of Total
Given a finite population of N individuals, and we are interested in some trait that they have. Let X g denote the value of the trait for individual j We don't get to see all these X g ij we only sample n < N of them. With this sample of n individuals, we may be interested in obtaining an estimate of the total T = ∑ X g m g0 or the mean If the probability of individual j being included in the sample is π g the Horvitz-Thompson estimator of the total is given as: This estimator gives each observation a weight which is the inverse of its probability of inclusion and under unequal probability sampling, the Horvitz-Thompson estimator (HT estimator) is an unbiased estimator of the population total [13]. This estimator can also be defined as: (12) where w j is the design weight of the jth element. The HT estimator of the population mean can be expressed as Where N q = ∑ w g g∈j is the estimated population size. With the same functional form, we can rescale the weights so that ∑ w g g∈j = n and rewrite (13) as y | } rs = 1 n G w ‚ g y g g∈j The design based Horvitz -Thompson estimator of population total will thus be given by

Model Based Estimator of Population Total
The potential disadvantage of estimators motivated by super population model is inefficiency under model misspecifications [14]. This could be avoided by replacing the parametric specification by a non-parametric specification in which . is smooth function of j x and (.) V is smooth and strictly positive [4]. To enhance efficiency, the model based estimator of population total using Nadaraya Watson is given by Â‰ a = ∑ Š ∉OE + ∑ / ∈OE (15) Where Š ' is as defined in (5). The model based estimator of total using local polynomial is given by is defined in (9). The model based estimators are highly efficient when Š and =K / L are correctly specified but biased and even inconsistent if the model is wrong. In surveys involving clusters with equal number of second stage units, equal probability of selection is used [13]. In many surveys however, the second stage samples are not equal. This research presents a case where the Secondary Sampling Units, SSU are not necessarily equal.

Smoothing Parameter Selection
In nonparametric kernel estimation, the smoothing parameter effectively controls the model complexity. When ℎ → ∞, local modeling becomes a global modeling, when ℎ = 0 the estimate essentially interpolates the data and the modeling bias will be small. Since the bias is proportional to ℎ and the variance proportional to 2 , the bandwidth has to be taken neither too large nor too small so as not to increase the bias and variance of the estimates [5]. The problem can be solved theoretically by choosing a bandwidth that balances the trade-off between the bias and the variance components, since the consistency of the estimator is based on the sum of the bias and variance.
The positive value ℎ that minimizes any of the selection criteria namely AIC, BIC and ••' ' is selected as an optimal smoothing parameter. In this study, the smoothing parameter adopted was the Improved Akaike Information Criterion ••' ' , derived from the classical ••' for linear models under the likelihood setting: −2 X # "" " • " ℎ""# + 2 ; -"* - where, • is chosen particularly to be the form of the bias corrected version of the ••' . A • is the trace of the smoothing matrix regarded as the nonparametric version of degrees of freedom, called the effective number of parameters [5]. When KA • L = −2"" 1 − A 2 ⁄ , then (16) becomes the generalized cross validation criterion.

Results and Discussion
In this study, the population of interest comprised of all the households within Nairobi County, Kenya, which was projected at 1,551,06. The sampling frame was constructed using the Kenya Population Census of 2009 which had 10,323 enumeration areas (EAs). Petroleum demand data in kilogram of oil equivalent (koe) for a period of 24 hours was collected from 1515 households which came from 10 clusters. A total of 963 households comprising of 63.6 per cent had complete interviews, 43 households comprising of 3.3 percent gave partial answers while 457 households (30.2 per cent) of the households could not be accessed. The other data source of auxiliary population characteristics from the clusters in question were obtained from the records of previous survey [18][19][20].
The total cost of the survey was computed as, Where is the primary sampling units andthe additional households [15]. With their unit costs being C 1 and C 2 respectively. Thus the average number of households in each primary sampling unit was computed as: P is the proportion obtained from a pilot survey conducted before the main survey [15].
In this study, C 1 was estimated to be KES 50,000 and C 2 was 2KES. The value of P was given by 84/100, the proportion of persons who used petroleum products during the pilot survey. Thus, -¢Qš was found to be 70 households in each Primary sampling unit (PSU). The number of PSUs used was computed to be 9.97 approximately 10 enumeration areas. Thus a total of (70X9.97) household's interviews were needed to qualify for this survey. To minimize non-response, the sample was inflated by 10% leading to a total sample size at 757 households in 10 PSUs. Random numbers were used to pick 288 clusters from a sampling frame obtained from the National Sample Survey and Evaluation Programme (NASSEP IV) consisting of 10,323 clusters in Nairobi County. Primary sampling units were stratified into low class representing 50 percent, middle economic class representing 28.8 percent and high economic class representing 22 percent. This was done using Probability Proportion Sampling (PPS scheme) taking care of the clustering effect design, see Table  1. In this study, incomplete data was found to be due to non-response as a result of language barrier, failure to identify sampled respondent and loss of completed questionnaires by some enumerators. An evaluation of proportion of population with missing data and its pattern provided a diagnosing mechanism for a reasonable method of imputing missing values. Figure 1 depicts the proportion of population missing data and its missing pattern. Age and income of household head had the most missing values with 33 percent of the expected respondents not giving complete information. This implied that we would lose 67 percent of all the petroleum demand information collected within the 1515 households if missed data was not corrected through an imputation process. Missing information was imputed using auxiliary information based on the head of the households' age, income, education status and the household population. This was done using the multivariate imputation by chained equations (MICE) under the assumption that missing data was at random. The observed values and the imputed values for the 10 clusters are presented in Table 2. In estimating the total demand in the population, the model based estimator (15) which uses Nadaraya-Watson, the model based estimator (16) which uses local polynomial, and the HT estimator equation (14) were compared. The bandwidth obtained from AICC was 6.6.
For the Nadaraya-Watson Estimator, the bias concentrated between 165koe and 428koe. This estimate showed sparseness of data between the clusters with population of 450 to 550 persons, which was not also equally spaced.   The mean square error MSE seems to be in clusters lying between 400 and 550 persons with more than 99% of the demand ranging between 365 and 625. Figure 4 shows the boundary effect of the Nadaraya-Watson estimator with some inconsistency at the boundary and not clearly showing the ten clusters. Overall, Nadaraya-Watson indicated an estimated total demand of 3,061,485.90koe. For local polynomial model based estimator, the mean squared error reported was 0.10417 koe with the estimate ranging between 101.03 to 940.28 koe. The bias of local polynomial estimation was between -0.00004 to 0.00005 koe with the bulk of its bias lying around the zero mark. The boundary bias captured all the ten clusters unlike the Nadaraya-Watson estimator as shown in Figure 5. The Local polynomial estimator was consistent at the boundary with an estimated total demand of 3,061,485.90koe.

Figure 5. Local Polynomial boundary bias.
For the Horvitz Thompson estimator, the average bias ranged -0.02734 to 0.06427 as indicated in Figure 6 while Figure 7 illustrates the boundary bias of Horvitz Thompson Estimator Horvitz Thompson estimator. Of the three estimators, Local polynomial had the lowest mean squared error as compared to Horvitz Thompson Estimator which reported the highest MSE followed by Nadaraya-Watson respectively. Horvitz Thompson estimator bias ranged from -0.02734 to 0.06427koe. This information should come in section. Table 3 reports the total estimated demand for the three classes namely low class, middle class and high class as estimated by the Local Polynomial estimator.

Conclusion
From this study, Local Polynomial Estimator performed better with less bias at the boundary as compared to the Nadaraya-Watson and the Horvitz Thompson Estimators. This study found that the bulk of petroleum demand in the county (49.85%) was consumed by the low class while the high class consumed less (22.09%). The remaining demand of 28.06% was attributed to the middle class.
The bias of local polynomial estimation was between -0.00004 to 0.00005 koe with the bulk of its bias lying around the zero mark. The boundary bias captured all the ten clusters unlike the Nadaraya-Watson estimator. The Local polynomial estimator was consistent at the boundary with an estimated total demand of 3,061,485.90koe.