Assessment and Selection of Competing Models for Count Data: An Application to Early Childhood Caries

: Count data has been witnessed in a wide range of disciplines in real life. Poisson, negative binomial (NB), zero inflated Poisson (ZIP) and zero inflated negative binomial (ZINB) are some of the regression models proposed to model data with count response. All the count models are potential candidates that can model count data, but there is no means to choose the one that would perform better than the others. This study aimed to assess the count models mentioned earlier at various degrees of zero inflation. Datasets were simulated with ZIP distribution with different conditions of zero inflation (0%, 2%, 5%, 10%, 15%, 20%, 30% and 40%). Poisson and NB were observed to predict regression coefficients well when the proportion of zero is below 15%. The two ZIM performed well at higher degrees of zero inflation; beyond 15% for ZIP and 20% for ZINB. Exploratory examination of the caries data revealed a zero inflation below 15%, that is, 3.23%. Analysis of early childhood caries (ECC) data among 3-6 year old children who visited Lady Northey Dental Clinic was then performed with Poisson and NB. Akaike information criterion (AIC) test was used to compare all the competing models both under simulation and with real data. Poisson yielded lower AIC values at lower zero inflation rates as compared to other three models. ZIP had the lowest AIC value at 10%, 15%, 20%, 30% and 40% levels of zero inflation. NB model had the lowest AIC value when real data was analyzed. Education level of the father- primary school completed, chewing gum several times a week, Feeding habit jam several times a day, Feeding habit juice every day, Feeding habit soda every day and Feeding habit sweets several times a week were found to be significant factors causing ECC.


Introduction
Count regression models have been employed overtime to model count data and have found a wide application in real world [6]. The index dmft denoting number of decayed (d), missing (m) or filled (f) teeth (t) due to dental caries is used to denote presence of cavity among children with primary dentition. It is a count occurrence. Event count can be defined as the number of times an event occurs, for instance, the number of teeth affected by cavity for each subject. It takes on a random variable that is nonnegative and discrete. Count regression models include Poisson, NB, ZIP and ZINB models among others [1]. An assumption of the Poisson distribution is that the mean and variance are equal. Violation of this assumption leads to models such as the NB that allow modeling of Poisson heterogeneity [4] [7] [9].
Zero inflated data is as well common in dental caries research. In such situations, some subjects portray absence of caries due to chance while others may never experience dental caries [5]. In ZIM, which are two-part in nature, zeros result from two population groups, one involving subjects who never portray a study characteristic and therefore generate the structural zeros while others yield sampling zeros with a probability during the study [17] [19]. ZIM allow us to model both presence and abundance simultaneously. Logistic regression models the first part while count regression models such as the Poisson and NB model the second part [10] [13]. Artificial data has been produced on several occasions to mirror important features of data expected in real world [2]. Simulations were therefore done to inform practitioners on what level of excess zeros warrant use of sophisticated models such as ZIP and ZINB. One merit of simulation study is the ability to generate several datasets within seconds inorder to evaluate stability of decisions [11]. This may not be feasible from true respondents. Medical modeling and simulation has been known to assist in several areas of medical profession such as disease modeling, training and treatment [12] [18]. Dental caries, being a health problem has posed challenges to dental practitioners and administration while trying to model real data and forecast future trends of caries.
The dmft counts are in most cases characterized by inflated zeros, making modeling of such data more complex. This property violates the classical Poisson distribution assumptions, which is the simplest and most popular count regression model. Although several count models capable of addressing count data are available, the advantages of one over competing models has not been absolutely discussed in existing literature [8] [14] [15] [16]. Assessment of these competing models and their traits still calls for research. The main objective of this article is to assess the performance of potential competing count models under various zero inflation levels. It focuses on comparing Poisson, NB, ZIP and ZINB models in modeling count data. Choice of suitable model(s) for the analysis of real data at hand is discussed. Validity of one or more models with simulation guides its application to caries data in order to determine the main causative agents of caries among 3-6 year old children attending Lady Northey Dental Clinic.
One purpose of modeling count data is to enable prediction of effects changes have to a system. Inference will be made possible unlike when study is limited to exploratory analysis. Several studies have exhibited over dispersion and zero preponderance [17]. Real data considered for this study has a count response variable, dmft count, which requires choosing an appropriate count regression model to model it based on the degree of excess zeros as well as over-dispersion. Dental practitioners require more information about the best count regression model to employ for this and future case studies in order to plan for treatment.
Simulation modeling plays a key role in validating the models for prediction [14]. Comparison of model outputs under specified input conditions can only be achieved through simulation analysis. This counters the possibility of model's failure to meet specifications and eliminates over or under-utilization of regression models. Simulation tool will provide a better understanding of a system similar to the caries study at hand by developing mathematical models and observing how it operates under different inputs of zero inflation. In addition to simulation, goodness of fit statistic such as AIC is necessary. This is because it is not easy to determine the appropriateness of ZIP and ZINB as the zeros they account for cannot be observed directly but are latent.
Factors contributing to ECC should be recognized among infants in order to equip trainees and dental specialties with better skills for solving problems and making decisions. This study will be beneficial in developing new treatments and preventing progression of dental caries. Dental clinics can benefit from this study as information derived from it can be used to investigate the most optimal way to treat caries patients without compromising patient expectations.

Introduction
Significant developments in count models have taken place in demography, actuarial science and biostatistics. These models portray special features such as the features of generalized linear models. The main interest is to investigate the role of regressors which is achieved by regression modeling of count event. The response variable, dmft, is restricted to be a positive integer variable whose conditional mean is linked to a vector of regressors through the log link. In this chapter, both simulated and real data have been used for regression.

Simulation
Simulation has been defined as "the process of creating and experimenting with a computerized mathematical model of a physical system" [15]. Simulations enable researchers to check the performance of a statistical test on ideal data. Simulations were used to generate datasets with pre-specified properties and compared the parameter estimates resulting from regression to the specified parameters. A number of methods may exist for analyzing count data and suitability of such methods could be determined using simulations [16].
Two classical count regression models together with ZIP and ZINB have been discussed so far, as well as their estimation technique. Simulation of data was done under ZIP distribution. A count regression variable Y and two different types of covariates were simulated. The experiment was done 500 times on a sample of size n=1000 with two explanatory variables, age and sex, in the count component. The structural zero component assumed simple inflation, thus no regressors.
Age is a continuous variable and was assumed to follow a normal distribution with mean=5 and standard deviation=0.7. The normal distribution is given as follows: ~ (5,0.7). Therefore: Age was simulated as age=rnorm(1000,5,0.7). Sex is categorical binary and the function rbinom was used to generate random sample with n=1 and p=0.4.
The underlying interest was to see the performance of the four count models discussed earlier for different proportions of zeros. Y was generated with a Poisson distribution with different zero percentages. These values include lower proportions of zeros to enable us assess the merit of the four models. It also helps us determine to what extent shall ZIM be employed. Proportions of zeros considered for this study are 0%, 2%, 5%, 10%, 15%, 20%, 30% and 40%. R software was used for analysis, where Poisson and NB models were fitted using glm() function in stats package and glm.nb() function in package MASS respectively.
To validate the simulation process, performance measures such as bias and MSE (mean square error) were used to make comparisons between simulation results and prespecified parameters as described by Beaujean [3].

Data
Collection point for data used in this study was Lady Northey Dental Clinic, situated in Westlands constituency, Nairobi City County along State house Avenue. The sampling frame consisted of patients between the age of three and six years whose parents or guardians accompanied them and agreed to be interviewed. This data was collected from September to November 2014. Only 83 observations with all values for every variable were used.

Study Variables
The outcome variable is the number of decayed (d), missing (m) or filled (f) teeth (t) due to dental caries. Predictor variables included age, gender, highest education of the father, highest education of the mother, employment state of the father, feeding habit biscuits, feeding habit gum, feeding habit jam, feeding habit juice, feeding habit soda, feeding habit sweets, feeding habit tea with sugar, brushing frequency and use of fluoridated toothpaste.

Results and Discussion
Regression coefficients estimates from Poisson, NB, ZIP and ZINB resulting from simulation were recorded alongside their respective bias and root mean square error (RMSE) as shown in table 1. AIC values from all the resulting models were also recorded in table 2. This study aimed at evaluating four count regression models after making changes to data, by varying the proportions of zero. Results from table1 show that Poisson estimated the two regression coefficients under simulation experiment with 0%, 2% and 20% proportions of zeros relatively well and the coefficient of age at 15% and 40% zero proportions. This model underestimated = and = when p was fixed at 0.1 and 0.3. Poisson model was also observed to overestimate = at 5% and 10% and = at 5%, 15%, 30% and 40% percentages. NB model under-predicted the value of = at 0.3 fraction of zero while over-predicting the same regression coefficient at 5% and 40% and = at 5%, 15%, 30% and 40%. However, the NB model performed well in estimating = and = at 0%, 2%, 10% and 20% levels of zero proportion as well as = at 15%. The ZIP regression model estimated = at 2%, 10%, 15%, 20% and 40% and = at 10% and 20% proportions of zeros approximately well while over-predicting = at 0% and 5% and = at 0%, 5%, 15%, 30% and 40% percentages of zeros. Only = was under-predicted by the ZIP model when the value of p was set at 0.3. ZINB approximated = well when the proportion of zero was set at 0%, 2%, 10%, 15% and 20% and = when p was specified as 0%, 2%, 10% and 20%. The regression coefficient of x1 was underestimated at 30% and overestimated at 5% zero proportions. This model overestimated = at 5%, 15%, 30% and 40% zero percentages. Poisson model yielded the lowest AIC value under 0%, 2%, 5% and 10% of zeros, as can be seen in table 2. ZIP proved to outperform all the four models at 15%, 20%, 30% and 40% percentages of zeros with the lowest AIC values followed by ZINB. 0, 0.1 and 0.4 fractions of zeros were approximately well estimated by the ZIP and ZINB models. These two models overestimated the value of p at 2% and 5% while underestimating p=15%. 20% and 30% proportions were observed to be predicted more accurately by ZIP in relation to ZINB. This can be observed from table 3. The root mean square error (RMSE) was observed to increase with increase in the value of p up to when p=15% for ZIP. On the other hand, RMSE increased with increase in p up to when p=20% for ZINB.
The proportion of zeros in the caries data was only 3.23\%. In other words, only four children did not display any sign of dmft. This number of zeros is below the threshold of 15\% recommended for application of ZIM from simulation results. The NB model had the lowest AIC value, hence fitted the caries data well as compared to Poisson. The following covariates' levels were observed to be significant at 5% level under NB regression: education level of the father-primary school completed, chewing gum several times a week, Feeding habit jam several times a day, Feeding habit juice every day, Feeding habit soda every day and Feeding habit sweets several times a week. These explanatory variables are marked with * against their p-values as indicated in table 4.

Conclusion
The ZIM can be employed when the proportion of zero p exceeds 15%, otherwise, the two classical count regression models apply. NB model fitted the caries data well.
The following covariates are the main risk factors associated with caries among children attending Lady Northey dental clinic: education level of the father-primary school completed, chewing gum several times a week, Feeding habit jam several times a day, Feeding habit juice every day, Feeding habit soda every day and Feeding habit sweets several times a week.
The simulation study dismissed the use of complicated ZIM, while favoring use of Poisson and NB models with real data. Classical count regression models should therefore not be overlooked as datasets have distinct properties. The ZIM should be employed with high percentages of zero in data, above 15% zero inflation. This is evidenced in the simulation study, where ZIP and ZINB models gave lower AIC values at 15% level of zero preponderance and beyond. NB should be used to fit the caries data at hand as it portrayed lower AIC value as compared to Poisson model. The main risk factors observed from regression with caries data should be considered in planning for prevention and treatment of caries among the children attending Lady Northey clinic. Further study should be done to determine the effect of each category of respective factors responsible for dental caries.
Tu, X. (2012). Modeling count outcomes from hiv risk reduction interventions: a compari-son of competing statistical models for count responses. AIDS research and treatment, 2012.