Bayesian Finite Mixture Negative Binomial Model for Over-dispersed Count Data with Application to DMFT Index Data

To establish viable statistical model for modelling and analyzing DMFT index data which is important in oral health studies, difficulty arise when DMFT index data is characterized by over-dispersion. Over-dispersion caused by unobserved heterogeneity in the data pose a problem in fitting more common models to this data. and failure to account on such heterogeneity in the model can undermine the validity of the empirical results. The limitations of other count data models to account for overdispersion in DMFT index data due to existence of heterogeneity in the data, this paper formulated alternative model that captures heterogeneity in the data, that is Bayesian Finite mixture negative binomial regression model and the model applied to simulated overdispersed count data to determine the exact number of negative binomial components to be mixed and finally apply the model to DMFT index data. Bayesian finite mixture Negative Binomial (BFMNB-3) regression model is useful since the data were collected from heterogenous population. simulation results shows that 3component Bayesian finite mixture of NB regression model converges and was quite enough to model the overdispersed simulated count data, applying BFMNB-3 model to DMFT index data, the model capability to capture heterogeneity in the data identifies that the methods; all the treatment (all methods together), mouth wash with 0.2% sodium fluoride and Oral hygiene were the best methods in preventing tooth decay in children in Belo Horizonte (Brazil) aged seven years this shows that BFMNB-3 performs better than BNB model were due to heterogeneity present in methods it only identifies methods; all the treatment (all methods together) and mouth wash with 0.2% sodium fluoride to be the best methods for preventing tooth decay for children in Belo Horizonte (Brazil) aged seven while this two methods were not the only significant methods, therefore from results there is complete superiority of BFMNB-3 over BNB model. R statistical software was used to accomplish the objectives of this paper.


Introduction
Count data is encountered in many areas of research including social sciences, transport, economic and health, count data includes; the number of accidents in a specified period of time, number of epileptic seizures in a week, number of insurance claims paid by Insurance company in a year, number of domestic violence and number of defective items in a batch of manufactured items. This count data has different forms that is, count data with excess number of zeros, count data with large observations and count data without zeros. Many standardized models have been developed to model count; Poisson regression, Negative Binomial, Zero inflated Poisson, Conway-Maxwell Poisson model, Double Poisson model [1], the choice of application of any model depend on the existence of excess zero's and dispersion in the data [2]. In the recent past Negative binomial and Poisson distribution have been commonly used probability models in statistical analysis of count data [3], Poisson regression is popular for modeling equi-dispersed count data and it has been used in a number of applications involving data which have no overdispersion [4], but its underlying assumption of equidispersion limits its use in many real-world applications where over and underdispersed count data is encountered [2] overdispersion and under dispersion can lead to inconsistent standard errors of parameter estimates when Poisson model is used [5][6], due to existence of overdispersion mainly due to generation of excess zero's. Negative Binomial, in this distribution the distribution's parameter is itself considered a random variable and variation of this parameter can account for variance of the data that is higher than the mean, this serves as a good alternative to handle overdispersed count data [7], zero-inflated double Poisson model could be a viable alternative to the joint modeling of excess of zeros and overdispersion (or under-dispersion). Over and Under-dispersed count data is conveniently model by Conway-Maxwell regression model [8,9] and Double Poisson regression model [10][11]. and have been found to be very flexible to handle overdispersed count data [11][12][13][14] model discrete count using Bayesian framework in Win Bugs [15] but parameter estimation still remain to be complex and difficult. Although Conway-Maxwell-Poisson distribution could be a feasible alternative to model over-dispersion and under-dispersion count data, it is observed that it requires a lot of computation for parameter estimation [16] also perform less compared to Double Poisson where there is high sample mean for all types of dispersion [17], major challenge with Double Poisson distribution is, results are not exact since the normalizing constant has no closed form solution [4]. The study proposes Bayesian Finite mixture model to fit over-dispersed count data because when posterior distribution for the unknown parameters are given, Bayesian method provide valid inference without relying on the asymptotic normality and this is important when the sample size is small. In the study, BFMNB-k was formulated its performance accessed by fitting to over-dispersed Simulated count data and finally apply BFMNB -3 model to DMFT index data.

Literature Review
Bayesian analysis represent prior uncertainty about model parameters having probability distribution and updating prior uncertainty with current data to induce posterior probability distribution for the parameter with less uncertainty. In Bayesian analysis, model parameters are considered random quantities, whereas the data having been already observed are considered fixed quantities. The Bayesian approach provides a fairly explicit solution to common problems of statistical inference, new problems of high-dimensional data analysis that are coming up because of emergence of highdimensional data sets, and complex decision problems of real life [18].
Models for Count data discussed in this paper, Poisson Regression Model Poisson has been used as basic model in modeling count data [19] it models equi-dispersed count data that is; Ε | = | = ( ) but this model fails when we have over-dispersed data. The model is represented as, In real life situations count data exhibit overdispersion and the assumptions of equality of mean and variance in Poisson (restrictiveness) fails due to heterogeneity (difference between individuals) and contagion (dependence between the occurrence of events) [17].
[4] To model overdispersed count data, Poisson regression can be modified such that, is nonnegative multiplicative random effect term to model individual heterogeneity, and taking total expectation Ε | = ( ) Ε ( ) Therefore, variance is greater than mean and the model can be used now to model overdispersed count data.
Negative Binomial Negative Binomial have been considered to out-perform Poisson regression model in modeling overdispersed count data [5], it is obtain by placing gamma prior in the nonnegative multiplicative random effect term in Poisson regression model, Placing prior distributions in the regression parameters where @(. ) induce a prior distribution, that is @( ) = B (0, D = B) n is the sample size.
Prior distribution and likelihood function define the posterior distribution of the regression parameters (Bayes' theorem). Samples from the posterior distribution is obtained in PROC MCMC.
To access the goodness of fit

Methodology
Bayesian Finite Mixture Negative Binomial Regression Model [20] Due to heterogeneity (difference between individual) and contagion (dependence between the occurrence of events) count data in real life are usually overdispersed (Variance is greater than the mean).
The random vector ? = ( ) , … , K )′ is said to arise from a finite mixture distribution if the probability density function ( ) has the form, vector of all unknown parameters and N ′ are mixing proportions whose elements are restricted as positive and sum to unity.
A single density 2 & (. |O & ) is component distribution for component " and is assume to arise from the same distribution The marginal distribution of the mixture is given by;

Data Description
DMFT index data was from the Belo Horizonte Caries Prevention (BELCAP) study. The data was collected from children in Belo Horizonte (Brazil) aged 7 years at the start of the study. determining which method was the best for preventing tooth decay, six treatments were randomized to six separate treatment groups. Only eight deciduous molars were considered, the lowest value was 0, highest was 8. Main reasons for using this dataset; the data is relatively good quality and has been used in various study purposes and the data shows existence of heterogeneity of the several different sub-populations. Data has two sub-population i.e. End (Number of decayed, missing or filled teeth at the end of the study), begin (Number of decayed, missing or filled teeth at the beginning of the study) and covariates (Gender-(male and female), Ethnic (with levels brown, white and black), Treatment (with levels control-Control group, educ-Oral health education, enrich-Enrichment of the school diet with rice bran, rinse-Mouthwash with 0.2% sodium fluoride (NaF) solution, hygiene-Oral hygiene and all-All four methods together).

Empirical Results
i) Results from simulation Sample size of N=1000 was used to generate the data that was used to select the most viable model of Bayesian finite mixture negative binomial. The number of components for mixture was determined and was found to be 3 since the model converge only when 3 components were mixed.  2 (1966.592) therefore the model where 3 components of Negative Binomial were mixed is the best to fit the data and research concludes that with 3 components mixed the model provides best fit to overdispersed data.
The output was also visualized using figure 1. Checking for overdispersion in the simulated count data, figure 2 show that there is skeweness to the right this displays presence of over-dispersion also displays the high poroportion of zero's and that justifies over-dispersion in the data, the summary statistics [M (SD) = 3.15 (5.66)] showed that the mean was 3.15 and the variance was 32.0356 implying that variance is greater than the mean a characteristic present in any overdispersed data.    Checking for overdispersion in DMFT index data, In Table  3 it was found that the hypothesis { g : u = 0 for no overdispersion was rejected u = 0.1044174 which is not equal to zero therefore there is over-dispersion in the data. Z-score test had a t-probability of 2e−16 which is less than 0.0005, Z test also evaluate that the data are negative binomial and therefore it suggests that real over-dispersion exists in the data also the existence of overdispersion was also seen in figure 3 were the graph is skewed to the right.      Figure 5 shows the convergence of the model by trace plot fitted to DMFT index data it was observed that the model converges after 2000 iterations, posterior distributions of the model is bell and monomodal shape of marginal posterior distribution close to a normal distribution although there is evidence of dispersion on the right-hand side of the plot.   Table 5 shows that the parameters for the methods; all methods together, mouth wash with 0.2% sodium fluoride and Oral hygiene had Bayes factor of 18.69, 0.56 and 0.25 respectively all greater than the standard value 0.1 and that all this parameters were significant therefore the methods; all the treatment (all methods together), mouth wash with 0.2% sodium fluoride and Oral hygiene were the best methods in preventing tooth decay in children in Belo Horizonte (Brazil) aged seven years. Table 5 also report the BIC of 2885.572 this value defines the goodness of fit of the BFMNB-3 to DMFT index data the value is lower compared to the BIC (2898.932) for BFMNB-2 therefore BFMNB-3 best fit the data.
From figure 6 the posterior distribution for all the variables are almost the same this is due to the assumption of conjugate priors for the model parameters this is also clear on the density of prior which is same for all the variables.
ii) Applying Bayesian Negative Binomial (BNB) Model to data (DMFT index data) Applying BNB to DMFT index data it was observed that the parameters for methods; all the treatment (all methods together) and mouth wash with 0.2% sodium fluoride were the best method for preventing tooth decay for children in Belo Horizonte (Brazil) aged seven this is explained by low proportions inside ROPE i.e. the closer to zero the better and null hypothesis should be rejected therefore the parameters are significant from Table 6. The same results were also seen in figure 7 where the light blue color defines the most significant parameters. Bayesian Negative Binomial for differential expression with confounding factors [22] concluded that BNB is capable of handling and tracking complex experiments involving multiple factors and multi-variate dependence structure, despite this incredible performance of BNB it was observed from the results that it is less capable to capture heterogeneity and uncertainty in the variables under study therefore BFMNB-3 outperforms (since BFMNB-3 was able to identify the 3 components being the best methods to prevent tooth decay) BNB and the research concludes that BFMNB-3 is the most viable model to analyze the DMFT index data.

Conclusion
To formulate and apply BFMNB-3 to DMFT index data was the main objective of this paper. The findings of the study were BFMNB-3 model deemed to be the best model in modelling overdispersed data with sub-populations characterized with heterogeneity also the model can handle uncertainty in the data (DMFT index data) it was clearly seen that the model has lower BIC (2885.572). the plots (figures 3 and 5) showed that the model can be used to better reveal the source of dispersion observed in DMFT index data (where treatment was found to be the source of Over-dispersion), The model being capable to capture heterogeneity it was found that the methods; all the treatment (all methods together), mouth wash with 0.2% sodium fluoride and Oral hygiene were the best methods in preventing tooth decay in children in Belo Horizonte (Brazil) aged seven years this shows that BFMNB-3 performs better than BNB model were due to heterogeneity present in methods it was only able to only identify methods; all the treatment (all methods together) and mouth wash with 0.2% sodium fluoride to be the best methods for preventing tooth decay for children in Belo Horizonte (Brazil) aged seven and indeed this two methods were not the only best methods, therefore from results there is complete superiority of BFMNB-3 over BNB model.