Non-linear Approximations of Shape and Location Parameters in the Poisson Inverse Gaussian Model in Analysis of Infectious Count Data

Statistical models create a basis for analysis of infectious disesase count. These data sets exhibit unique characteristics such as low counts, delayed reporting, underreporting amoung others. The tendency to model these counts using linear models with their simplicity is common with most research. Further, the assumption of a fixed dispersion in modeling infectious disease counts is quite high. Prediction relating to infectious disease counts have been based on the Poisson model framework. The extension of the Poisson models such NB and PIG distributions have gained popularity over the recent past in modeling count responses showing over dispersion relative to the Poisson distribution. In this study we propose non-linear models for these data sets, modeling the mean and dispersion parameters as additive terms. Negative Binomial (NB) and Poisson Inverse Gaussian (PIG) glm models with a fixed and a varying dispersion parameter and compare them with NB GAM and PIG GAM with both mean and dispersion modeled as additive terms. The model are fitted to over dispersed infectious counts, Salmonella Hadar data set. Residual plots are constructed to explore the quality of fits and analysis goodness of fit is carried out to access the best fitting model. The study results reveal better performance of the PIG models on both the linear and non linear model platforms. Further, modelling both the mean and dispersion proved better as compared to models assuming the dispersion as a constant.


Introduction
Count data is encountered on daily basis and dealings. The data exhibits unique characteristics such as over-dispersion, under-dispersion, incompleteness, presence of excess zeros among others. More understanding of such data and extraction of important information about it needs some statistical analysis or modelling. Various count data may posses different characteristics and therefore cannot be used with particular count data models. This necessitates the need for a systematic way in choosing of the best model that best describes the data in that, one should test whether the models assumptions are met rather than just picking a model naively. The nature of these data has led to development of various types of statistical models that are of great use in the statistical analysis of this type of data. The models and results vary according to the strength of the distributional assumptions made [4]. Most of these methods have now found their way into major statistical packages, which has greatly encouraged their application in variety of contexts.
Count data are most commonly modeled using regression based on the Poisson regression model [3], this regression has considerable limitation. Poisson distribution has one parameter which limits the variance from varying independently from the mean. Despite the fact that these models have been applied extensively in modelling count events [36], this assumption that the mean and variance of the data are equivalent is rather too restrictive and seldom does occur in observational data. This is a significant drawback since count data are often over-or underdispersed relative to the Poisson variance. More so, when the observed data involves excessive zero counts, over dispersion arises hence may result in underestimation of the variance of the estimated parameter, resulting in wrong deductions [31].
Over-dispersion in regard to the Poisson model can be modeled by introduction of an additional parameter. For instance, in the negative binomial (NB) distribution and Generalized Poisson (GP) distributions [6], the models enable independent modelling of both, mean and variance by the incorporation of an additional parameter. They enable additional variation within the data to be accounted for by adding a randomly distributed error term, that is based on the Gamma distribution.
Further developments have been in progress with the view of coming up with a model that best analyses count data with their unique characteristics. Extension of the Poisson models such as the Conway-Maxwell-Poisson (COM-Poisson) distribution and its generalized regression model (GLM) have been proved to be exceptionally flexible in modelling count data [15,14,32,13] whereas, the Double-Poisson (DP) and its generalized regression model; DP GLM performed better for count data with high mean scenarios independent of the type of dispersion [41].
The Sichel (SI) distribution is a model suitable for modelling count data that is highly dispersed [40]. This distribution is a compound form of Poisson distribution, that combines the Poisson distribution with generalized inverse Gamma distribution [35]. Recent research on mixed Poisson distributions for analyzing long-tailed count data stated that the SI model provides satisfactory inferences for so many cases compared to other models [16]. However, though the model seems adequate in modelling count data it suffers estimation problem. To overcome this challenge, reparameterization of the model was introduced, [35]. The special form of the SI distribution; the Poisson inverse Gaussian (PIG) in which the shape parameter in the SI model is set to -0.5 provides a better model for analysis of count data with slightly longer tails and excess kurtosis [39]. The PIG model has tractable nature in that its likelihood function is easily obtainable and has a closed form, indicating estimation of parameters as quite simple and less time consuming.
In the recent past, regression models extending modelling to dispersion, other shape parameters and to distributions beyond the exponential family have been so popular [10]. However, though much interest is still focused on modelling the mean, these models allow the flexibility of modelling shape parameter as a function of covariates. The PIG model parametrized in terms of the mean and dispersion parameter is available as response distribution in already existing regression software [34]. A study conducted on analysis of clinical trials where a secondary outcome was the number of falls that participants experienced while undergoing a drug treatment or usual care (control group: [18]), estimates of the treatment effect on the mean were found to be sensitive to the specification of the dispersion model. The sensitivity of the mean model to the dispersion model is of particular concern in the context of clinical trials, since statistical analysis plans do not generally specify modelling the dispersion parameter. A regression model using alternative parametrization of the PIG distribution, where the shape parameter is orthogonal to the mean should be considered [21]. But this parametrization of and leads to model estimates that are robust to misspecification of the dispersion model. Misspecification of the model may result in realisation of biased estimates, that may in turn lead to erroneous and misleading conclusions. Regression models require that relationship between the variables (response and explanatory) conform to a particular functional form. Omission of important explanatory variables, failure to account for any non-linear components or critical interaction terms, or making measurement errors may lead to misspecification and accordingly bias the parameter estimators of one or more of the predictors in the regression model [5]. The study propose non-linear parameterization of the PIG model in location and dispersion parameters.

Literature Review
Counts are non negative integers which represents occurrences of event (s) within a fixed period of time. In numerous scientific and economic contexts the response variable is usually a count which needs to be analyzed in terms of a set of covariates. Common features off these counts is over dispersion or underdipersion with regard to the Poisson assumption. This arises when the variance of the counts data exceeds (overdispersion) or falls short of (underdispersion) the mean.
Regression models for count data a) The Poisson Regression model The Poisson model is a benchmark model for count data in much same way as the normal linear model is a bench mark for continuous data [30,4,36]. This count model distribution is focused on the number of outcomes of a given event. The model is derived from the Poisson probabilty mass function; where =count response =rate parameter or predicted count =time or area in which counts enter the model. When is considered as applying to individual counts without considering size or the time, = 1, where as, when > 1 it is known as an exposure and usually modeled as an offset. Estimation of this model is based on log-likelihood parametarization of the Poisson probability distribution, which is focussed at establishing parameter values making the data most likely. The exponential family form is: where symbolize the predicted count. The deviance function associated with this equation is applied when the model is estimated as a Generalized Linear Model The mean response is given as , ′. = . / + . , +. . . +. 12 , 12 which is a function of linear predictor variables , , . . . , , 12 . The function (, , .) relates the mean response to the predictor variables , and the regression coefficients . . Hence, if ∼ *( ) then, the mean(response) function is given as = 34*(, .) . Therefore, the Poisson regression mean response function can be expressed as; = . / + . , +. . . +. 12 and its likelihood function is; The log likelihood function is given by; MLEs of . / , . , . . . , . 12 , can be realized through numerical maximization procedures e.g, through the reweighted least square approach. The response function and fitted values ̂= exp(, , A) for the regression function = (, , .) = exp(, ′, .) can be obtained from the parameter estimates of .. The Poisson regression has advantages in that it takes into account the non-negative and discrete data hence, allows for drafting of conclusions on the probability of the occurrence of an event. The model may likewise be used as an option to Cox model in survival analysis, when the risk rates are around steady, amid the perception time frame and given that the danger associated with the occurrence is minimal [1]. However, despite its extensive use the Poisson model seems rather too restrictive due to the assumption that, the mean and variance of the data are equal. This is rarely the case with observation data since most count data in real life are either over-or underdispersed. This violation of the Poisson necessitates the need for other models like, the Negative Binomial (NB) regression model and the Generalized Poisson (GP) model that are able to capture these characteristics of the data. b

) The Generalized Poisson (GP) Regression Model
The explicit presumptions of the Poisson regression models is that the force of Poisson process is a deterministic capacity of the covariates and the occurrences happen arbitrarily over some time. When handling count data characterised by underor over-dispersion; where the sample variance is smaller (or larger) compared to the sample mean, the Poisson model leads to biased estimates of the parameters [7]. For over-dispersed data sets, the GP regression models provide a better alternative for the Poisson regression model [25,28,23,24,26]. The GP and NB regression models have been applied in order to handle over-dispersed count data in place of the Poisson model [25,12]. For instance, where the type of dispersion exhibited by the data set is already known to be overdispersed then, one can either use the GP or NB regression models to model such data, otherwise, for an unknown type of dispersion the decision ought to be the GP model as it is more adaptable. The General Poisson regression can be expressed as; For S = 0 the above model reduces to a Poisson model implying that the GP model is general form of the standard Poisson model. A comparison of the two models in terms of their capability to model over-dispersed count data showed that the GP regression out do the Poisson regression comparing their log-likelihood values [11]. Despite the fact that these models are able to capture overdispersion in count data, the models may not be adequate in handling data characterised by too many zeros.
Other regression models for handling count data c) The Double Poisson (DP) Model and its extensions.
The Double Poisson regression model is a model within the frameworks of the double exponential family proposed by [9]. The distribution is derived from a mixture of two Poisson distributions, V( ) and V(C), i.e, where, W =dispersion parameter X( , W) =normalizing constant. The exact values of this constant are dependent on the values of and W. The density function of the model is expressed as, and the probability mass function is given by; with mean and standard deviation (the exact density F,Y (C)) given as; The normalizing constant X( , W) is approximately equal to 1 and can be obtained as; This constant ensures that the density adds up to unity. This double poiusson model takes into account overdispersion ( bR W > 0) and underdispersion (cℎ3O W < 0) and reduces to a Poisson distribution whenever W = 1 .
The estimates of the parameters and W can be gotten through maximum likelihood approach. Due to the fact that the model has capability of handling over-and underdispersed count data, it has been applied in various research and several extensions developed.
Though this models seems attractable in its ability to handle over-and underdispersed counts, it is still limited in that the normalizing constant has no closed form hence leading to non exact results [36,22]. The incorporation of this constant increases the non-linearity which results in difficulties in achieving the MLEs.
Despite having some attractable characteristics and being more robust it has limitations in its applicability as a basis for GLMs since the parameters and R lack a clear centering parameter [15]. Actually, while almost equals the mean for R close to 1, this differs at a greater extent from the expected value for small values of R. When the data is over-scattered, R would be required to be small and along these lines a COM-Poisson dependent on the original formulation would be extremely hard to decipher and use for the over-scattered count data. To overcome this challenge introduced a reparametrization of to = /R was introduced in the original distribution to provide a clear centering parameter [15].
e) The Sichel (SI) Distribution The Sichel (SI) distribution, which is a compound Poisson model was introduced by [33]. The distribution combines the Poisson distribution with the generalized inverse Gaussian distribution. This resultant combination is in particular very useful in modelling count data that show overdispersion in respect to the Poisson model and has produced satisfactory results in many instances where other count models were inadequate. The distribution contains three parameters 0 < W < 1, > 0 and −∞ < s < ∞ and can be expressed as; where, =dependent variable =expected value of the observations =scale parameter s =shape parameter S T = 2T + 2 (P ) 2 and The fucntion X y ( ) is referred to as the modified Bessel function expressed as; The central moments of the SI distribution are given as; +(9) = , QNR(9) = T 62 (s + 1)/P + 1/P T − 17 As → ∞ and s > 0, the distribution converges to a NB distribution. For, s = −0.5 , the distribution reduces to a Poisson-inverse Gaussian with an expected value and variance + T [39]. The likelihood function of the PIG model is effectively realistic and has a closed form, showing estimation of parameters very straightforward and nearly takes no time [27]. Various research in the fields of medicine and transport have demonstrated that PIG provides a better fits in modeling count data with longer tails and excess Kurtosis [39]. Recently, an orthogonal parametrization of this dispersion model was proposed and its performance tested on a clinical trial data [20].

Methodology a) The Poisson Inverse Gaussian (PIG) Regression Model
The PIG regression model is a special form of the Sichel distribution proposed by [8]. In this model, the shape parameter in the Sichel distribution is set to be a constant, that is, s = − T . Therefore, the model is characterized by two parameters in contrast with the SI model. This model is more tractable and very useful when handling data exhibiting longer tails compared to those of a NB model [8,35]. A variety of Poisson mixture distributions that can be used to handle count data have been proposed, [29]. The distributions are characterized by the two parameters and where the expected value and the variance are given as +(9) = and QNR(9) = (1 + ) . Under these parametrization the probabilty denstity function function is expressed as; where represents the moddel offset then we can have the model as;

∼ V…&( , )
The PIG distribution has been presented in various literature texts in different parametrization. It was first introduced with parameters S and with S > 0 and 0 < < 1 , [33]. Under this parametrization the distribution's expected value is expressed as + (9) (19).
The aspect of the PIG model based on orthogonal parametrization of the mean and shape parameter was investigated and the model appllied on data from clinical trial for the number of falls that patients experienced while undergoing a drug treatments [20]. The PIG distribution is represented as a respeonse distribution with; B( ) = 4′.; ℎ( ) = '′' The study assumes an orthogonal parametrization proposed and introduce general additive models for both the mean and the dispersion parameters in the PIG model assuming; B( ) = 4′.; ℎ(S) = '′s (21) b) Model Formulation Let 9 be a response variable from the exponential family distributions and assuming 9 is parameterized by W then the General additive model (GAM) for the distribution parameters can be expressed as; B(W) = S + ∑ " " " (4 " ) + • where S is a constant term, " ′OE are unspecified smooth functions of the covariates 4 " , • = 1,2, . . . , -, B(⋅) is a smooth monotonic link function which is known and • ∼ -(0, T ) is error term [17,37]. The model above can be represented otherwise using basis expansions for each smoother with its corresponding penalty and estimation carried out by penalized regression approaches with appropriate degree of smoothness for the " ′OE estimated from the data via marginal likelihood maximization or cross validation approaches [37]. Under this approach a smooth function " can be represented by a set of ˜ spline basis functions A " (4) hence; where . " is the smoothing coefficient associated with the • š function. . • is estimated by maximizing the penalized loglikelihood; where ℓ ⋅ is the log-likelihood function, " is the smoothing parameter for the • ℎ function " and _ " is known matrix of coefficients [38].
In this study we seek to develop GAMs, for both the mean and dispersion parameters under the orthogonal parametrization, that is, 9 ∼ V…&( , S). We assume a mean and dispersion model of the form; B( ) = . / + ∑ " " " (4 " ) + • (24) ℎ( ) = s / + ∑ " " OE " (4 " ) (25) where, B( ) = log( ) and ℎ( ) = log( ) are the monotonic link functions 4 ′OE are observed data variables • = log is the offset term for an observation time c) Model specification A parameter-observational model for infectious diseases takes the following functional form; 34* oe 0 S 2 (26) where the model parameters is the mean infections per week and oe and S 2 are the endemic and is epidemic components respectively [19]. The endemic component oe is considered to be parameter driven. Due to the fact that most infectious disease data exhibit seasonality, •bB oe is modeled as the sum of S harmonic waves having different frequencies and an intercept term as; where, OE=number of harmonics ' Ÿ TŸ} 1 are fourier frequencies with ˜ as the base frequency. The epidemic component S 2 is an observational driven process through S . For values of S between 0 and 1, the model depicts occasional epidemic outbreaks which a branching process with immigration. The process is ergodic in situations where S e 1 and has an exponential increase for values of S 1. However, in cases where S 0 , the model reduces to a parameter driven formulation without an epidemic incidences.
The model serves a link for infectious disease counts mean and the explanatory variables. The dispersion parameter is modeled as a function of the explanatory variables to explain extra variation in the data. The study modelled mean infection model as a function of additive terms: •bB oe 0 2 (28) where, oe is the endemic additive term given as: and 2 is the epidemic component considered to be observational driven. Further the dispersion in the data was modelled as additive terms of the both the endemic and epidemic components under the Poisson inverse model as: •bB oe 0 2 (30) where parameters are as earlier defined. d) Description of the Data (Salmonella Hadar Data) The Salmonella Hadar data set contained 295 observations of the disease recorded over a period of six years (2001)(2002)(2003)(2004)(2005)(2006) in Germany. The distribution of the responses, in figure 1, depicts a highly peaked data with a long tail and skewed to the right with a kurtosis and skewness coefficients of 7.3 and 1.86 respectively. The ratio of variance to the mean was 3.45 an indication of presence of over-dispersion in the data [2]. A Poisson glm model fiited on the data recorded a dispersion of 1.96. This showed that the data was over-dispersed relative to the Poisson distribution.

Results and Discussions
a) Linear models Generalized linear models (GLM) were fitted for both data sets under the Poisson, Negative Binomial and PIG distributions and model performance examined. Goodness of fit of the models was examined based on values of Akaike Information Criterion (AIC), global deviance and Bayesian Information Criterion (BIC). Model (s) considered to have best fit for the data were those that recorded the smallest values of these measures. Table 1 shows the summary results for the Poisson, Negative Binomial (NBI) and Poisson Inverse Gaussian (PIG) linear model fits with both fixed and varying dispersion parameters for the study data. From these summary results it is evident that the NB and PIG models had similar estimates. For instance, the past observations of the disease had a positive relationship with the current infection frequency. Further, the PIG model had better fits for both the fixed and varying dispersion parameters. The models with a varying dispersion parameter showed better fits for the data compared to those with a fixed dispersion. This is evident from the small values of AIC, SBC and Global deviance values. This is an indication that the Inverse Gaussian part in the PIG model has a better flexibility to handle over dispersed data common in infectious disease compared to the Gamma part present in NB models. Efficiency of the PIG model is facilitated by the varying dispersion parameters in the model that help to capture variability within the data. b) Non-linear model fits General additive models were fitted for Poisson, NB and PIG models. General additive term models for the mean assuming the dispersion was fixed were fitted to the data under the models and results compared. Further both the mean and dispersion were modeled as smooth additive terms and model performance investigated. Table 2, shows the summary results for the AIC, SBC and global deviance for the fitted models. Evidently, from the observed summary results the PIG GAM model produced a better fitted model for the study data. This is indicated by the small values of AIC (1199.047, 1209.564) recorded for the model compared to the NB model AICs (1212.634, 1211.622). Modeling the mean and the dispersion parameters as additive terms increased the model performance further by far. This is clear indication that modeling the mean parameter alone may result in model misspecification, [20]. A comparison of the AIC of the additive models and their linear counterparts showed that the additive models had better fits compared to the glm models previously fitted. This implies that, non-linear additive models are best suited to fit overdispersed count data as they produced better fits compared to the linear models. c) Residual Analysis For any fitted model, true residuals R have standard normal distribution irrespective of the model distribution. The randomized quantile residual summary for the additive models are shown in Table 3. The normalized quantile residuals for the fitted PIG models behave well. Their means are nearly zero, variance nearly one, kurtosis is nearly 3 and the values fall inside the acceptance region, except for model skewness. The model residuals shows some right skewness. The residual distributions for NB the models suggests a leptokurtic behavior which is slightly skewed to the right. The residuals from the PIG model with additive terms for the mean parameter are leptokurtic and skewed to the right. The figures 2 and 3 show the residual worm plots for the additive models with fixed dispersion and an additive dispersion model respectively.  The residual worm plots show that the model fits the data fairly well. However, the plots for the additive models where the dispersion parameter was assumed to be fixed indicate the model did not fit the data very well as shown by some points lying outside the confidence bounds. An in depth insight to the residual worm plots for both the fixed dispersion models and additive dispersion reveal some misfits of the penalized curves fitted to the residual points for given data. This serves as an explanation to the disparities seen from the summary of randomized quantile residuals for the models. The misfits observed for these models were as follows: (0.5, 23.5) and (74.5, 98.5) for the skewness and kurtosis of the residuals, (98.5, 122.5) for kurtosis and (245.5, 270.5) for the variance of the residuals in the eleventh interval for the NB model with a fixed dispersion parameter. The PIG model with fixed dispersion had misfits in the intervals; (0.5, 23.5), (68.5, 90.5) and (226.5, 249.5) for the skewness coefficient and (0.5, 23.5) for the kurtosis. Though the additive dispersion models showed improved model fits of the residual worm plots, an in depth analysis of these plots had evidence of misfits in some intervals. The NB additive model had misfits in: (74.5, 98.5) for skewness coefficient, (0.5, 24.5), (74.5, 98.5) and (172.5, 196.5) for peak coefficients. The PIG additive model recorded misfits at the intervals, (0.5, 24.5) and (74.5, 98.5) for the skewness and kurtosis of the model residuals.

Conclusions
The study applied non linear models for both the mean and dispersion parameters of the PIG distribution to overdispersed infectious disease count data. General additive models of the PIG and NB distribution were fitted to infectious disease counts. The linear model fits for counts indicated that PIG glm models with varying dispersion parameter had better performance in fitting the data as opposed to the other fits. More so, PIG GAM overall provided better model results compared to NB GAM models and the linear forms of the two models. This could be attributed to the flexibility of the Inverse Gaussian part of present in this model. Further, the findings showed that models having both the mean and dispersion parameters as varying terms had better performance as compared to models with a fixed dispersion paramater. This is an indication that modeling the mean parameter in a distribution assuming the dispersion may have resulted in the poor performance of these models. From the study results the PIG and NB models had almost similar estimates for the smooth terms and linear coefficients implying that these models can be applied alternatively depending on the data structures. This study proposes the PIG GAM models as a better distribution for modelling overdispersed infectious disease count.