Poisson Inverse Gaussian (PIG) Model for Infectious Disease Count Data
Vincent Moshi Ouma^{1}^{, *}, Samuel Musili Mwalili^{2}, Anthony Wanjoya Kiberia^{2}
^{1}Applied Statistics, Department of Statistics and Actuarial Sciences, College of Pure and Applied Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
^{2}Department of Statistics and Actuarial Sciences, College of Pure and Applied Sciences, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya
Email address:
To cite this article:
Vincent Moshi Ouma, Samuel Musili Mwalili, Anthony Wanjoya Kiberia. Poisson Inverse Gaussian (PIG) Model for Infectious Disease Count Data. American Journal of Theoretical and Applied Statistics. Vol. 5, No. 5, 2016, pp. 326-333. doi: 10.11648/j.ajtas.20160505.22
Received: September 9, 2016; Accepted: September 21, 2016; Published: October 10, 2016
Abstract: Traditionally, statistical models provide a general basis for analysis of infectious disease count data with its unique characteristics such as low disease counts, underreporting, reporting delays, seasonality, past outbreaks and lack of a number of susceptible. Through this approach, statistical models have provided a popular means of estimating safety performance of various health elements. Predictions relating to infectious disease outbreaks by use of statistical models have been based on Poisson modeling framework and Negative Binomial (NB) modeling framework in the case of overdispersion within the count data. Recent studies have proved that the Poisson- Inverse Gaussian (PIG) model can be used to analyze count data that is highly overdispersed which cannot be effectively analyzed by the traditional Negative Binomial model. A PIG model with fixed/varying dispersion parameters is fitted to two infectious disease datasets and its performance in terms of goodness-of-fit and future outbreak predictions of infectious disease is compared to that of the traditional NB model.
Keywords: Mixed Models, Poisson-Inverse Gaussian Distribution, Negative Binomial Distribution, Infectious Disease
1. Introduction
An everlasting fight for humans to cab the virulent of various viruses has been ongoing since time in memorial. This has been done by use of statistical models. Since data from infectious disease are count data, most frequently researchers and scholars alike have tended to favor the use of Poisson distribution as the base distribution for the counts and in addition, zero inflation to counter the excessive zeros that are prominent in most infectious disease data. The NB distribution, a mixture of Poisson and Gamma distributions, has been applied to account for overdispersion usually encountered in most infectious diseases count data [6]. But with the restrictive nature of the Poisson models (equal dispersion), and the NB models less flexibility approach in handling highly dispersed data, they pose a great challenge in modeling infectious disease count data with the corresponding characteristics. A few extensions within the model to allow for higher dispersion is needed. A few studies suggest the PIG model as an alternative to the NB model for modeling count data especially those with longer tails and larger kurtosis [11] [19]. Further extensions of these in recent past have led to the development of mixed models that have the capabilities of handling effectively count data with unique characteristics common to infectious diseases [21].
This paper will involve the use of a mixed Poisson- Inverse Gaussian distribution in modeling count data from infectious diseases by using a parameter -and observation-driven model for infectious disease and compare its performance to that of the traditional Negative Binomial distribution.
2. Poisson Distribution
The basic of all regression models for count data is the Poisson regression model [3] expressed as
(1)
Where
Assuming an equal dispersion
(2)
3. Negative Binomial Distribution
For overdispersed count data, the Poisson model can be modified as
Where
(3)
Is the non-negative (e.g. gamma, lognormal etc) multiplicative random-effect term that assists in modeling heterogeneity [17].
Incorporating the use of the law of total expectation and the law of total variance it is evident that
(4)
and
(5)
This suggests that for we obtain a regression model for overdispersed counts [20].
Even though the Poisson models have been extensively applied to model count events [17], the assumption that the mean and the variance of the count data are equal is rather too restrictive and rarely does occur in observational data. Furthermore, when observed data involves excessive zero counts, overdispersion arises hence may lead to underestimating the variance of the estimated parameter, leading to the wrong conclusion [13]. Due to overdispersion NB model was presented as a good alternative to the Poisson model. This is because the NB model allows the extra variation within the data to be captured by adding a randomly distributed error term, that is based on the Gamma distribution, that is, the structure of the NB regression model is constructed by allowing have a prior gamma;
(6)
Where
Thus the NB distribution is parameterized by and an inverse parameter (reciprocal of r) as
(7)
Hence, we have and .
Since the inception of the NB model, it has been widely favored to model count data from infectious disease due to the flexibility and less strictness on the gamma part [8] [9] [12] [13]. In the recent past, other statistical regression models for analyzing count data exhibiting overdispersion and excess zero counts have been proposed and show potential alternatives for the traditionally NB regression model. They include.
4. Review of Some Count Mixed Models
4.1. The Zero-Inflated and the Hurdle Models
The Zero-inflated and Hurdle models are both known to handle count data with excessive zeros in the observed data. The Zero-inflated model assumes that the zero observations have two distinct different origins, that is, structural and sampling (chance). On the other hand, Hurdle models assume all zero data are from structural source with the nonzero data having sampling origin with either truncated Poisson or truncated NB distributions [15].
For a further review of Zero-inflated and Hurdle models, check Hu et al [10].
4.2. Lognormal-Poisson Regression Model
Agresti [1] constructs a lognormal-Poisson regression model by inputting a lognormal prior on as
(8)
Where and
Hence we have
(9)
(10)
4.3. The Log-normal and Gamma Mixed NB (LGNB) Regression Model
Since Bayesian analysis of counts is limited in that, their lack an efficient inference as to the conjugate prior for the regression coefficient β is unknown under the Poisson and NB likelihoods [17] and also the conjugate prior of the NB dispersion parameter is unknown. The Log-normal and gamma mixed NB regression model addresses these issues by inputting a lognormal prior as multiplicative random effect term on and a gamma prior on the dispersion parameter r.
Let
(11)
and
(12)
Then the LGBN model is expressed as
(13)
Where
(14)
(15)
(16)
(17)
and
ψ can be modeled as
(18)
where and , , , , , and are gamma hyperparameters.
4.4. The Negative Binomial-Lindley (NB-L) Model
Even though models mentioned above are considered to provide a better fit than the traditional NB model, many of the models show a deficiency in analyzing datasets with a large number of zeros and are highly skewed [19]. Geedipally [5] introduced the Negative Binomial-Lindley (NB-L) model in modeling count data characterized by a large number of zeros.
The NB-L distribution is a mixture of NB and Lindley’s distributions resulting into a mixed distribution with a thick tail and useful for data containing a large number of zeros. Zamani and Ismail [18], gives the pmf of the NB-L distribution as
(19)
Where
r is the shape parameter of NB-L distribution
in combination with shape parameter r dictates the mean and variance of the NB-L distribution.
The mean of the NB-L (r,) is given as;
(20)
Lord et al. [14] demonstrated that the NB-L model provides a better statistical performance than the ZINB and it is more theoretically sound. However, the only disadvantage of the NB-L model is that its likelihood function does not have a closed form hence the parameter estimation based on McMC chain requires intensive computation time [19].
4.5. The Sichel (SI) Model
Hauer [7] in his research points out that although the gamma assumption for the is adequate for various datasets, it is not a proof that are indeed gamma distributed. He notes that there is a need to explore other mixed-Poisson models. Zou et al. [21] introduced the Sichel (SI) distribution for modeling highly dispersed count data. The SI distribution is a compound of Poisson distribution that mixes Poisson distribution with generalized inverse Gaussian distribution.
The Sichel distribution SI (μ, , ν) has the following structure;
(21)
Where y, μ, , ν are response variable, mean, scale parameter, shape parameter respectively and
(22)
(23)
is the Bessel function.
For and , the SI distribution reduces to a NB distribution.
5. Poisson Inverse Gaussian (PIG) Distribution
The PIG distribution is a special case of the SI distribution in which the shape parameter is set to be -0.5. Thus it is characterized by only two parameters. Its likelihood function is easily obtainable and has a closed form, indicating the estimation of parameters as quite simple and almost takes no time [16].
As demonstrated by Zha et al. [19], the overall PIG distribution denoted as PIG (), is given by
(24)
Where
is the Bessel function of third kind
The mean and variance of the PIG distribution are
(25)
(26)
6. Methodology
The functional form of the model is given by
(27)
where is the mean number of counts per week, endemic component, and endemic component.
The form of the dispersion parameters is nonlinear since it provides more flexibility to capture the variance of the data [21].
(28)
Where and are estimated coefficients.
The endemic component is defined as
(29)
Where s is the number of harmonics to include andis the Fourier frequencies i.e. where p is the base frequencies (week in our case).
The epidemic component is considered as observations driven process through parameter λ. For λ= 0, the model reduces to a parameter-driven formulation with no epidemic incidence and for 0 <λ < 1, the model displays occasional epidemic outbreaks.
6.1. Estimation of Model Parameters
An advantage of the proposed model is that its framework is easily estimated by Maximum likelihood method [8]. A further advantage is that the likelihood function of the PIG distribution has a closed form hence, enabling the regression parameter to be easily obtained through MLE method [19]. As proposed by Dean et al. [4], the probability generating function (PGF) for PIG (μ, ) is given by;
(30)
Calculating recursively the probabilities, we have;
(31)
(32)
For y=2, 3, …..
For a random sample of observations , the log likelihood function can be derived from
(33)
6.2. Data Description
Two datasets used in the analysis are weekly salmonella Hadar observed cases in Germany from the years 2001 to 2006 and weekly Salmonella Agona observed cases in the UK from the years 1990 to 1995. The two dataset are provided within the Surveillance package in R.
The statistics involved in describing the data is skewness, kurtosis, and variance to mean ratio.
6.2.1. Salmonella Agona Data
The Salmonella Agona data set included 312 observed cases of the disease within the UK over a period of six years from 1990 to 1995. The distribution of the data is as shown in figure 1. From the distribution, it is evident that the data has a long tail and rightly skewed with a skewness coefficient of 1.68 indicating high skewness [2]. The variance over mean ratio is 2.436. This information suggests that the data is overdispersed. 16% out of the 312 cases reported zero counts.
6.2.2. Salmonella Hadar Data
The Salmonella Hadar data set included 295 observed cases of the disease within Germany over a period of six years from 2001 to 2006. The distribution of the data is as shown in figure 2. From the distribution, it is evident that the data has a long tail and rightly skewed with a skewness coefficient of 1.85 indicating high skewness. The variance over mean ratio is 3.45. This information suggests that the data is overdispersed. 14% out of the 295 cases reported zero counts.
7. Results and Discussion
7.1. Goodness of Fit
The goodness-of-fit test was based on the values of the global deviance, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). Small values of this meant that the respective model provided the best fit for the data.
Table 1 and Table 2 gives a summary estimation results for the NB model and PIG model using the Salmonella Agona data and Salmonella Hadar data, respectively.
NB | PIG | |||
Parameters | Fixed Dispersion Parameter | Varying Dispersion Parameter | Fixed Dispersion Parameter | Varying Dispersion Parameter |
Nu (ν) | 0.7369 (0.1038) | 0.7275 (0.1100) | 0.7295 (0.1052) | 0.7282 (0.1120) |
Beta (β) | -0.0003 (0.0005) | -0.0003 (0.0005) | -0.0002 (0.0005) | -0.0003 (0.0005) |
Gama (γ) | -0.4284 (0.0686) | -0.4106 (0.0684) | -0.4383 (0.0697) | -0.4167 (0.0695) |
Delta (δ) | -0.1558 (0.0618) | -0.1527 (0.0614) | -0.1656 (0.0623) | -0.1619 (0.0617) |
Lambda (λ) | 0.0853 (0.0160) | 0.0892 (0.0162) | 0.0845 (0.0161) | 0.0890 (0.0163) |
ln α; ln τ | -1.6941 (0.2516) | - | -1.6141 (0.2661) | - |
| - | 2.5715 (1.0164) | - | 3.3879 (1.2586) |
| - | -0.5682 (0.2302) | - | -0.6109 (0.2796) |
Global Deviance | 1237.203 | 1232.262 | 1235.81 | 1231.302 |
AIC: | 1249.203 | 1246.262 | 1247.81 | 1245.302 |
BIC: | 1271.641 | 1272.441 | 1270.249 | 1271.481 |
NB | PIG | |||
Parameters | Fixed Dispersion Parameter | Varying Dispersion Parameter | Fixed Dispersion Parameter | Varying Dispersion Parameter |
Nu (ν) | 1.0138 (0.0950) | 0.9804 (0.1001) | 1.0277 (0.0976) | 0.9987 (0.1010) |
Beta (β) | -0.0016 (0.0005) | -0.0014 (0.0005) | -0.0016 (0.0005) | -0.0015 (0.0005) |
Gama (γ) | -0.1426 (0.0643) | -0.1409 (0.0639) | -0.1481 (0.0654) | -0.1449 (0.0650) |
Delta (δ) | -0.4807 (0.0702) | -0.4974 (0.0724) | -0.4889 (0.0710) | -0.5039 (0.0726) |
Lambda (λ) | 0.0862 (0.0127) | 0.0875 (0.0126) | 0.0839 (0.0124) | 0.0853 (0.0123) |
ln α; ln τ | -1.500 (0.194) | - | -1.4009 (0.2092) | - |
| - | 0.4871 (0.7636) | - | 0.5007 (0.7211) |
| - | -0.1782 (0.1729) | - | -0.1647 (0.1633) |
Global Deviance | 1240.889 | 1239.903 | 1237.772 | 1236.804 |
AIC: | 1252.889 | 1253.903 | 1249.772 | 1250.804 |
BIC: | 1274.991 | 1279.688 | 1271.874 | 1276.589 |
Two conclusions can be obtained from the summary statistics within the Table 1 and 2. Foremost, both NB model and PIG model provided similar estimates. For instance, the Salmonella Agona data, both models showed that past observed cases of Salmonella Agona positively associated with the current infection frequency. This is as expected since the disease is infectious, hence exposure to it enhances its migration or transmission. Secondly, the PIG model showed better statistical fit for both datasets than the NB model when varying dispersion parameters were considered. This is the same case even for fixed dispersion parameter for both models. Models with varying dispersion parameters provided better statistical fit for the two datasets as compared to the models with fixed dispersion parameters. This is evidence that the Inverse Gaussian part of the PIG model is more flexible than the Gamma distribution in NB model in handling overdispersed datasets that are common for infectious diseases. This is enhanced by varying the dispersion parameter.
7.2. Residuals Analysis
True residuals , for any fitted model have a standard normal distribution if the fitted model is correct, irrespective of the model distribution. Table 3 and Table 4 give the summary of the randomized quantile residuals for the NB and PIG fitted models for salmonella hadar and salmonella agona datasets, respectively.
Mean | Variance | Skewness | Kurtosis | |
NB | -0.007 | 1.02 | 0.15 | 3.00 |
PIG | -0.009 | 1.05 | -0.14 | 3.45 |
Mean | Variance | Skewness | Kurtosis | |
NB | 0.005 | 0.94 | -0.004 | 3.88 |
PIG | 0.01 | 0.95 | 0.23 | 3.59 |
The residuals summary of both the NB and PIG models for the salmonella hadar data behave well (their mean is nearly zero, variance nearly one, coefficient of skewness near zero, and the residuals values fall inside the "acceptance" region) except for the kurtosis for both the fitted models and skewness for PIG model. The residuals distribution kurtosis for both models is leptokurtic and the PIG model residuals distribution is rightly skewed.
For the salmonella agona data, the residuals summary of the PIG model suggests that the distribution of the residuals is leptokurtic and that from the NB suggest that the distribution is slightly skewed to the right.
Figure 3 and 4provide residual worm plots for both NB and PIG models for the salmonella hadar data respectively, while Figure 5 and 6 provide residual worm plots for both NB and PIG models for the salmonella agona data respectively.
Generally, the four residuals worm plots from the fitted NB and PIG models for both the salmonella agona and salmonella hadar data, indicate that both the NB model and PIG model fits the data well. A much deeper insight into the worm plots reveals some misfits of the cubic polynomial curve fitted to the residuals points for both the models considering the respective datasets. This explains the disparity shown from the summary of the randomized quantile residuals from both the fitted models for both the datasets. For the salmonella hadar data, the misfits are as follows; (-0.5, 3.5) and (5.5, 21.5) for the constant coefficient, (3.5, 5.5) for the linear coefficient, (-0.5, 3.5) and (5.5, 21.5) for the quadratic coefficient, and (1.5, 5.5) for the cubic coefficient of the cubic polynomial curve fitted to the residuals points. For the salmonella agona data, the misfits are as follows; (-0.5, 3.5) for the constant coefficient, (1.5, 5.5) for the linear coefficient, (-0.5, 3.5) and (5.5, 21.5) for the quadratic coefficient, and (1.5, 5.5) for the cubic coefficient of the cubic polynomial curve fitted to the residuals points.
7.3. Comparison of Prediction Performance
The analysis was done in two step approach. First, 80% of the entire datasets respectively were randomly selected from each dataset for model estimation. Based on this, a regression model was fitted. The fitted model was used for prediction purposes for the remaining 20% of the data sets respectively. A note to consider is that the model fitted for prediction performance was based on models with varying dispersion parameters since previously they provided a good statistical fit to the respective datasets.
Comparison of prediction accuracy for both the fitted NB model and PIG model was based on the Mean Absolute deviance (MAD). A better prediction is indicated by small values of MAD.
Table 5 gives a summary of modeling results of both the NB model and the PIG model.
NB | PIG | |||
Parameters | Salmonella Agona | Salmonella Hadar | Salmonella Agona | Salmonella Hadar |
Nu (ν) | 0.8939 (0.1314) | 1.4244 (0.1251) | 0.9068 (0.1384) | 1.4240 (0.1257) |
Beta (β) | -0.0015 (0.0007) | -0.0039 (0.0007) | -0.0016 (0.0008) | -0.0039 (0.0007) |
Gama (γ) | -0.4751 (0.0785) | -0.2422 (0.0671) | -0.4781 (0.0794) | -0.2416 (0.0680) |
Delta (δ) | -0.2409 (0.0734) | -0.4763 (0.0731) | -0.2510 (0.0737) | -0.4757 (0.0734) |
Lambda (λ) | 0.0731 (0.0178) | 0.0335 (0.0163) | 0.0730 (0.0180) | 0.0348 (0.0163) |
| 3.3511 (1.0893) | 0.4049 (0.9189) | 6.4360 (1.4522) | 0.3838 (0.8641) |
| -0.6519 (0.2609) | -0.2253 (0.2324) | -0.7790 (0.3377) | -0.1893 (0.2158) |
AIC: | 982.2296 | 1034.436 | 981.1035 | 1031.829 |
MAD | 1.2548 | 1.3604 | 1.2736 | 1.3721 |
The MAD values for both the NB model and PIG model are considerably almost the same, which therefore suggested that the PIG model can do as well as the NB model when it comes to prediction performance.
8. Conclusions and Recommendations
The aim of this study was to apply the PIG distribution in analyzing infectious disease count data. From the study, the PIG model provided better statistical fit than the traditional NB model. The statistical fit was further improved by varying the dispersion parameter of the respective models. These results can be attributed to the flexibility of the inverse Gaussian part of the PIG distribution as compared to the gamma part of the NB model. Hence, it can be said that PIG model can be an alternative model for analyzing infectious disease count data that exhibit overdispersion.
The applicability of the PIG model over NB model for analyzing infectious disease count data is not fully investigated considering that the datasets used in this study had small proportions of zero counts. The salmonella agona dataset had16% while salmonella hadar had 14%. Furthermore, the datasets were not that highly dispersed to better investigate the flexibility of the inverse Gaussian part of the PIG model. To cement the claim that PIG distribution can be a suitable alternative distribution to NB distribution in handling infectious disease count data, a simulation study should be performed.
References