Modeling Zero Inflation and Over-Dispersion in Domestic Package Insurance Claims Portfolio: A Case of Madison Insurance Company-Kenya

The standard Poisson distribution is widely used as a mechanism for regression modeling of count data outcomes. However, the suitability of this modeling technique is only limited to equi-dispersed count data outcomes. This is due to the fact that this modeling technique does not take into account the problems associated with over dispersion and excess zeros in many data sets as with insurance claims data. The study objective is to model domestic package insurance claims frequency using zero inflated and hurdle models since insurance portfolios are characterized by the non-occurrence of claims over a given time interval. This non-occurrence of claims over a given time interval usually leads to the Zero-Inflation and Dispersion associated with insurance claims data. The study consequently evaluates the performance of the Poisson, Zero Inflated Poisson (ZIP) and Hurdle Poisson (HP) models in determining the model that best models the domestic package insurance claims data. This is then used to estimate, predict and determine the heterogeneity of occurrence of the aforementioned insurance claims. The statistical Hosmer-Lemeshow tests is used to define the suitability of the fitted model to estimate the zero-inflation and over-dispersion characteristic of the data. To determine the presence of outliers and the distribution of residuals, the Residual Pearson and Deviance statistics are used. Data on a number of claims for domestic package insurance policy from Madison Insurance ltd, Kenya spanning from 2014 to 2018 (261 weeks) is used in the study.


Introduction
The insurance product is distinctive in nature in that its quality can only be judged when something goes wrong. For this reason, the way in which a claim is handled has important market repercussions for an insurer. A claim is a request by the insured to be indemnified by the insurer following a financial loss associated with the occurrence of the insured peril.
Non-life insurance companies (insurers) calculate probable income within a given period by offsetting premiums receivable against claims payable thus the need for them to strike a balance between the premiums receivable and claims payable [5]. The accurate estimation of premium expenses is thus a vital task for all insurance stakeholders and this has been traditionally accomplished through the degeneration of the overall claim expenses into claim frequency and claim amount which are thus the two key risk drivers in any insurance business [5].
This study concerns itself with modeling claim frequency (number of claims) modeling as it is an indispensable component for premium determination, a vital yet a difficult undertaking, by the insurers in insurance industry. Traditionally this was achieved by use of the classical statistical Poisson regression model by which different rating factors were justified by the use of a regression coefficient. However, this methodology proved not to provide accurate results since insurance claims count data possess a specific characteristic of having an excess number of zeros for a particular time interval which is not catered for by the classical Poisson regression coefficients.
This excess number of zero claims can be attributed to policyholders willingly failing to report small claims to the insurance company for not getting deductibles and claim bonuses for reduction of the payable premiums for the forthcoming period i.e. year [3]. In this regard, claims count data modeling is viewed as a being a mixture of a degenerate distribution at 0 to model the excess zero claims and a positive continuous part to model the positive non-zero claims.
With the presence of zero claims in the claims data, this study compares the Poisson, Zero-Inflated and Hurdle models to determine an appropriate statistical distribution to model the excess zeros correctly. The over-dispersion in the data, which goes against the equi-dispersion property of the Poisson regression modeling, shall also be looked into. This is due to the randomness of occurrence of claims since an insurer is not able to determine precisely the number and amount of claims that will occur in the following period.
To aid the determination of the best fit statistical count data model for the domestic package insurance claims data, this research shall apply the following steps; i. Selecting two subsets of variables, one for rating factors and the other for zero-inflation occurrence for each model. ii. Hypothesis testing for the occurrence of zero inflation and over-dispersion effect for the domestic package insurance claims. iii. Specifying the model selection criteria that shall be used in selecting the statistical distribution that best models the claims data. iv. Estimating the parameters and calculating goodness-of-fit measures. The study uses the term zero-inflation to put emphasis on the exceedance of the probability mass at count zero in comparison to that handled by the standard count distributions. If inadequately modeled, it can lead to the invalidation of the data analysis results thus endangering the reliability of the scientific inferences. Insurance portfolios are characterized by the non-occurrence of claims over a given time interval thus making zero-inflation and over-dispersion common phenomenon in Insurance Portfolios.

Domestic Package Insurance Claims
Domestic package insurance is an insurance package that covers accidental loss or damage to a residential home (private dwelling used for domestic purposes only) [3]. It covers loss attributed to any of the contents (household goods) of the residential home and personal effects which could belong to the owner of the home or a third party associated to the owner and residing in the residential home [3]. This policy can be extended to cover expenses incurred that may arise due to the death/sickness of the homeowner's domestic servant while on duty as defined in the Work Injury Benefits (WIBA) Act [3].
Domestic package insurance claims are formal requests by the insured to an insurer for indemnification upon the occurrence of the insured peril. Upon verification and approval of the claim by the insurer, payments are made to the insured so as to cover for the loss attributed to the occurrence of the insured peril. This policy has an excess which is the first amount of each loss that the insured must bearer for every claim made and when the loss is equal or less than the excess the insured bearers the whole loss.
The domestic package insurance policy covers six insurance perils namely; fire, burglary, theft, explosion, riot and strike and floods. Domestic package insurance claims occur randomly hence the insurer cannot predict the next occurrence and the magnitude of occurrence over a given period. This claims thus have special characteristics that shall be considered when choosing a distribution that will fit the data. This are; i. The exclusion of some rating factors justifies the inclusion of the component of heterogeneity for the regression model and for some rating factors being the main cause of domestic package insurance claims compared to others. ii. The high number of zero count data motivates zero-inflated and hurdle models to be fitted hence the worthiness of the fit can be explained by the insured's behavior that is modified once a claim has been reported in the year. iii. The randomness of occurrence of claims and the associated assumption of independency of claim occurrence. This is due to the inverted business cycle where premiums are received before any costs are to be paid.

Statement of the Problem
The estimation of expected claim frequency and severity of any Insurance Policy enables insurers to make decisions on asset pricing, allocation and claim payment with high accuracy since insurance business is characterized by an inverted product cycle. This calls for optimal claim frequency modeling which is the purpose of this study to give a theoretical framework to explore and derive the same. It is in this regard that mixed random effect models are thought of providing the best fit for modeling insurance data as they are able to handle correlation, over-dispersion and zero-inflation in the data.
A recent study by Asmussen and Albrecher (2010) showcase this by using the additive and multiplicative random effect Poisson Model with the Gamma and Inverse Gaussian distributions to model risk premiums [14]. Wolny and Dominiak (2013) propose a mixed Poisson regression with spatial random effects which handle zero-inflation and over-dispersion effects [1]. The study thus seeks to explore the zero-Inflated and hurdle regression models in statistical literature that can sufficiently fit the claims data and then be used to estimate and predict the expected domestic package insurance claim liability. This shall be extended to help determine the incidence of heterogeneity in the occurrence of the claims.

General Objective
The general objective of the study is to model zero inflation and over-dispersion in domestic package insurance claims using the Zero-Inflated and Hurdle regression models.

Specific Objective
The specific objectives are; i. To model the zero-inflation and over-dispersion in domestic package insurance claims data using the zero inflated and hurdle models. ii. To determine the zero-inflated count model that best fits the domestic package insurance claims data and use the same to estimate and predict the associated number of claims. iii. To perform the diagnostics of the zero-inflated regression models on domestic package insurance claims data.

Significance of the Study
Insurance plays a major role in a country's economic development and this can only be possible if the insurance companies are making profits. This is only achieved through proper claim management since claims make up the main cash outflow for insurance companies. With proper claim modeling techniques, insurance companies will be able to estimate and predict future claims and in the long run predict their profitability since most insurance products are characterized by an inverted business cycle where premiums are paid before any claims are to be paid usually for a period of one year. Forecasting of profitability in the insurance industry can thus be used by policy makers to determine the contribution of insurance to economic growth and towards achieving the Kenya Vision 2030.

Scope of the Study
The study focuses on using the Zero-Inflated/Hurdle regression modeling techniques to model the domestic package insurance claims data for Madison Insurance Ltd, Kenya. The claims experience shall consist of detailed information on the type of domestic package insurance and the corresponding number of claims.

Introduction
This chapter is established with the intention of pre-viewing past studies on insurance claim modeling so as to get appropriate theories and the experiential proves to substantiate this research.

Empirical Literature Review
In the modeling of insurance claims, Shevchenko (2010) defined the insurance loss function as a multiplicative function of claim frequency and average claim severity [14]. Alicja & Dominiak (2013) extended this by applying the parametric regression to claims frequency modeling [1]. Yulia et al. (2013) estimated the insurance claim cost for insurance claims data using the Zero Adjusted Gamma and Inverse Gaussian Regression Models with an application to Malaysian Motor Insurance Claims [7].
Kwame & Agbodah (2014) studied probability modeling and simulation of Insurance Claims in Ghana [6]. Vytaras & Andreas (2015) modeled severity and tail risk of Norwegian Fire Insurance Claims using the Pareto, Log normal Pareto and the folded C distributions [8]. Evgenii & Elena (2017) also modeled insurance claims using the Generalized Hurdle and Gamma Distributions, an application to the Russian motor own damage insurance data [12].
Sakthivel & Rajitha (2017) studied the zero-inflated and hurdle models in comparison to the Artificial Neural Network in modeling insurance claims [13]. Joseph & Christophe (2011) reviewed the zero inflated count models in modeling annual trends in incidences of some occupational allergic diseases in France as an extension of using mixed random effects methodologies to model other count data sets other than insurance claims [9].
Marjan (2019) considered the problem of modeling violation of claims with excess zeros in a liability insurance portfolio [11]. Yogita & Kamalja (2017) summarized the several dispersed and zero inflated count data distributions used to handle dispersion in count data [2]. Cheng (2018) studied Hurdle models for general insurance claims data modeling [17]. Lu Yang & Edward (2016) extended the literature on multivariate frequency-severity regression modeling of claim counts by introducing a copula for modeling the dependence of claims [4].

Introduction
This chapter discusses the Zero-Inflated and Hurdle Distribution count data models used in the modeling of zero-inflation and over-dispersion effect in claim frequency data.

Data
Data for the study included weekly number of claims filed under the domestic package insurance policy for the time period of 261 weeks from Jan 2014 to Dec 2018 for Madison Insurance Company. This was secondary data obtained from Madison Insurance Company, Ltd.

Statistical Modeling
The study uses the term modeling to refer to the process of identifying a theoretical distribution that fits the domestic package claims data reasonably well. This involved the fitting of Zero-Inflated and Dispersed statistical distributions to the domestic package insurance claims data to determine the distribution that fits the data reasonably well. The Poisson, Zero-Inflated and Hurdle Poisson distributions were fit to the data.
The Poisson distribution is the classical distribution for modeling count data and its density function is given as; with mean equal to variance given as = .
The Zero-Inflated Poisson distribution is given as; with mean and variance respectively given as " 1 # and " 1 1 #.
The Hurdle Poisson distribution is given as;

Parameter Estimation
The parameter estimates were obtained through the maximum likelihood methodology in which letting 0 to be the number of zeros in the sample , the likelihood function is given as; Where The values of and which maximize the likelihood function are given by the partial derivatives of the log-likelihood for which; (5) and 8 9 " ∑ < 4 0 $ #"1 *+ 8 9 #

Model Selection
The study used the Akaike Information Criterion (AIC) as a model selection measure to select the model that best fits the claims data. The model with a smaller value of the information criterion shall be deemed to be the one that gives a best fit to the domestic package insurance claims data. If we let to be the model parameters and L the to be the likelihood function then the AIC information criterion is thus given as =>?

Model Diagnostics
To evaluate the goodness of fit of the fitted distributions, the study used the Pearson Chi-Square, Pearson & Residual Deviance, Hosmer-Lemeshow and the Cameron Trivedi test statistics.

Introduction
This chapter deliberates the zero-inflated and hurdle models for modeling claim frequency data, a case of Madison Insurance.

Descriptive Data Analysis
The exploratory data analysis gave a detailed account of preliminary analysis of the findings of the study and this is illustrated as below in Table 1. A total of 664 claims on domestic package insurance for the period Jan 2014 to Dec 2018 were used for the study. This was given a graphical visualization as in Figure 1. The minimum, median, mean, maximum, 1 IJ quartile and 3 2L quartile number of claims made are 0, 1, 3, 14, 0 and 4 respectively. Skewness, kurtosis and variance are 1.2937, 0.8882 and 10.0644 respectively. The mean number of claims experienced is higher than the median number of claims thus the implication that most of the claims made are centered on the left of the mean value and that the extreme claim frequency values are on the right of the mean value. The skewness of the data is greater than zero and kurtosis is less than 3 thus giving direction that the claims frequency data follows a right skewed distribution that is leptokurtic. The standard deviation of 3.1724 (square root of variance) is close to the mean value of claims made thus implying that most of the claim frequency values are close to the average number of claims i.e. there exists lower deviations in the data.
The variance of the claims data was found to be higher than the mean thus going against the equi-dispersion theory of the Poisson distribution thus the presence of over-dispersion in the data. The hanging rootogram was used to explain the dispersion in the claims data at different counts as given in Figure 2.

Fitted Model Coefficients
To model the insurance claims, the Poisson, Hurdle Poisson and Zero-Inflated Poisson model were fit to the data by regressing the number of domestic package insurance claims made on the regression coefficients. The regression covariates included Fire, Burglary, Theft, Explosion, Riots & Strikes and Floods.
The parameter estimates of the fitted models were estimated by the maximum likelihood estimation approach and given in Table 2 as;  Table 2. The estimated mean value of the claims data had a 17.69% positive influence on the number of claims made. All the rating factors had a significant effect on the number of claims made.
For the Zero-Inflated Poisson with Binomial logit-link model (ZIP-B), a unit increase in the number of claims recorded due to Fire, Burglary, Theft, Explosion, Riots and Floods had a respective -33.58, -33.51, -13.11, -36.95, -17.72 and -16.33 unit increase in the number of zero claims made. In this case, a unit increase in the estimated mean number of claims had a 13.86 unit increase on the number of zero claims made.
For the truncated Hurdle Model (HP-T) used to model the non-zero claims; Fire, Burglary, Theft, Explosion, Riots and Floods had a 19.05%, 19.32%, 22.76%, 27.40%, 18.73% and 20.97% respective chance for the insured to make additional non-zero claims to the already positive number of claims made. The estimated mean value of the claims data had a 25.10% positive influence on the number of claims made. All the rating factors had a significant effect on the number of claims made for this log-link hurdle coefficients.
For the Zero-Hurdle Poisson model (HP-B), a unit increase in the number of claims recorded due to Fire, Burglary, Theft, Explosion, Riots and Floods had a respective 2.2951, 1.3395, 0.8053, 1.2841, 1.2850 and 2.0840 unit increase in the number of zero claims made.

Results Discussion
This study's data exploration engrossed itself on modeling the number of insurance claims under domestic package insurance policy sold by Madison Insurance Company Ltd, Kenya with an aim of estimating the model parameters so as to aid insurance decision making. The Deviance & Pearson residuals, residual probability plots and the AIC & Log-likelihood model selection techniques were used to aid the data exploration.  Table 3 gave a summary of Deviance and Pearson Residuals of the fitted models in modeling insurance claims frequency data. The median Deviance and Pearson residuals of the fitted models is close to zero and the residuals are to some extent symmetrical thus implying that the fitted models were not biased in modeling the claims data i.e. they did not overestimate or underestimate the model parameters.

Residual Probability Plots
As measure of goodness of fit of the fitted distributions to models claims frequency data, the quantile-quantile plots were used to give a visual representation of the residuals as given in Figure 3. A careful observation of the quantile-quantile plots of the data samples and the fitted models, revealed that the Hurdle Poisson and the Zero-Inflated Poisson models fitted the data well with a few outliers above and below the 45-degree reference line compared to the Poisson model. This gave an indication on the need to use zero-inflated and dispersed models to model claims frequency data-sets which are characterized by many zeroes.

Test for Dispersion
The Cameron & Trivedi (CT-1990) test was used to test for dispersion in the data. This test gave a dispersion value of 1.140364 which was greater than the reference value of 1 (equi-dispersion) for the Poisson distribution. This was a confirmatory test for the presence of dispersion in the data that was more than what the Poisson distribution can handle thus the need to fit higher forms of the Poisson distribution to the insurance claims data.

Model Selection
Model selection criteria was informed by the information criterion. Table 4 gives the values of AIC for the fitted models and the associated log likelihoods. The Zero-Inflated Poisson gave a better fit to the data as it had the lowest AIC and Log-likelihood. This was confirmed via the Hosmer-Lemeshow test which confirmed the appropriateness of the Zero-Inflated Poisson in the modeling of insurance claims frequency data as in Table 5.

Claims Frequency Prediction
Since the Zero-Inflated Poisson gave a better fit to the data, it was used to predict the future number as given in Figure 4. Claims were predicted for a period of 24 weeks (8 months). The maximum and minimum predicted claims were six (7) and zero (0) respectively. Hence at any expected time the insurer would expect a maximum of seven (7) claims and a minimum of zero (0) claims. The expected mean number of claims was found to be two (2) claim counts with a standard deviation of 2.2161. This gave an implication that at any particular time interval, the insurer (Madison Insurance Company Ltd.) would expect at least two claims to be made with regard to the domestic package insurance policy thus enabling the insurer to set aside enough reserves to cater for the anticipated claims.
In order to determine the association between the observed and the predicted number of Domestic Package Insurance Claims, the Pearson Chi-Square test was used. Table 6 gave a summary of the Pearson Chi-Square test. The observed and the expected number of Domestic Package Insurance Claims were found to be statistically significantly associated. This was evident as with the Pearson Chi-Square P-value that was close to zero (p-value M 0) implying presence of association between the observed and expected number of claims.
This gave an implication that the observed and the expected claims came from the same distribution thus the appropriateness of the Zero-Inflated Poisson in the modeling and predication of insurance claims.

Introduction
This chapter is the final stage of the study; it gives conclusions to the findings and recommendations for future research.

Conclusion
Modeling is a vital aspect for fair pricing of insurance products in the insurance industry. It helps determine the amount of premium to be charged and the appropriate policy exclusions (inclusions) for any given insurance policy. The study used the Poisson, Hurdle Poisson and the Zero-inflated Poisson in an application to modeling domestic package insurance claims frequency data in which a total of 664 claims from Madison Insurance for the period Jan 2014 to Dec 2018 were enrolled for the study.
The study concludes that in the analysis of count data with indistinguishable sources of excess zeros, the Zero inflated regression models should be used to analyze them. This is due to reason that this models give a better account of the heterogeneity of claim occurrence, zero-inflation and dispersion associated with such data-sets. This study postulates the use of the Zero-Inflated Poisson model as one of the zero-inflated models that can be used to model count data outcomes with excess zeros such as the number of insurance claims.

Recommendations
The claim count modeling is one of the important steps in the insurance rate making process. Great work needs to be done to help model zero-inflated data counts as with insurance data-sets with the expansion of this modeling techniques to capture other types of zero-inflated models. This would include the Generalized Quasi Poisson and the Discrete Weibull models in order to give a better fit of the model parameter estimates. This study gave an application of the Poisson, Zero-Inflated Poisson and Hurdle Poisson models to the modeling of the number of domestic package insurance claims. To improve on this application, other techniques like the bootstrapping can be considered in future researches with an application to longitudinal data with indistinguishable sources of excess zeros.