Posterior Predictive Checks for the Generalized Pareto Distribution Based on a Dirichlet Process Prior

Extreme value modelling is widely applied in situations where accurate assessment of the behavior of a process at high levels is needed. The inherent scarcity of extreme value data, the natural objective of predicting future extreme values of a process associated with modelling of extremes and the regularity assumptions required by the likelihood and probability weighted moments methods of parameter estimation within the frequentist framework, make it imperative for a practitioner to consider Bayesian methodology when modelling extremes. Within the Bayesian paradigm, the widely used tool for assessing the fitness of a model is by using posterior predictive checks (PPCs). The method involves comparing the posterior predictive distribution of future observations to the historical data. Posterior predictive inference involves the prediction of unobserved variables in light of observed data.. This paper considers posterior predictive checks for assessing model fitness for the generalized Pareto model based on a Dirichlet process prior. The posterior predictive distribution for the Dirichlet process based model is derived. Threshold selection is done by minimizing the negative differential entropy of the Dirichlet distribution. Predictions are drawn from the Bayesian posterior distribution by Markov chain Monte Carlo simulation (Metropolis-Hastings Algorithm). Two graphical measures of discrepancy between the predicted observations and the observed values commonly applied in practical extreme value modelling are considered, the cumulative distribution function and quantile plots. To support these, the Nash-Sutcliffe coefficient of model efficiency, a numerical measure that evaluates the error in the predicted observations relative to the natural variation in the observed values is used. Finite sample performance of the proposed procedure is illustrated through simulated data. The results of the study suggest that posterior predictive checks are reasonable diagnostic tools for assessing the fit of the generalized Pareto distribution. In addition, the posterior predictive quantile plot seems to be more informative than the probability plot. Most interestingly, selecting the threshold by minimizing the negative differential entropy of a Dirichlet process has the added advantage of allowing the analyst to estimate the concentration parameter from the model, rather than specifying it as a measure of his/her belief in the proposed model as a prior guess for the unknown distribution that generated the observations. Lastly, the results of application to real life data show that the distribution of the annual maximal inflows into the Okavango River at Mohembo, Botswana, can be adequately described by the generalized Pareto distribution.


Introduction
Modelling extreme values is widely applied in situations, like civil engineering design, where accurate assessment of the behavior of a process at high levels is required. In the statistics literature, most of the works on extreme value analysis and modelling are based on the likelihood or frequentist approach. However, there are several reasons why a statistician might prefer Bayesian methodology to the frequentist approach when modelling extreme values: (i) extreme value data are inherently scarce, making it natural for the statistics practitioner to consider including other available prior information in the analysis, (ii) the main objective of an extreme value analysis is to estimate the probability of future events reaching extreme levels and predicting high quantiles and, one approach to prediction is through the predictive distribution, which is naturally Bayesian and, (iii) the Bayesian methodology does not require regularity assumptions required by the maximum likelihood and probability weighted moments methods of estimation. It is for these reasons that the present paper focusses on using posterior predictive checks for assessing model-fit for the generalized Pareto distribution, a family of continuous probability distributions that is popularly used to model the tails of distributions of extreme values. The reader is referred to Coles and Powell [7] for a comprehensive review of and new developments in the utility of a Bayesian approach to the inference of extremes.
In applications, there are two widely adopted approaches to modelling extreme values of a time series of independent observations: Block maxima method, in which data are subdivided into sequences of observations of size n each, for some large n, resulting in a series of block maxima, which are commonly modelled using the generalized extreme value (GEV) distribution, and, the peaks over threshold (POT) approach in which all exceedances over a specified high threshold are modelled. In this paper, our main interests is in predicting high quantiles, and, as a result, we consider the POT approach in which all observations in the series 1 2 , , , n x x x … which exceed a high threshold, say t , are modelled.. The POT methods are based on fitting a stochastic model to either the exceedances or the peaks over a threshold [9]. We discuss the fit of the generalized Pareto distribution (GPD), the most popular model when fitting data using the POT method [9,23,35]. The advantage of the POT method in modelling extremes is that more observations are incorporated into the analysis [28].
There are many model assessment diagnostics in the literature, but no overall goodness of fit test exists and the scientist should specify his preference [21]. One of the main Bayesian methods for assessing the fit of parametric models involves embedding the parametric family into a larger family called the extended model [4,6,24]. This method is more useful if the larger family is infinite-dimensional so that the extended model becomes nonparametric [37]. The Dirichlet process by Ferguson is one of the many infinitedimensional models used in Bayesian inference for generating random distribution functions [13]. It is widely used in Bayesian inference as a prior on a set of probability distributions of random variables, enabling the statistician to consider certain nonparametric problems from a Bayesian approach.
To assess the fit of the model mentioned above, we consider diagnostics based on predictions generated by the model. Predictions are drawn from the Bayesian posterior predictive distribution by Markov chain Monte Carlo simulation (Metropolis-Hastings Algorithm). Comparing the posterior predictive distribution to the observed values is generally referred to as a posterior predictive check [14,17,19,32,33]. One of the advantages of this approach is that the posterior predictive distribution reflects both the parametric uncertainty (via prior specification) and sampling uncertainty (via the sampling distribution of the future observation). If the model fits the data well, then its predictions should resemble the data; large discrepancies between the observed data and the predicted observations indicate poor fit. Posterior predictive checks for assessing statistical models has been gaining currency as one of several existing techniques for Bayesian evaluation of model goodness of fit. For example, Mmino et al. [27] have discussed a general statistical procedure for checking the goodness-of-fit of an admixture model to genomic data based on posterior predictive checks [27]. The authors find that the same model fitted to different genomic studies resulted in highly studyspecific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies. Another author investigates the sensitivity of the posterior predictive checks to prior specification for Item response theory models and finds that the PPC procedure hit rates appear to be influenced by prior specification, but only in certain circumstances and the effect of prior in formativeness is tied to the choice of discrepancy measure [1]. Fawcett and Walshaw [13] demonstrate predictive inference for return levels of wind speed and sea-surge extremes and recommend the posterior predictive return level as the most convenient and useful representation of a return level which provides practitioners with a point summary capturing estimation uncertainty [13]. A large scale simulation study reveals the superiority of the predictive return level over the other posterior summaries in many cases of practical interest [14]. Despite the utility of posterior predictive checks for assessment of model-data fit, literature shows that not much work has been done in the area of Bayesian inference for extremes values. The present study seeks to address this gap by considering posterior predictive checks for assessing model fitness for the generalized Pareto model based on a Dirichlet process prior.
In this paper graphical PPCs and a numerical measure of discrepancy between predicted observations and observed values are both considered. Two graphical diagnostics are used: the cumulative distribution function of the predicted distribution (PD) overlaid on the estimated GPD and, a Q-Q plot of the predicted quantiles against the observed values. The numerical measure of discrepancy is given by the Nash-Sutcliffe coefficient of model efficiency, which measures the error in the predicted observations relative to the natural variation in the observed values [29].
The paper is organized as follows. Section 2 presents the generalized Pareto model. Section 3 provides derivations of the Dirichlet process, the negative differential entropy of the Dirichlet distribution, a statement of some important results for the negative differential entropy of the Dirichlet, and a description of the method of threshold selection. Derivation of the posterior predictive distribution and model assessment tools are the subject of section 4. Section 5 presents an application on the modelling of annual maximum water inflows into the Okavango River at Mohembo, Botswana, during the period 1975-2003. Section 6 contains conclusions on the study.

The Generalized Pareto Model
be a series of n independent random variables with a common distribution function F . Extreme events are defined by identifying a high threshold, say t , with the k observations exceeding a specified threshold t in a sample of n extreme observations of a process. Threshold selection remains an active area of research in extreme value theory. De Waal proposed the choice of threshold through minimization of the entropy of the Dirichlet process when applying the POT method [10]. We follow this approach.
To formulate the model, consider the asymptotic behavior of the series ( ) ( ) The parameters µ and σ are location and scale parameters, while γ is a shape parameter which determines the weight of the tail of G, and hence F. The current study is based on the generalized Pareto distribution which has an interpretation as a limiting distribution similar to the GEV family [9]. In fact, the GPD belongs in the domain of attraction in which the shape parameter in (1) is positive, i.e. 0 γ > . To implement the POT method, denote by The distribution of the excesses, i given that t is exceeded. Pickands has shown that the conditional distribution of threshold exceedances follows the generalized Pareto distribution, with distribution function (df) given by Where t σ and γ are the GEV parameters of model (1), with 0 µ = . If we let 0 x ≤ ∞ denote the upper endpoint of F , and define ( ) F y t as above, Pickands showed that the GPD is a good approximation of ( ) for some fixed γ and t σ , if and only if F is in the domain of attraction of one of the three extreme value limit laws [31]. Inference based on model (2) is generally superior since it applies to more data [8]. Parameters of the model (2) are estimated via MCMC simulation using the Metropolis-Hastings algorithm.

The Dirichlet Process
Partition the sample space according to the largest k ordered observations exceeding a threshold t : Clearly, 1 2 1 , , , k P P P + … are random variables, and have jointly the kvariate Dirichlet distribution with parameters 1 2 That is, as a prior for ( ) F x t , then, a priori, ( ) F x t is said to be described through a Dirichlet process with parameters ( ) The parameter β is called the Distribution Based on a Dirichlet Process Prior concentration parameter and measures the analyst's belief in ( ) G x t as a prior for ( ) F x t . It has been shown that in order for the negative differential entropy to reach its lower bound for a given k , β has to be proportional to k , say, ( ) 1 k β α = + for some real 0 α > . The choice of α , also called the concentration parameter, for a given k is obtained by minimizing the negative differential entropy of the Dirichlet distribution. From the above setup, it follows that, if we assume a priori that a certain threshold model, say the generalized Pareto, with df ( ) G x t best fits the data, then it follows by Bayes Theorem that, a posteriori, ( ) F x t is described through a Dirichlet process with parameters is the empirical distribution function for the random variable X and ( ) G x t is the distribution function for the proposed model.

The Negative Differential Entropy of the Dirichlet Distribution
Shannon's seminal discovery of a quantitative measure for entropy came in connection with communication theory, and very soon his concept of information had a pervasive influence on several other disciplines [22]. These disciplines include statistics. According to Shannon, a measure of the entropy (uncertainty) of a probability distribution reflects the amount of ignorance in our state of knowledge. Conversely, a measure of the negative entropy (lack of uncertainty) of a probability distribution reflects the amount of information in our state of knowledge. Though originally defined for a discrete random variable, Shannon's entropy can be extended to the case of a continuous random variable. Then the entropy is referred to as differential entropy, and its negative as negative differential entropy. We concentrate on the latter.
Suppose the random variables 1 2 1 , , , k P P P + … have jointly the k-variate Dirichlet distribution, Then, Honkela showed that the negative differential entropy, defined as ( )

Some Results for the Negative Differential Entropy of the Dirichlet Distribution
We have established the following results for the NDE of the Dirichlet distribution. The proofs have been left out for lack of space.
i. The NDE is a convex function of k.
ii. The NDE reaches a minimum when 1 The lower bound on the NDE is given by For the NDE to reach its lower bound for a given k, β has to be proportional to k, say,

Threshold Selection
The results stated in section 3.2 are useful in selecting the threshold when applying the POT method. The threshold is selected by finding the number of observations, k , above the threshold t that minimizes the negative differential entropy of the Dirichlet distribution. In practice, plot the relative NDE against k and select the optimal k as that k which corresponds to the minimum of the relative NDE. The relative NDE is obtained by dividing the NDE in (5) by its lower bound, The , and (by implication) α or β in (5) are estimated by MCMC simulation.
Remark 1: Mazzuchi applied the Dirichlet process to introduce a Bayesian technique for assessing goodness of fit of models [28]. De Waal and Beirlant showed that the relative negative differential entropy of the Dirichlet process could possibly be used as a tool for model selection [11]. That is, given two competing models, Model 1 and Model 2 with relative NDE

Selection of Concentration Parameter
Once the threshold has been selected, we can determine the concentration parameter α . The choice of α for a given k is obtained by minimizing the negative differential entropy of the Dirichlet distribution. This leads to various selected values of α corresponding to various minimum values of the negative differential entropy for different values of k . Thus, to find the optimal value of α , plot the selected α 's against k as in Figure 4.

The Predictive Distribution
Remark 2: 3) reduces to the pdf of the beta distribution. Therefore, from the Dirichlet process, we that marginally, (4)).
From theory, the predictive distribution of a future observation f x is given by If we let ( ) F x t u = , then we have that where ( ) , h u t data denotes the posterior density of ( ) Equation (7)   To assess the fit of the model, we visually compare the predicted distribution with the observed values by a Q-Q plot of the predicted quantiles versus the observed values. If the model fits the data well, then its predictions should resemble the data, and we would expect the points to approximately follow a straight line; large discrepancies between the observed data and the predicted observations indicate poor fit.

Model Assessment
To complement the Q-Q plot of the observed data and the predicted observations we use a numerical measure of goodness of fit, Nash-Sutcliffe model efficiency coefficient (due to [29]). The coefficient of efficiency measures the error in the predicted observations relative to the natural variation in the observed values, and is given by

Application
In civil designs, extreme river discharges with very large return periods can be estimated by fitting various probability models to available observations. The need for a predictive hydrological model of the Okavango Delta was recognized a long time ago, and several modelling efforts have been carried out, [2,18], presented a statistical model to predict the extent and geographical area of the annual maximum flooding of the Okavango Delta from observed inflow and local precipitation. In this paper, we model the annual Distribution Based on a Dirichlet Process Prior maximum water inflows into the Okavango River at Mohembo, Botswana, during the period 1975-2003, by using the Dirichlet process based generalized Pareto model. The data are given in Figure 1.  To check if the distribution that generated the inflow data is in the Frechet-Pareto domain of attraction, we graph a Pareto quantile plot of the annual maximum inflows, given in Figure 2. The Pareto quantile plot in Figure 2 is approximately linear, but bends down, at the very largest observations. This indicates a weaker behavior of the ultimate tail of the annual maximum inflows distribution. So we can reasonably conclude that that the underlying distribution of the annual maximum inflows of the Okavango river at Mohembo belongs in the Frechet-Pareto class, and, therefore, proceed to fit a generalized Pareto distribution From equations (2) and (9), we obtain the posterior density of the We use the Metropolis-Hastings algorithm with proposal density q taken as lognormal   To select the threshold t through minimizing the negative differential entropy of the Dirichlet process, we need estimates of parameters of the Dirichlet process . This implies that we also need an estimate of the concentration parameter, ( ) 1 k β α = + , or equivalently, of α . The choice of α for a given k is obtained by minimizing the negative differential entropy of the Dirichlet distribution. This leads to various selected values of α corresponding to various minimal values of the negative differential entropy for different values of k . Thus, to find the optimal value of α , plot the selected α 's against k as in Figure 4, and select as the optimal the alpha corresponding to a region within which the estimates are stable. From Figure 4, we take ˆ0.26 α = , ignoring the first part of the graph below 16 k = where the estimates are quite unstable, and the last part of the graph after 25 k = where the threshold t would be too low. Too low a threshold is likely to violate the asymptotic basis of the model, leading to bias; too high a threshold will generate few excesses with which the model can be estimated, leading to high variance [8]. For this value of α , Figure 5 shows that the minimum value of the relative negative differential entropy is reached when 16 k = corresponding to a selected threshold of 523.33 t = m 3 /sec.  to the 16 k = largest observations. The empirical distribution function, k F , the proposed GPD model and the predictive (PD) distribution functions are given in Figure 7. Figure 7 shows that the GPD gives a good fit to the data in the very largest observations, but poor model fit in the lower largest observations. The posterior predictive distribution gives reasonably good overall fit to the extreme data above t. However, a Q-Q plot of the predictive quantiles versus the observed, hereafter referred to as a posterior predictive quantile plot, is more informative. See Figure 7. From Figure 8, both the predictive distribution (PD) and the generalized Pareto distribution quantiles (estimated) are consistent with the observed annual maximum inflow data for the Okavango River. Notice, however, that that the estimated GPD quantiles and the predicted quantiles differ at the very largest observations. This weaker behavior of the ultimate tail of the annual maximum inflows distribution was observed in the generalized Pareto quantile plot ( Figure 2). Overall, the posterior predictive quantile plot shows a good agreement between the predicted values and the observed values, implying the generalized Pareto distribution fits the annual maximum inflow data reasonably well.  The Nash-Sutcliffe coefficient of model efficiency for this dataset is found to be 0.9952 N S E − = (very close to 1), indicating that the model produces predictions that closely resemble the observed data. This value of a numerical measure of discrepancy between the predicted and the observed values further supports the evidence given graphically in terms of the posterior predictive quantile plot. Thus, we can conclude that the Dirichlet based generalized Pareto model fits the annual maximum inflow data very well.

Conclusions
The results from this investigation show that posterior predictive checks seem to be reasonable diagnostic tools for assessing the fit of the generalized Pareto. In particular, the posterior predictive quantile plot seems to be more informative in assessing the fit of an extreme value model than the probability plot. Selecting a threshold by minimizing the negative differential entropy of a Dirichlet process has the added advantage of allowing the analyst to estimate the concentration parameter from the model, rather than specifying it as a measure of his belief in the proposed model as a prior guess for the unknown distribution that generated the observations. Lastly, the results of the analysis of the annual maximum inflows data show that the distribution of the annual maximal inflows into the Okavango River at Mohembo, Botswana, can be adequately described by the generalized Pareto distribution.