A Simple Conditional Approach for Generating Spatial Correlated Binary Data
Renhao Jin, Tao Liu, Fang Yan, Jie Zhu
School of Information, Beijing Wuzi University, Beijing, China
Renhao Jin, Tao Liu, Fang Yan, Jie Zhu. A Simple Conditional Approach for Generating Spatial Correlated Binary Data. American Journal of Theoretical and Applied Statistics. Vol. 4, No. 4, 2015, pp. 305-311. doi: 10.11648/j.ajtas.20150404.21
Abstract: Generating a spatial random field in which the observations are binary random variables with a particular covariance function may be impossible, because there are restrictions on the parameters of Bernoulli variables. This paper develops a conditional method based from spatial GLMM for generating spatial correlated binary data, which can generate spatial correlated binary data, with the variograms of the simulated data are similar to the variograms of the corresponding latent Gaussian random field. However, the closed form for their spatial correlation is not available specifically.
Keywords: Spatial Binary Data, Generalized Linear Mixed Model, Variogram
The main goals of this paper are to offer a method to generate spatially correlated binary data, named as conditional method, which is based on spatial generalized linear mixed model (GLMM). Simulating spatial data is very important on theory research, as the worth of a spatial statistical method can be established convincingly only if the method proved to be long-run satisfactory. In many cases, the assessments of the spatial models are mainly based on simulated data. In this paper, the authors only focus on spatially correlated binary data, which are encountered in many applications ranging from epidemiology to forestry. Infectious disease data often have spatially clustered observations. In forestry binary responses, for example, the presence or absence of some disease is often observed.
Generating a spatial random field is not a simple task unless it is a Gaussian random field (GRF). However, generating a random field in which the observations are binary random variables with a particular covariance function may be impossible, because there are restrictions on the parameters of Bernoulli variables. What can be done is to generate random deviates whose marginal moments (mean and variance) "behave like" those of binary variables (Schabenberger and Gotway (2005), Chapter 7). Schabenberger and Gotway (2005) suggested the convolution representation method to generate spatially correlated binary data. However, their method can only simulate second-order stationary data, i.e., constant mean and constant variance for all observations.
Several authors have proposed different methods for generating correlated binary data. A study of their methods was performed and it was tried to extend their methods to spatially correlated binary data. However, the majority of these methods have limitations with respect to generating spatially correlated binary data with non-constant mean. For example, Lunn and Davies (1998) showed a method of generating correlated binary variables with a very simple correlation structure, which is suitable for generating variables with correlation structures which are exchangeable, and is easily extended to cater for correlation structures which are autoregressive or stationary M-dependent. However it is impossible to extend their method to general spatial correlation structures and also their method only generates binary data with constant means.
Park et al. (1996) developed a method for generating spatial binary data based on generating correlated Poisson random variables which are then recoded as zero or one. The approach by Park et al. relies on the property that any Poisson random variable can be expressed as a convolution of several other independent Poisson random variables. The binary variables have desired correlations by sharing common independent Poisson variables. The authors used this property for generating correlated Poisson variates, which are used in turn for generating correlated binary variates. Their method allows unequal means and only positive correlations, and thus may be extended to generate spatially correlated binary data. Park et al. (1996) did discuss some restrictions of their method. Firstly, for Bernoulli data, there is a natural restriction on the correlation coefficient between two binary variates and . Note that . Therefore , where and . So is not free on but is constrained by . Based on this natural restriction, if varies a lot, all the will be much smaller than 1. Then a spatial correlation structure that satisfies this restriction is difficult to find, because the spatial correlations should decrease from 1 to 0 as distances increase. Park et al. (1996) did not spell out the restrictions of their method but they gave three conditions that if they were held, their method would succeed in generating correlated binary as desired. However, to generate spatially correlated binary data, even assuming they have a constant mean, these three conditions are still not easily to satisfy in a simulation algorithm.
In this paper, a conditional method based from spatial GLMM for generating spatially correlated binary variables are developed that do not have the shortcomings of the methods above. The conditional approach listed here is similar to the simulation method in Crainiceanu, Diggle and Rowlingson (2008).
2.1. Spatial GLMM
To better explain the conditional method for generating spatial binary data, the spatial GLMM model is firstly described in detail. For the spatial GLMM model, the spatial data are assumed conditionally dependent on an underlying, smooth, spatial process . Given S(s), Z(s) has a Bernoulli distribution given by
Here is a Gaussian random field with mean 0 and covariance function . Thus, the assumption of conditional independence defers treatment of spatial autocorrelation to the process. is a diagonal matrix with in the diagonal. is the parameter for modeling the over-dispersion in the data. As explained in the Introduction 3.1, in theory a conditional model has a marginal formulation, but the closed marginal form of and is unavailable
The marginal mean of for this model is
is the probability-distribution function of , so is a Gaussian probability-distribution function with mean 0 and variance , i.e. N(0, ). It is difficult to obtain a theoretical expression for , but its numerical value can be easily calculated using Riemann summation. For a continuous function on , always exists and can be computed by Riemann summation as
for any choice of in with , and .
The variance of and the covariance function between are as follows:
The numerical value of (2) can be calculated through the numerical value of (1). The numerical value of can also be calculated by a Riemann summation, thus the numerical values of (3) and (4) can be obtained. However, the theoretical mean and covariance of Z(s) are not available for binary data generated by this conditional method.
2.2. Algorithm of Conditional Method
Based on the definition of conditional GLMM above, the algorithm below generates spatially correlated binary data by a conditional method:
1. Generate , ,
2. Obtain by ,
3. Obtain by ,
4. Generate using a random number generator from .
The algorithm above for simulating GLMM data is a new method but very similar to that of Crainiceanu, Diggle and Rowlingson (2008). In the simulation part of their paper, they simulated binomial data, and comparing with steps 1 and 2 in this algorithm they used random effects vector with a design matrix instead of a Gaussian random field .
2.3. Description of the Simulation Study
Spatial binary data with sample size 100 on a regular grid were generated. The grid is on with intervals of 4 in both directions. The maximum distance between the data points was 50.91 and a half of this was 25.46. was a zero-mean intrinsically stationary Gaussian process whose variogram was continuous at the origin. The Gaussian, exponential and spherical variograms were considered. Gaussian and exponential variograms are from Matérn class of variogram functions with no nugget is given by
The smoothness of the process increases with and among the most commonly used parametric variogram models are the Gaussian (), Whittle () and exponential (). The spherical variogram given by
is also commonly used. A nugget effect can be incorporated by adding a constant. Figure 1 gives an illustration. The spherical model attains its sill, but the Matérn models achieve their sill only asymptotically and thus their practical ranges are defined as where 95% of the sill is attained.
Figure 1. Variograms for Gaussian, Whittle, exponential and spherical models with nugget , sill and practical range 40 indicated by the vertical line. The horizontal line denotes 95% of the sill.
The sill of was 1 with nugget 0 and its practical range was 20 for each of the three variograms. The spherical variogram attains its sill at the range, and its range is 24.65 corresponding to a practical range of 20. So now the practical ranges of the three variograms were close to one half of maximum distance between the data points, and the range of the spherical variogram was less than this distance. In the equation for the conditional mean , is defined as , where is a random number from a uniform distribution on . This choice of in was made so that is an important part of the model, since exp(-1)/(1+exp(-1))= 0.27, and to make the mean of the generated to be around 0.3. When a uniform random number was generated, it was kept the same for all simulations. Data were simulated by the conditional method using SAS software (SAS® 9.2, SAS Institute Inc., Cary, N.C.). The spatial in the conditional method were generated by the SAS SIM2D Procedure.
In this section, spatial binary data were simulated by the procedure described in Method section, and the binary data was generated on a regular grid and in the model , was same for all simulations but the variogram of varied in different simulations. Three spatial binary datasets of were generated conditionally with the variogram of Gaussian, exponential and spherical respectively.
A typical realized dataset from one simulation of a Gaussian random field and the corresponding spatial binary data generated by the conditional method is shown in Figure 2. From the plots, it can be seen that the spatial patterns in the generated binary data are similar to the spatial patterns in the corresponding latent Gaussian random field. Recall the conditional method procedure in Method section, where a large value of may lead to a large , the mean of the , and thus is likely to be 1. Comparing the spatial patterns in generated by different variogram types, little difference was found between the binary data generated by exponential and spherical variograms. However, the spatial binary data generated by Gaussian variogram had a different spatial pattern from the data by the other variogram types, being more smooth. The reason can be found from their corresponding realizations of Gaussian random fields. As shown in (a), (c), (e) of Figure 2, the Gaussian random field with Gaussian variogram is more smooth than the other two.
Above all, the Algorithm of conditional method in this paper can generate spatial correlated binary data, with the variograms of the simulated data are similar to the variograms of the corresponding latent Gaussian random field. However, the theoretical variogram of the binary data thus generated is still unavailable. Further work is needed to find good approximations to the correlation function of the data generated by the conditional method.
Figure 2. The Gaussian random fields with Gaussian, exponential and spherical variograms were generated on the grid with intervals of 4 in both directions and shown in (a), (c), (e) respectively. Plots (b), (d), (f) are for the corresponding spatial binary data generated by the conditional method.
This paper is funded by the project of National Natural Science Fund, Logistics distribution of artificial order picking random process model analysis and research (Project number: 71371033); and funded by intelligent logistics system Beijing Key Laboratory (No.BZ0211); and funded by scientific-research bases---Science & Technology Innovation Platform---Modern logistics information and control technology research (Project number: PXM2015_014214_000001); University Cultivation Fund Project of 2014-Research on Congestion Model and algorithm of picking system in distribution center (0541502703).