Generating Spatial Correlated Binary Data Through a Copulas Method
Renhao Jin, Sha Wang, Fang Yan, Jie Zhu
School of Information, Beijing Wuzi University, Beijing, China
Email address
To cite this article:
Renhao Jin, Sha Wang, Fang Yan, Jie Zhu. Generating Spatial Correlated Binary Data Through a Copulas Method. Science Research. Vol. 3, No. 4, 2015, pp. 206-212. doi: 10.11648/j.sr.20150304.18
Abstract: Simulating spatial correlated binary data is very important on many cases, but it is not easily to accomplish, as there are restrictions on the parameters of Bernoulli variables. This paper develops a copulas method to generate spatial correlated binary data. The spatial binary data generated by this method has an inverse spatial pattern comparing with the latent Gaussian random field data, however they have similar empirical variograms, although the closed form for the spatial correlation is not available specifically.
Keywords: Spatial Binary Data, Copulas, Simulation, Variogram
1. Introduction
The main goals of this paper are to offer a method to generate spatially correlated binary data through a copulas method. In probability theory and statistics, a copula is a multivariate probability distribution for which the marginal probability distribution of each variable is uniform. Copulas are used to describe the dependence between random variables. They are named for their resemblance to grammatical copulas in linguistics. Sklar's Theorem states that any multivariate joint distribution can be written in terms of univariate marginal distribution functions and a copula which describes the dependence structure between the variables. Copulas are popular in high-dimensional statistical applications as they allow one to easily model and estimate the distribution of random vectors by estimating marginal and copulas separately. There are many parametric copula families available, which usually have parameters that control the strength of dependence.
Simulating spatial data is very important on many cases. The absence of replication in most spatial data sets requires repeated observation of a phenomenon to obtain empirical estimates of mean, variation, and covariation. In this paper, the authors only focus on spatially correlated binary data, which are encountered in many applications ranging from epidemiology to forestry. Infectious disease data often have spatially clustered observations. In forestry binary responses, for example, the presence or absence of some disease is often observed.
Several authors have proposed different methods for generating correlated binary data. A study of their methods was performed and it was tried to extend their methods to spatially correlated binary data. However, the majority of these methods have limitations with respect to generating spatially correlated binary data with non-constant mean. For example, Lunn and Davies (1998) showed a method of generating correlated binary variables with a very simple correlation structure, which is suitable for generating variables with correlation structures which are exchangeable, and is easily extended to cater for correlation structures which are autoregressive or stationary M-dependent. However it is impossible to extend their method to general spatial correlation structures and also their method only generates binary data with constant means.
Park et al. (1996) developed a method for generating spatial binary data based on generating correlated Poisson random variables which are then recoded as zero or one. Al-osh and Lee (2001) introduced a simpler approach than that of Park et.al (1996) for generating non-negatively correlated binary data. Their proposed method only uses properties of binary random variates that eliminates the need for the intermediate step of using correlated Poisson variates as in Park et.al (1996). The key idea lies in the fact that any Bernoulli random variable can be expressed as a convolution of other independent Bernoulli random variables and that correlation among the binary observations can arise as a result of their sharing some common elements that induce such correlation. The algorithms of Al-osh and Lee (2001) are almost the same as Park et al. (1996), and their demonstrations and results on simulating 3-dimensional binary vectors are similar, and only differ in whether the convolution is of Poisson variables or Bernoulli variables. In comparing these two methods’ efficiency, Al-osh and Lee’s method is expected to be more efficient since it generates binary variables directly without any intermediate step as in Park et al.'s method.
However, Al-osh and Lee (2001) did not discuss the restrictions of their method very much, and just stated that their algorithm should work for most practical cases in generating a vector of binary variates with nonnegative correlation structure. Their method is not as powerful as claimed. The restrictions on the possible combinations of mean and correlation structure required in their algorithm are such that no single simulation method can handle moderate to large sample sizes easily together with the restrictions. Unlike Park et al. (1996) and Al-osh and Lee (2001), Qaqish (2003) introduced a family of multivariate binary distributions with a certain conditional linear property. This family is particularly useful for efficient and easy simulation of correlated binary variables with a given marginal mean vector and correlation matrix . His method can be used to generate spatially correlated binary data with non-constant mean, but this method also had restrictions. Qaqish (2003) stated a Lemma giving restrictions on , for which his method is available. For certain patterned correlation matrices, such as exchangeable, AR(1), and MA(1) correlations, the algebraic inverse forms of their correlation matrices are available and there are simple rules to decide beforehand on the parameters in to satisfy the Lemma. However, for many other correlation matrices, such as spatial correlation matrices, it is difficult to obtain the algebraic inverse form of the correlation matrix. An example for which it works is a binary process regularly spaced on a 1-dimensional transect with exponential correlation. Since the exponential correlation now is actually AR (1) correlation and it is easy to obtain the inverse of AR (1) correlation matrices, this can be simulated by the method of Qaqish (2003). However for a binary process regularly spaced on a 2-dimensional grid, no simple rules exist for the algebraic inverse of their correlation matrices even for exponential correlation. For a general , Qaqish (2003) then suggested trying permutations of and computing numerical inverse of the permutated and checking those that satisified the conditions of the Lemma, but he noted that this was not a practical approach for actual simulation work as even for a small sample size 50, the number of possible permutations is a huge figure . For sample size 100, 10000 permutations for each of several with spatial correlation matrices were done here and none of the 10000 permutations met the conditions of the Lemma.
In this paper, a method based from copulas for generating spatially correlated binary variables are developed that do not have the shortcomings of the methods above. This copulas method is simple but are totally new and not found elsewhere.
2. Methods
2.1. Generating Spatial Binary Data Through Copulas
Copulas method is a simulation method, and it is easy to understand and manipulate. This method is wildly used to mathematical experiments, and its procedure is explained in this section.
Assume that random variables are K-variate normally distributed and the cumulative distribution function of each is ,. The copulas method first transforms to by , and now is uniform distributed in . Then random variables are then generated as required based on .
Here spatially correlated binary data are generated based on . Let be spatially correlated, and let . For with arbitrary mean and variance, is always uniformly distributed in . For simplicity it is assumed . The spatially correlated binary data are generated as .
To generate spatially correlated binary data , with and , the procedure is as follows:
Step1. Generate spatially correlated , for all and .
Step2. Obtain by for all .
Step3. Generate by for all . Now has and . There is no closed form relationship between and , but are spatially correlated based on the spatial correlation between . The nature of the correlation is investigated here through the relationship between the processes.
For an isotropic second-order stationary spatial process, the variogram function is expected to increase as increases. At the range , the achieve its sill , i.e. . If the variogram achieves the sill only asymptotically, then the practical range is defined as the lag distance at which the variogram achieves of the sill. For the spatial process , if its variogram achieves the sill at , for arbitrary if for , . By the copulas algorithm, it is clear the corresponding and also has . But for arbitrary if , then , and the corresponding and also has . It can be concluded therefore, that the process has the same range as . But when the variogram of achieves the sill only asymptotically, then for arbitrary. As the increase, will be close to but still bigger than . By the copulas algorithm, for the corresponding and , , for arbitrary also. So the also achieves its sill only asymptotically. However, and have different scales, and there is no closed form for the relationship of the covariance functions between and . They will also have different practical ranges, but their practical ranges are close, as shown in the simulations section of this paper.
2.2. Description of the Simulation Study
Firstly spatially correlated normal data with sample size 100 on a regular grid were generated. The grid chosen was on with intervals of 4 in both directions. The maximum distance between the data points was 50.91 and a half of this was 25.46. Gaussian, exponential and spherical variograms of were generated. For each variogram type, a sill of 1, nuggets of 0, 1/3 and 2/3, and a practical range of 20 were considered.
Gaussian and exponential variograms are from Matérn class of variogram functions with no nugget is given by
The smoothness of the process increases with and among the most commonly used parametric variogram models are the Gaussian (), Whittle () and exponential (). The spherical variogram given by
is also commonly used. A nugget effect can be incorporated by adding a constant. Figure 1 gives an illustration. The spherical model attains its sill, but the Matérn models achieve their sill only asymptotically and thus their practical ranges are defined as where 95% of the sill is attained.
Figure 1. Variograms for Gaussian, Whittle, exponential and spherical models with nugget , sill and practical range 40 indicated by the vertical line. The horizontal line denotes 95% of the sill.
Secondly spatial binary data with non-constant mean were generated from by the copulas method introduced in this paper. The definition of is defined as
,
,
where is a random number from a uniform distribution on . Thus the mean of the generated was around 0.27, since exp(-1)/(1+exp(-1))= 0.27.
Data were simulated in this paper using SAS software (SAS® 9.2, SAS Institute Inc., Cary, N.C.). The spatial in the conditional method were generated by the SAS SIM2D Procedure.
3. Results
3.1. Simulations of the Data
Here a sample of 100 spatially correlated normal data were first generated on the regular grid with intervals of 4 in both directions. The Gaussian, exponential and spherical correlation types were considered for . For each variogram type, a sill of 1, nuggets of 0, 1/3 and 2/3, and a practical range of 20 were considered. The were transformed to uniformly distributed data by the transformation , where was the distribution function of . i.e. the Gaussian. A sample of size 100 spatial binary data were then generated by the simple transformation, . In this section, the results of the analysis of the spatial binary data generated by copulas method are shown below.
One simulation of a realized dataset of Gaussian random field data and the corresponding spatial binary data generated by the copulas method are shown in Figure2. Here the sill and nugget of the variogram of the Gaussian random field were chosen to be 1 and 0 respectively. From the plots, it can be seen that the spatial patterns in the generated binary data were different from the spatial patterns in the corresponding Gaussian random field. The were transformed to uniformly distributed data by the transformation , but the were determined without regard to and so it is expected that if U is large Z will be zero and U small Z will be 1. Thus the spatial patterns are the ‘inverse’ of each other. However, they should have similar variograms. Comparing the spatial patterns in generated by different variogram type, little difference was found between the binary data generated by exponential and spherical variograms. However, the spatial binary data generated by Gaussian variogram had a different spatial pattern from the data by the other variogram types. The reason can be found from their corresponding realizations of Gaussian random fields. As shown in (a), (c), (e) of Figure 2, the Gaussian random field with Gaussian variogram had a different spatial pattern from the other two, while the other two had similar spatial patterns.
(a) A plot of V(s) with Gaussian variogram
(b) The plot of generated spatial correlated binary data where V(s) has a Gaussian variogram
(c) A plot of V(s) with an exponential variogram
(d) The plot of generated spatial correlated binary data where V(s) has an exponential variogram
(e) A plot of V(s) with spherical variogram
(f) The plot of generated spatial correlated binary data where V(s) has spherical variogram
Figure 2. The Gaussian random field data with Gaussian, exponential and spherical variograms generated on the grid with intervals of 4 in both directions and shown in (a), (c), (e) respectively. In all plots, the size of circles are proportional to their numeric values. The (b), (d), (f) were the corresponding generated spatial binary data by the copulas method. Here the sill and nugget of the variogram of the Gaussian random field data were chosen to be 1 and 0 respectively.
3.2. Variogram Plots
In this section, to examine the relationship of the variogram between the data sets and , 500 simulations were done. For the data , constant and non-constant means were considered. For the non-constant mean case, the means of were set as in Method section. For the constant mean case, the mean of were simply taken to be 0.27. In each simulation, the Matheron estimators of the variograms of and were calculated. In order to accurately estimate the variograms, only lags which had more than 30 data pairs were kept. As the and were generated on a regular grid, there were sufficient lags that had more than 30 data pairs, so the data were not binned. To show the result of 500 simulations, the Matheron estimator at each lag is the average value of Matheron estimators at that lag over 500 simulations.
Figure 3 shows the results only for a nugget of 0 in each variogram type of . For the case of nuggets 1/3 and 2/3, similar results to that of Figure 3 were seen. As can be seen in Figure 3, clearly there is spatial association between the . More importantly, the generated were found to have a similar spatial correlation type as the . Note that the variograms of and have similar practical range, which was as expected in Method section. There is no closed form for the connection between the correlation functions of and . However, from the plot, it can be concluded that the copulas method kept a similar correlation type. From this same simulation, note that the nugget and sill of the variogram of are different from that of . A comparison of the variograms in Figures 3 (b) and (c), shows the two plots are very similar. This may be because the non-constant means of had a small variation around 0.27 (the value of the constant mean). However, in the Figure 3 (c) the non-constant mean variogram had a little more fluctuation at the big lags than Figure 3 (b).
(a)
(b)
(c)
Figure 3. Matheron variogram plots of and . In the three plots, the Matheron estimator at each lag was the average value of Matheron estimators at that lag in 500 simulations. Figures (a), (b) and (c) are the Matheron variogram of the normal data , the spatial binary data with constant mean, and the spatial binary data with non-constant mean respectively. In Figure (a), the exp, Gaussian and sph denote the generated with exponential, Gaussian and spherical correlation types respectively, while in Figures (b) and (c), exp, Gaussian and sph denote the generated from the corresponding in Figure (a). This Figure shows results only for a nugget of 0 in each variogram of .
4. Conclusion
Simulating spatial correlated binary data is very important on many cases, but it is not easily to accomplish, as there are restrictions on the parameters of Bernoulli variables. Several authors have proposed different methods for generating correlated binary data. A study of their methods was performed and it was tried to extend their methods to spatially correlated binary data. However, the majority of these methods have limitations with respect to generating spatially correlated binary data with non-constant mean. This paper develops a copulas method to generate spatial correlated binary data. Copulas method is a simulation method, and it is easy to understand and manipulate. This method is wildly used to mathematical experiments, and its procedure is explained in this section. The spatial binary data generated by this method has an inverse spatial pattern comparing with the latent Gaussian random field data, however they have similar empirical variograms. The limitation of this copulas method is that the closed form for the spatial correlation is not available specifically. However, in many applications, the main requirements on simulation is to hold the designed variograms, and from this point the method proposed in this paper is delighted.
Acknowledgements
This paper is funded by the project of National Natural Science Fund, Logistics distribution of artificial order picking random process model analysis and research (Project number: 71371033); and funded by intelligent logistics system Beijing Key Laboratory (No.BZ0211); and funded by scientific-research bases---Science & Technology Innovation Platform---Modern logistics information and control technology research (Project number: PXM2015_014214_000001); University Cultivation Fund Project of 2014-Research on Congestion Model and algorithm of picking system in distribution center (0541502703).
References