American Journal of Theoretical and Applied Statistics
Volume 5, Issue 4, July 2016, Pages: 225-233

Modeling Multivariate Correlated Binary Data

Ahmed Mohamed Mohamed El-Sayed

High Institute for Specific Studies, Department of Management Information Systems, Nazlet Al-Batran, Giza, Egypt

Email address:

To cite this article:

Ahmed Mohamed Mohamed El-Sayed. Modeling Multivariate Correlated Binary Data. American Journal of Theoretical and Applied Statistics. Vol. 5, No. 4, 2016, pp. 225-233. doi: 10.11648/j.ajtas.20160504.19

Received: June 13, 2016; Accepted: June 22, 2016; Published: July 13, 2016


Abstract: This paper provides the model, estimation and test procedures for the measures of association in the correlated binary data associated with covariates in multivariate case. The generalized linear model (GLM) which satisfies the Markov properties for serial dependence, and the alternative quadratic exponential form (AQEF) are employed for multivariate Bernoulli outcome variables. The log-odds ratios as measures of association have been estimated, and the appropriate test procedures are suggested. The over-dispersion measure is investigated for the multivariate correlated binary outcomes. The scaled deviance is used as a goodness of fit of the model. For comparison, we have used the data on the respiratory disorder. In such situation, we indicate that the vectorized generalized linear models (VGLM) and AQEF procedures have the same estimates of regression parameters in the bivariate case.

Keywords: Multivariate Bernoulli Distribution, Generalized Linear Model, Scaled Deviance Test, Likelihood Ratio Test, Maximum Likelihood Estimators, Alternative Quadratic Exponential Form


1. Introduction

The dependence between the responses and the explanatory variables have been focused in the recent studies specially with one and two correlated outcomes variables associated with covariates. These studies make an attempt to focus on the multivariate correlated binary outcomes. Lovison [10] proposed a matrix-valued Bernoulli distribution, based on the log-linear representation introduced by Cox [6], for the multivariate Bernoulli distribution with correlated components. The model is based on the integration of conditional and marginal models. Teugels [12] used the concept of the Kronecker product to give some relationships between the correlated variables, namely, the correlation and odds ratios as measures of association. Zhao and Prentice [16] discussed the pseudo-maximum likelihood for analyzing correlated binary responses. Their parametrization is based on a simple pairwise model in which the association between responses is modeled in terms of correlations. Also, Heagerty [7], Heagerty and Zeger [8] presented the the conditional log-odds interpretation, and developed a general parametric class of the serial dependence models that permits the likelihood based marginal regression analysis of binary response data. Islam et al. [9] developed a new simple procedure to take account of the bivariate binary model with covariate dependence. Many of the vectorized generalized additive model (VGAM) features come from generalized linear model (GLM) and generalized additive model (GAM), so that readers with these functions can be returned to Chambers and Hastie [4]. Additionally, Yee and Wild [15] and the VGAM user R-manual, [14], should be consulted for general instructions about the software. General books dealt with log-linear model are referred as well, especially Christensen [5], Agresti [1] and McCullagh and Nelder [11]. Finally, El-Sayed et al. [3] introduced an alternative measure, based on the quadratic exponential form in the bivariate case, to make it more realistic, in terms of defining the underlying pseudo likelihood function, by modifying the normalizing term and developed Zhao and Prentice model [16] in the bivariate case, and this work also is devoloped in the trivariate case by El-Sayed [2].

In this paper, the major work is modeling the GLM with serial dependence, and the AQEF procedures associated with covariates. The estimations and tests of the association parameters are specified with appropriate link functions for the multivariate correlated binary case. Hence, the bivariate and trivariate AQEF will be extended to the multivariate case by modifying the the normalizing process. Also, to compare with the AQEF procedure for the log-odds ratios as measures of association and the regression parameters, we will use the GLM approach which demonstrates the serial dependence with the first-order Markov model. Section (2) presents the introduction to the multivariate Bernoulli distribution, namely, the joint probabilities and the log-odds ratios as measures of association explaining the relationship between the marginal, conditional and joint probabilities. Sections (3) and (4) present the modeling of the GLM and the AQEF procedures, in the multivariate case, respectively. Section (5) present simple introduction to VGLM procedure. Section (6) explained the numerical examples using the respiratory disorder data.

2. Multivariate Bernoulli Distribution

In this section, we will present the joint probability and the log-likelihood function for  correlated binary outcomes variables each following the Bernoulli distribution.

Let  be a dimensional vector of possibly correlated Bernoulli outcomes variables. The most general form of the joint mass function for  is

(1)

The corresponding log-likelihood function, for  observations, is

(2)

For special case, , we have the joint mass function for the correlated Bernoulli outcomes variables,  and , as

(3)

and the log-likelihood function, for  observations, is

(4)

The next sections explain the parameters estimation and appropriate test procedures for both the AQEF and GLM procedures for the multivariate Bernoulli distribution as following:

3. Multivariate AQEF Procedure

In this section, we will extend the bivariate alternative quadratic exponential form which proposed by El-Sayed et al. [3] to the multivariate case. So, the joint mass function for  correlated binary variables  is

(5)

where, , , are natural parameters, and

, ,

are associated parameters, and so on.

To obtain the normalizing term, , in the function (5), we can use this constraint

(6)

In this case, the normalizing constant can be obtained as

(7)

the summation over all  possible values of . Then, the normalizing constant is

(8)

For special case, , the joint probability mass function for  and  is

(9)

3.1. Natural Parameters Estimation

The log-likelihood function, for  observations, can be written as

(10)

where  is defined as shown in (8).

Taking the first derivatives for (10) with respect to , and put it equal to zero, we have:

(11)

Solving the equations (11), numerically, we can get the estimates , respectively.

3.2. Testing Hypothesis for Natural Parameters

We can test the null hypothesis  against the alternative hypothesis , . To test the significance of association parameters, we can test the null hypothesis  against the alternative hypothesis , . Also, we can test the null hypothesis , , and so on. All tests can be done using the Likelihood ratio test (LRT).

3.3. Modeling Multivariate AQEF Procedure

In this section, we will use the next link functions to generalize the model, with correlated dependent binary variables associated with some covariates,  (not always binary variables). The marginal probabilities  is given by the the regression model

(12)

A regression model expresses the association between these responses, associated with some covariates, , can be given by

(13)

The covariates, , which are selected show some significant association with the variables, , in multivariate analysis.

Now, we will study the effect of covariates  on the log-likelihood function (10), using the equations (12) and (13).

3.4. Regression Parameters Estimation

The log-likelihood function can be expressed as follows:

(14)

where  is defined as shown below

(15)

Taking the first derivatives for (14) with respect to , and put it equal to zero, we have:

(16)

Solving the equations (16), numerically, we can get the estimates , respectively.

3.5. Testing Hypothesis for Regression Parameters

We can test the null hypothesis, , using

(17)

Finally, we can test the null hypothesis,  ( or  or... or ), using

(18)

The estimated dispersion parameter  can be used as a measure for the over-dispersion. So, let us define

The quantity  follows the non-central  distribution. Under independence, the estimator of dispersion parameter  can be defined as

(19)

the value of  should be closed to one for a Bernoulli data. To evaluate , we must obtain the estimate of marginals, , using the equation (12), as

(20)

Also, to specify the goodness of fit model, we can use the scaled deviance function

(21)

where  is the number of estimated parameters, and  is the dispersion parameter estimate as defined in (19). Since, the deviance function is

(22)

4. Multivariate GLM Procedure

The Markov structures of dependence often adequately describe serial stochastic dependence in specified data. This pattern of dependence has been studied and so only a few remarks will be made here. Markov dependence of first order implies

(23)

Using the conditional logg-odds interpretation, Heagerty [7], and Heagerty and Zeger [8], and the Markov property, the joint mass function for the variables  can be defined as

(24)

For special case, , the joint probability mass function for  and  is

(25)

4.1. Natural Parameters Estimation

In this section, we present the estimation of parameters of the multivariate Bernoulli distribution. For  observations, we can get the log-likelihood function as

(26)

Taking the first derivatives for (26) with respect to  and , and put it equal to zero, we have

(27)

Solving the equations (27), numerically, we have the estimates  and .

4.2. Testing Hypothesis for Natural Parameters

We can test the null hypothesis  against the alternative hypothesis , . To test the association parameters, we can test the null hypothesis  against the alternative hypothesis , . All tests can be done using the Likelihood ratio test (LRT).

4.3. Modeling Multivariate GLM Procedure

In this section, we will use the same link functions similar to the AQEF to determine the regression model. A regression model which expresses the link functions and the association between the correlated binary responses, , associated with covariates, , can be given by the equations (12) and (13).

4.4. Regression Parameters Estimation

Now, we study the effect of covariates on the log-likelihood function (26) which is become

(28)

Taking the first derivative for (28) with respect to , and putting it equal to zero, we get the estimating equations

(29)

Solving the equations (29), numerically, we have the vectors estimates .

4.5. Hypothesis Test for Regression Parameters

We can test the regression parameters using the null hypothesis , by the function

(30)

Finally, we can test the association parameters using the the null hypothesis , by the function

(31)

The estimate of dispersion parameter  can be defined as shown in the equation (19). Also, to specify the goodness of fit model, we can use the scaled deviance function (21).

5. Multivariate VGAM Procedure

The conditional distribution of vectorized generalized linear models (VGAM), Yee and Wild [15], for multivariate correlated binary responses (), given that some covariates, , is given by the function:

Where,  is the normalizing term. Similar to the GLM and AQEF procedures, we can get the estimate of natural parameters, the estimate of regression parameters, the estimate of dispersion parameters, the scaled deviance and the LRTs.

6. Numerical Examples

Respiratory Disorder Data: Source: Stokes, Davis, and Koch (1995), SAS and R programs.

These data is taken from a clinical trial of patients comparing two treatments for a respiratory illness. The data contains (111) patients from two different clinics (centers) which were randomized to receive either placebo = 0 or active = 1 treatment. Patients were examined at baseline (represent the baseline respiratory status) and at four visits during the treatment. At each examination, the respiratory status was determined. A data frame are (444) observations and (8) variables which are: outcome variable (represent the respiratory status at each visit [categorized as good = 1, poor = 0]), center (center 1=1, center 2 = 2), id (repetition), age (age at time of entry into the study which represents a continuous variable), baseline (baseline respiratory status good or not, hence [good = 1, poor = 0]), treatment (placebo = P, active = A), hence to be binary data we can put P = 0 and A = 1, sex (female = 1, male = 0) and visit (four visits). We suppose that, for the bivariate case , the response variables in this model are two variables: the "outcome" variable represented by the binary variable  and the "treatment" variable represented by the binary variable . Explanatory variables in this model are six variables: center, age, baseline, sex and visit. In this example, the two dependent correlated binary variables  and , represent the outcome and the treatment variables respectively. One explanatory variable , represents the visit. In the next examples, we use the VGLM procedure, Yee [14], Yee and Wild [15], which depends on the log-linear approach in the bivariate case. The estimates obtained using the BB-package of  program, [13].

Table 1 explains the results for the GLM, QEF and AQEF procedures as following:

Table 1. Results of VGAM, AQEF and GLM procedures.

, , p = 6 parameters.

From Table 1, we have found that:

The VGLM and AQEF procedures have the same estimates, but the GLM procedure has different estimates.

For the scaled deviance measure as a goodness of fit of the model, we found all measures have values less than , p = 6 parameters.

This means that all measures have a good fit.

For the estimate of dispersion parameter , the procedure GLM has the smallest value.

For the LRT to test the null hypothesis , we find, for all procedures, the value of LRT is more than . This means that, for all procedures, the two correlated dependent variables  and  are affected significantly with the explanatory variable.

Then, the patient respiratory status, contributed and the treatment, are affected significantly by each visit. Hence the test of associated parameters reflect the significant association between  and  associated with  covariates.

In sum, the previous results proved that the same results are obtained for the VGLM and AQEF procedures. Then, we can use the Wald statistic to test the significance of the parameters of regression model as shown below.

The results in Table 1, are demonstrated in the regression model shown below:

For the GLM procedure, we have the regression model:

(32)

Also, for the VGAM and AQEF procedures, we have the regression model:

(33)

Table 2 reflects the estimates, standard error and Wald statistic for regression parameters for the procedures VGAM and AQEF, which have the same results.

Table 2. Estimates, Standard error and Wald statistic.

From Table 2, the Wald statistics reflect the dependent variables  and  together are affected significantly with the explanatory variable,. This confirms the results obtained for the LRT in Table 1. Also, we can use the VGAM-package to fit the model using more than one covariates. Applying that on the respiratory disorder data, considering the dependent correlated binary variables are outcome () and treatment (), and the the explanatory variables are: center , sex , age , visit  and baseline .

Table 3 represents the results associated with more than one covariates:

Table 3. Logits, Measure of association, Standard error and Wald statistic.

Log-likelihood: , .

From Table 3, we have found that:

The two dependent correlated binary variables, outcome (), and treatment () are together affected significantly by the explanatory variable age

The dependent variable outcome () is affected significantly by the explanatory variables, baseline  and age .

The dependent variable treatment  is affected significantly by the explanatory variables, baseline  and sex .

From Table 3, we have the regression model:

(34)


References

  1. Agresti A. Categorical data analysis (second edition). New Jersey, United States: John Wiley & Sons; 2002.
  2. El-Sayed A M. M. Modeling trivariate binary data. Al-Azhar University, Journal of College of Science 2016; Accepted.
  3. El-Sayed A M M, Islam M A, Alzaid A A. Estimation and test of measures of association for correlated binary data. Bulletin of the Malaysian Mathematical Sciences Society 2013; 2, 36, 4: 985-1008.
  4. Chambers J M, Hastie TJ. Statistical Models in Solomon. New York: Chapman and Hall; 1993.
  5. Chri stensen R. Log-linear Models and Logistic Regression (second edition). New York, United States: Springer-Verlag; 1997.
  6. Cox D R. The analysis of multivariate binary data. Journal of the Royal Statistical Society, Series C (Applied Statistics) 1972; 21: 113-120.
  7. Heagerty P J. Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics 2002; 58: 342-351.
  8. Heagerty P J and Zeger S L. Marginalized multi-level models and likelihood inference (with discussion). Statistical Science 2002; 15: 1-26.
  9. Islam M A, Chowdhury R I, Briollais L. A bivariate binary model for testing dependence in outcomes. Bulletin of the Malaysian Mathematical Sciences Society 2012; 2, 35, 4: 845-858.
  10. Lovison G. A matrix-valued Bernoulli distribution. Journal of Multivariate Analysis 2006; 97: 1573-1585.
  11. McCullagh P, Nelder J A. Generalized linear models (second edition). London, United Kingdom: Chapman & Hall; 1989.
  12. Teugels J L. Some representations of the multivariate Bernoulli and Binomial distributions. Journal of multivariate analysis 1990; 32: 256-268.
  13. Varadhan R, Gilbert P D. BB: An R package for solving a large system of nonlinear equations and for optimizing a high-dimensional nonlinear objective function. Journal of Statistical Software 2009; 32, 4: 1-26.
  14. Yee T W. The VGAM package, R News 2008; 8, 2: 28-39.
  15. Yee T W, Wild C J. Vector generalized additive models. Journal of the Royal Statistical Society, Series B, Methodological 1996; 58: 481-493.
  16. Zhao L P, Prentice R L. Correlated binary regression using a generalized quadratic model. Biometrika 1990; 77: 642-648.

Article Tools
  Abstract
  PDF(248K)
Follow on us
ADDRESS
Science Publishing Group
548 FASHION AVENUE
NEW YORK, NY 10018
U.S.A.
Tel: (001)347-688-8931