Modeling Multivariate Correlated Binary Data

This paper provides the model, estimation and test procedures for the measures of association in the correlated binary data associated with covariates in multivariate case. The generalized linear model (GLM) which satisfies the Markov properties for serial dependence, and the alternative quadratic exponential form (AQEF) are employed for multivariate Bernoulli outcome variables. The log-odds ratios as measures of association have been estimated, and the appropriate test procedures are suggested. The over-dispersion measure is investigated for the multivariate correlated binary outcomes. The scaled deviance is used as a goodness of fit of the model. For comparison, we have used the data on the respiratory disorder. In such situation, we indicate that the vectorized generalized linear models (VGLM) and AQEF procedures have the same estimates of regression parameters in the bivariate case.


Introduction
The dependence between the responses and the explanatory variables have been focused in the recent studies specially with one and two correlated outcomes variables associated with covariates. These studies make an attempt to focus on the multivariate correlated binary outcomes. Lovison [10] proposed a matrix-valued Bernoulli distribution, based on the log-linear representation introduced by Cox [6], for the multivariate Bernoulli distribution with correlated components. The model is based on the integration of conditional and marginal models. Teugels [12] used the concept of the Kronecker product to give some relationships between the correlated variables, namely, the correlation and odds ratios as measures of association. Zhao and Prentice [16] discussed the pseudo-maximum likelihood for analyzing correlated binary responses. Their parametrization is based on a simple pairwise model in which the association between responses is modeled in terms of correlations. Also, Heagerty [7], Heagerty and Zeger [8] presented the the conditional log-odds interpretation, and developed a general parametric class of the serial dependence models that permits the likelihood based marginal regression analysis of binary response data. Islam et al. [9] developed a new simple procedure to take account of the bivariate binary model with covariate dependence. Many of the vectorized generalized additive model (VGAM) features come from generalized linear model (GLM) and generalized additive model (GAM), so that readers with these functions can be returned to Chambers and Hastie [4]. Additionally, Yee and Wild [15] and the VGAM user R-manual, [14], should be consulted for general instructions about the software. General books dealt with log-linear model are referred as well, especially Christensen [5], Agresti [1] and McCullagh and Nelder [11]. Finally, El-Sayed et al. [3] introduced an alternative measure, based on the quadratic exponential form in the bivariate case, to make it more realistic, in terms of defining the underlying pseudo likelihood function, by modifying the normalizing term and developed Zhao and Prentice model [16] in the bivariate case, and this work also is devoloped in the trivariate case by El-Sayed [2].
In this paper, the major work is modeling the GLM with serial dependence, and the AQEF procedures associated with covariates. The estimations and tests of the association parameters are specified with appropriate link functions for the multivariate correlated binary case. Hence, the bivariate and trivariate AQEF will be extended to the multivariate case by modifying the the normalizing process. Also, to compare with the AQEF procedure for the log-odds ratios as measures of association and the regression parameters, we will use the GLM approach which demonstrates the serial dependence with the first-order Markov model. Section (2) presents the introduction to the multivariate Bernoulli distribution, namely, the joint probabilities and the log-odds ratios as measures of association explaining the relationship between the marginal, conditional and joint probabilities. Sections (3) and (4) present the modeling of the GLM and the AQEF procedures, in the multivariate case, respectively. Section (5) present simple introduction to VGLM procedure. Section (6) explained the numerical examples using the respiratory disorder data.

Multivariate Bernoulli Distribution
In this section, we will present the joint probability and the log-likelihood function for K correlated binary outcomes variables each following the Bernoulli distribution. The corresponding log-likelihood function, for n observations, is (2) For special case, k = 2 , we have the joint mass function for the correlated Bernoulli outcomes variables, Y 1 and Y 2 , as (3) and the log-likelihood function, for n observations, is (4) The next sections explain the parameters estimation and appropriate test procedures for both the AQEF and GLM procedures for the multivariate Bernoulli distribution as following:

Multivariate AQEF Procedure
In this section, we will extend the bivariate alternative quadratic exponential form which proposed by El-Sayed et al. [3] to the multivariate case. So, the joint mass function for K correlated binary variables where, log are associated parameters, and so on.
To obtain the normalizing term, c θ ψ In this case, the normalizing constant can be obtained as (7) the summation over all k 2 possible values of Y . Then, the normalizing constant is (8) For special case, k = 2 , the joint probability mass function for Y 1 and Y 2 is (9)

Natural Parameters Estimation
The log-likelihood function, for n observations, can be written as (10) where c θ ψ ( , ) is defined as shown in (8).

Testing Hypothesis for Natural Parameters
We < < , and so on. All tests can be done using the Likelihood ratio test (LRT).

Modeling Multivariate AQEF Procedure
In this section, we will use the next link functions to generalize the model, with correlated dependent binary variables associated with some covariates, x (not always binary variables). The marginal probabilities j p j k ( = 1, 2, .., ) is given by the the regression model (12) A regression model expresses the association between these responses, associated with some covariates, x , can be given by (13) The covariates, x , which are selected show some significant association with the variables, , , ..., , in multivariate analysis.
Now, we will study the effect of covariates x on the log-likelihood function (10), using the equations (12) and (13).

Testing Hypothesis for Regression Parameters
We can test the null hypothesis, The estimated dispersion parameter ϕ can be used as a measure for the over-dispersion. So, let us define ) follows the non-central χ 2 distribution. Under independence, the estimator of dispersion parameter ϕ can be defined as (19) the value of φ should be closed to one for a Bernoulli data. To evaluate φ , we must obtain the estimate of marginals, ˆj p , using the equation (12), as Also, to specify the goodness of fit model, we can use the scaled deviance function

Multivariate GLM Procedure
The Markov structures of dependence often adequately describe serial stochastic dependence in specified data. This pattern of dependence has been studied and so only a few remarks will be made here. Markov dependence of first order implies (23) Using the conditional logg-odds interpretation, Heagerty [7], and Heagerty and Zeger [8], and the Markov property, the joint mass function for the variables Y can be defined as (24) For special case, k = 2 , the joint probability mass function for Y 1 and Y 2 is (25)

Natural Parameters Estimation
In this section, we present the estimation of parameters of the multivariate Bernoulli distribution. For n observations, we can get the log-likelihood function as

Testing Hypothesis for Natural Parameters
We can test the null hypothesis

Modeling Multivariate GLM Procedure
In this section, we will use the same link functions similar to the AQEF to determine the regression model. A regression model which expresses the link functions and the association between the correlated binary responses, Y , associated with covariates, x , can be given by the equations (12) and (13).

Regression Parameters Estimation
Now, we study the effect of covariates on the log-likelihood function (26) which is become

Hypothesis Test for Regression Parameters
We can test the regression parameters using the null hypothesis The estimate of dispersion parameter ϕ can be defined as shown in the equation (19). Also, to specify the goodness of fit model, we can use the scaled deviance function (21).

Multivariate VGAM Procedure
The conditional distribution of vectorized generalized linear models (VGAM), Yee and Wild [15], for multivariate correlated binary responses ( , ,...., k Y Y Y ), given that some covariates, x , is given by the function:  is the normalizing term. Similar to the GLM and AQEF procedures, we can get the estimate of natural parameters, the estimate of regression parameters, the estimate of dispersion parameters, the scaled deviance and the LRTs.

Numerical Examples
Respiratory Disorder Data: Source: Stokes, Davis, and Koch (1995), SAS and R programs.
These data is taken from a clinical trial of patients comparing two treatments for a respiratory illness. The data contains (111) patients from two different clinics (centers) which were randomized to receive either placebo = 0 or active = 1 treatment. Patients were examined at baseline (represent the baseline respiratory status) and at four visits during the treatment. At each examination, the respiratory status was determined. A data frame are (444) observations and (8) variables which are: outcome variable (represent the respiratory status at each visit [categorized as good = 1, poor = 0]), center (center 1=1, center 2 = 2), id (repetition), age (age at time of entry into the study which represents a continuous variable), baseline (baseline respiratory status good or not, hence [good = 1, poor = 0]), treatment (placebo = P, active = A), hence to be binary data we can put P = 0 and A = 1, sex (female = 1, male = 0) and visit (four visits). We suppose that, for the bivariate case ( = 2 ) K , the response variables in this model are two variables: the "outcome" variable represented by the binary variable Y 1 and the "treatment" variable represented by the binary variable Y 2 . Explanatory variables in this model are six variables: center, age, baseline, sex and visit. In this example, the two dependent correlated binary variables Y 1 and Y 2 , represent the outcome and the treatment variables respectively. One explanatory variable X , represents the visit. In the next examples, we use the VGLM procedure, Yee [14], Yee and Wild [15], which depends on the log-linear approach in the bivariate case. The estimates obtained using the BB-package of R program, [13]. Table 1 explains the results for the GLM, QEF and AQEF procedures as following: For the estimate of dispersion parameter ϕ , the procedure GLM has the smallest value.
For the LRT to test the null hypothesis H α 0 : = 0 , we find, for all procedures, the value of LRT is more than χ 2 (0 .0 5 , 1 ) = 3 .8 4 1 5 . This means that, for all procedures, the two correlated dependent variables Y 1 and Y 2 are affected significantly with the explanatory variable X .
Then, the patient respiratory status, contributed and the treatment, are affected significantly by each visit. Hence the test of associated parameters reflect the significant association between Y 1 and Y 2 associated with x covariates.
In sum, the previous results proved that the same results are obtained for the VGLM and AQEF procedures. Then, we can use the Wald statistic to test the significance of the parameters of regression model as shown below.
The results in Table 1, are demonstrated in the regression model shown below: For the GLM procedure, we have the regression model: Also, for the VGAM and AQEF procedures, we have the regression model:  From Table 2, the Wald statistics reflect the dependent variables Y 1 and Y 2 together are affected significantly with the explanatory variable, X . This confirms the results obtained for the LRT in Table 1. Also, we can use the VGAM-package to fit the model using more than one covariates. Applying that on the respiratory disorder data, considering the dependent correlated binary variables are outcome ( Y 1 ) and treatment ( Y 2 ), and the the explanatory variables are: center X 1 ( ) , sex X 2 ( ) , age X 3 ( ) , visit X 4 ( ) and baseline X 5 ( ) . Table 3 represents the results associated with more than one covariates: From Table 3, we have found that: The two dependent correlated binary variables, outcome ( Y 1 ), and treatment ( Y 2 ) are together affected significantly by the explanatory variable age . X 3 ( ) The dependent variable outcome ( Y 1 ) is affected significantly by the explanatory variables, baseline X 5 ( ) and age X 3 ( ) .
The dependent variable treatment Y 2 ( ) is affected significantly by the explanatory variables, baseline X 5 ( ) and sex X 2 ( ) . From Table 3, we have the regression model: