Modeling Multivariate Correlated Binary Data
Ahmed Mohamed Mohamed El-Sayed
High Institute for Specific Studies, Department of Management Information Systems, Nazlet Al-Batran, Giza, Egypt
Email address:
To cite this article:
Ahmed Mohamed Mohamed El-Sayed. Modeling Multivariate Correlated Binary Data. American Journal of Theoretical and Applied Statistics. Vol. 5, No. 4, 2016, pp. 225-233. doi: 10.11648/j.ajtas.20160504.19
Received: June 13, 2016; Accepted: June 22, 2016; Published: July 13, 2016
Abstract: This paper provides the model, estimation and test procedures for the measures of association in the correlated binary data associated with covariates in multivariate case. The generalized linear model (GLM) which satisfies the Markov properties for serial dependence, and the alternative quadratic exponential form (AQEF) are employed for multivariate Bernoulli outcome variables. The log-odds ratios as measures of association have been estimated, and the appropriate test procedures are suggested. The over-dispersion measure is investigated for the multivariate correlated binary outcomes. The scaled deviance is used as a goodness of fit of the model. For comparison, we have used the data on the respiratory disorder. In such situation, we indicate that the vectorized generalized linear models (VGLM) and AQEF procedures have the same estimates of regression parameters in the bivariate case.
Keywords: Multivariate Bernoulli Distribution, Generalized Linear Model, Scaled Deviance Test, Likelihood Ratio Test, Maximum Likelihood Estimators, Alternative Quadratic Exponential Form
1. Introduction
The dependence between the responses and the explanatory variables have been focused in the recent studies specially with one and two correlated outcomes variables associated with covariates. These studies make an attempt to focus on the multivariate correlated binary outcomes. Lovison [10] proposed a matrix-valued Bernoulli distribution, based on the log-linear representation introduced by Cox [6], for the multivariate Bernoulli distribution with correlated components. The model is based on the integration of conditional and marginal models. Teugels [12] used the concept of the Kronecker product to give some relationships between the correlated variables, namely, the correlation and odds ratios as measures of association. Zhao and Prentice [16] discussed the pseudo-maximum likelihood for analyzing correlated binary responses. Their parametrization is based on a simple pairwise model in which the association between responses is modeled in terms of correlations. Also, Heagerty [7], Heagerty and Zeger [8] presented the the conditional log-odds interpretation, and developed a general parametric class of the serial dependence models that permits the likelihood based marginal regression analysis of binary response data. Islam et al. [9] developed a new simple procedure to take account of the bivariate binary model with covariate dependence. Many of the vectorized generalized additive model (VGAM) features come from generalized linear model (GLM) and generalized additive model (GAM), so that readers with these functions can be returned to Chambers and Hastie [4]. Additionally, Yee and Wild [15] and the VGAM user R-manual, [14], should be consulted for general instructions about the software. General books dealt with log-linear model are referred as well, especially Christensen [5], Agresti [1] and McCullagh and Nelder [11]. Finally, El-Sayed et al. [3] introduced an alternative measure, based on the quadratic exponential form in the bivariate case, to make it more realistic, in terms of defining the underlying pseudo likelihood function, by modifying the normalizing term and developed Zhao and Prentice model [16] in the bivariate case, and this work also is devoloped in the trivariate case by El-Sayed [2].
In this paper, the major work is modeling the GLM with serial dependence, and the AQEF procedures associated with covariates. The estimations and tests of the association parameters are specified with appropriate link functions for the multivariate correlated binary case. Hence, the bivariate and trivariate AQEF will be extended to the multivariate case by modifying the the normalizing process. Also, to compare with the AQEF procedure for the log-odds ratios as measures of association and the regression parameters, we will use the GLM approach which demonstrates the serial dependence with the first-order Markov model. Section (2) presents the introduction to the multivariate Bernoulli distribution, namely, the joint probabilities and the log-odds ratios as measures of association explaining the relationship between the marginal, conditional and joint probabilities. Sections (3) and (4) present the modeling of the GLM and the AQEF procedures, in the multivariate case, respectively. Section (5) present simple introduction to VGLM procedure. Section (6) explained the numerical examples using the respiratory disorder data.
2. Multivariate Bernoulli Distribution
In this section, we will present the joint probability and the log-likelihood function for correlated binary outcomes variables each following the Bernoulli distribution.
Let be a dimensional vector of possibly correlated Bernoulli outcomes variables. The most general form of the joint mass function for is
(1)
The corresponding log-likelihood function, for observations, is
(2)
For special case, , we have the joint mass function for the correlated Bernoulli outcomes variables, and , as
(3)
and the log-likelihood function, for observations, is
(4)
The next sections explain the parameters estimation and appropriate test procedures for both the AQEF and GLM procedures for the multivariate Bernoulli distribution as following:
3. Multivariate AQEF Procedure
In this section, we will extend the bivariate alternative quadratic exponential form which proposed by El-Sayed et al. [3] to the multivariate case. So, the joint mass function for correlated binary variables is
(5)
where, , , are natural parameters, and
, ,
are associated parameters, and so on.
To obtain the normalizing term, , in the function (5), we can use this constraint
(6)
In this case, the normalizing constant can be obtained as
(7)
the summation over all possible values of . Then, the normalizing constant is
(8)
For special case, , the joint probability mass function for and is
(9)
3.1. Natural Parameters Estimation
The log-likelihood function, for observations, can be written as
(10)
where is defined as shown in (8).
Taking the first derivatives for (10) with respect to , and put it equal to zero, we have:
(11)
Solving the equations (11), numerically, we can get the estimates , respectively.
3.2. Testing Hypothesis for Natural Parameters
We can test the null hypothesis against the alternative hypothesis , . To test the significance of association parameters, we can test the null hypothesis against the alternative hypothesis , . Also, we can test the null hypothesis , , and so on. All tests can be done using the Likelihood ratio test (LRT).
3.3. Modeling Multivariate AQEF Procedure
In this section, we will use the next link functions to generalize the model, with correlated dependent binary variables associated with some covariates, (not always binary variables). The marginal probabilities is given by the the regression model
(12)
A regression model expresses the association between these responses, associated with some covariates, , can be given by
(13)
The covariates, , which are selected show some significant association with the variables, , in multivariate analysis.
Now, we will study the effect of covariates on the log-likelihood function (10), using the equations (12) and (13).
3.4. Regression Parameters Estimation
The log-likelihood function can be expressed as follows:
(14)
where is defined as shown below
(15)
Taking the first derivatives for (14) with respect to , and put it equal to zero, we have:
(16)
Solving the equations (16), numerically, we can get the estimates , respectively.
3.5. Testing Hypothesis for Regression Parameters
We can test the null hypothesis, , using
(17)
Finally, we can test the null hypothesis, ( or or... or ), using
(18)
The estimated dispersion parameter can be used as a measure for the over-dispersion. So, let us define
The quantity follows the non-central distribution. Under independence, the estimator of dispersion parameter can be defined as
(19)
the value of should be closed to one for a Bernoulli data. To evaluate , we must obtain the estimate of marginals, , using the equation (12), as
(20)
Also, to specify the goodness of fit model, we can use the scaled deviance function
(21)
where is the number of estimated parameters, and is the dispersion parameter estimate as defined in (19). Since, the deviance function is
(22)
4. Multivariate GLM Procedure
The Markov structures of dependence often adequately describe serial stochastic dependence in specified data. This pattern of dependence has been studied and so only a few remarks will be made here. Markov dependence of first order implies
(23)
Using the conditional logg-odds interpretation, Heagerty [7], and Heagerty and Zeger [8], and the Markov property, the joint mass function for the variables can be defined as
(24)
For special case, , the joint probability mass function for and is
(25)
4.1. Natural Parameters Estimation
In this section, we present the estimation of parameters of the multivariate Bernoulli distribution. For observations, we can get the log-likelihood function as
(26)
Taking the first derivatives for (26) with respect to and , and put it equal to zero, we have
(27)
Solving the equations (27), numerically, we have the estimates and .
4.2. Testing Hypothesis for Natural Parameters
We can test the null hypothesis against the alternative hypothesis , . To test the association parameters, we can test the null hypothesis against the alternative hypothesis , . All tests can be done using the Likelihood ratio test (LRT).
4.3. Modeling Multivariate GLM Procedure
In this section, we will use the same link functions similar to the AQEF to determine the regression model. A regression model which expresses the link functions and the association between the correlated binary responses, , associated with covariates, , can be given by the equations (12) and (13).
4.4. Regression Parameters Estimation
Now, we study the effect of covariates on the log-likelihood function (26) which is become
(28)
Taking the first derivative for (28) with respect to , and putting it equal to zero, we get the estimating equations
(29)
Solving the equations (29), numerically, we have the vectors estimates .
4.5. Hypothesis Test for Regression Parameters
We can test the regression parameters using the null hypothesis , by the function
(30)
Finally, we can test the association parameters using the the null hypothesis , by the function
(31)
The estimate of dispersion parameter can be defined as shown in the equation (19). Also, to specify the goodness of fit model, we can use the scaled deviance function (21).
5. Multivariate VGAM Procedure
The conditional distribution of vectorized generalized linear models (VGAM), Yee and Wild [15], for multivariate correlated binary responses (), given that some covariates, , is given by the function:
Where, is the normalizing term. Similar to the GLM and AQEF procedures, we can get the estimate of natural parameters, the estimate of regression parameters, the estimate of dispersion parameters, the scaled deviance and the LRTs.
6. Numerical Examples
Respiratory Disorder Data: Source: Stokes, Davis, and Koch (1995), SAS and R programs.
These data is taken from a clinical trial of patients comparing two treatments for a respiratory illness. The data contains (111) patients from two different clinics (centers) which were randomized to receive either placebo = 0 or active = 1 treatment. Patients were examined at baseline (represent the baseline respiratory status) and at four visits during the treatment. At each examination, the respiratory status was determined. A data frame are (444) observations and (8) variables which are: outcome variable (represent the respiratory status at each visit [categorized as good = 1, poor = 0]), center (center 1=1, center 2 = 2), id (repetition), age (age at time of entry into the study which represents a continuous variable), baseline (baseline respiratory status good or not, hence [good = 1, poor = 0]), treatment (placebo = P, active = A), hence to be binary data we can put P = 0 and A = 1, sex (female = 1, male = 0) and visit (four visits). We suppose that, for the bivariate case , the response variables in this model are two variables: the "outcome" variable represented by the binary variable and the "treatment" variable represented by the binary variable . Explanatory variables in this model are six variables: center, age, baseline, sex and visit. In this example, the two dependent correlated binary variables and , represent the outcome and the treatment variables respectively. One explanatory variable , represents the visit. In the next examples, we use the VGLM procedure, Yee [14], Yee and Wild [15], which depends on the log-linear approach in the bivariate case. The estimates obtained using the BB-package of program, [13].
Table 1 explains the results for the GLM, QEF and AQEF procedures as following:
, , p = 6 parameters.
From Table 1, we have found that:
The VGLM and AQEF procedures have the same estimates, but the GLM procedure has different estimates.
For the scaled deviance measure as a goodness of fit of the model, we found all measures have values less than , p = 6 parameters.
This means that all measures have a good fit.
For the estimate of dispersion parameter , the procedure GLM has the smallest value.
For the LRT to test the null hypothesis , we find, for all procedures, the value of LRT is more than . This means that, for all procedures, the two correlated dependent variables and are affected significantly with the explanatory variable.
Then, the patient respiratory status, contributed and the treatment, are affected significantly by each visit. Hence the test of associated parameters reflect the significant association between and associated with covariates.
In sum, the previous results proved that the same results are obtained for the VGLM and AQEF procedures. Then, we can use the Wald statistic to test the significance of the parameters of regression model as shown below.
The results in Table 1, are demonstrated in the regression model shown below:
For the GLM procedure, we have the regression model:
(32)
Also, for the VGAM and AQEF procedures, we have the regression model:
(33)
Table 2 reflects the estimates, standard error and Wald statistic for regression parameters for the procedures VGAM and AQEF, which have the same results.
From Table 2, the Wald statistics reflect the dependent variables and together are affected significantly with the explanatory variable,. This confirms the results obtained for the LRT in Table 1. Also, we can use the VGAM-package to fit the model using more than one covariates. Applying that on the respiratory disorder data, considering the dependent correlated binary variables are outcome () and treatment (), and the the explanatory variables are: center , sex , age , visit and baseline .
Table 3 represents the results associated with more than one covariates:
Log-likelihood: , .
From Table 3, we have found that:
The two dependent correlated binary variables, outcome (), and treatment () are together affected significantly by the explanatory variable age
The dependent variable outcome () is affected significantly by the explanatory variables, baseline and age .
The dependent variable treatment is affected significantly by the explanatory variables, baseline and sex .
From Table 3, we have the regression model:
(34)
References