Bias Adjustment Methods for Analysis of a Non-randomized Controlled Trials of Right Heart Catheterization for Patients in ICU

: Kaplan-Meier estimate or proportional hazards regression is commonly used directly to estimate the effect of treatment on survival time in randomized clinical studies. However, such methods usually lead to biased estimate of treatment effect in non-randomized or observational studies because the treated and untreated groups cannot be compared directly due to potential systematical difference in baseline characteristics. Researchers have developed various methods for adjusting biased estimates by balancing out confounding covariates such as matching or stratification on propensity score, inverse probability treatment weighting. However, very few studies have compared the performance of these methods. In this paper, we conducted an intensive case study to compare the performance of various bias correction methods for non-randomized studies and applied these methods to the right-heart catheterization (RHC) study to investigate the impact of RHC on the survival time of critically ill patients in the intensive care unit. Our findings suggest that, after bias adjustment procedures, RHC was associated with increased mortality. The inverse probability treatment weighting outperforms other bias adjustment methods in terms of bias, mean-squared error of the hazard ratio estimators, type I error and power. In general, a combination of these bias adjustment methods could be applied to make the estimation of the treatment effect more efficient.


Introduction
In randomized clinical studies, the effect of treatment on patients' survival time can be estimated by comparing treated and untreated subjects directly. In this case, Kaplan-Meier estimate or proportional hazards regression is used directly to estimate the effect of treatment on survival time. However, it is not easy to materialize a randomized study in daily life. There is an increasing number of nonrandomized studies in recent years. In an observational (or nonrandomized) study, the treated and untreated groups cannot be compared directly because they may systematically differ at baseline characteristics. For example, the patients' health condition and medical history are essential factors when doctors make a diagnosis. The treatment assignment to a patient is dependent on covariates like age, gender, health condition, and medical history, etc. As a result, the effect of medical treatment on patients' survival time may be confounded by their baseline covariates. Therefore, systematic differences in baseline characteristics between the treated and untreated groups must be considered in assessing the impact of treatment on survival time in observational studies.
The propensity score plays an important role in balancing the treated and untreated subjects to make them comparable. Rosenbaum and Rubin proposed that propensity score is the conditional probability assignment to a particular treatment given a vector of observed covariates [1][2]. They indicated that adjustment for the scalar propensity score contributes to control all confounders and eliminate bias due to observed covariates. Propensity Score is a scalar function of the covariates that includes the information required to achieve the balance of distribution of baseline covariates. The most common methods based on propensity score are matching, stratification, regression adjustment, and probability weighting [3][4]. With the application of the propensity score, the treated and untreated patients who have similar propensity scores will have a similar distribution of observed background covariates. Therefore, the effect of treatment will be unrelated to confounders, as a result of which, treated and the untreated subject is comparable like what we could attain in randomized studies.
The dataset that motivated this paper pertains to day 1 of hospitalization and the treatment variable "swang1" is whether or not a patient received a Right Heart Catheterization (RHC), also called the Swan-Ganz catheter, on the first day in which the patient qualified for the SUPPORT study [5]. RHC is a test used to see how well your heart is pumping (how much blood it pumps per minute) and to measure the pressures in the heart and lungs. In an RHC, the doctor guides a special catheter (a small, hollow tube) to the right side of the heart then passes the tube into the pulmonary artery. The doctor observes blood flow through the heart and measures the pressures inside the heart and lungs. A sensitivity analysis provided some evidence that patients receiving RHC had decreased survival time. But the sensitivity analysis indicated that any unmeasured confounder would have to be somewhat strong to explain away the results. Our goal is to estimate the effect of RHC treatment on the patients' survival time after reducing the confounding bias. However, systematic differences in patients in the two groups may exist, and these differences could lead to a biased estimate of treatment effect; which is known as the causal effect in a nonrandomized study.
As mentioned before, the distributions of baseline covariates between treatment 0 and treatment 1 subjects are quite different. What's more, as we will see in the matching methods, the distributions of the propensity score in the two treatment groups are different, which reveals the systematic difference in the two studies and the problem of confounding. The remainder of the article focus on the application and comparison of the following three methods. Section 2 introduce matching on propensity score method. Section 3 introduce stratification on propensity score method. Section 4 introduce inverse probability treatment weighting method. We apply each method to the Right Heart Catheterization study to compare the survival time of RHC treated group and control group. The article concludes with a discussion on the choice of methods under different scenarios in Section 5.

Matching on Propensity Score
The propensity score is presented in both randomized trials and observational studies. In randomized trials, the true propensity score is known and defined by the study design. In observational studies, true propensity scores are generally not known but can be estimated through the data [6]. The propensity score is the conditional probability assignment to a particular treatment given a vector of observed covariates [1]: Where the dependent variable is binary, =1 is associated with the RHC treatment and =0 is corresponding to control.
, i=0, 1 are observed values of the vector of covariates [6]. Propensity scores are generally calculated using one of two methods: Logistic regression or Classification and Regression Tree Analysis [6]. In practice, the propensity score is most often estimated using a logistic regression model, in which treatment status is regressed on observed baseline characteristics [7]. The estimated propensity score is the predicted probability of treatment derived from the fitted regression model [8].
Where the parameters α, β are estimated by maximum likelihood logistic regression.
Matching is a commonly used method to select "matched" pairs on background covariates that we believe need to be controlled. Even though it seems difficult to find patients who are similar on all important covariates, especially when there are plenty of covariates of interest, propensity score matching solves this problem by allowing us to control for as many covariates as we want simultaneously by matching a single scalar variable [9]. Rosenbaum and Rubin introduced three techniques for constructing a matched sample: (i) nearest available matching on the estimated propensity score; (ii) Mahalanobis metric matching including the propensity score; and (iii) nearest available Mahalanobis metric matching within calipers defined by the propensity score. Therefore, once the propensity scores are estimated by the logistic regression method, we apply the nearest available matching approach to reduce the confounding bias in the RHC study. In this method, the absolute difference between the estimated propensity scores for the control and treated groups is minimized [6]. Given randomly ordered control and treated subjects, the first treated subject is selected along with a control subject with a propensity score closest in value to it [10]. Generally, if a treated subject and a control subject have the same propensity score, the observed covariates are automatically controlled for [6]. Therefore, any differences between the treatment and control groups will be accounted for and will not be a result of the observed covariates.
To confirm the effect of the propensity score matching method on reducing systematic difference, it is necessary to compare the covariates between treatment 0 and treatment 1 before and after matching. Our goal is to reduce the difference in the mean of individual covariate between treatment 0 and treatment 1 after matching method. To decide whether there is a significant difference in the mean of individual covariate between treatment 0 and treatment 1, visualizations like box plots, bar plots are carried out first and then a two-sample t-test is applied to compare the results statistically.
Since there are 50 covariates in the dataset makes it too complicated to conclude the changes that matching influenced, and according to the variable description, not all of the covariates are useful in the model. There may be some errors in analyzing the results of matching without any variable selection. The Lasso method has been tried first for variable selection in the Cox model. LASSO can be computed via R Package glmnet [11]. But the final results showed that there are still 42 covariates remaining in the model whose coefficient is larger than zero. It is not convenient to implement a comparison among all of the 42 covariates. Then we can try to use the stepwise package which provides the final model with 28 covariates. Apparently, it is not the perfect result even though it provides a much simpler model. We can do further selection from the final 28 covariates.
According to the Cox matching adjusted model with the selected covariates, table 1 comparing the P-value results of the Cox match adjusted model with full covariates and the 28 covariates from the variable selection. The majority of the 28 covariates in the stepwise final model have smaller P-value, which means the corresponding covariates are more significant in this model. Meanwhile, the P-value of some covariates increases relatively. Therefore, those covariates whose P-value becomes smaller while are less than 0.05 before and after variable selection are reasonable to represent the most significant ones. It is more convenient to concentrate on these 9 covariates and compare the mean of them after the matching method.
In order to confirm the effects of matching, first of all, we draw the boxplots and bar plots of these covariates chosen from stepwise before and after matching. Here we use the boxplot of "surv2md1", "das2d3pc" and the results of propensity score and bar plots of "hema", "chfhx", "meta", "chrpulhx", "psychhx", "dnr1Yes", "renal". Although plots are showing the approximate equivalence between treatment 0 and 1, in favor of unbiased estimate of treatment effect before matching, it is not statistically significant at the 0.05 level of significance. As a result, it is not enough to conclude the effect of matching only by the plots of covariates. Further statistical steps are necessary. To be specific, a two-sample ttest is applied here to test whether the difference of a covariate's mean in treatment 0 and treatment 1 is zero.  Table 2 indicates the mean and standard deviation of covariates in subsets under treat 0 and treat 1 before matching. Among the 9 significant covariates, there are 8 covariates with P-value less than 0.05, which is sufficient evidence to reject the null hypothesis and conclude that the confounding bias exists. Similarly, the visualization and twosample t-test are conducted for relevant data after matching. It can be seen from the box plot of the PS's before and after matching that the unbalance has been reduced a lot after matching. Also, the test statistics and P-value in table 3 revealed that the differences between covariates under treatment 0 and 1 decreases, since most covariates' P-value are larger than 0.05. Even though the P-value of "survmd1" and "dnr1" is still less than 0.05, the significance becomes less with the P-value increasing much more.
Since the systematic differences between the patients in treatment 0 group and treatment 1 group have been greatly reduced, the effect of treatment on survival time could be compared directly. Figure 3 is the comparison plot of the Kaplan-Meier estimates before and after matching. The log-rank test statistic is 19.35 with P-value 1.00e-05 before matching, and 23.65 with P-value 1.20E-06 after matching. In other words, the result of the treatment effect (P-value 1.00e-05) is not accurate statistically without matching adjustment. The results provided evidence that the difference of survival functions between the two groups is more significant at significance level 0.05 after propensity score matching and the patients who received RHC had lower survival time than those who did not receive RHC.

Stratification on Propensity Score
Stratification on propensity score can also ameliorate the confounding effects of covariates. Each observation for the subject is classified into a propensity quantile based on the propensity score [12]. According to Rosenbaum and Rubin's results, creating five strata based on a continuous variable like the propensity quantile with the stratum boundaries determined by its distribution in the exposed and the comparison group combined eliminates approximately 90% of measured confounding [13]. Therefore, the patients can be assigned to five strata using the propensity score quantile as the cut-off. Within each stratum, the treated patients and untreated patients will have roughly similar propensity score values, also a similar distribution of the measured baseline covariates. The effect of the treatment can be estimated by comparing the outcomes directly between subjects with treatment 0 and subjects with treatment 1 in one stratum if the propensity score has been estimated correctly [7].
To confirm that the systematic difference has been reduced after stratification, it is necessary to compare the covariates' mean under treatment 0 and treatment 1 before and after stratification. The same problem occurs here as with matching when there are 50 covariates in the dataset, which is too complex to conclude whether the stratification makes a difference. Variable selection will be operated again as before. Similarly, the Lasso method has been tried for variable selection but there are 32 covariates left in the final result with coefficients larger than zero. Therefore, I still apply stepwise here aiming to obtain a simpler model and then 28 covariates are selected from the stepwise function with stratification. A further selection is similar as before.
According to the Cox stratification adjusted model with the selected covariates, table 4 comparing the P-value results of the Cox stratification adjusted model with full covariates and the 28 covariates from the variable selection. The majority of the 28 covariates in the stepwise final model have smaller P-value, which means the corresponding covariates are more significant in this model. Meanwhile, the P-value of some covariates increases relatively. So, those covariates Heart Catheterization for Patients in ICU whose P-value becomes smaller while are less than 0.05 before and after variable selection are chosen to represent the most significant ones. It is reasonable to concentrate on these 8 covariates and compare the mean of them after the stratification. To confirm the effects of stratification, a two-sample t-test is applied here to test whether the difference of a covariate's mean in treatment 0 and treatment 1 is zero, which is related to test whether the systematic difference in covariates has been reduced. Table 5 shows the mean and standard deviation of corresponding covariates in subsets under treatment 0 and treatment 1 before stratification. All of the 8 covariates' Pvalue is less than 0.05. That is sufficient evidence to reject the null hypothesis and conclude that the confounding bias exists and the stratification adjustment is necessary when evaluating the effect of treatment on survival time. Similarly, the two-sample t-test is conducted for relevant data after stratification. It can be seen from the test statistics and Pvalue in table 6 that the systematic differences between covariates under treatment 0 and treatment 1 decreases, since most covariates' P-value increased and the significance of the difference in mean between treatment 0 and treatment 1 decreased. Even though the P-value of the covariates is still less than 0.05, the significance becomes less with the P-value increasing. The reason for the zero P-value is that the stratified two-sample t-test function defines the extreme P-value as zero.
To compare the mean of selected covariates in the subject of treatment 0 and subject of treatment 1 more accurately and sufficiently, table 7 and 8 indicate the mean of each covariate in each stratification group and use two-sample t-test respectively to test whether there is a significant difference in the mean of covariates between treatment 0 and treatment 1 after stratification. Apparently, most of the P-values are larger than 0.05 which concludes to fail to reject the null hypothesis and illustrates that the systematic difference and confounding bias are reduced.
Since systematic differences between the patients in treatment 0 group and treatment 1 group have been greatly reduced, the effect of treatment on survival time is comparable. Figures 4 and 5 is the Cox proportional hazard regression model for treatment 0 and treatment 1 after stratification. It is obvious that the balance of the covariates is better achieved after stratification. Figure 6 are the comparison plots of the Kaplan-Meier estimates between treatment 0 and 1 in each stratification group. As we can see from the five plots, the survival time of patients after RHC treatment is relatively decreased, which leads to the same conclusion as propensity score matching.

Inverse Probability of Treatment Weighting
Kaplan Meier estimator is widely used in clinical studies to compare survival time between different treatment groups. However, if certain covariates corresponding to low survival rates are more strongly represented in one group than another, which is considered as over-represented, the survival estimated by the Kaplan-Meier method form one group would appear to be worse than survival estimated from the other group. Another approach reducing confounding effects was proposed by Xie and Liu in 2005 [14]. They developed the Adjusted Kaplan Meier estimator (AKME) using the inverse probability of treatment weighting (IPTW). The estimated propensity score, the probability of being treated in a certain group conditioning on a set of covariates, is used to construct the weights for subjects. A weight is assigned to each individual as the inverse of the propensity score. For example, a subject with a higher propensity score, which is considered as overrepresented, is assigned with a lower weight. On the other hand, subjects with a lower propensity score, considered as under-represented, will be given a higher weight [15]. They also proposed a weighted log-rank test for statistical comparison of the survival functions of the two groups.
As with the matching and stratification, we apply the IPTW method to the Right Heart Catheterization study. The propensity score of each patient is estimated using logistic regression in the same way. Then the Kaplan-Meier estimators of both the treatment group and control group are adjusted with the weight as the inverse of the propensity score. If the propensity score is estimated correctly, the sampling bias will be removed after weighting adjustment. Figure 7 shows the Kaplan-Meier estimator on the survival function of the two groups before and after weighting adjustment. We can see from the plot that the survival curve of the subject with treatment 1 is lower than the subject with treatment 0. It becomes more obvious after adjustment. We also perform the log-rank test for statistical comparison of the survival functions. Table 9 shows the comparison of the hazard ratio estimate with or without IPTW procedure. The log-rank test statistic without weighting is 19.35 with P-value 1.00E-05, while the weighted log-rank test statistic is 75.45 with a P-value less than 2.00e-16. We conclude that the difference in survival functions between the two groups is more significant at significance level 0.05 after weighting with the inverse of the propensity score. Moreover, the plot shows that the survival time of subjects with treatment 1, who received the RHC, tends to be lower than the survival time of those not receiving RHC.

Discussions and Conclusions
In this paper, we discussed three bias adjustment methods for causal inference in non-randomized clinical trials. According to the application results from three bias adjustment methods on the Right Heart Catheterization study, we conclude from the Cox proportional-hazards regression that patients receiving RHC had decreased survival time. Moreover, the difference in survival time between the two groups becomes more significant at significance level 0.05 after reducing the confounding bias.
Matching on propensity score is a good method for removing the bias between the treated group and the control group on the background covariates. It is preferred when the sample size of the control group is much large than the sample size of the treatment group. Stratification is preferred when the sample size is large enough since the estimation would be unreliable if there are not enough patients in each stratum. The IPTW method showed better performance in general. One may consider matching or stratification when the control group variance is much larger than the variance of the treatment group. Overall, a combination of the methods could be applied to make the estimation of the treatment effect more efficient.