Selection of Reliable and Valid Surgeon Performance Measures

ObjectiveTo identify measures of surgeon performance that are valid, reliable, and capable of classifying the risk of surgeon performance. Data SourcesA surgical quality improvement program, dataset unique to selected hospitals and surgeons containing abstracted surgical case records. Study DesignSix criteria were employed to assess the validity of 24 candidate measures of surgeon performance: 1) the presence of a surgeon random intercept; 2) a surgeon signal that is greater than zero; 3) surgeon majority control; 4) reliability of the surgeon random intercept of at least 0.7; 5) the capacity to identify both lowand high-risk surgeons and 6) the presence of a learning/improvement effect. Data collection/Extraction methodsSurgical case review nurses abstracted cases for each surgeon using a structured sampling and abstraction methodology. Principal findingsComparing outcomes requires risk adjustment and the use of the "true score" approach but is limited by case volume constraints and a confounding factor, i.e., the hospital, if used to judge surgeons' performance. Assessing surgeon performance requires a measure of the surgeon's effects on the consequences (postoperative occurrences) of surgical procedures, i.e., the surgeon-specific random intercept, which is a product of a multilevel risk adjustment model. ConclusionMorbidities and mortality lack the characteristics necessary to be used as measures of surgeon performance. However, the process (task-time) measures LOS and OT both have high event rates, high reliability, and are capable of classifying surgeon risk.


Introduction
Surgeon performance measurements are potentially helpful for quality improvement [1], consumer decision support [2], and surgeon management [3,4]. Models of the surgeon role in modern multidisciplinary care include the "captain of the ship" and "member of the team" models [5,6]. According to the "captain of the ship" model, the surgeon assumes responsibility for patient and intervention selection. In contrast, in the "member of the team" model, decisions are made by the team. In a published statement regarding physician-led team-based surgical care, the American College of Surgeons (ACS) endorsed the team approach: "Optimal care is best provided by a coordinated multidisciplinary team recognizing each member's expertise. Coordinated surgical care provides the best outcomes, lowers costs, and increases patient satisfaction" (Statement on Physician-Led Team-Based Surgical Care) [7].
Recent studies investigating surgeon performance have focused on establishing the feasibility of evaluating surgeon performance and reliability using discrete measures. However, few studies have focused on identifying surgeon performance [8,9], and no studies have compared surgeon-related risk and demographic, preoperative condition, and surgical procedure-related risk. Iezzoni proposed that the purpose of risk adjustment is to obtain "meaningful comparisons within the health care system that generally require risk adjustment-accounting for patient-associated factors before comparing outcomes across different patients, treatments, providers, health plans or populations [10]." The true risk score is the sum of the fixed and random effects identified by a multilevel mixed-effects risk model. The fixed effects consist of patient demographic factors, indicators of the presence or absence of patient preoperative conditions thought to impact the prevalence of postoperative complications and case-mix factors that reflect surgical procedure risk. In a three-level risk model, the random effects are estimated for the risk added by the surgeons and hospitals. The "true score" used to assess surgeon performance compares the sum of the fixed and random effects to the sum of the fixed effects. It is expressed as the relative risk or odds ratio of a postoperative complication. In a three-level system (patient, surgeon, and hospital), the "true score" is a patient-level measure.
A measure's validity is affected by adequate observations, the performance measure's prevalence, and the sample size. Adams offered the following list of validity determinants for physician measures: 1) the level of physicians' control over the measure, 2) proper adjustment of case-mix differences among physicians, 3) whether another level in the system partially controls the measure, and 4) whether the measure is correlated with other established quality measures [11]. The level of surgeons' control over the candidate measures of surgeon performance has not previously been assessed.
The aim is to understand the impact of the constraints of validity, reliability, and model specification on the selection of surgeon performance measures.

Evaluation Framework
The following criteria were applied to determine the suitability of the candidate measure as a surgeon performance measure: 1) the presence of a surgeon random intercept; 2) a surgeon signal that is greater than zero; 3) surgeon majority control; 4) reliability of the surgeon random intercept of at least 0.7; 5) the capacity to identify both low-and high-risk surgeons and 6) the presence of a learning/improvement effect.

Risk Model
Twelve months of abstracted data included in the dataset for this study, with 29,267 surgical cases, 644 surgeons, and 23 hospitals, was used to evaluate 24 candidate measures of surgeon performance. The candidate measures included the following postoperative occurrences: mortality, acute renal failure (ARF), bleeding/transfusion (BT), cardiac arrest requiring CPR (CPR), deep incisional surgical site infections (DSSI), deep venous thrombosis (DVT), myocardial infarction (MI), ventilator use for more than 48 hours (ONVENT), organ/space SSI (OSSI), pneumonia (PNA), progressive renal insufficiency (PRI), pulmonary embolism (PE), sepsis, septic shock (SHOCK), stroke/cerebrovascular accident (CVA), superficial SSI (SSSI), unplanned intubation (UI), urinary tract infection (UTI), wound disruption (WD), patients with morbidity (PTSWMB), readmission (READ), return to the operating room (ROR), operative time (OT) and length of stay (LOS). In total, 15,366 inpatient cases were used to risk-adjust the LOS. Only one procedure was performed in 19,412 cases, which were used to risk-adjust the OT. Cases with multiple procedures are likely to confound the OT risk adjustment and were thus excluded from the OT analysis. The dataset was generated by surgical case reviewers based on sampled cases reported by surgeons.
Multilevel mixed-effects models appropriate to the type of postoperative occurrence were used for risk adjustment. A logistic model was used for binary occurrences (all except for the LOS and OT). A negative binomial model was used for the LOS (in days). A linear model was used for OT (in hours rounded to the nearest 0.01). Random intercepts were included at the second and third levels of the models, i.e., surgeons and hospitals, respectively. A three-level model was used to estimate the patient risk score because it reflects the patient, surgeon, and hospital system in which the surgery and postoperative occurrences occur.
Standard demographic, preoperative risk factors, and procedure identifiers were included as covariates. Variables for patient age, gender, body mass index, number of procedures per case, procedure groups, and a Current Procedural Terminology (CPT) code-based measure of postoperative occurrence risk were employed. In this study, the grouping method was based on 47 categories of CPT codes representing different procedures, such as hernia repair, colectomy, and vascular bypass/repair. The CPT code-based measure of each postoperative occurrence risk was estimated by constructing multilevel mixed-effects models; a random intercept was created for each CPT code used in previous periods as an independent variable in the risk model.

Model Specification
Each risk model was tested to determine whether a multilevel model is required using the likelihood ratio test to compare the model to the standard regression. A significant likelihood ratio test indicated that the multilevel model had an improved performance over the standard regression and that at least one of the additional levels was helpful. The variance components were analyzed to test the hypothesis that the between-surgeon within-hospital variance is zero for each candidate measure of surgeon performance using the likelihood ratio test to compare the full three-level model to an otherwise identical model in which the between-surgeon variance was set to zero by removing the surgeon random effect. If the hypothesis is correct, the surgeon signal and surgeon random intercept are not significant, and the surgeon's performance cannot be assessed.

Measuring and Classifying Surgeon Performance
According to each candidate measure, the surgeon's performance was compared using the Bayesian posterior mean (random intercept) of each surgeon and the 95% prediction interval. The values were assigned to the random intercept using empirical Bayes predictions based on the following obtained estimates: covariate coefficients (β), between-surgeon variance (ψ ), and within-surgeon variance (θ ). According to Bayes' theorem for linear models, the posterior distribution (posterior means/surgeon random intercepts) is proportional to the prior distribution multiplied by the likelihood of the responses.
The prior is a vector of shrinkage factors, and the likelihood is the surgeon's specific mean total residual. Surgeons with random intercepts and 95% prediction intervals above zero had significantly larger intercepts than the all-case-averaged intercept, and the postoperative occurrence risk was higher. In comparison, surgeons who had 95% prediction intervals below zero had intercepts significantly smaller than the all-case-averaged intercept, and the risk of a postoperative occurrence was lower.

Reliability
The surgeon performance measure's reliability was calculated as the ratio of the between-surgeon variance to the sum of the between-and within-surgeon variance. The variance of the surgeon random intercepts, which is reported as a random surgeon effect in the multilevel mixed-effects model, is the between-surgeon variance. The within-surgeon variance is the squared standard error of the measurement (random intercept), which is reported as the standard error of the empirical Bayes estimator of the random effects. A reliability score of 0.7 was used as the required threshold for identifying a surgeon as having a high or low risk for any postoperative occurrence [12].

Measure Validity
To test Adam's first and third validity criteria, i.e., the physician control level, the between-surgeon and hospital variances (signals) estimated by the risk models were compared to identify the system's level with majority control over the candidate measure. A postoperative risk model in which the surgeon has a larger signal than the hospital, suggests that the surgeon has majority control and can be used as a surgeon performance measure if the other criteria are met. Spearman's rank correlation coefficients were used to assess the candidate measures' correlations and test Adam's fourth criterion.

Sensitivity Analysis of the Reliability Assessment
The surgeon reliability assessments using a two-level model, in which the random intercept is estimated only for the surgeons, was compared to the current three-level model, in which both the surgeon and hospital random intercepts are generated. If surgeon only models had higher reliabilities it would confirm the need to use a hospital random intercept.

Identification of a Learning/Improvement Effect
Evidence of learning requires measurable improvement over time. A second larger dataset of 171,116 cases was used to establish the presence of a learning/improvement effect.
Each postoperative occurrence was tested as an improvement measure over the 12 years for which there is data, using the three-level, mixed-effects model with a variable for year. The measure is the coefficient or odds ratio (95% confidence interval) for the independent variable year, dependent upon regression type.
All analyses were performed using 64-bit STATA/MP 16.1 for Windows (College Station, Texas, U.S.A.). The modeling methods proposed by Rabe-Hesketh and Skrondal were followed [13]. The surgeon effects represent the surgeon's impact on the true score; in this example, OT. The hospital and surgeon signals, which are expressed as the mean and standard deviation, and the range of the surgeon effects are shown in Table 1. The hospital signal is zero for the three candidate measures of PE, CVA, and UI. The surgeon signal is zero for the six candidates: CPR, DVT, mortality, MI, PE, and SHOCK. The range of the surgeon effects (random intercepts) is zero for all candidate measures in which the signal is zero and PRI. The surgeon effects are expressed in probability units for all candidate measures, except for OT, which is reported in hours, and LOS, which is reported in days. The surgeon signal did not show majority control; thus, the hospital signal was larger than the surgeon signal for eleven postoperative occurrences: mortality, CPR, DSSI, DVT, MI, OSSI, PNA, PRI, sepsis, SHOCK, and SSSI.   Table 2 shows the likelihood ratio test results in which the three-level risk models were not helpful over a standard regression analysis for five postoperative occurrences: ARF, CPR, ONVENT, CVA, and UI. The hypothesis that a random surgeon intercept does not exist was true for ten risk models: ARF, DSSI, ONVENT, OSSI, PNA, PRI, sepsis, CVA, SSSI, and UI.

Results
The associations between the candidate measures and other quality measures (all candidates) are reported in the appendix. All candidate measures were associated with the other candidate measures, ranging from a low of 4 measures for CPR to a high of 20 measures for BT. The presence of a learning/improvement effect was confirmed in ten of the candidate measures: in operative time the annual improvement (coefficient) was -0.007; P<0.0005; ARF the odds ratio ( performance included a 42% larger between-surgeon variance (0.1447 versus 0.1018), and the surgeon random intercept variance was greater by 22% (0.081 versus 0.066). Comparing the two models, the likelihood ratio test was 18.32, P <0.0001, indicating that the three-level model with both hospital and surgeon random intercepts is better than the model with only the surgeon random intercepts.  The following sixteen measures did not detect either the high-or low-risk surgeons: mortality, ARF, CPR, DSSI, DVT, MI, ONVENT, OSSI, PNA, PRI, PE, sepsis, SHOCK, CVA/stroke, SSSI, UI and UTI.

Discussion
The establishment of a high-quality registry of clinical information for surgical cases and outcomes facilitates quality improvement efforts [14]. In addition to the LOS and OT, surgical case mortality and morbidity have been proposed as measures for the assessment of the quality of surgical intervention. The LOS has been promoted as a quality measure by the Committee on Trauma of the ACS and has been positively impacted by Enhanced Recovery After Surgery protocols [15,16]. The duration of surgery has previously been used as a quality measure in the United Kingdom [17][18][19].
In this three-level random intercept model of patient risk, the sum of the fixed (β) and random effects ( + + ) is the true risk score for patient i, with hospital k, surgeon j, and risk, demographic and case-mix factors, x. The random intercept model shifts the overall regression line according to each surgeon and hospital, but the slope, β, remains constant. The random effect (random intercept) of surgeon j1 represents the individual differences compared to other surgeons due to personal characteristics that are not included as variables in the model. Since neither β nor x varies by surgeon and varies by hospital, exploring potential measures of surgeon performance required comparing , which is the surgeon random intercept among surgeons u 1 through u 644 in the current study for most assessed measures (LOS and OT provided results for 586 and 593 surgeons, respectively). The fixed effects, or the "slope" of the model, are important in developing surgeon random intercepts that are properly adjusted for covariates' presence. The fixed effects do not add to the surgeon's performance assessment once the intercepts have been estimated and do not impact the surgeon ranking. The surgeon random intercept estimation is adjusted by the inclusion of the fixed effects representing the patient demographics, preoperative risk, case-mix factors, and hospital random effects. A larger surgeon random intercept indicates that for the same fixed-effects result, a patient's risk for an increased LOS, for example, is greater.
The patient-level "true score" measure includes the surgeon and hospital random effects plus the fixed effects in the numerator of the incident rate ratio, creating a measure of surgeon performance that is confounded by the hospital effects. Both the hospital and surgeon effects range from negative to positive for each candidate measure (Table 1); controlling for the between-hospital variance in the estimation of the surgeon performance by using a three-level model reduces the error associated with an ambiguous performance measure. Surgeon performance can be estimated using an incident rate ratio, where the numerator is the surgeon random effects plus fixed effects and the denominator is the fixed effects. However, because surgeon performance is measured using a random intercept, a comparison to the population-averaged intercept is intuitively more appealing and eliminates the counterintuitive comparison to the fixed effects, which include the patient demographics, patient preoperative risks and procedure risks, all of which may influence but are not measures of surgeon performance.
Assessing the performance of the three-level model by testing for random intercepts at both levels 2 (surgeon) and 3 (hospital) is helpful and reduces the potential error of using model results where no surgeon random intercept is present as a measure of surgeon performance. In this experience using this dataset, only seven of the 24 candidate measures have surgeon random intercepts. As measured in this study, majority surgeon control prevents holding the surgeon accountable for an outcome that has historically been controlled by the hospital. Surgeons who work in a three-level system at more than one level (as many do) may have an opportunity to influence policy at the hospital level and, consequently, can play a role in improving a target measure that is traditionally not influenced by surgeons. The candidate measures with no surgeon random intercept, i.e., no surgeon effect on their outcome, could require a non-traditional approach to establish a surgeon effect. In contrast, surgeons have a large effect on OT and can use technologies, such as robotic or other forms of minimally invasive surgery, to mitigate the impact of long procedure durations. Finally, surgeons can also influence the OT through additional learning and experience, as shown in this study.
The prevalence of postoperative morbidity and the distribution of cases among surgeons do not favor the use of morbidity as a measure of surgeon performance. Thirty-two percent of surgeons have fewer than 10 cases; eight percent of surgeons have only one case. Only 10 percent of surgeons have patients with DSSI, while 30 percent of surgeons have no patients with postoperative morbidity. WD has the largest surgeon signal, but in the 104 events, only two surgeons met the reliability threshold of 0.7. In total, 3,288 of 29,267 cases had a(ny) morbidity, and only 15 surgeons met the reliability threshold for this measure. The second most prevalent morbidity is transfusion, with 1,483 cases, and only 28 surgeons met the reliability threshold. Shih et al. concluded that when assessing the colectomy complication rates, statistical noise, as evaluated by low reliability, is a significant determinant of most surgeons' surgeon-specific complication rate due to the low case volume. Hall et al. reported that 61.9 percent of surgeons achieved a reliability of 0.7 for their morbidity measure. However, Hall et al. did not control for the between-hospital variance of morbidity because they used only a two-level model, including the surgeon and patient. The hospital level's exclusion from their model of postoperative occurrences created the potential for a confounded between-surgeon variance and an inflated estimate of reliability. Postoperative process measures, such as RORs and READs, also suffer from a low prevalence, uneven case distribution, and low reliability. No surgeons met the minimum reliability threshold of 0.7 for unplanned READs within 30 days of discharge (1,605 events). Only 4 surgeons met the reliability threshold for ROR (1,101 events). In contrast, 104,799 inpatient days and 47,258 operative hours were reported. Three hundred seventy-five (58.4%) surgeons met the reliability threshold for LOS, and 527 (81.8%) surgeons met the reliability threshold for OT. Only 34 of 644 (5.3%) surgeons had no inpatient days, and all surgeons have OT.
The generalizable results of this study include several important points: OT is an excellent surgeon performance measure, while most postoperative outcome measures are limited by a low prevalence, no or low surgeon control or an inability to classify risk. The LOS is a good surgeon performance measure, while BT and PTSWMB may be used selectively but lack the characteristics to be widely applicable. Careful consideration of the surgeon signal's presence and magnitude provides insight into the possible mechanisms by which reductions in postoperative occurrences can be achieved and whether the primary vector occurs at the hospital or surgeon level. The intraclass correlation could be used to determine the relative level of surgeon control in linear, logistic, and probit models., Due to the multilevel mixed-effects negative binomial model for LOS, and the desire to compare the control levels across model types, evaluation of the surgeon signal was used. The study results that are unlikely to be generalizable include the surgeon and hospital signals because care approaches may vary geographically and over time. However, this lack of generalizability also presents an opportunity for further studies to explore how the most significant surgeon effect can be achieved by examining the varied approaches to care for each postoperative occurrence.

Conclusions
Comparing outcomes across surgeons differs from measuring surgeon performance. Comparing outcomes requires a risk adjustment and the use of the "true score" approach. Still, it is limited by the constraints of case volume and a confounding factor, i.e., the hospital, if used to judge surgeons' performance. Assessing surgeon performance requires a measure of the surgeon's effects on the consequences (postoperative occurrences) of surgical procedures, i.e., the surgeon-specific random intercept, which is a product of a multilevel risk adjustment model. Postoperative morbidities and mortality lack the characteristics necessary to be used as measures of surgeon performance. The combination of low prevalence rates, low case numbers, low reliability, and limited ability to classify surgeons by risk generally precludes their use. The postoperative measures of process, ROR, and READs are also affected by low prevalence rates, low case numbers, and low reliability at the surgeon level. However, the process measures LOS and OT both have high event rates and high reliability. Controlling for the between-hospital variance of the postoperative occurrence in a three-level model reduces the probability of the hospital's influence on the candidate measure of surgeon performance. There is no control for between-hospital variance in a two-level model, and the surgeon reliability may be artifactually higher. Improvement or learning effects enhance the appeal of measures for evaluating surgeon performance.