A Text-Mining Framework for Supporting Systematic Reviews
Dingcheng Li1, 2, Zhen Wang1, 3, Liwei Wang1, Sunghwan Sohn1, Feichen Shen1, Mohammad Hassan Murad3, 4, Hongfang Liu1
1Department of Health Sciences Research, Mayo Clinic, Rochester, USA
2Watson Health Cloud, IBM, Rochester, USA
3Robert D. and Patricia E. Kern Centre for the Science of Health Care Delivery, Mayo Clinic, Rochester, USA
4Division of Preventive Medicine, Mayo Clinic, Rochester, USA
To cite this article:
Dingcheng Li, Zhen Wang, Liwei Wang, Sunghwan Sohn, Feichen Shen, Mohammad Hassan Murad, Hongfang Liu. A Text-Mining Framework for Supporting Systematic Reviews. American Journal of Information Management. Vol. 1, No. 1, 2016, pp. 1-9. doi: 10.11648/j.infomgmt.20160101.11
Received: July 21, 2016; Accepted: August 3, 2016; Published: August 31, 2016
Abstract: Systematic reviews (SRs) involve the identification, appraisal, and synthesis of all relevant studies for focused questions in a structured reproducible manner. High-quality SRs follow strict procedures and require significant resources and time. We investigated advanced text-mining approaches to reduce the burden associated with abstract screening in SRs and provide high-level information summary. A text-mining SR supporting framework consisting of three self-defined semantics-based ranking metrics was proposed, including keyword relevance, indexed-term relevance and topic relevance. Keyword relevance is based on the user-defined keyword list used in the search strategy. Indexed-term relevance is derived from indexed vocabulary developed by domain experts used for indexing journal articles and books. Topic relevance is defined as the semantic similarity among retrieved abstracts in terms of topics generated by latent Dirichlet allocation, a Bayesian-based model for discovering topics. We tested the proposed framework using three published SRs addressing a variety of topics (Mass Media Interventions, Rectal Cancer and Influenza Vaccine). The results showed that when 91.8%, 85.7%, and 49.3% of the abstract screening labor was saved, the recalls were as high as 100% for the three cases; respectively. Relevant studies identified manually showed strong topic similarity through topic analysis, which supported the inclusion of topic analysis as relevance metric. It was demonstrated that advanced text mining approaches can significantly reduce the abstract screening labor of SRs and provide an informative summary of relevant studies.
Keywords: Systematic Review, Text Mining, Topic Modeling, Keyword Relevance, Indexed-Term Relevance, Topic Relevance, Data Mining
Evidence-based medicine (EBM) has been shown to play significant roles in informing decision-making regarding the care of individual patients . However, the large number of new publications in health sciences hinder physicians and researchers from keeping up with the latest literature . Therefore, there is a great need for evidence summaries.
Narrative reviews usually involve rapid reviewing so that results can be obtained in a timely manner . For example, at an individual level, busy physicians want to find quick answers from thousands of literatures or at a team level, a group of researchers attempt to acquire current trend of some popular research. In both cases, they may rely on high-reputation journals or highly cited articles to find what they need . However, different from narrative reviews, systematic reviews (SRs) involve a detailed and comprehensive plan and search strategy, with the goal of reducing bias by identifying, appraising, and synthesizing all relevant studies on a particular topic . Therefore, SRs do not rely on journal ranking or abstract-counts to determine whether a study is relevant or not.
High-quality SRs follow strict procedures, and require significant resources and time . At least eight time-consuming steps are needed to conduct a systematic review . Allen and Olkin estimated that a SR with 1000 potential studies retrieved for abstract screening needed 952 working hours to complete . A recent evaluation of 63 SRs conducted by 114 reviewers found that on average a reviewer spent 0.9 minutes, 7 minutes and 53 minutes on abstract screening, full text screening, and data extraction respectively [9,10]. To keep up to with the latest literature, 7% of SRs needed to be updated at the time of publication, 4% within a year and 11% within 2 years . Therefore, methods that can increase the efficiency of abstract screening without compromising credibility are highly desired.
In this study, we propose a text-mining framework aiming to reduce the burden of screening abstracts in SRs utilizing diverse relevance ranking metrics, including keyword, indexed-term and topic relevance (please see the detailed definition of those relevance metrics in the Methods section). The work to reduce screening burden is fully unsupervised. Meanwhile, all ranking metrics are derived from information retrieval algorithms and offer the flexibility of adding or replacing new ranking metrics. In addition, the framework is highlighted with topic analysis. Specifically, topic analysis, based on Latent Dirichlet Allocation (LDA) , is a fully unsupervised model on the basis of word co-occurrences which can group similar documents together. Since its appearance, it has been widely used in natural language processing [13,14], image processing [15,16], biomedical informatics , and bioinformatics  to improve classification [17,19,20], summary  and other tasks [19,22]. After conducting the automatic systematic review, we investigated the topic distribution of the abstracts retrieved for each case study in order to find the topic similarities and provide an informative summary.
In the following sections, first related work of this study was introduced, then our approaches described in detail, and finally experiment results were presented using three case studies.
2. Related Works
Attempts to automate abstract screening in SRs started around 2006. O’Mara-Eves et al  described the evolution of such approaches and summarized 44 studies that implicitly or explicitly addressed screening workload problems. They concluded that efficiencies and reduction in workload are potentially achievable with text-mining approaches. Across the studies, a saving in workload of 30% -70% was reported as possible using such methods although it may be associated with a loss of 5% of relevant studies (i.e. a 95% recall). Somewhat different from other text-mining applications is that systematic reviewers generally place strong emphasis on high recall (95% to 100%)—that is, a desire to identify all the relevant studies—even if that means a vast number of irrelevant studies need to be considered to find them .
Existing automated methods for reducing screening burden in SRs include supervised machine learning and active learning. The task of identifying relevant abstracts can be defined as a binary document classification task where a classifier can be trained to classify abstracts as relevant or irrelevant. Different supervised machine learning algorithms have been explored including the use of naïve Bayes, Adaboost, and SVM by Aphinyanaphongs et al [24,25], perceptron based voting by Cohen et al , factorized version of complement naïve Bayes (FCNB) by Uzuner et al , ensembles of SVMs by Wallace et al , and evolutionary SVM by Bekhuis and Demner-Fushman [29,30]. However, supervised machine learning requires annotated training data where informatics researchers rely on existing data gathered in previous SRs. For a given new topic, we may not have previous SRs to serve as training data. In addition, only a small percentage of the abstracts retrieved are relevant which makes the training data very imbalanced. To overcome the above limitations, Wallace et al.  and  proposed an active online learning approach which starts with a small training set and interactively obtains more training data. To avoid potential overfitting, Jonnalagadda and Petitti  incorporated distributional semantics into the active learning process.
In contrast, we consider the task of identifying relevant abstracts as an information retrieval (IR) task with diverse IR relevance ranking metrics considered. Some previous works based on IR have been done [33,34]. In our proposed SR framework, we also incorporate topic analysis to provide an informative summary as well as to improve relevance ranking. To our knowledge, our work is the first one to integrate topic model into IR approach to reduce screening burden in SRs for new studies. The closest work to ours is Bekhuis et al , who built a database of abstracts from 5 systematic reviews and then extracted 5 feature sets from abstracts, including indexing and topic features to train Bayesian classifiers to update relevant articles for previous studies. Two essential differences exist between our proposed approach and theirs. Firstly, they made use of topic probabilities and KL-divergences to generate topic features while we calculate topic relevance with term topic distributions and document topic distributions. Secondly, they focus on finding related new publication leveraging previous studies as training data while we focus on discovering new studies in an unsupervised way.
Figure 1 provides an overview of the proposed framework. The core part of the framework is the three relevance-ranking methods, which are derived from both Query and Topic Analysis. The Query functions as a screening component incorporate diverse IR ranking metrics to rank studies according to their relevance. The Topic Analysis is employed for grouping similar studies together and investigating topic distribution to provide information summary. In the following, details of the framework were provided. Three published Cochrane systematic reviews were used as case studies.
3.1. System Input
The system input includes a list of keywords, the corresponding search strategies adopted from the Cochrane SRs and a list of abstracts retrieved by a librarian for given SR protocols. The keyword list captures important concepts in the SR protocol and is utilized to assess keyword, indexed-term and topic relevance. For both keyword and indexed-term relevance assessment, keywords will be employed as query terms and the search strategies utilized to retrieve abstracts. For topic relevance, weights associated with keywords will be used to compute the relevance score. The abstract list will be used as the collection for identifying relevant studies. In Cochrane SRs, the search strategies are a mix of free text and indexed terms. Since those studies attempt to be comprehensive, diverse databases are involved and search strategies are subtly different for each of them. In this proposed framework, we only utilize a MEDLINE search strategy due to accessibility. In order to see the contributions of each relevance metric, we also separate free text and indexed terms as illustrated below.
3.2. Relevance Ranking
Three semantics-based relevance ranking metrics are named as keyword relevance, indexed-term relevance and topic relevance. Keyword relevance and indexed-term relevance are similar where both measure how relevant an abstract is to the keyword list. Lucene score  were adopted to compute the relevance which is based on the term frequency and inverse document frequency (TF-IDF) after screening with a general stop-word list, and is calculated by combining Boolean model and vector space model (VSM) . Specifically, we index the abstract collection using Lucene . A query is then formed by the keyword list. The score returned by Lucene for searching the title and the abstract of the abstract is used to measure keyword relevance. In this combination model, weights obtained by VSM, thresholds are added so that a binary score can be assigned to each weight. Since keywords are generated by users, we may regard keyword relevance as user-defined semantics.
Indexed-term relevance is based on the MeSH terms, which is a comprehensive vocabulary for the purpose of indexing journal abstracts and books in the life sciences provided by the National Library of Medicine (NLM) . Usually, each abstract indexed by PubMed is assigned a group of relevant MeSH terms. Therefore, we suppose that those MeSH terms can reflect the relevance degrees among given studies. The score returned by Lucene for searching the indexed MeSH terms is used to measure the indexed-term relevance. Indexed-term relevance can be different depending on different indexed vocabulary used by different database systems. Indexed terms are usually defined by experts of specific fields. Therefore, this relevance can be also thought as expert-defined semantics.
Topic relevance is derived from topic analysis with LDA, detailed in next section. Each relevance score is normalized across the abstract collection with unit length scaling method to normalize (i.e., ).
3.3. Topic Analysis with Latent Dirichlet Allocation
In this component, we use the LDA implemented in Mallet Toolkit . All retrieved abstracts and their titles in the case studies are used to construct LDA models. Stop words are removed from the raw documents in a pre-processing step. Then perplexity optimization is used to find the best number of topics where a grid search is made to find the lowest perplexity  with the number of topics ranging from 5 to 100. After that certain number of topics are set, and 1000 iterations performed to obtain the topic distributions among given studies. After LDA results are obtained, each topic, represented as a group of words of top probabilities (roughly equivalent to top 10 words) returned by LDA is used to provide high-level information summary. Prominent topics are defined if they cover more than 10% of abstracts.
We assume that studies manually screened tend to have similar topic distributions. Hence, one more relevance metric is defined based on topic distributions and incorporated into the abstract-screening framework. Topic relevance comes from the abstract itself. Specifically, given a query (q, the keyword list), the topic relevance score of an abstract (d) is calculated as:
where and are the posterior estimates of (the prior of the topic distribution of words) and (the prior of the topic distribution of an abstract). In the process, the values of hyper-parameters, and need to be determined beforehand. The former controls the abstract distributions while the latter controls the word distributions. The optimal values for and can be obtained through grid search as well. Here, we follow the usual heuristic practice  by setting as 50 divided by the number of topics while as 0.01.
The term refers to the probability of the query word given a topic (z) tuned by (namely how close query word q to abstract d under topic z). refers to the probability of topic z (namely the common hidden semantics of some words or some documents) given abstract d tuned by . The product of and refers to how close query word q to abstract d under topic z. The implementation of topic relevance is based on the posterior estimates and , which are outputs from the Mallet.
4.1. Data Sources
We retrospectively evaluated our framework using three published Cochrane SRs that were chosen to cover different topics (Table 1). The SRs assessed mass media interventions for reducing mental health-related stigma , postoperative adjuvant chemotherapy in rectal cancer , and the effect of vaccination on preventing influenza in healthy children . The numbers of abstracts retrieved with above-described search strategy from MEDLINE were 3,303, 4,075 and 811 respectively and the numbers after manual screening were 7, 10 and 49 respectively (0.22%, 0.25% and 6% in percentage).
4.2. Evaluation Metrics
We adopted a few metrics that have been utilized previously to measure the screening performance. For a given ranking threshold T, Table 2 provides the definition of each metric.
For a given ranking threshold, the recall change and the reduction in screening burden are the standard metrics used by previous efforts on reducing SR workload [26,32]. We also pooled the combined effect size of the outcomes using the DerSimonian and Laird random-effect models  to show whether meta-analysis estimates derived from results obtained using our framework differ from those in the published Cochrane review (ie, the gold standard list of studies obtained manually). The difference in effect size was tested using the interaction test as described by Altman and Bland .
|Case||Mass media interventions for reducing mental health-related stigma ||Postoperative adjuvant chemotherapy in rectal cancer operated for cure ||Vaccines for preventing influenza in healthy children |
|Objective||To assess the effects of mass media interventions on reducing stigma (discrimination and prejudice) related to mental ill health compared to inactive controls, and to make comparisons of effectiveness based on the nature of the intervention (e.g. number of mass media components), the content of the intervention (e.g. type of primary message), and the type of media (e.g. print, internet).||To quantitatively summarize the available evidence regarding the impact of postoperative adjuvant chemotherapy on the survival of patients with surgically resectable rectal cancer.||To appraise all comparative studies evaluating the effects of influenza vaccines in healthy children, assess vaccine efficacy (prevention of confirmed influenza) and effectiveness (prevention of influenza-like illness (ILI)) and document adverse events associated with influenza vaccines.|
|Eligibility Criteria||Undergraduate university students from seven upper level psychology courses, two introductory psychology courses, one introductory communications, and two advanced communications||Adults undergoing surgery for rectal cancer who received no adjuvant chemotherapy and those receiving any postoperative chemotherapy regimen.||School children from 2 boarding schools aged 4 to 7 years and 8 to 15 years. There does not appear to be any attrition|
The screening performance was assessed for three combinations of relevance metrics by the distribution of relevant studies in five ranking intervals: I. (1-100), II. (100-200), III. (200-300), IV. (300-400) and V. (400, above):
A. keyword relevance
B. linear combination of keyword relevance and indexed-term relevance
C. linear combination of keyword, indexed-term and topic relevance
The interval choice is based on what has been reported in the literature  (that is, a saving in workload of between 30% and 70% is expected to be associated with loss of 5% of relevant studies).
|Ranking Threshold (T)||Number of abstracts which are used as the threshold.|
|True positive (TP)||Number of abstract ranking higher than the threshold matching human included studies (this is done by a few professional systematic reviewers)|
|Recall||The ratio of true positives to the number of relevant studies identified manually|
|Precision||The ratio of true positives to threshold|
|Screening saved||The subtraction of total number of abstracts and threshold divided by the total number of abstracts retrieved|
|Combined effect size||A summary estimate that results from meta-analysis of individual studies included in systematic review.|
4.3. Results of Case Studies
Table 3 and Table 4 show the screening performance of our framework and the topic distribution of each case study respectively. Only combination of C was used in Table 3 and Table 4 since it showed the best performance. Although systematic reviewers generally place strong emphasis on high recall, we still report the screening labor for lower recall rates in order to provide a comprehensive view across the three case studies. Figure 2 depicts the proportion of relevant studies for five ranking intervals. In the following, we detailed the results case by case.
Case 1. Mass Media Intervention
The total number of retrieved abstracts is 3,303 and the number of true positives is 7 with the percentage of true positives about 0.2%. When the ranking threshold is 300, we achieved a recall of 100% with 91.8% of the screening labor saved. The ratio of relevant studies in interval I and IIare 0.14 (1 out of 7) and 0.29 (2 out of 7) respectively for A, where only keyword relevance was used. The addition of indexed-term relevance (namely, combination B) brought the inverse proportion for interval I and II (0.29 and 0.14 respectively now). After adding topic relevance (i.e., combination C), there is an increase of 0.43 in ratio for interval I (i.e., increased to 0.72, 5 out of 7).
The number of topics through perplexity optimization was 20. Two prominent topics (4 and 2 abstracts respectively) were found. The top topic words for one (Topic 7) include brain, cortex, cognitive and temporal and the other (Topic 17) involvesdepression, anxiety, mood and suicidal.
|Case 1 mass media intervention||Case 2 rectal cancer study||Case 3 flu vaccine study|
|Screening saved (%)||0||89||91.8||94.5||97.3||0||85.7||90.2||92.6||95.1||97.5||0||49.3||61.7||73.4||86.3|
|Combined effect size and 95% confidence interval||0.92||0.92||0.92||0.94||0.96||0.92||0.92||0.93||0.94||0.95||0.96||1.02||1.02||1.01||0.99||0.99|
|Case 1 mass media intervention||Case 2 rectal cancer study||Case 3 flu vaccine study|
|Topic No||Key words (# studies)||Key words (# studies)||Key words (# studies)|
|1||smoking, tobacco, prevalence, cessation||cases, tumor, treated, tissue, bone, years||antibody, vaccine, influenza, hemagglutinin|
|2||drug, users, abuse, reduction, addiction||survival, patients, rates, surgery, lower||years, age, children, groups, chronic, months (7)|
|3||studies, trials, interventions, reports||months, medium, relapse, developed||label, respiratory, media, acute (9)|
|4||adolescents, screening, factors, age (1)||radiotherapy, radiation, rectal, acute||asthma, vaccination, pulmonary, exacerbations (21)|
|5||participants, weight, increased, trial||tumor, surgical, biopsy, vincristine (4)||virus, antibody, h1n1, h3n2, inhibition, antigen (12)|
|6||patients, placebo, dose, bseline||mortality, induction, complications, deaths||patients, group, residents, population|
|7||brain, cortex, cognitive, temporal (4)||cancer, adjuvant, colorectal, adverse||cost, effectiveness, economic, criteria|
|8||children, behavioral, ratings, families||malignant, progression, brain, surgical||placebo, dose, days, recipients, adults|
|9||alcohol, survey, questionnaire, questions||prognostic, retrospectively, regression||coverage, increased, persons, data, season|
|10||women, hiv, sexual, aids, African||chemotherapy, neoadjuvant, pathologic (3)||elderly, high, pandemic, deaths, morbidity|
|11||internet, web, computer, feedback||carcinoma, pelvic, endometrial, squamous||respiratory, symptoms, fever, illnesses|
|12||interviews, communication, dementia (1)||lung, patients, cisplation, prospective||reactions, split, immunogenicity, safety|
|13||psychological, measure, scale, sample||cancer, surgery, therapeutic, oncology|
|14||social, autism, fear, examined||breast, tomoxifen, mastectomy, relapse|
|15||memory, auditory, attention, motor||resection, liver, metastases, hepatic|
|16||mental, public, caregivers, policy||trials, randomized, systematic, advantage (2)|
|17||depression, anxiety, mood, suicidal (2)||adjuvant, margins, nodal, invasion|
|18||community, prevention, local, based||Complications, performed, laparoscopic|
|19||exposure, blood, beta, central, amyloid||dose, fluorouracil, paclitaxel, regimen|
|20||disorders, lead, association, diagnostic||trials, adjuvant, randomized, systematic survival|
Case 2. Rectal Cancer
The total number of retrieved abstracts is 4,075 and the number of true positives is 10 with the percentage of true positives about 0.25%. When the ranking threshold is 400, we achieved a recall of 80% (8 abstracts, namely) with 90.2% of the screening labor saved. This result did not reach the goal of high recall. Therefore, we also used the threshold 600 for this case, where the recall is 100% and 85.7% is the screening labor saved. The ratio of relevant studies for combination A in interval I is 0.40 (4 abstracts). For combination B, the ratio in interval I increases to 0.50 (i.e. an increase of 0.10, or one more abstract found). The topic relevance (the combination C) brings another 0.10 (another one) increase in interval I (0.60 now or 6 out of 10).
The optimal number of topics through perplexity optimization was 20. The three prominent topics include 4, 3 and 2 abstracts respectively. The top words for the first topic (Topic 5) are clinical related words including tumor, surgical, biopsy, vincristine, removal, malignant and resection. The second topic (Topic 10) is more therapy related consisting of chemotherapy, neoadjuvant and pathologic. And the third one (Topic 16) comprises of trials, adjuvant, randomized, systematic, survival and regimens.
Case 3. Influenza Vaccine
For this SR study, the percentage of true positives is about 6% where the number of true positives is 49. A recall of 98% (48 out of 49 abstracts) with 49.3% of the saving in screening labor is achieved when the threshold is 400. For this study, the best ratio achieved in interval I is 0.39 (19 out of 49) by combination B. After adding topic relevance (i.e., combination C), there is a slight decrease of 0.06 in interval I compared to combination C but an increase of 0.08 in interval II. Counting interval I and II, the best results come from combination C (0.59, 0.66 and 0.68 or 29, 32 and 33 for A, B and C respectively)
The optimal number of topics is 12. The 49 relevant studies are mainly distributed among four topics. One topic (Topic 4) includes 21 abstracts where asthma, vaccination, pulmonary and exacerbations are dominant words. Another topic (Topic 5) includes vaccine, antibody, virus, h1n1, h3n2 and etc with 12 abstracts in it. The third one (Topic 3) includes 9 abstracts in which label, respiratory, media and acute are the top words and the fourth one (Topic 2) includes 7 abstracts with years, age, chronic, children and groups in it.
We have described a text-mining framework that reduces the abstract screening burden in SRs while keeping high recall rate and can also provide an informative summary. This framework is partially inspired by our prior work on automated reference assignment , which explores methods for assigning reference automatically to expert-written content and also a significant extension of our another work on labor screening reduction . Compared with related work, the proposed framework has multiple advantages. Firstly, it is purely unsupervised. The use of diverse relevance ranking metrics does not require any training data as needed by supervised learning or active learning. Secondly, topic analysis enables the systematic exploration of topics. The topic analysis can be valuable for reviewers to have a better understanding of the relevant studies. Thirdly, our framework has good portability and extensibility. As mentioned in Introduction, we focus on newly conducted SRs while prior works focused on updating existing reviews. However, extension of our framework to update published SRs is possible with minimal effort. We can either run our framework on the newly added studies to test how relevant they are to the previous studies or we can make use of all relevance scores as features to train classifiers. Without doubt, it will be interesting to utilize public resources to make comparisons with other approaches, which will be our future work.
More importantly, the evaluation on three diverse systematic review studies demonstrates robust performance, i.e., adding indexed term relevance and topic relevance boosts the performance comparing to using keyword relevance alone. MeSH terms, as an indexed term system, are derived from experts. It is understandable that MeSH may be a good relevance metric. In Case 1, topic relevance was more helpful than the other two relevance metrics and it brought improvements for both Case 2 and Case 3 as well. Hence, we could say that it is a reasonable relevance considering the unsupervised nature and the modularity of topic modeling. We can flexibly extend topic modeling to incorporate diverse features and to strengthen the model with more representative variables, such as domain knowledge, indexed terms, external resources and so on.
One limitation is that we evaluated our framework retrospectively. To truly assess the contribution of the framework, a prospective indexed study is needed where two groups of systematic reviewers, one with the support of our system and the other following the traditional SR workflow, would conduct demonstrative SRs. The outcome of the two groups can be compared in terms of time spent on abstract screening and the final list of studies selected.
In addition, our current approach for combining relevance metrics is simply an unweighted linear combination. It is noticed that the contribution of relevance metrics for different SR studies is not always consistent. In the future, we plan to give end users options of weighting different relevance ranking metrics.
One other limitation of this study is that only MEDLINE was searched due to accessibility and feasibility issues. It is known that EMBASE  and other databases are also important to search in a comprehensive SR. Future work should evaluate text-mining approaches in other databases to enhance portability of proposed frameworks.
A credible SR should summarize evidence from studies selected based on an explicit methodological criteria. Studies should not be selected based on the reputation of journals (impact factor) or authors. Otherwise, the SR would propagate publication bias and not represent the totality of evidence. Therefore, the ranking metrics in our framework (keyword relevance, indexed term relevance or topic relevance) are all purely semantics-based. Potentially, if a rapid (not systematic) review is needed, journal relevance and citation relevance can be used as supplements to our framework.
6. Conclusion and Future Work
It was demonstrated that a text-mining SR supporting framework based on diverse relevance ranking metrics can reduce the labor of SRs to a large degree, while keeping comparably high recall. Meanwhile, we incorporated topic analysis into the framework to provide high level summary of the latest development of intervention trials of given topics. Future work would test such a framework in prospective studies, integrate limited supervision techniques iteratively into SR workflow to further increase recall, and reduce screening burden.
DL led the study design, methodology implementation and drafted the manuscript. DL and FS implemented the data extraction and formation. LW, ZW, MHM, SS and HL gave guidance and consultations on the study designs and on the manuscript editing. HL provided institutional support and manuscript editing. All authors read and approved the final manuscript.
The study was supported by the following grants: R01GM102282A1, R01LM11934A1, R01LM11829A1, R01LM11369A1 and 1K99LM012021-01A1.
Special thanks to Yanshan Wang and Yue Yu’s help on the final proofreading and editing as well as the strong support from the systematic review group at the Mayo Clinic