Bioinformatics Analysis Identifies Potential Key Genes of Peripheral Blood Mononuclear Cell in Idiopathic Pulmonary Fibrosis

Idiopathic pulmonary fibrosis (IPF) is a chronic progressive fibrotic interstitial pneumonia with progressive worsening of dyspnea and lung function. The etiology of IPF is unknown, and the pathogenesis remains unclear. Our study aimed to investigate the key genes of the peripheral blood mononuclear cell in IPF by bioinformatics analysis. Our study used the online Gene Expression Omnibus (GEO) microarray expression profiling dataset GSE28042 to identify differentially expressed genes (DEGs) between IPF patients and healthy controls. We performed the Gene Ontology (GO) and pathway enrichment analyses of genes for annotation, visualization, and integrated discovery. The STRING database constructed Protein-protein interaction (PPI) network analysis, and hub genes were identified by the CytoHubba plugin. Moreover, we used the receiver operating characteristic (ROC) curve to assess the diagnostic value of the hub genes. In total, 28 upregulated and 44 downregulated genes were identified in the differential expression analysis. The protein-protein interaction network (PPI) was established with 69 nodes and 68 edges. The top 10 hub genes were JUN, FOS, STAT3, SOCS3, JUNB, DUSP1, IL4, FCER1A, MS4A2, and CPA3. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways enriched for the important module containing hub genes contained Fc epsilon RI signaling pathway, TNF signaling pathway, Jak-STAT signaling pathway, and MAPK signaling pathway. Additionally, the identified hub genes show high functional similarity and diagnostic value in IPF. Our study used bioinformatics analysis to provide new insight into the mechanisms underlying IPF. However, more experiments are needed to explore the relationships between the top 10 hub genes and IPF in the future.


Introduction
Idiopathic pulmonary fibrosis (IPF) is defined as a specific form of chronic, progressive fibrosing interstitial pneumonia and its cause is unknown [1]. Its clinical features are unexplained exertional dyspnea, chronic dry cough, or Velcro rale on examination [2]. The histopathology of this disease is interstitial fibrosis with spatial heterogeneity and patchy involvement of lung parenchyma, and microscopic honeycombing [3]. The incidence of IPF is high and its incidence rates increased over time in most countries. Its incidence ranges from 0.2 per 100000 per year to 93.7 per 100000 per year based on estimates from Europe and North America [4]. Besides, median survival from the time of diagnosis is only 2.5 to 3.5 years [5]. Currently, the risk factors of IPF are environmental, genetic, epigenetic alterations, and aging [6]. However, the pathogenesis is still not exact.
So far, understanding of the pathogenesis of IPF includes epithelial cell dysfunction caused by genetic susceptibility, defined profibrotic processes caused by TGF-beta activation, and progressive pulmonary fibrosis [7,8]. However, recent studies have shown that epithelial cell dysfunction is still a central cause of IPF [7]. Currently, the diagnosis of IPF requires exclusion of other known causes of interstitial disease (ILD), high-resolution, and surgical lung biopsy [1]. However, the accuracy of the diagnosis of IPF requires experienced physicians, which has certain limitations. IPF patients are usually in the terminal stages of the disease when diagnosed and there is no particularly effective treatment at this point. Therefore, early diagnosis and early treatment were most important.
Bioinformatics analyses can enable researchers to search online biological databases to explore the pathogenesis and molecular diagnosis, such as juvenile dermatomyositis [9], major depressive disorder [10], Tuberculosis [11]. Here, we use the peripheral blood mononuclear cell (PBMC) microarray dataset GSE28042 created by Herazo-Maya et al. to investigated the differentially expressed genes (DEGs) between IPF patients and healthy controls to explore the key genes. Our findings will provide new insights into the clinical diagnosis and treatment of IPF.

Microarray Data
The microarray data of GSE28042 was downloaded from the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih,gov/geo/) database. The dataset was based on the GPL6480 Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name Version). The data type was expression profiling by the array and the species selected was Homo sapiens. The peripheral blood mononuclear cell samples (PBMC) included 75 patients with the diagnosis of IPF (IPF group) and 19 healthy controls (control group). The clinical details of GSE28042 were listed ( Table 1). The annotation file for GPL6480 was also downloaded from the GEO.

Analysis of Differentially Expressed Genes (DEGs)
We used the online analysis tool GEO2R (http://www.ncbi.nlm.nih.gov/geo/geo2r/) to screen the DEGs between the IPF group and the control group. The GEO2R can allow us to compare two or more groups of samples to identify genes that are differentially expressed across experimental conditions. The results are presented as a table of genes ordered by significance. To exclude gender differences, we divided the IPF group and the control group into two groups (Male IPF group, Female IPF group, Male control group, Female control group), respectively, depending on the gender. P-values and adjusted P-values (adj. p) were calculated using t-tests. Genes with log2 fold change (FC) >1 and adj. p <0.05 were identified as DEGs. A Venn diagram of DEGs was drawn using the online tool Venny 2.1 (http://bioinfogp.cnb.csic.es/tools/venny/). The heatmap for the DEGs was created using R software (version 4.0.2).

Gene Ontology (GO) and Pathway Enrichment Analysis of DEGs
GO, a bioinformatics tool aims to establish a vocabulary that defines and describes the functions of genes and proteins for a variety of species. GO is divided into three parts: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC) [12]. KEGG (Kyoto Encyclopedia of Genes and Genomes) is a database that systematically analyzes the metabolic pathways of gene products in cells and their functions [13]. KEGG integrates data on genomes, chemical molecules, and biochemical systems, including metabolic pathways, drugs, diseases, genes, and genomes [13]. The Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://david.ncifcrf.gov; version 6.8) is a free online biological information database and it provides a comprehensive set of functional annotation tools for researchers to understand biological meaning behind a large list of genes. We performed GO and KEGG pathway enrichment analyses using the DAVID online database to analyze the function of DEGs. P-value <0.05 was set as the cut-off criteria.

Protein-protein Interaction (PPI) Network Analysis and Module Analysis
A significant number of proteins do not function alone. Proteins interact with each other to form complexes that then do their work. We used the Search Tool for the Retrieval of Interacting Genes (STRING; https://string-db.org/; version 11.0) online database to systematically predict and construct a PPI network of all DEGs. A combined score >0.4 of PPI pairs was regarded as a significant interaction. Cytoscape (version 3.8.0), a bioinformatic software available online, can be used to construct and visualize the network of PPI. MCODE (version 1.5.1), a plugin of Cytoscape software, can construct functional modules by clustering in a large network of PPI. CytoHubba, mainly used for exploring PPI network hub genes, is a Cytoscape plugin. We selected the genes with the highest ranking by the maximum correlation criterion (MCC). The selected genes are represented by redder color.

Go and Pathway Enrichment Analysis of Hub Genes
To analyze the function of hub genes, biological analyses were performed using the DAVID online database. P-value <0.05 was set as the cut-off criteria.

Hub Genes Diagnostic Efficacy Evaluation
The receiver operating characteristic (ROC) curve can be used to assess the diagnostic accuracy. We use the "pROC" package of the R software to plot the ROC curve, calculate the area under the curve (AUC) and evaluate the diagnostic capability of hub genes to distinguish IPF patients and healthy controls.

Statistical Analysis
All statistical analyses were performed as the means±standard deviation. The R software (version 4.0.2) was used to analyze the data. A P-value <0.05 was considered statistically significant.

Differentially Expressed Genes
We downloaded the microarray expression dataset GSE28042 from the GEO database and analyzed the DEGs between IPF patients and healthy controls using the GEO2R tool. In total, 107 upregulated and 244 downregulated genes were identified between male IPF patients and male healthy controls. Besides, 54 upregulated and 212 downregulated genes were identified between female IPF patients and female healthy controls. The intersection of these two datasets identified 28 upregulated and 44 downregulated genes ( Table  2). The Venn diagram and heatmap for the DEGs are presented in Figure 1.

Go and Pathway Enrichment Analysis of DEGs
To analyze the functions and mechanisms of DEGs, the functional and pathway enrichment analyses of upregulated and downregulated DEGs were performed using the DAVID 6.8 online tool. In our study, a total of 51 GO terms and 8 pathways of DEGs (P-value <0.05), including 27 BPs, 12 CCs, and 12 MFs, were obtained, and the top five of each item are shown in Table 3. GO analysis results showed that changes in BPs of upregulated DEGs were significantly enriched in response to cAMP, positive regulation of cell differentiation, cellular response to calcium ion, response to drug, and regulation of cell cycle. Downregulated DEGs in BPs were significantly enriched in oxygen transport, bicarbonate transport, positive regulation of mast cell degranulation, angiotensin maturation, and positive regulation of cytosolic calcium ion concentration. Changes in CCs of upregulated DEGs were mainly enriched in nuclear chromatin, lamin filament, ciliary transition zone, transcription factor complex, and nuclear inner membrane. Downregulated DEGs in CCs were mainly enriched in hemoglobin complex, haptoglobin-hemoglobin complex, integral component of plasma membrane, endocytic vesicle lumen, and blood microparticle. Changes in MFs of upregulated DEGs were mainly enriched in RNA polymerase II core promoter proximal region sequence-specific DNA binding, transcription factor activity, RNA polymerase II core promoter proximal region sequence-specific binding, transcriptional activator activity, RNA polymerase II core promoter proximal region sequence-specific binding, and transcription factor binding. Changes in MFs of downregulated DEGs were mainly enriched in oxygen transporter activity, oxygen binding, heme binding, iron ion binding, and haptoglobin binding ( Table 3).
The pathways enriched by upregulated DEGs were mainly related to the TNF signaling pathway, Osteoclast differentiation, Herpes simplex infection, Prolactin signaling pathway, and Hepatitis B. The pathways enriched by downregulated DEGs were mainly related to Asthma, Fc epsilon RI signaling pathway, and Hematopoietic cell lineage (Table 3). If there were more than five terms enriched in this category, the top five terms were selected according to P-value.

PPI Network Construction and Hub Gene Identification
Protein interactions among the DEGs were predicted with STRING online database. A PPI network with 69 nodes and 68 edges was obtained and the PPI network was visualized by Cytoscape ( Figure 2B). The cytoHubba plugin was then used to analyze hub genes with MCC, and genes with the top 10 scores were identified as hub genes ( Figure 2C). As shown in Figure 2C, six upregulated genes (JUN, FOS, STAT3, SOCS3, JUNB, DUSP1) and four downregulated genes (IL4, FCER1A, MS4A2, CPA3) were identified. The gene symbols, full names, and scores of hub genes are shown in Table 4.

Go and Pathway Enrichment Analysis of Hub Genes
We performed a functional enrichment analysis for hub genes. The GO analysis demonstrated that changes in BPs were mainly enriched in cellular response to calcium ion, regulation of cell cycle, positive regulation of mast cell degranulation, response to muscle stretch, B cell activation, positive regulation of pri-miRNA transcription from RNA polymerase II promoter, positive regulation of cell differentiation, SMAD protein signal transduction, transforming growth factor beta receptor signaling pathway, and response to drug. Changes in CCs were significantly enriched in external side of plasma membrane and nucleoplasm. Changes in MFs for the hub genes were enriched mainly in transcriptional activator activity, RNA polymerase II core promoter proximal region sequence-specific binding, and RNA polymerase II core promoter proximal region sequence-specific DNA binding.
The pathways enriched by hub genes were mainly related to Asthma, Fc epsilon RI signaling pathway, Inflammatory bowel disease (IBD), Leishmaniasis, T cell receptor signaling pathway, TNF signaling pathway, Osteoclast differentiation, Jak-STAT signaling pathway, Prolactin signaling pathway, Measles, Hepatitis B, Herpes simplex infection, and MAPK signaling pathway (Table 5).

Using Hub Genes for IPF Diagnosis
The diagnostic accuracy of the top 10 hub genes was assessed using ROC curve analysis (

Discussion
In this study, we analyzed the DEGs in PBMCs from IPF patients and healthy controls. We performed independently for the male IPF and female IPF patients to pinpoint the potential gens. The results of the microarray analysis revealed the expression of 72 DEGs (28 upregulated genes and 44 downregulated genes). The associations between these genes were revealed by constructing a PPI network. The top 10 genes with the highest scores were identified, including JUN, FOS, STAT3, SOCS3, JUNB, DUSP1, IL4, FCER1A, MS4A2, and CPA3. There were six upregulated genes (JUN, FOS, STAT3, SOCS3, JUNB, DUSP1) and four downregulated genes (IL4, FCER1A, MS4A2, CPA3). In addition, we use GO and pathway enrichment analysis to perform their functions. At the same time, we used the ROC curve to analyze the AUC of the 10 top hub genes.
JUN, FOS, STAT3, SOCS3, JUNB, and DUSP1 are upregulated genes in PBMC of IPF. JUN and FOS are a subunit of the activator protein-1 (AP-1) [14]. AP-1 is a dimeric complex composed of the JUN (c-JUN, JUND, JUNB), FOS, ATF, and MAF protein families [14][15][16]. It can be seen that JUNB is a member of the JUN protein. Moreover, JUN and FOS are important members of the transcription AP-1 [17]. One study has reported that AP-1 induction may be associated with increased proliferation and extracellular matrix (ECM) production in IPF fibroblasts [18]. And Werning et al. have recently found that c-JUN and FOS are expressed in fibroblasts of IPF [19]. Besides, Chang et al. have shown that JUNB can regulate Epithelial-to-mesenchymal transition (EMT) [20]. One study has shown that EMT plays an important role in the pathogenesis of IPF [21]. Therefore, JUN, FOS, and JUNB may be involved in forming IPF.
STAT3 is a ubiquitously expressed latent cytoplasmic protein that regulates lung fibrosis [22]. Waters et al. have shown that STAT3 can promote fibroblast senescence to promote fibrosis [23]. Milara et al. have found that lung from patients with IPF expressed higher levels of STAT3, as well as phosphorylated [24]. Recently, it has been demonstrated that STAT3 regulates lung fibroblast-myofibroblast activation and differentiation in IPF [25]. Given these findings, STAT3 can be used as a biomarker in IPF.
SOCS3 is one of the most studied members of the SOCS family that consists of eight proteins (SOCS1-7 and cytokine-inducible SH2-containing protein, CISH) [26]. SOCS3 can be a major regulator of STAT3 activation [27]. One study has found that SOCS3 expression was shown to be elevated for up to 30 days in bleomycin-induced fibrosis [28]. A study by Akram et al. has shown that a significant increase in SOCS3 expression was observed in IPF AEC and macrophages compared to control lung tissue using dual immunohistochemical analysis [29]. Shocher et al. have found that primary human fibroblast culture from IPF (IPF-HLFs) expressed higher levels of SOCS3 when tested basal levels in HLFs [30]. Therefore, it may be hypothesized that SOCS3 may play a significant role in IPF. DUSP1, also named mitogen-activated protein kinase (MAPK) phosphatase (MKP-1) that dephosphorylates and deactivate MAPKs, acts as a negative regulator of the MAPK signaling pathway [31,32]. Redente et al. have shown that DUSP1-deficient mice reduced pulmonary fibrosis in bleomycin-induced fibrosis and pulmonary fibrosis was attenuated in mice given bleomycin using DUSP1 inhibitors [33]. Besides, one study has found that DUSP1 plays a critical role in promoting pulmonary fibrosis from macrophages to fibroblasts in vivo experiments [34]. These studies suggest that DUSP1 plays a crucial role in IPF and maybe a relevant therapeutic target.
IL4, FCER1A, MS4A2, and CPA3 are downregulated genes in PBMC of IPF. IL-4 is a fibrogenic cytokine that increases collagen production by fibroblasts [35]. But one study has found that pulmonary fibrosis and lung injury and inflammation in the bleomycin-induced fibrosis model of IL-4-deficient mice were substantially less than wild-type mice [36]. These results suggest that IL-4 has both fibrogenic and anti-inflammatory in the context of bleomycin-induced lung fibrosis and injury. FCER1A gene encodes the а-subunit (FCER1а) of the high-affinity IgE receptor consisting of an а-chain (FCER1), an β-chain (MS4A2) and two γ-chains (FCER1G) [37][38][39]. FCER1 and AMS4A2 expressing in mast cells (MCs) are associated with asthma [37,38,40]. But CPA3 is one of the MC-restricted proteases that are secreted by MCs [41]. It is also associated with asthma [41]. What's more, it is demonstrated that FCER1A, MS4A2, and CPA3 may be favorable prognostic indicators in non-small cell lung cancer [42,43]. However, there is no research on the relationship between FCER1A, MS4A2, and CPA3 and IPF so far. Based on our study, we can predict that FCER1A, MS4A2, and CPA3 may be PBMC markers in IPF.
In our study, all the AUC values of the top 10 genes were in the range 0.880-0.980 concerning the ROC curve. These genes possess high accuracy except that FCER1A indicated moderate accuracy [44]. KEGG enrichment analysis of these genes showed that these genes were mainly linked to the Fc epsilon RI signaling pathway, TNF signaling pathway, Jak-STAT signaling pathway, and MAPK signaling pathway. These signaling pathways play an essential role in the pathogenesis of IPF. We speculated that these genes might play an important role in IPF. But our study has a few limitations. Firstly, to fully identify the key genes in PBMC of IPF, it is better to combine venous blood samples and lung tissue to explore. Second, the sample size of the dataset we explored was too small. Therefore, it is necessary to increase the samples to improve the diagnostic accuracy of these hub genes in IPF. Third, a single microarray analysis was used in our study. It may result in a high false-positive rate. Thus, it is necessary to improve the detection capability by combining multiple individual data in future studies. What's more, some key genes and pathways were not found in IPF in previous studies. For this reason, we need more experimental evidence to prove the relationship between these key genes and IPF.

Conclusion
Our study used bioinformatics analysis to identify the associated biological functions and pathways involved in IPF to explore the pathogenesis, diagnosis, and prognosis of IPF. Moreover, we identified 10 key genes as potential diagnostic biomarkers through PPI network analysis and the ROC curve analysis. However, more experiments are needed to validate the results of our study further.