In-Silico Screening of Biomarker Genes of Hepatocellular Carcinoma Using R/Bioconductor

: Hepatocellular Carcinoma is a primary malignancy of the liver. It is the fifth most common cancer around the world and is a leading cause of cancer related deaths. For about 40 years HCC has been predominantly linked with Hepatitis B and Hepatitis C infection. This work aims to find out potential biomarkers for HBV and HCV infected HCC through rigorous computational analyses. This was achieved by collecting gene expression microarray data from GEO (Gene Expression Omnibus) database as GSE series (GSE38941, GSE26495, GSE51489, GSE41804, GSE49954, GSE16593) and pre-processing it using Bioconductor repository for R. Following a robust mechanism including the use of statistical testing techniques and tools, the data was screened for DEGs (Differentially Expressed Genes). 3354 down regulated genes and 785 up regulated genes for HBV and 3462 down regulated and 251 up regulated genes for HCV were obtained. For a comparative study of DEGs from HBV and HCV, they were merged to look for potential biomarkers whose differential expression may result in carcinoma. A total of 17 biomarkers (1 up-regulated and 16 downregulated), was obtained which were further subjected to Cytoscape to generate a GRN using STRING app. Furthermore, module level analysis was performed as it offers robustness and a better understanding of complex GRNs. The work also focuses on the topological properties of the network. The results point out to the presence of a hierarchical framework in the network. They also shed a light on the interactions of biomarkers whose down regulation may result in HCC. These results can be used for future research and in exploring drug targets for this disease.


Introduction
Hepatocellular carcinoma, also known by the name of malignant hepatoma, is a primary malignancy of the liver. It represents a poor prognostic cancer and is the fifth most common cancer in the world [1]. The primary cause for this cancer appears to be chronic liver disease and liver cirrhosis [2]. Annually, the cancer is diagnosed in about more than half a million people worldwide. Early diagnose can sometimes be cured with surgery or transplant but in more advanced cases it cannot be cured. The few cases (less than 5%) of HCC that do not develop on the background of chronic liver disease, are diagnosed late and usually have poor chances of cure [3].
Age is an important factor for this cancer as the people of 50 years or more have a higher risk of HCC as compared to the young population. Interestingly, the rate of this malignancy is higher in males than in females. Being the fifth most common cancer in men it appears to be the seventh most common cancer in women [4]. For more than 30 years HCC has been predominantly associated with chronic infection of Hepatitis B and Hepatitis C virus. Thus, the problem is even more overwhelming in regions where the incidence of chronic viral hepatitis B and/or C is of high prevalence [5].
It should be noted that the occurrence of HCC is increasing in several developing nations and is likely to increase in the same manner [6]. Although the rate of this tumor is low for the developed world, there is a distinct geographical variation in its incidence, with 81% of cases occurring in the developing world and 54% of these occurring in China [5]. In Chinese and black African population, mainly infected with HBV, the patients are younger, while in Sub-Saharan Africa (high incidence of HBV infection), where the incidence of HCC is the highest, it can appear in the third decade of life. In Asia and sub-Saharan Africa there are as many as 120 cases per 100,000. However for the Asia-Pacific region, it appears to be the third most common cause of cancer-related deaths [7].
According to the data from Surveillance Epidemiology and End Results (SEER), HCV infected HCC is a major cause of cancer mortality in the United States.
For a better understanding of the disease the biological data can be viewed using a computational approach and analysed accordingly. Comparative studies on HBV and HCV-infected HCC have shown that there exists distinct differential gene expression pattern (for each of them). In this work we use gene expression microarray data and analyse it to generate a Gene Regulatory Network using R and Cytoscape. We further find subnetworks and communities and then trace the biomarkers following the network. We also perform module enrichment and GO enrichment analysis.

Pre-processing of Data
The microarray data was downloaded from GEO (Gene expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/) database on NCBI. A stepwise search was performed for the identification of HCC-related gene expression profiles of humans using the keyword 'hbv' and 'hcv' for the corresponding GSE series. The datasets containing a comparison between normal and control tissues were preferred. A total of six datasets was downloaded (GSE38941, GSE26495, GSE51489; GSE41804, GSE49954, GSE16593) i.e. three for each type. For the pre-processing of microarray data Affy package was used. The packages were loaded onto the R environment from Bioconductor for data normalization and background correction was done [8]. Later, the text files generated were converted into a gene expression matrix with probe_id as rows and gene expression as the columns.

Screening for Differentially Expressed Genes
The gene expression matrix was used to obtain DEGs. The infected and normal controls were separated and average value of gene expression was calculated for each probe number. The probe numbers of the expression profile were later converted into the corresponding gene symbols following the correlation between gene and probe from the platform GPL570.
The fold change (FC) was calculated by subtracting the average values of infected samples from the corresponding values of normal controls [9]. A threshold of 0.5 was used for HBV and of 1 for HCV. Further, FC values were filtered to obtain DEGs. The common genes between HBV and HCV were found using VENNY 2.1.0. They were further screened for cancerous genes using NCG (Network of Cancer Genes, ncg.kcl.ac.uk) [10].

Network Construction and Topological Properties
The cancerous genes obtained from NCG (Network of Cancer Genes) were mapped onto Cytoscape to construct a gene regulatory network [11]. It is one of the common uses of Cytoscape to map attribute data onto a biological network, such as a protein-protein interaction network or metabolic pathway. The network was constructed using STRING database application (in the public database section) in Cytoscape [12]. Furthermore, the topological properties of the GRN were considered by constructing plots for degree distribution: node-degree distribution P(K), clustering coefficient C(K) and neighbourhood connectivity C N (K) and centralities: betweeness C B (K), closeness C C (K), eigen vector C E (K).

Finding Communities
The final and most important step of the process was to generate potential modules from the GRN so obtained. The modules were generated using R and Cytoscape simultaneously. An sif file was generated for the GRN from Cytoscape which was loaded onto R to be broken into communities. Text files containing the gene names of the respective modules were also generated through R by following a series of commands. The gene names were then used in Cytoscape to break the network into corresponding modules and biomarkers were traced following the hierarchy. This process was carried out until no communities were left that could be broken further and no genes were left to be traced. GO enrichment analysis was performed using DAVID 6.8 (Database for Annotation Visualization and Integrated Discovery, https://david.ncifcrf.gov/) [13].

Analysis of Topological Properties
The topological analysis of different large-scale biological networks highlights some recurrent properties: power law distribution of degree, scale-freeness, small world, which have been proposed to confer functional advantages such as robustness to environmental changes and tolerance to random mutations [14]. Network analysis acts as a powerful way for understanding the function and evolution of biological processes, provided that smaller functional modules are equally focused establishing a link between their topological properties and their dynamical behaviour. The topological parameters namely probability of degree distributions (P(K)), clustering co-efficient (C(K)) and neighbourhood connectivity (C N (K)) exhibit power law and are analysed here.
The behaviour of the network is characterized by equations (K) ~ − (K) ~ − (K) ~ ∅ The +ve value in theta of connectivity parameter shows assortative nature of the network. While, the -ve value in alpha (α) of degree distribution shows availability of each node in the network. The -ve value in beta of clustering parameter shows dissociation in the communication between the nodes in network.
The basic centrality parameters, namely, betweeness C B (K), closeness C C (K), eigen vector C E (K) of the network also exhibit hierarchical behaviour given by, C B (K) = 3.78 C C (K) = 0.21 C E (K) = 1.05 The + ve values of the centralities exponents shows the strong regulating behaviour of the nodes in the network.

Gene Ontology (GO) Enrichment Analysis
To understand the DEGs so obtained, it is essential to have the knowledge of their specific function. The genes obtained from the microarray data were divided into up-regulated and down-regulated elements from which cancerous genes were found namely BUB1B, RUNX1T1, COL3A1, EGFR, FGFR2, GPC3, LAMA4, MKI67, NEK2, PEG3, PLCG2, R1T1, TTK, CCNB2, AKR1B10, CASC5 (up-regulated) and MALAT1 (down-regulated). For these genes, the Gene Ontology was found using DAVID and three important categories namely Biological Process (bp), Cellular Component (cc) and Molecular Function (mf) were noted. It gave the results as in Table 1.

Network and Module Representation
A visual representation of the network and its corresponding sub-networks i.e. modules was done using Adobe Illustrator or AI. Each level was represented in different colours and the biomarkers in higher level of the network were highlighted in red. The result obtained marked the presence of hierarchy in the network. Table 2 shows the potential biomarkers that occurred in the higher levels of hierarchy. The hierarchical network is represented in Fig. 2.

Network Properties and Gene Tracing
We can identify potential hubs from the network by knowing the maximum number of interactions each node has. The potential hubs identified in the network were: PEG3, MKI67, RUNX1T1, GPC3, FGFR2, EGFR, RIT1 and COL3A1. A plot for Hamiltonian energy was then generated for the potential biomarkers at different levels (Fig. 3.). The plot shows an active participation in the lower levels as compared to higher ones.
A plot was generated for modularity at different levels [15]. The plot shows a decreasing trend as it moves towards the higher levels, meaning modularity decreases from one level to the other and follows the same trend as we go on. Fig. 4. shows the plot for Modularity.  At first two communities emerge from the network i.e. C1 and C2 which further divide into two more communities. C1 has 10 biomarkers namely GPC3, COL3A1, EGFR, PLCG2, RIT1, PEG3, AKR1B10, RUNX1T1, MKI67, FGFR2 while C2 contains 5 biomarkers namely NEK2, TTK, CASC5, CCNB2, BUB1B. As the divergence increases each community divides into two except for C121, which diverges into 3 communities. The communities are represented in different colour for different levels (Fig. 5). EGFR, FGFR, RUNX1T1, RIT1 and AKR1B10 go till the fourth level. The community at fourth level i.e. C1212 diverges into 1212a and 1212b and gives rise to the last level that is the fifth level which has PEG3 and MKI67 in community 1212a. We then represent the hierarchy in the network and trace the potential biomarkers till the last level of the hierarchy. Fig. 6 represents the hierarchical network and the division of genes into different communities and modules. Each level is shown in a different colour for a better understanding.

Conclusion
This work aims at providing an insight into the regulatory network of HBV and HCV infected Hepatocellular Carcinoma. The communities (and sub-networks) provide an insight into the interactions of the biomarkers involved in causing the malignancy. It can be noted that potential biomarkers for HCC, like AKR1B10, COL3A1, FGFR2, EGFR, PEG3 and MKI67 are present till the higher levels of the network. Also the topological properties suggest two things: degree distribution plots (P(K), C(K), C N (K)) suggest the presence of hierarchy in the network while the centralities (C B (K), C C (K), C E (K)) suggest the assortative nature of the network. The presence of biomarkers in higher levels, the plot for modularity and the topological properties together suggest the presence of potential hubs at every level of the hierarchy of network. In conclusion, this work can be used for future research to explore for potential drug targets and when worked upon can provide methods for the development of better treatment against HBV and HCV infected HCC.