Computational Analysis of Single Nucleotide Polymorphism (SNPs) in Human SLC5A1 Gene

: Glucose galactose malabsorption (GGM) is an autosomal recessive disease manifesting within the first weeks of life. It is characterized by a selective failure to absorb dietary glucose and galactose from the intestine leading to severe life threatening diarrhea and dehydration. Mutations in the Na+/glucose co-transporter gene ( SLC5A1 gene) have been determined to be associated with congenital GGM. In this study different computational tools were used to investigate the nsSNPs (Single nucleotide polymorphisms) in the SLC5A1 gene and to determine their effects on the protein function and structure . SLC5A1 gene was investigated in NCBI database and SNPs were analyzed using seven computational software (SIFT, Polyphen-2, PROVEAN, SNPs and GO, PHD-SNPs, I-mutant and MU Pro). The protein structural analysis was done by modeling using Project Hope and Chimera after homology modeling by CPH models 3.2. In addition Gene MANIA software was used to study the association between this gene and related ones. A total of 166 nsSNPs were obtained from the SNPs database in NCBI during 2019. A total of 37 SNP were predicted to be deleterious using SIFT software, while 25 SNPs were predicted to be probably damaging by PolyPhen-2 and 30 SNPs were predicted to be deleterious by PROVEAN. The results of SIFT, PolyPhen-2, PROVEAN, SNPs&GO, PHD-SNP collectively revealed that 16 SNPs were predicted to be highly damaging.


Introduction
Glucose / galactose malabsorption (GGM) is an autosomal recessive disease manifesting within the first weeks of life andis characterized by a selective failure to absorb dietary glucose and galactose from the intestine [1]. Patients with GGM are presented with the neonatal onset of severe lifethreatening watery diarrhea and dehydration [2]. It was first described in 1962 [3]. The diarrhea ceases within one hour after removing oral intake of lactose, glucose, and galactose, but promptly returns with the introduction of one or more of the offending sugars into the diet [4].
Secondary active transport of glucose occurs via symportwith sodium, using SGLT proteins (sodium-glucose transport protein), in the choroid plexus, proximal tubules of kidneys, and the intestine [4]. Mutations in the Na+/glucose co-transporter gene SLC5A1 (Solute Carrier Family 5 Member 1 (Sodium/Glucose Cotransporter) can cause structural and functional deletion in the SGLT-1 proteins thus glucose and galactose are not absorbed from the intestine leading to clinical manifestations [5]. A total of more than 40 SLC5A1 mutation have been identified in patients with congenital Glucose / galactose malabsorption up to date [6].
The SLC5A1 gene encoding the SGLT1 membrane protein was cloned and sequenced in 1987 [7]. This gene is located within chromosome 22q13.1 and is composed of 15 exons. Expression of SLC5A1 gene is mainly in the intestine and kidney. The translated protein is composed of 664 amino acids with a molecular mass of approximately73 kDa, consisting of a core of 13 transmembrane domains [8][9].
For various reasons it might not be feasible to perform laboratory studies for all SNPs in a specific gene or even the whole genome. Thus computational studies are now becoming indispensable for the identification and prioritization of SNPs with functional importance from an enormous number of non-risk alleles. Computational methods are sufficiently fast and flexible and can provide predictions of functionally significant SNPs with a high accuracy of 80-85% [10] if combined with other techniques such sequencing, structure and phylogenetic relationships In this study different computational methods were used to identify the SNPs (Single nucleotide polymorphisms)in SLC5A1 gene and the effects of the predicted mutation on the protein function and structure.

Methodology
SLC5A1 gene was investigated in dbSNP/NCBI database using computational analysis. The SNPs and the related ensembles protein (ESNP) were obtained from the SNPs database (dbSNPs) http://www.ncbi.nlm.nih.gov/snp/and Uniprot database during the year 2019. Several software were used for analysis

GeneMANIA
(http://www.genemania.org). GeneMANIAfinds related genes to the input genes, using a very large set of functional association data. Association data include protein and genetic interactions, pathways, co-expression, co-localization and protein domain similarity. Gene MANIA can be used to find new members of a pathway, additional genes which where missed in screening or find new genes with a specific function [11]. The input wasSLC5A1 gene name and the results are usually shown as a diagram and tables showing the relation between the different genes.

SIFT: "Sorting Intolerant from Tolerant"
(http://siftdna.org/www/SIFT_dbSNP.html) Itis a sequence homology-based tool that presumes important amino acids will be conserved in the protein family. Hence, changes at well-conserved positions tend to be predicted as deleterious or tolerated. A list of nonsynonymous ID (rsID) that were obtained from the dbSNP database were the input for SIFT and then only the deleterious SNPs were chosen for further analysis. The cutoff value in the SIFT program is a tolerance index of ≥0.05. The higher the tolerance index, the less functional impact a particular amino acid substitution is likely to have [12].

PolyPhen-2(Polymorphism Phenotyping v2)
(http://genetics.bwh.harvard.edu/pph2/). Itis an online bioinformatics program that predicts the possible impact of amino acid substitution on the stability and function of human proteins using structural and comparative evolutionary considerations. This program basically searches for 3D protein structures, multiple alignments of homologous sequences and amino acid contact information in several protein structure databases, then calculates position specific independent count scores (PSIC) for each of the two variants, and then computes the PSIC scores difference between two variants. The higher a PSIC score difference, the higher the functional impact a particular amino acid substitution is likely to have [13]. Prediction outcomes could be classified as benign, possibly damaging or probably damaging. For structural and functional predictions, SNPs that were predicted to be deleterious by SIFT were submitted to PolyPhen-2 as protein sequence in FASTA format (obtained from Expasy), along with the position of the mutation, native and the new substituent amino acids.

PROVEAN (Protein Variation Effect Analyzer)
(http://provean.jcvi.org/index.php). It is a software tool which predicts the effect of all classes of protein sequence variations such as single amino acid substitutions, insertions, deletions, and multiple substitution on the function of protein. Prediction out comes could be classified as deleterious or neutral [14]. The protein sequence in FASTA was again the input for this software.

SNPs&GO (Predicting Disease Associated Variations Using GO (Gene Ontology Terms)
SNPs&GO (http://snps.biofold.org /snps-and-go/snps-andgo.html). It is an accurate method that, starting from a protein sequence, can predict whether a mutation is disease related or not by exploiting the protein functional annotation. SNPs&GO collects in unique framework information derived from protein sequence, evolutionary information, and function as encoded in the Gene Ontology terms, and outperforms other available predictive methods [15]. The protein sequence and mutation sites were the input for this software.

PHD-SNP (Predictor of Human Deleterious Single Nucleotide Polymorphisms)
PHD-SNP is a web-based tool available at (http://snps.biofold.org/phd-snp/phd-snp.html). PhD-SNP is a Support Vector Machines (SVMs) based method that predicts disease associated nsSNPs using sequence information. The protein sequence and mutation positions were the input. For each mutation, PhD-SNP returns an output score (ranging from 0-1) that represents the probability of this nsSNPs being associated with disease. The method considers 0.5 to be the threshold above it the nsSNPs are predicted to be diseaseassociated [16].

Protein Stability
In order to predict the effect of single point mutationon the protein stability, two software were used:

I-Mutant Suite
(http://gpcr2.biocomp.unibo.it/cgi/predictors/IMutant3.0.c gi). It is a support vector machine (SVM)-based tool for the automatic prediction of protein stability changes upon single International Journal of Biomedical Science and Engineering 2019; 7(4): 85-91 87 point mutations. The input was the protein sequence, the position of the SNP in the protein and the new residue. The method allows to predict if a mutation can largely destabilize the protein (Gibbs-free energy change DDG <-0.5 Kcal/mol) or largely stabilize (DDG >0.5 Kcal/mol) or have a weak effect (-0.5≤G≤0.5 Kcal/mol) [17].

MUpro
http://mupro.proteomics.ics.uci.edu/). It is another web server for prediction of protein stability changes upon mutations. It use support vector machines to predict protein stability changes for single-site mutations by using sequence information. The protein sequence and point of mutation was the input and the output is either increased or decreased stability [18].

Project hope
(http://www.cmbi.ru.nl/hope/). It is a fully automatic program that analyzes the structural and functional effects of point mutations. It builds a report with text, figures, andanimations [19]. The protein sequence in FASTA format, wild and new amino acid and point of substitution were the input for Project hope.

CPHmodels3.2
(http://www.cbs.dtu.dk/services/CPHmodels/) is a web server predicting protein 3D structure by using asingle template homology modeling. The template recognition is based on profile-profile alignment guided by secondary structure and exposure predictions [20].

Results and Discussion
The goal of this study was to analyze the nsSNPs in SLC5A1 gene and the effect of predicted mutations at the proteomic level. SLC5A1 gene plays a vital role in human body and it was found to be co-expressed and shared domains with 11 genes as predicted by GeneMANIA ( Figure 1 and Table 1).  A total of 166 ns SNPs were obtained from the SNPs database (dbSNPs) in NCBI. Following analysis using SIFT software, 37 SNPs were predicted to be deleterious. A total of 25 SNPs were predicted to be probably damaging by PolyPhen-2 and 30 SNPs were predicted to be deleterious by PROVEAN as shown in (Tables 2, Appendix A1). Analysis with SNPs &GO and PHD-SNP showed different results, 23 SNPs were predicted to be disease related with SNPs &GO compared to 32 with PHD-SNP (Figure 2, Appendix A2).  16 SNPs were predicted to be highly damaging (Table 3). Regarding protein stability, the stability was found to be decreased in all SNPs except in two SNPs: rs200304934 and rs202070786which showed increased stability when I-mutant software has been used, and only one mutation:rs199872285 showed increase protein stability. The prediction accuracy based on sequence information alone is close to the accuracy of methods that depend on tertiary structure information. MUpro software overcomes one important shortcoming of approaches that require tertiary structures to make accurate predictions. Thus, this method can be used on a genomic scale to predict the stability changes for large numbers of proteins with unknown tertiary structure [18].
The SNPs were further submitted to the Project Hope software to see the effect of amino acid substitution on protein structure. Each amino acid has its own specific size, charge, and hydrophobicity value and the wild type residue and newly introduced mutant residue often differ in these properties. Differences in size in all predicted SNPs can affect the contact with the lipid-membrane. In addition, differences in hydrophobicity can affect the hydrophobic interactions with the membrane lipids and can result in loss of hydrogen bonds and/or disturb correct folding. This was predicted for SNPs: rs121912669, rs33939896, rs199872285, rs200304934, rs201079555, rs201598524, rs371505974, rs201271081, rs370932142) ( Table 4).
Difference in charge between wild-type and mutant residue can also affect protein function can cause loss of interactions with other molecules or residues. This was predicted for SNPs: (rs121912669, rs201079555 rs199573966, rs201598524, rs371505974, rs373203939, rs202070786, rs201271081) ( Table 4).
Three SNPs namely (rs121912669), (rs371505974) and (rs200406921) have been reported in previous studies [8,22] to be associated with SGLT and in this study they were predicted to be highly damaging by all software. Also, a recent study revealed two novel SNPs among Saudi population suffering from congenital Glucose galactose malabsorption(G89R and G435D) [8]. . Another SNP (rs121912668)has also been reported to be disease related in a previous study [1] while in the current study it was predicted to be highly damaging by all software programs except in PolyPhen-2 it predicted to be possibly damaging with high score 0.936.
It is thus important todifferentiate between disease associated and neutral SNPs since this will help in understanding the relationship between the genotype and phenotype and provide a better diagnosis strategies.

Conclusions
In this study found 16 ns SNPs were identified inmutationsSLC5A1 gene. Three of the predicted SNPs were also reported in clinical trials, while the others need further confirmative studies. Predicting the phenotypic effect of nsSNPs using computational algorithms will also help in better understanding of the genetic variations in response to diseases, albeit that computation prediction need further conformation using clinical studies.