Novel Candidate Genes for Somatic Cell Count in Frizarta Dairy Sheep

Aim of the present study was to identify genomic regions and candidate genes impacting on somatic cell count in the Frizarta dairy sheep. A total number of 482 Frizarta ewes genotyped with the medium density SNP array with available records on milk somatic cell count were used. Associations between genomic markers and the trait under study were detected by application of a multi-locus mixed model treating markers as fixed additive effects. Positional candidate genes identified within 1Mb flanking distances from significant markers were in silico prioritized based on their functional similarity to a training gene list including 1,120 genes associated with the term ‘immunity’. Association analysis pinpointed 4 chromosomewide significant SNPs dispersed on four autosomes (OAR2, OAR18, OAR19 and OAR22). A total number of 37 positional candidate genes were identified within the searched genomic distances while 13 candidate genes were highly prioritized. Seven highly prioritized genes (NFIB, GFRA1, PSIP1, ARHGAP5, HECTD1, EMX2, STRN3) along with genes FREM1 and GPR33 had evidenced involvement in immune-related processes. Current results extent previous findings by providing novel candidate genes for the somatic cell count phenotype in dairy sheep.


Introduction
Somatic cell count (SCC) in dairy sheep milk is important in many aspects, including health and production. As in dairy cows, mastitis in dairy ewes is associated with increased SCC in milk [1]. Hence, milk with elevated SCC is usually considered as an indication of intra-mammary infection (IMI) and selection for decreased SCC could lead to reduced susceptibility to mastitis [2]. IMI in dairy sheep differs from the respective bovine infections in both incidence and aetiology. In dairy ewes, lower incidence of clinical mastitis (CM) versus subclinical mastitis (SCM) is observed, while the major pathogens in this species are the coagulasenegative Staphylococci [2]. Furthermore, in dairy ewes, SCC can reach highest counts (e.g. 4·10 6 cells/ml) without mastitis symptoms, with milk still of normal macroscopic appearance [3]. However, SCC levels between affected and non-affected udders seem to clearly differentiate, with sheep milk from udders free from IMI having an average of 0.185·10 6 cells/ml when contrasted to an average equal to 1.445·10 6 cells/ml of infected halves [3]. In milk from uninfected mammary glands, macrophages are the predominant cell type of the cell population (46 to 84%), followed by lymphocytes (11 to 20%), polymorphonuclear neutrophilic leukocytes (PMN) (2 to 28%) and epithelial cells (1 to 2%). In infected mammary glands, the percentage of PMN increases to 50% at a SCC of 0.20·10 6 cells/ml and up to 90% at a SCC over 3·10 6 cells/ml [3] playing a protective role in the mammary gland [4].
The advent of high-throughput genotyping platforms made genome-wide association studies (GWAS) a reality and the identification of the responsible functional genes involved in SCC in dairy sheep a promising task. Early genome scans in the Spanish Churra sheep identified a single genome-wide suggestive QTL (Quantitative Trait Locus) for SCS (Somatic Cell Score) on OAR20 [5] with peak QTL location close to marker OLADRBPS, which is located in the major histocompatibility complex (MHC). In a crossbred population of Sarda x Lacaune breeds, there were two genome-wide significant QTLs identified on OAR6 and OAR13, along with 11 genome-wide 'suggestive' QTLs associated with SCS (2). In a crossbred population of Lacaune x Manech breeds, a 'suggestive' genome-wide significant QTL was found on OAR14 for SCS [6]. In an Awassi x Merino crossbred population, a significant QTL for SCC was found on OAR14, whereas two 'suggestive' QTLs for this trait were reported on OAR17 and OAR22 [7]. Rupp et al. [8] identified a major QTL associated with SCC on ovine chromosome 3. Fine mapping of the region, provided one strong candidate SNP that mapped within the coding sequence of a highly conserved gene, suppressor of cytokine signalling 2 (SOCS2), mediated by the JAK/STAT signaling pathway. Using transcriptional profiling of the milk somatic cells of susceptible and resistant sheep infected by Staphylococcus sp., Bonnefont et al. [9] provided a list of differentially expressed genes between the resistant and susceptible animals that were associated with immune and inflammatory responses, leukocyte adhesion, cell migration and signal transduction. Most recently, Banos et al. [10] identified SNPs associated with mastitis traits on chromosomes 2, 3, 5, 16 and 19 and proposed relevant  candidate genes such as SOCS2, CTLA4, C6, C7, C9,  PTGER4, DAB2, CARD6, OSMR, PLXNC1, IDH1, ICOS, FYB and LYFR, implicated in innate immunity.
In the present study, we performed a genome scan of 482 Frizarta dairy ewes using a medium density genotyping SNP array to identify genomic regions associated with the SCC phenotype. We then applied in silico gene prioritization analysis of the positional candidate genes to identify the most plausible functional candidate genes for the trait. Current findings are expected to contribute to a better understanding of the genetic mechanisms underlying the SCC phenotype in the Frizarta dairy sheep.

SNP Genotyping and Quality Control
A total of 524 dairy ewes of the Frizarta breed were originally sampled. These ewes were randomly chosen from 7 herds of the Cooperative of the Agricultural and Livestock Union of Western Greece (ALUWG) in Agrinio (north-western Greece). Ewes are kept under an intense production system, with standardized conditions and feeding regime. DNA was extracted from blood samples using the NucleoSpin Blood kit (Macheray-Nagel). Sample genotyping based on the Illumina OvineSNP50 BeadChip was performed commercially at Neogen Europe, Ltd. From the 524 original samples, one sample could not be genotyped. Quality Control (QC) of the remaining 523 genotypes was performed in two stages, first on an 'individual' and second on a 'marker' basis. On the first level, samples were removed if they had: i) call rate<0.95 and ii) overall autosomal heterozygosity rate outside the 1.3 interquartile range (0.346-0.389). Using these criteria, the number of available animals (samples) was reduced to 503. Marker QC was based on the following criteria: (i) call rate (>0.95), (ii) minor allele frequency (MAF)≥0.05 and (iii) Fisher's Hardy-Weinberg equilibrium (HWE) p<0.0001. Only mapped autosomal SNPs were considered. From the originally available SNPs (n=54,013), the final number of SNPs retained for the GWAS was 42,884.

Phenotypic Data
There were SCC records for 482 out of the 503 genotyped animals fulfilling the QC criteria. These records were obtained using a MilkoScan™ FT system. Prior to analysis, original SCC values (given in 10 5 cells/ml) were logarithmically transformed to approximate normality using the following transformation function: lSCC=log(SCC.10 -5 cells/ml). lSCC ranged from 0.07 (104,972 cells/ml) to 4.42 (2,140,684 cells/ml) with average and standard deviation equal to 1.93 (381,055 cells/ml) and 0.96 (194,531 cells/ml), respectively. SCC records were obtained from ewes dispersed in 7 herds, 5 lactations, 4 production years, 3 lactation stages and 8 classes of month measurements. Detailed inspection of the number of observations per class effect showed that some of month measurement classes were not adequately represented. For this reason, grouping of observations for this particular effect was performed by grouping classes as follows: calendar month 1 (1, 2 and 3), 8 (8 and 9) and finally 11 (11 and 12). Furthermore, three lactation stages were defined as follows: early: (1-100 days), middle (101-160 days) and late (>160 days) days of lactation. The final data set included SCC records per 7 herds, 5 lactations (classes: 2, 3, 4, 5 and 6), 4 production years (classes: 2011, 2012, 2013 and 2014), 4 months of measurement (classes: 1, 8, 10 and 11) and 3 lactation stages (classes: early, middle and late) ( Table 1). 2.107 ± 0.080 6 (77) 1.780 ± 0.109 6 (137) 1.058 ± 0.080 7 (14) 1.903 ± 0.177 Multifactor analysis of variance (ANOVA) of lSCC including all the fixed effects (herd, production year, lactation number, month of measurements and stage of lactation) revealed statistical significance of only herd effect (p<0.001). This factor alone explained 37.3% of the variance of the trait (results not shown). No interaction term was found to be statistically significant.

Marker Association Analysis
A multi-locus mixed linear (MLMM) model [11] was used to select significant SNPs as fixed (additive) covariates. This method employs a stepwise mixed-model regression procedure with forward selection and backward elimination. From the three model selection criteria that are implemented in MLMM for multi-testing correction, the modified Bayesian information criterion (mBIC), the extended Bayesian information criterion (eBIC) and the multiple Bonferroni criterion (mBonf), we used the eBIC criterion to select the significant SNP co-factors using a p threshold value of 0.10. Specifically, lSCC data were analysed using the following mixed model: y X wa u e = β + + Ζ + where y is the vector of the lSCC, X is the incidence matrix relating observations to fixed effects, β is the vector of fixed (environmental) effects: herd (7 classes . Note that although the additional fixed effects (apart from herd) were found not to be statistically significant during multifactor ANOVA, an expanding fixed model was used during association analysis as these effects jointly explained an additional proportion of 3% of the lSCC variance. Furthermore, w is the vector of the SNP effects with elements coded as 0, 1 or 2 for homozygote of the reference allele, heterozygote and homozygote of the other allele, α is the vector of the fixed effect for the reference allele of the candidate SNP to be tested for association, Z is the incidence matrix relating observations to the random polygenic random effects, u is the vector of random polygenic effects, and e is the vector of random residuals. The random effects were assumed to be normally distributed with zero means and the following covariance structure: σ are the polygenic and error variance components, I is the nxn identity matrix, and G is the nxn genomic relationship matrix with elements of pairwise relationship coefficient using all the 42,884 SNPs. The genomic relationship coefficient between two individuals j and k, was estimated as follows: where x ij and x ik the numbers (0, 1 or 2) of the reference allele(s) for the i th SNP of the j th and k th individuals, respectively, and p i is the frequency of the reference allele. Note that inclusion of the genomic relationship matrix in the model has been shown to correct for possible population structure and stratification in the data [12]. This analysis was carried out with SNP and Variation Suite ver. 8.7.0 (Golden Helix, Inc. 2016).

Search for QTLs and Candidate Genes
Since in this breed levels of linkage disequilibrium (LD) were higher than 0 between markers at genomic distances up to 1 Mb (results not shown), we searched within 1 Mb upstream and downstream each significant SNP for reported QTLs and positional candidate genes for the trait under study. The SheepQTLdb (release 36, August 22nd, 2018) and the latest sheep genome Oar_v4.0: http://www.ncbi.nlm.nih.gov/genome/?term=ovis+aries/ along with NCBI annotation release 102 of the sheep genome (http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Ovis_ aries/102/) were used, respectively.

In Silico Gene Prioritization Analysis
We performed in silico prioritization analysis (PA) of the positional candidate genes using the ToppGene portal (https://toppgene.cchmc.org/prioritization.jsp). PA was based on the functional similarity of the candidate genes to a training gene list including n=1,221 genes that have been associated with the term 'immunity'. The InnateDB (https://www.innatedb.com/) that is a knowledgebase of genes, proteins, experimentally-verified interactions and signaling pathways involved in the innate immune response of humans, mice and bovines to microbial infection was mined to retrieve the relevant associated genes. The following semantic annotations were used during PA: GO: Molecular Function, GO: Biological Process, Human and Mouse Phenotype, Pathway, Interaction, Gene Family and Co-expression. From the initial number of training genes, a total number of n=1,120 genes were mapped and finally used for training during PA. Note that two candidate genes i.e. FREM1 and GPR33 were omitted from PA as they were included in the training gene list. Genes with overall p-values lower than 0.05 were considered as highly prioritized. Figure 1 shows the Q-Q (Quantile-Quantile) plot of the expected and the observed p-values (on the -log 10 scale) of the 42,884 SNPs. As Q-Q plot clearly shows, there is no evidence of any systematic bias due to population structure or analytical approach, a suggestion that was also supported by the estimated value for the genomic inflation factor (λ=1.051). The Q-Q plot along with the Manhattan plot depicted in Figure 2 also show that 4 SNPs depart from the expected probability indicating that they might be associated with the trait under study.  A detailed description of the 4 departed SNPs is provided in Table 2. Specifically, the 4 SNPs that reached chromosome-wise statistical significance i.e. −log 10 (p value)=4.094 were detected on OAR2, OAR18, OAR19 and OAR22.

Searched QTLs and Positional Candidate Genes
A search for 'Somatic Cell Score' QTLs at SheepQTLdb revealed two QTLs, one on OAR2 mapped from 86.5 to 117.9 Mb and another on OAR22 mapped from 5.8 to 31.8 Mb (results not shown). In both cases, reported QTLs lie about 4 Mb away from the significant SNPs. Table 3 presents a list of positional candidate genes within 1 Mb distance from the 4 significant SNPs. A total number of 37 genes were identified within the searched regions with 8, 10, 2 and 17 candidate genes located on OAR2, OAR18, OAR19 and OAR22, respectively (Table 3).

Discussion
GWAS are powerful in determining genomic regions associated with a trait. Nevertheless, these regions often contain tens or hundreds of positional candidate genes and experimentally identifying the true causal genetic variants requires considerable costs, effort and time. One of the most intriguing challenge is thus as how to narrow down the candidates list and pinpoint the most plausible genetic variants for the trait under investigation. To address this challenge, in the present study, we applied in silico PA of the positional candidate genes and ended up with a total of 13 highly prioritized (top) genes. A thorough search of the respective literature with regard to the biological function(s) of the top prioritized genes followed. This search showed that 7 prioritized genes i.e. NFIB, GFRA1, PSIP1, ARHGAP5, HECTD1, STRN3 and EMX2 have documented involvement in immune-related processes. Specifically, NFIB (Nuclear Factor I/B, ranked 1 st in PA) is a member of the nuclear factor I family of proteins; the latter are known to be involved in viral and cellular transcription and specifically CD4 transcription [13]. GFRA1 (GDNF family receptor alpha 1, ranked 2 nd in PA) is connected to GDNF (Glial cellderived neurotrophic factor) gene that exerts its effect on target cells by binding to GDNF family receptor-a. GDNFs are reported to be regulated by inflammatory cytokines [14]. PSIP1 (PC4 and SFRS1 interacting protein 1, ranked 3 rd ) encodes the lens epithelium-derived growth factor p75 (LEDGF/p75), an important host co-factor that interacts with HIV-1 integrase to target integration of viral cDNA into active the outcome of HIV-1 infection genes [15]. ARHGAP5 (Rho GTPase activating protein 5, ranked 4 th ) participates in focal adhesion and specifically in leukocyte transendothelial migration, while HECTD1 (HECT domain containing E3 ubiquitin protein ligase 1, ranked 6 th ) participates to focal adhesion and macrophage activation [16]. STRN3 (striatin 3, ranked 11 th ) and EMX2 (empty spiracles homeobox 2, ranked 13 th ) are members of the gene network of Wnt/b-catenin pathway that plays a critical role in cell differentiation, growth, proliferation, survival and immune cell function [17].
Apart from the aforementioned genes, the list with the most plausible candidate genes should also comprise genes FREM1 and GPR33 that they were found among the training genes, thus having evidenced involvement in immune-related processes. Specifically, FREM1 (FRAS1 related extracellular matrix 1) is an extracellular protein with multiple annotated functional domains that interact with integrin, collagen, fibronectin, and interleukin 1 receptor (IL1R1) and influence transendothelial migration, epithelial integrity and inflammatory responses [18]. GPR33 (G protein-coupled receptor 33) is an orphan member of the chemokine-like receptor family and is highly expressed in dendritic cells that provide a functional link between innate and acquired immunity and orchestrate the interplay between T-and Blymphocytes [19].
We further explored the role of the current candidate genes with regard to immunity by constructing a network depicting human genes co-expressed in memory CD4 Tcells using information from the Immuno-Navigator database and the Network Analyst platform (https://www.networkanalyst.ca/). The resulting gene network is shown in Figure 3. As Figure 3 shows, this network is formed by 6 member genes including four of the candidate genes (ARHGAP5, STRN3, HECTD1, PSIP1) along with two connected genes (STK38, SART3). Figure 3. A minimum network showing 6 member genes co-expressed in human memory CD4 T cells. Apart from the four candidate genes (ARHGAP5, STRN3, HECTD1, PSIP1) denoted as blue coloured spheres, this network includes two more connected genes (STK38, SART3) denoted as red coloured spheres. Network was constructed using the NetworkAnalyst platform (ver. 3.0, https://www.networkanalyst.ca/) using data mined from the Immuno-Navigator database.
Intuitively, genes that include or are in close proximity to the lead SNPs while having functional relevance with the trait under study are considered ideal functional candidates. However, proximity of a gene to the significant marker does not guarantee functional relevance and causative candidate genes may also exist among distantly located loci from the associated SNP in both qualitative [20] and quantitative traits [21]. In line with this scenario, the distance of the most plausible candidate genes from the respective significant SNPs ranged from 52 kb (ARHGAP5) to 972 (STRN3) kb, in the present study.

Conclusion
Employment of in silico prioritization analysis on results of genome wide associations helped identifying novel candidate genes for SCC with documented involvement in immune-related processes such as focal adhesion, transendothelial migration, macrophage activation and inflammatory responses. In light of the relatively small number of animals used here, further studies employing higher number of animals with higher density arrays are warranted to disentangle the genetic basis of the SCC phenotype in dairy sheep.

Funding
This study was supported by the Greek Ministry of Education and Religious Affairs, Action 'Cooperation 2011' (grant number 447919/11SYN_3_1087).