Gene Regulatory Network Inference Using Prominent Swarm Intelligence Methods
Md Julfikar Islam, M. S. R. Tanveer, M. A. H. Akhand
Dept. of Computer Science and Engineering, Khulna University of Engineering & Technology, Khulna, Bangladesh
Email address:
To cite this article:
Md Julfikar Islam, M. S. R. Tanveer, M. A. H. Akhand. Gene Regulatory Network Inference Using Prominent Swarm Intelligence Methods. Computational Biology and Bioinformatics. Vol. 4, No. 5, 2016, pp. 37-44. doi: 10.11648/j.cbb.20160405.11
Received: November 25, 2016; Accepted: December12, 2016; Published: January 16, 2017
Abstract: Genes are the basic blue print of life in an organism containing the physiological and behavioral characteristics. A gene regulatory network (GRN) is a set of genes, or parts of genes, that interact with each other to control a specific cell function. GRN inference is the reverse engineering approach to predict the biological network from the gene expression data. Biochemical system theory based S-System is a popular model in GRN inference and the model is defined with its different parameters. The task of S-System based GRN inference is its parameter estimation which is an optimization problem. Several studies employed Particle Swarm Optimization (PSO) and other pioneer optimization techniques to estimate S-System model. In this paper several prominent swarm intelligence (SI) techniques have been studied and adapted for S-System parameter estimation. They are Group Search Optimizer, Grey Wolf Optimizer and PSO. Proficiency of optimization techniques are compared to infer GRN from SOS DNA real gene expression data and DREAM 4 Silico data.
Keywords: Gene Regulatory Network (GRN), GRN Inference, Swarm Intelligence, S-System Model
1. Introduction
Genes are the basic blue print of life in an organism containing the physiological and behavioral characteristics [1]. There are thousands of genes situated in the DNA double helix, contained in the chromosomes inside cell nucleus in wounded form. Each gene codes for a protein. The protein is produced from a gene through a sequence of bio-chemical reaction when the cell environment is undergone some internal or external change [2]. The process of protein production is called the gene expression and the rate of expression is called the gene expression level. The protein from one gene may affect the expression of any other gene contained in the same cell or in any other cell of the organism. This phenomenon is called the regulation of the first gene to the second. For a set of genes the pairwise regulatory relationship forms the network called Gene Regulatory Network (GRN).
GRN inference is the reverse engineering approach to uncover the dynamic and intertwined nature of gene regulation in cellular systems [3]. In GRN inference the gene regulation network is inferred from gene expression data. The inference of genetic networks faces a great challenge in which mutual interactions among genes are estimated using time-series data of gene expression patterns. The inferred model of a genetic network is considered as an idea tool which helps biologists generate hypotheses as long as facilitate the design of their experiments.
For GRN inference researchers have used a number of approaches [4,5]. They proposed numerous models to describe biochemical networks have ranged from simple Boolean networks to detailed sets of differential equations of an arbitrary form [6,7]. Nowadays, S-System model is a popular model derived from biochemical system theory to infer GRN from gene expression data [8,9].
The GRN inference through S-System means estimation of its parameters values from time-series data of gene expression patterns. S-System has many different parameters and they have different optimal values for a unique set of genes. The number of S-System parameters is proportional to the number of network components. Hence, S-System model based GRN inference is an optimization problem and any optimization technique can be used. For this reason, researchers have used optimization techniques such as genetic algorithms (GA) [10], global optimization methods [11], Particle Swarm Optimization (PSO) [26], and linear time variant models [12] to estimate parameters of S-System model for different gene expression data.
In recent years nature inspired Swarm Intelligence (SI) has become popular in the field of optimization. PSO is the pioneer SI based optimization technique and used for S-System model [26]. Other recently developed SI based methods might also interesting for S-System model and the aim of this study is to investigate several prominent SI based methods for S-System parameter estimations. In this study, PSO, Group Search Optimizer and Grey Wolf Optimizer are investigated to estimate the S-System parameters.
This paper is organized in the following way. A brief description about S-System based GRN inference technique is given in Section 2. In Section 3, adaptation of selected SI techniques for S-System model is explained. Experimental studies have been discussed in Section 4. Finally the conclusion is drawn in Section 5.
2. S-System Based GRN Inference Technique
A number of approaches have been investigated to infer GRNs from gene expression data with the aim of improving the network inference accuracy and scalability. Basically, the methods can be categorized into two types: information theoretic approaches and model based approaches. In the information theoretic approach, the network is inferred through measuring the dependences or causalities between transcription factors and target genes. A number of prominent methods in this category use Mutual Information (MI) and its variants [13,14,15]. On the other hand, in a model based approach nonlinear differential equations are used to express the chemical reaction of transcription, translation and other cellular processes. In model based approach, S-System model is one of the most widely used models for GRN inference.
The S-System model is provided by the Biochemical system theory (BST) [16] to represent and analyze biological systems. The model is a nonlinear differential equation model of GRNs, and it can describe various dynamics of the relationships among genes. It represents a network as a set of differential equations:
(1)
where, represents the expression level of the ith gene of the network; is the number of genes in the network; are rate constants; and are kinetic orders. It is to mention that the kinetic orders regulate the synthesis and degradation of due to . The GRN shown in Figure 1 contains 5 genes.
In this network the rate constants are {α_{1,} α_{2}, α_{3}, α_{4}, α_{5}} and {β_{1}, β_{2}, β_{3}, β_{4}, β_{5}} and the kinetic orders are {g_{i,}_{ }_{1}, g_{i,}_{ }_{2}, g_{i,}_{ }_{3}, g_{i,}_{ }_{4}, g_{i,}_{ }_{5}} and {h_{i,}_{ }_{1}, h_{i,}_{ }_{2}, h_{i,}_{ }_{3}, h_{i,}_{ }_{4}, h_{i,}_{ }_{5}} where i=1 to 5. The kinetic orders g_{ij} and h_{ij} determine the structure of the regulatory network. If g_{ij} > 0 gene j induces the synthesis of gene i. If g_{ij} < 0 gene j inhibits the synthesis of gene i. Analogously, a positive (negative) value of h_{ij} indicates that gene j induces (suppresses) the degradation of the mRNA level of gene i. So in Figure 1, as g_{12} > 0, so gene 2 activates gene 1. As well as, for being h_{23} > 0, gene 3 deactivates gene 2.
For the purpose of evaluation of an S-System model for GRN, a numerical solver has to be used such as Runge–Kutta. In the Runge–Kutta method, we have the following differential equation:
(2)
(3)
where f (t, x) is a nonlinear differential equation. For the case of the S-System
(4)
Using a fourth-order Runge–Kutta method, we can integrate the solution as
(5)
(6)
where the factors are defined as
(7)
(8)
(9)
(10)
The numerical solutions often take a long time to calculate given that (4) depends on each of the N variables to do an update when the networks are large. On the other hand, to optimize the parameters with an optimization technique, it is necessary to evaluate the S-System for multiple candidates and multiple iterations. Linear Lagrange polynomial was used by Tsai and Wang [8] to reduce the calculation’s time by introducing experimental while they used the Runge–Kutta method for solving the S-System of a network. Thus, each new step in the numerical solution of the S-System is defined as:
(11)
where the on the right side of the equation represents the value of the experimental values at time t and is evaluated at x with parameters according to equation (1). The parameter denoted the smoothness rate which is to set small, as a result the approximation does not overshoot. In this paper, for notation convenienceis named to the set of all parameters in the S-System. With a view to decreasing the calculations and the inference time, equation (11) is used instead of the update in (5). Instead of performing the joint solution of all genes in the classical Runge–Kutta method in (4), the experimental values are used for the solution of so that the system can solve each gene’s solution independently.
Since the aim of this study to find the best parameters θ for the network, it is necessary to formulate it as an optimization problem. Tominaga et al. [17] standardized the use of the mean squared error (MSE) evaluation to measure a candidate’s fitness in the S-System. Thus, the fitness function is
(12)
where, T represents the number of time samples in the experimental data, N is the number of genes, and _{cal} and _{exp} refer to the calculated and experimental values of the gene expression’s data, respectively. Since different networks can have the same time series data, in this study experimental time series have been divided in M sets. This will force the parameters θ to recreate the same dynamics at different initial points, which helps to infer the true parameters of the network. Each of the N genes’ time expression is divided on M sets, which will create N×M training sets for the system. In this study, the values of the parameters θ have been inferred using several SI techniques with regularization parameters.
A regularization term λE (θ) is often introduced in an error function to avoid over fitting as well as to restrict the search space. It achieves this by restraining the growth of the parameters. There are different regularization terms, which are chosen according to the data. Ng [18] showed that an L1 regularizer is a good choice when trying to infer sparse parameters in a logistic regression. Thus, the regularization term for each of the N genes i will take the form,
(13)
where, the parameters θ^{*} for large GRNs. The L1 regularization is only used on sparse parameters. It has been applied to the h_{ij} and g_{ij} parameters, while the parameters α and β are left without regularization. Thus, with the regularization term and the decoupling included the following optimization function is defined for each gene i as
(14)
where, M represents the different time series in which the data have been divided and is calculated using (11). In this study S-System has been decoupled so that this value has been calculated for each individual gene. The problem then has two objective functions, where the parameters {h_{ij}, g_{ij}} have been constrained to be small and at the same time to have the best MSE fitness for the original time series data.
In this study, experimental time series have been divide in M sets because different networks can have the same time series data. As a result, this has forced the parameters θ to recreate the same dynamics at different initial points and helped us to infer the true parameters of the network. Each of the N genes’ time expression is divided on M sets and they create N×M training sets for the system.
3. Adaption of Prominent SI Techniques for S-System Model
In general, the GRN inference problem is formulated as a function optimization problem to minimize the sum of the squared relative error by Tominaga et al. [17] in (12). Where is an experimentally observed gene expression level at time t of the ith gene, is a numerically computed gene expression level acquired by solving (1). Where, N is the number of components in the network and T is the number of sampling points of observed data. Since 2N (N+1) S-System parameters need to be determined in order to solve (1), so this function optimization problem is 2N (N+1) dimensional.
In this study, Particle Swarm Optimization, Group Search Optimizer and Grey Wolf Optimizer are investigated to estimate the S-System parameters. Table 1 demonstrates the parameters for network with 5 genes as shown in Figure 1. For 5 network components the dimension of the problem (for N=5) is D= 2×N (N+1) = 2×5 (5+1) = 60. Therefore, it is required to optimize 60 parameter values for network presented in Figure 1. It is notable that number of parameters will be larger for network with more genes. The following section briefly describe the SI based methods and their adaptation to estimate S-System parameters.
1 | .. | 5 | 6 | .. | 10 | 11 | .. | 35 | 36 | .. | 60 |
α_{1} | .. | α_{5} | β_{1} | .. | Β_{5} | g_{1} | .. | g_{25} | h_{1} | .. | h_{25} |
3.1. Particle Swarm Optimization (PSO)
PSO was proposed by Eberhart and Kennedy [21]. The algorithms is evaluated according to the idea of swarm intelligence based on the swarming habits by certain kinds of animals (such as birds and fish). The basic operations of PSO are performed simultaneously maintaining several candidate solutions in the search space. During each iteration of the algorithm, the fitness of each candidate solution is determined by the objective function being optimized. In PSO, each particle represents a potential solution. At every iteration each particle moves to a new position (i.e., search a new point) based on the calculated velocity.
In PSO, each particle updates position based on the calculated velocity comparing the best solution of population and its own best solution. Population of particles is distributed uniformly for multi-dimension search space optimization problem. Equation (15) is to calculate velocities of particles and (16) to update positions.
(15)
(16)
In (1) the global best location is denoted by and represents the best location ever encountered by this particle. An inertia weight is included in (1) to avoid the swarm being trapped into a local minimum. Both and are learning parameters and, are random parameters in a range of [0,1].
In PSO the position of each particle is considered as the solution of the problem. So that to adapt PSO for S-System model, each particle position has been adapted according to Table 1. As well as the velocity will be adapted according to the equations (15) and (16). Finding the best particle position in PSO denotes the estimation of proper S-System model. In every particle, value of each dimension, must be within the pre-defined range. For getting the best particle fitness function is shown in (14).
3.2. Group Search Optimizer (GSO)
GSO is a novel optimization technique developed inspired by animal searching behavior [22]. Animal searching behavior may be described as an active movement through which animal can find resources such as foods, mates, nesting sites. One major consequence of living together is that it is group searching strategy which allows group members to increase patch finding rates. Simply this has led to the adoption of two foraging strategies within groups which are 1) producing, e.g., searching for food; and 2) scrounging, e.g., joining resources uncovered by others. The second one is also referred to as conspecific attraction, kleptoparasitism.
In GSO algorithm population is called a group and each individual animal is called a member. In an n dimensional search space, the ith member at the kth searching bout has a current position which is and a head angle. The search direction of the ith member is a unit vector which is measured from polar to Cartesian coordinate transformation.
,
(17)
Practically in a 3-D search space, if at the k-th searching bout, the ith member’s head angle is = (), then the search direction unit vector is = (1/2, /4, /2). Each iteration, a group member located in the most promising area and which confers the best fitness value chosen as the producer. It then stops and scans the environment to seek resources actually that is the optima. At the kth iteration producer will scan at zero degree in (18) and then scan laterally by randomly sampling other two points in right according to (19) and left by (20).
(18)
(19)
(20)
is a normally distributed random number with mean 0 and standard deviation 1 and is random number sequence in the range (0, 1). If the resource of the best point among the three is better than producer’s current position, then it will fly to that point; otherwise it will have to stay in its present position and turn its head to a new randomly generated angle, . If the producer fails to find a better area after a iterations, it will have to turn its head back to zero degree In GSO, some group of members are selected as scroungers and they try to get opportunities from the producer. The random walk toward the producer of the ith scrounger is modeled in (21) at kth iteration.
(21)
is constant in a uniform random sequence which range is (0, 1). In this equation "" is the Hadamard product, which calculates the entry wise product of the two vectors. A few worst members in GSO are considered as dispersed or ranger members who perform random walks. At the k-th iteration, ranger generates a random head angle; chooses a random distance and move to the new point using (22).
(22)
In GSO the expected solution is denoted through the efficient location of the producer. For GRN inference, GSO has been adapted to S-System model through the parameter according to the declaration of Table 1. The best location found by the producer is identified by (14).
3.3. Grey Wolf Optimizer (GWO)
GWO is the most recently developed SI technique based on the hunting behaviors of Grey wolf (Canis lupus) which belongs to Canidae family [23]. It is seen in nature that Grey wolves prefer to live in a pack. The main phases of their hunting are i) tracking, chasing, and approaching the prey; ii) Pursuing, encircling, and harassing the prey until it stops moving; and iii) attack towards the prey. In the pack, wolves are categorized into several different types and each type perform specific task.
In GWO, the wolf with fittest solution mark as the alpha (α). Consequently, the second and third best ones are named beta (β) and delta (δ) respectively. The rest of the candidate solutions are assumed to be omega (ω). In the GWO, the hunting (optimization) is guided by α, β, and δ. The ω wolves have to follow these three types of wolves. During hunting Grey wolves encircle the prey. The mathematical model of encircling behavior is illuminated through the following (23) and (24).
(23)
(24)
Here, t shows the current iteration, and are both coefficient vectors, represents the position vector of the prey, and the position vector of a grey wolf is indicated by. The vectors and are calculated through (25) and (26).
(25)
(26)
In (25) and (26), component is decreased from 2 to 0 linearly over the course of iterations and, are both random vectors in [0, 1]. The mathematical simulation of the hunting behavior of grey wolves illustrates that the alpha, beta, and delta have fair knowledge about the potential location of prey. As a result, the first three best solutions obtained so far are saved and also oblige the other search agents (including the omegas). The following equations are used to fulfill this purpose.
(27)
(28)
(29)
According to GWO the best solution is the position of alpha wolf . And to adapt GWO for S-System model, every position of wolf has been adapted according to the Table 1. The best position of alpha is identified by the fitness function in (14). The beta and gamma wolf also identified by the same fitness function.
4. Experimental Studies
This section gives experimental settings S-System model parameters and gene expression data. After that settings of PSO, GSO and GWO are explained. Finally experimental results have been presented and discussed accordingly.
4.1. Gene Expression Benchmark Datasets
In this study, both synthetic and real gene expression benchmark data are considered. The gene expression data is available in a two dimensional matrix form in which each column represents an individual gene and each row represents the expression level of all genes within an experiment. Table 2 shows the brief description of the datasets which shows a considerable variety in the number of types, gene number, series, and sample size.
Network Name | Gene Size | Series | Samples | Type | Source |
SOS DNA network | 8 | 4 | 50 | Real | Uri Alon [24] |
InSilico_Size10_1 | 10 | 5 | 21 | Synthetic | DREAM4 [25] |
SOS DNA network is a well-known real genetic network published by the Uri Alon [24] group. It is the time series data of different multi array experiments. In their experiments, eight genes are expressed (uvrD, lexA, umuD, recA, uvrA, uvrY, ruvA, and polB). They irradiate their DNA with ultraviolet light, which will affect some genes, and the network will repair itself, thus expressing auto regulation. They did four experiments for different light intensities. Each experiment had 50 time steps spaced by 6 min. During each time step, they take measures of the eight genes.
In DREAM4 [25] the datasets are provided for InSilico Network Challenge. The goal of the in silico network challenge is to reverse engineer gene regulation networks from given in silico gene expression datasets. Network topologies are obtained by extracting subnetworks from transcriptional regulatory networks of E. coli and S. cerevisiae. They adapted the subnetwork extraction method to preferentially include parts of the network with cycles. Auto-regulatory interactions were removed, i.e., there are no self-interactions in the in silico networks. All networks and data are generated with version 2.0 of GNW. There are three different datasets provided in this challenge. They are InSilico_Size10, InSilico_Size100, and InSilico_Size100_ Multifactorial.
In silico network challenge the dataset InSilico_Size_10 contains the gene expression data of the network that contains 10 genes. There are 5 different datasets (InSilico_Size_10_1, InSilico_Size_10_2, InSilico_Size_10_3, InSilico_Size_10_4 and InSilico_Size_10_5). In this study InSilico_Size_10_1 dataset has been used.
4.2. Experimental Setup
In this study search region has been set in the range of [−2, 2] for the kinetic orders and [0, 5] for the rate constants α and β. The search region is set in such a manner that the search will be fast to converge. Population size and generation are set to 40 and 5000 respectively to draw the comparison among these algorithms. In PSO acceleration coefficients are 2.0 and the inertia weight ω is in range [0. 2, 0.9]. In GSO the initial head angle of each individual, is set to be (π/4,..., π/4). The constant a was given by) where n is the dimension of the search space. The maximum pursuit angle is π/a^{2}. The maximum turning angle is set to be. The maximum pursuit distance is calculated from the following equation:
(30)
where, and are the lower and upper bounds for the ith dimension. In this study the result is taken from the best among 5 trials. The experiments have been done on a PC (Intel Core i7 @4.40GHz CPU, 8GB RAM, Windows 7 OS, MATLAB 2015).
4.3. Evaluation Technique
To evaluate the performance any GRN inference techniques the inferred network is compared with the true network parameters. Also when the true network parameters are not given then the network is inferred from time series gene expression data. After that, time series data is generated from the new inferred network. Finally the newly generated time series data is matched with the corresponding previous time series data. If the two datasets get matched then the network is inferred correctly.
The performance evaluated by receiver operator characteristic (ROC) curve. In general, ROC curve is a graphical tool for depicting true hit rate along the vertical axis (the number of target events correctly classified as targets) as compared to false alarm rate along the horizontal axis (the number of target events incorrectly classified as non-targets). In GRN inference evaluation, the ROC curve is created by plotting the fraction of true positive rate (i.e., true positives out of the total actual positives) vs. false positive rate (i.e., the fraction of false positives out of the total actual negatives), at various threshold settings. The following equations are used to calculate true positive rate (TPR) and false positive rate (FPR).
(31)
(32)
Here TP=True Positive (i.e., links are correctly identified), FP=False Positive (i.e., identified links are not correct), TN=True Negative (i.e., correctly identified that there is no links between genes), FN=False Negative (i.e., failed to identify links between genes).
4.4. Evaluation on SOS DNA Real Gene Expression Data
To infer SOS DNA network, at first the S-System parameters have been estimated from time-series data provided by the Uri Alon Laboratory [24]. Among the 8 genes, 2 genes (uvrY, ruvA) have little involvement in regulations [26]. Therefore, this study considered 6 genes as like study of Leon et al. [26] and actual network is shown in Figure 2.
The S-System parameters are estimated first then the network is reconstructed. The kinetic orders g_{ij} and h_{ij} determine the structure of the regulatory network. In the case g_{ij} > 0, gene j induces the synthesis of gene i. If g_{ij} < 0, gene j inhibits the synthesis of gene i. analogously, a positive (negative) value of h_{ij} indicates that gene j induces (suppresses) the degradation of the mRNA level of gene i. Now the graphical representation of the SOS network and the networks inferred by the SI techniques are represented in Figure 3. The performance evaluation of the algorithms used in this study on the SOS DNA dataset are shown in the Table 3. And the ROC plot is shown in Figure 4.
Connection Status | True Network | GSO | GWO | PSO |
True Positive (TP) | 5 | 2 | 2 | 2 |
True Negative (TN) | 31 | 18 | 17 | 18 |
False Positive (FP) | - | 14 | 14 | 13 |
False Negative (FN) | - | 2 | 3 | 3 |
From Table 3 it is observed that TP and TN in SOS DNA true network are 5 and 31, respectively. In experimental result, it is seen that all the methods identified TP as 2; GSO and PSO identified TN as 18; and GWO showed TN 17. For good inference of GRN, FP and FN are expected to minimum. From this point of view, PSO and GSO are competitive and GWO is worse. To visualize these result ROC plot has been drawn in Figure 4, where the x-axis represents the FPR and the y-axis represents the TPR. In the ROC plot, the algorithm having a high TPR and a low FPR GSO is the best in SOS DNA experiment.
4.5. Evaluation on DREAM4 Data
Synthetic DREAM4 InSilico_Size10_1 dataset is for 10 genes and is relatively larger than SOS DNA. Table 4 shows the summary of true and inferred networks. From Table 4 it is observed that, in InSilico_Size10_1 true network, TP is 15 and TN is 75. It is noticeable that GSO identified all true connections showing TP value as 15. On the other hand, GSO and PSO showed TP values 10 and 12, respectively. Moreover, FN is zero for GSO; whereas FN values for GWO and PSO are 5 and 3, respectively. From the ROC plot drawn in Figure 5, GSO is also shown the best among the three SI methods for DREAM4 data.
Connection Status | True Network | GSO | GWO | PSO |
True Positive (TP) | 15 | 15 | 10 | 12 |
True Negative (TN) | 75 | 6 | 17 | 17 |
False Positive (FP) | - | 69 | 58 | 58 |
False Negative (FN) | - | 0 | 5 | 3 |
5. Conclusions
Network systems are easy to understand and visualize the genetic interactions in cell organisms. So GRN is a significant way to represent the gene regulations. S-System model is considered as an effective way of GRN inference. S-System parameter estimation as an optimization task and three prominent SI based methods are investigated in this study for to estimate its parameter. In this study both real (i.e., SOS DNA) and synthetic (i.e., DREAM4 InSilico_Size10_1) datasets have been used for S-System model based GRN inference. Among these techniques GWO has shown fair performance in both datasets for GRN inference. Finally, the study revealed the scope of SI based optimization in GRN inference through S-System parameter estimation.
References