Input Dataset Survey of In-Silico Tools for Inference and Visualization of Gene Regulatory Networks (GRN)

: Understanding Gene Regulatory Network (GRN) is considered to be the fundamental approach to many biological questions, and the input dataset performs a crucial role in investigating and visualizing the gene regulatory network [5, 14, 17, 23, 34, 37, 40, 41, 44, 45]. Several software tools [2, 5, 7, 10, 11, 14, 21, 22, 25, 31-33, 37, 38, 40, 41, 44] have recently been developed for GRN inference, where some are designed for a particular dataset, an organism or a particular diseased cell. The questions that prompted this review are; what is (are) the kind of omic data needed to construct a GRN? Is there any peculiar property attached to a GRN of a particular data? And, could there be an integration of data from various omic experiments in form of a knowledge base? The input dataset for GRN are transcriptome information which is analyzed comprehensively including the two major technologies (sources) that produce them. We consider four omic datasets and two of their sources for the purpose of this review. The biological data source technologies are hybridization-based, and sequence-based. Dataset from microarray and ChIP-Chip experiments are hybridization-based while RNA-seq and ChIP-seq are sequence-based. Software tools published on Omic Tool website (http://omictools.com/gene-regulatory-networks-c435-p1.html) are analyzed for this review. However, the major disparity is whether the dataset is ChIP-X (ChIP-Chip and ChIP-seq) or expression (Microarray and RNA-seq) dataset not whether the source is from hybridization-based or sequence-based. Moreover, ChIP-X dataset gives more opportunity to investigate more biological problems. The importance of gene regulatory network suggests a GRN software template, which contains all the additional data from ChIP-X experiment and a knowledge base of biological prior knowledge, including integration of data from different omic datasets as a single knowledge base.


Introduction
Investigating gene regulatory network is an approach in bioinformatics to study the interactions of genetic materials (genes, proteins, enzymes, ligands etc), and also with the cellular components. Gene regulatory networks inference is a quickly evolving field, with new developments and algorithms being published almost daily. Modeling GRN with computational techniques involves development of a virtual biological cell that represents the dynamism of interactions and reactions among cellular components.
Development and functioning of organisms' cells emerge from interactions in genetic regulatory networks [16] and the regulation of gene expression is achieved through the interactions between DNA, RNA, proteins, and small molecules. This regulatory system can be described by the structure of network called genetic regulatory network (GRN) [35]. Several mathematical and computational models have been developed to analyze the gene regulatory networks and metabolic networks of different cells in different disease traits Visualization of Gene Regulatory Networks (GRN) especially [4, 6, 9, 12, 15, 16, 35, 43, 44,].
Various mathematical and computational approaches have been used to infer gene regulatory network, and this has resulted into development of different In-Silico tools for the reconstruction and visualization of gene regulatory networks [10,15,20,26,35,43,45].Classifications of these computational techniques have been reviewed in several articles based on different criteria. In [16], general review of existing models was performed based on a number of dimensions. The approaches were compared on whether the models were discrete or continuous, static or dynamic, deterministic or stochastic, and qualitative or quantitative. [36] classified the computational models of gene regulatory networks into four classes, (i) logical models, (ii) continuous models, (iii) single molecule models and (iv) hybrid models. In yet another review by [27], network inference algorithms were categorized based on their major properties and they are: (i) the underlying method, (ii) the result, (iii) the directionality of interactions, (iv) the consideration of dynamics (v) the integration of prior knowledge (PK) (vi) non-linearity or linearity, (vii) the explicit consideration of stimulation, (vii) the consideration of stochastics and application of probabilities, (viii) the network size, (ix) the number of required data and (x) the availability as a software tool.
Some of the tools for the purpose of investigating and visualizing gene regulatory networks were presented in [24] which include LegumeGRN, geWorkbench, GENeVis, Cytoscape, NetBioV, FastMEDUSA etc. The weaknesses and strengths of these tools were discussed based on the study performed on their features. In this paper, we perform a comprehensive analysis of GRN software tools based on their various input datasets. We are able to analyze various kinds of input datasets to visualize gene regulatory networks, and investigate peculiar properties evolving from a GRN by using a particular datasets gotten from a particular source. We also analyze tools that are using additional parameters apart from the major dataset, and the added features of the GRN as a result of these additional parameters.
The paper is arranged as follow: section 2 describes the various input datasets to visualize GRN. The GRN visualization and inference tools are analyzed in details in section 3 to reveal the build-up of each tool around a particular dataset. Section 4 presents the conclusion while suggestions are outlined in the last section.

Grn Input Data
The analysis of large datasets of information derived from various biological experiments plays a vital role in functional genomics, and a good inference of gene regulatory network is one of these analyses. The availability of high-throughput technologies that allow measuring simultaneously expression of thousands of genes, have given rise to different kinds of genomic datasets used to visualize GRN [22], and majority of GRN software tools were built around a particular dataset. These datasets used for the in-silico inference and visualization of gene regulatory networks are transcriptome information, and the transcripts are acquired and quantified through two major technologies; (i) hybridization array and (ii) sequence-based approaches. Transcriptome gives the complete set of transcripts in a cell and their quantity to analyze the functional constituent of the cell [39]. The transcriptome information produced by these technologies are as follow; a) mRNA Messenger RNA is the major RNA molecule produced from the transcription stage of gene expression, that is, conversion of DNA to RNA molecules. It carries genetic information from DNA to the ribosome, which specifies the amino acid sequence of the corresponding protein. mRNAs are arranged into codons consisting of three bases each, and each codon encodes for a specific amino acid, except the stop codons, which terminate protein synthesis. mRNA is a sequence of nucleotides [28]. b) microRNA (miRNA) It is a conserved class of small noncoding RNAs, whose function is RNA silencing and post-transcriptional regulation of gene expression [30]. MicroRNAs (miRNAs) are about 22nt long that are processed by Dicer from precursors with a characteristic hairpin secondary structure [3]. c) tRNA This RNA molecules called a transfer RNA is typically 76 to 90 nucleotides in length, and serves as the carrier of amino acids to the ribosomes, so that the ribosomes can put this amino acid on the protein that is being synthesized as an elongating chain of amino acid residues, using the information on the mRNA to determine which amino acid should be put on next. Each type of amino acid has its own type of tRNA, which performs the binding and transporting the amino acid to the growing end of a polypeptide chain if the next code word on mRNA calls for it [28]. d) rRNA The ribosomal RNA is an important part of ribosome in the cell. It contains about 60% of ribosome complexes whereby the rest 40% is protein molecules. A Ribosome is divided into a large and small subunit, each of which contains its own rRNA molecule or molecules [28]. The nucleotide sequence of rRNA is highly complex depending on whether it is eukaryotic rRNA or prokaryotic rRNA, large subunit or small subunit. The basic function of ribosomes is translation of mRNA into protein by linking amino acids together. The small subunit of rRNA reads the order of amino acids while linking of the amino acids together is the function of the rRNA in the large subunit of the ribosome. e) other non-coding RNAs These include snoRNAs, microRNAs, siRNAs, snRNAs, exRNAs, piRNAs and the long ncRNAs These datasets are the major input dataset to infer and visualize GRNs in various literatures [10,11,13,14,19,21,22,25,26,29,31,34,37,38,40,41], which are generally known as gene expression data. One common attribute of these datasets is their constituent, which is nucleotide bases but they are of different lengths and different functions they perform. So, it is crucial to know in advance the structure and function(s) of a particular data before using such in inferring and visualizing GRN. In addition with these major datasets are other datasets that act as additional input parameters in inferring GRNs like Transcriptional Start Sites (TSSs), Transcriptional Factors (TFs), promoter elements and binding signals. The use and combination of these datasets are analyzed in the following section, including the influence of the input dataset on the outcome. A number of organisms and disease traits like cancer and cardiovascular diseases have been profiled and are available in different public databases like Gene Expression Omnibus (GEO) repository.
A particular input dataset for in-silico investigation of regulatory networks has a source, and the sources have been categorized into two, the hybridization-based and the sequence-based technologies [29]. The primary aim of these two technologies is to get the transcriptome information of different genes at the same time. Hybridization-Based Technology is a technique for identifying, among a sample of many different DNA fragments, the fragment(s) containing a particular nucleotide sequence. This is achieved by combining two complementary single-stranded DNA molecules and allowing them to form a single double-stranded molecule through base pairing. The hybridization-based techniques widely used for GRN analysis and visualization are DNA microarray and ChIP (Chromatin immunoprecipitation) Microarrays (ChIP-chip). Whereas, sequence-based technologies that are widely used for GRN analysis and visualization are RNA-seq and ChIP-seq (Chromatin Immunoprecipitation Sequencing). These are parts of Next Generation Sequencing techniques. Table 1 shows the properties of these techniques [17,34,18]. Having full understanding of any dataset is paramount in inferring gene regulatory network and if necessary, integrating multiple datasets from distinct sources is challenging [40].

Grn Visualization Tools
We are able to perform a comprehensive analysis of several software tools for investigating and visualizing gene regulatory network, which are obtained from publicly available biological tools on Omic Tool website (http://omictools.com/gene-regulatory-networks-c435-p1.ht ml). Their various properties in relation to their input dataset are reported in several literatures [1, 2, 5, 7, 8, 10, 11, 13, 14, 19-22, 25, 26, 29, 31, 34, 37, 38, 40, 41]. We present the summarized analysis of the tools in Table 2, taking cognizance of the input dataset among other properties. a) ARACNE (Algorithm for the Reconstruction of Accurate Cellular Network) ARACNE is a powerful network inference tool designed to scale up to the complexity of regulatory networks in mammalian cells. It uses microarray dataset from human B cells, both a realistic and synthetic datasets. The microarray dataset makes it possible to measure statistical interactions and dependencies using mutual information, which does not require discretization of the expression level [29]. However, ARACNE is unable to infer edge directionality because it does not use temporal data. b) ARTIVA (Auto Regressive TIme VArying Model) ARTIVA uses time course gene expression data to perform a gene-gene analysis, and to infer the topology of the network and how it changes over time using two microarray datasets. The data related to the developmental stages of Drosophila Melanogaster and Benomyl data are the major input data into the software, but including the gene ontology information to perform the knock-out procedure and functional annotations and transcription factor binding information to access the biological relevance of the result [26].
ARTIVA provides powerful and evolving mechanism in inferring gene regulatory network because the nature of incoming input is investigated, works with a continuous datasets and no threshold is needed to define up and down regulated groups of genes. It is thereby concluded that ARTIVA will be able to incorporate data originating from different sources like ChIP-Chip or ChIP-seq experiment. c) AtmiRNET Visualization of Gene Regulatory Networks (GRN) AtmiRNET is another software tool for biological network inference to explore mechanisms of transcriptional regulation and microRNA functions in Arabidopsis Thaliana. The gene regulatory networks gives users an intuitive insight into the pivotal roles of Arabidopsis miRNAs through the crosstalk between miRNA transcriptional regulation (upstream) and miRNA mediate (downstream) gene circuit [5].
The data input include miRNA of Arabidopsis, core promoter element and high confidence transcription factors. It was observed that the plant miRNAs are primarily encoded in intergenic regions and that they have their own promoters unlike animals' miRNAs. d) CMGRN Constructing Multilevel Gene Regulatory Networks (CMGRN) performs several instigative biological functions and gene regulatory network inference. It uses ChIP-seq count, binding data, gene expression profile, miRNA and targets as input dataset and according to [14], the use of ChIP-seq data allows high-fidelity mapping of different regulators; binding data to identify regulatory modules and reconstruct GRN; gene count to infer the causal relationship between Transcription Factors (TFs) epigenetic modification; gene expression data and miRNA/regulatory signal of TFs are used to construct the GRN by multi-level factors.
In summary, CMGRN generates hierarchical regulatory network structures controlled by the interacting factors at transcriptional, post-transcriptional and epigenetic layers. e) ChIP-Array It is a web server biological network visualization tool that integrates ChIP-X (ChIP-Chip, ChIP-seq, etc) and gene expression data from human, mouse, yeast fruit-fly and Arabidopsis to analyze both the ChIP-X and expression data together. It requires binding locations from ChIP-X data, differential expression data from gene expression profile and other parameters to construct GRN [34]. The direct and indirect target genes are detected, which is regulated by a Transcriptional Factor (TF) of interest.
This ultimately aids the characterization of function(s) of the TF.
f) ChIP-Array2 ChIP-Array2 is an enhanced version of ChIP-Array, which accommodates additional type of omics data from rat and worm to investigate a more comprehensive gene regulatory network involving diverse regulatory components. ChIP-Array2 can be used to detect the direct and indirect target genes separately; this is as a result of independence of both ChIP-X and expression data. It can run without either ChIP-X or expression data, direct target genes are detected with only expression data while indirect target genes are detected with only ChIP-X data [37]. g) iRegulon This is another powerful network inference tool designed to identify master regulators and detect target genes in human, mouse and drosophila genes. It is used for motif discovery, which provides access to many cancer-related TF-target subnetworks/regulon [21].
The input datasets are TF binding data (from ENCODE ChIP-seq), co-expressed genes downstream of a TF perturbations, miRNA and genes involved in the same signaling pathway.
Finally, we observed that large percentage of the tools if not all has additional input parameters like TSSs, TFs, protein-protein interaction, functional annotations, binding information and other gene ontologies information to unravel the interaction and relationship complexity of GRN.

Conclusion
Analysis of input dataset especially based on the technology of source data provides insight for future researchers on the kind of data to use for a specific investigation. We have been able to dissect GRN input dataset and discovered that the datasets can be expression data or regulatory data (ChIP-X) aside the technology of the source of the data. Though, Next Generation Sequencing (NGS) allows the elucidation and demarcation of complex transcriptional regulatory networks [18], the major disparity is whether the dataset is ChIP-X or expression dataset not whether the source is from hybridization-based or sequence-based.
Gene expression dataset is from microarray or RNAseq while ChIP-X dataset is either from ChIP-Chip or ChIP-seq experiment. We observed from this analysis that there is no one universal method suitable for inference of GRN for all biological conditions [10], the suitability of inference method depends on the kind of datasets employed and the important features of each dataset can suggest other hypothesis in investigating organism or diseased cells.
Notwithstanding the presence of expression data in all the tools analyzed above, the ChIP-X dataset has been observed to help elucidate the genetic, epigenetic and environmental states of a cell and help in a great extent to determine the phenotype of the cell. In fact, [41] submitted that, no true causal relationships can be represented with any pure expression driven method, and that the problem can be solved by using ChIP-Seq binding data. Also, [13] observed that with ChIP-X dataset, visualizing GRN and modeling of metabolic and signal networks can be combined to model the global operation of cells with unprecedented completeness and accuracy. Besides, with ChIP-X dataset, relevant subnetworks that underlie observed genetic interactions can be reconstructed [21], and cooperations among other regulatory elements can be studied such as splicing factors, long non-coding RNAs, etc. [38].
Finally, the resulting networks of inference tools with more input parameters are more accurate than those produced using individual dataset, and E.coli dataset is believed to be the benchmark of biological dataset [19].

Suggestions
Taking this comprehensive analysis into consideration, and looking at importance of gene regulatory networks as the fundamental solution to majority of biological questions, the following suggestion should be taken care of.
a) A concerted effort should be made in designing a GRN software template that will contain all the additional data from ChIP-X experiment, which makes it different from expression data. This will make expression dataset to be useful to carry out further investigations like histone modification analysis, DNA methylation analysis, functional analysis of transcription factors and Transcription Factor Binding Sites (TFBS) mapping. b) A knowledge base of prior knowledge should be built, and it should be available to every GRN inference developer. It is observed that all the tools require information like gene ontology, functional annotation, protein-protein interaction information, binding information etc. to successfully reconstruct the GRN. It can also help to achieve above suggestion by integrating data from different omic datasets as a single knowledge base. c) Both the expression and ChIP-X datasets of E.coli should be obtained and made available to be used as the simulated dataset for all GRN inference tools because majority of the authors in the papers reviewed used E.coli data successively.