Discovering Escherichia coli K-12 Promoter Features Using Convolutional Neural Network

The mechanism of prokaryotic gene expression remains incompletely understood. Promoters are regions in genome that locating upstream to genes and regulate of gene expressions. Despite more and more E. coli K-12 promoter sequences have been obtained experimentally, and some regions such as -10 region and -30 region have been described, the features in promoter sequences are far from explicitly characterized. Here, we address this challenge using an approach based on the deep convolutional neural network (CNN). We collected six classes of E. coli K-12 promoter sequences which are all annotated as with strong evidence and belong to only one promoter class in RegulonDB database. Then, we applied the CNN model to recognize the six classes of promoters. The CNN model achieved an accuracy of above 97% for all six classes of promoters. Next, we extracted the weight matrix of the last convolution layer in CNN with the Grad-Cam algorithm, and convert the weight matrix to an information content matrix. Finally, we visualized the information content matrix as promoter logos using the logomaker tool and discover the promoter features in six classes of promoters. Our approach could not only find the previous described promoter feature regions, but could also discover promoter features with better sensitivity and accuracy. We provide a novel computational approach to discover features in biological sequences.


Introduction
Promoters are regions of DNA that locating upstream to genes and regulate of gene expressions [1,2]. In bacteria, the promoter is recognized by RNA polymerase and an associatedσfactor. In E.coli, seven classes of σ promoters have been found: σ24, σ28, σ32, σ38, σ54, σ70 and σ19 [3,4]. RegulonDB is a database of the regulatory network of gene expression in E. coli K-12. Currently, RegulonDB has collected about 8000 E. coli K-12 promoter sequences. Among them, about 1200 promoter sequences having strong evidence being annotated belong to one or more σ classes [5].
Convolutional neural network (CNN) is one of the most important model in deep learning [6,7]. CNNs have become the gold standard for numerous image analysis tasks [8]. It surpasses many established algorithms, such as support vector machines or random forests [9][10][11]. CNN also demonstrates a better performance in recognition of E. coli K-12 promoters than that of PSSM (Position-Specific Scoring Matrix) method [12][13][14][15]. In previous study, we have demonstrated that CNN outperforms PSSM method in identification of different promoter classes. However, it was unclear why CNN performs better [16].
The weight matrix of the last convolution layer in CNN contains the features extracted from the input data, and the Grad-Cam algorithm has realized the visualization for the weight matrices of CNN intermediate convolution layers [17]. In this work, we first train a CNN model, then we used the Grad-Cam algorithm to obtain the weight matrix of the last convolution layer. Further, we converted the matrix to an information content matrix. Finally, we used the logomaker tool [18] to visualize the information content matrix as a promoter logo. Our method not only successively displays the well-known -10 and -30 regions shown by the Weblogo method [19], but also be more accurate than the Weblogo result. Moreover, our method is more sensitive in discovering the dominant positions and bases in the promoter sequence other than -10 and -30 regions. These factors contribute the CNN a better performance than PSSM in discovering promoter features.

Promoter Sequences
The E. coli K-12 promoter DNA sequences were derived from RegulonDB database (http://regulondb.ccg.unam.mx/menu/download/datasets/inde x.jsp). We collected six classes of promoter sequences: σ24 (66), σ28 (10), σ32 (51), σ38 (102), σ54 (19) and σ70 (766) (The number in the parenthesis indicates the number of sequences). In RegulonDB database, these promoter sequences are all annotated as with strong evidence and belong to only one promoter class. The length of each promoter sequences is 81nt, including 1nt transcription start site (position 0), 60nt upstream region and 20nt downstream region ( Figure 1A). In this study, we chose the 60nt upstream region as the dataset for CNN input.
Next, we constructed a convolutional neural network. The CNN contains three convolution layers. Each convolution layer followed a batch normalization layer and a dropout layer to reduce the overfitting. We set the padding parameter "same" to keep the size of the last convolution layer matrix being the same as the input matrix. The last convolution layer is followed by one flatten layer and an output layer ( Figure 1C).
We performed the 10-fold cross-validation to train the CNN model. We saved the model weights in an h5 file after each round of training and applied the weights from the last round as the starting weights for the next round of training. After several rounds of iterations the performance of the CNN was not improved any more. We used accuracy (Acc), specificity (Spec), sensitivity (Sen) and ROC curve to evaluate the performance of the CNN model.

Obtaining the Last Convolution Layer Matrix (G) Using Grad-CAM Technique
The last convolution layer matrix contains the features extracted from the input matrix. For each promoter sequence, we used the Grad-CAM technique [17] to obtain its last convolution layer matrix (G) ( Figure 1D).

Generating the Promoter Sequence Feature Matrix (S)
In matrix G, the row item represents four bases. While in a particular promoter sequence, only one base occurs in one position, so we generated the promoter sequence feature matrix (S) for each promoter sequence from the matrix G ( Figure 1E). In detail:

Generating the Promoter Feature Matrix (P)
After we obtained the matrices S for each promoter sequences in six promoter classes, we created the promoter feature matrix P for each promoter class ( Figure 1F). We calculated the P matrix as following: Where m is the number of sequences for a particular promoter class.

Generating the Promoter Feature Entropy Matrix (E)
Finally, we transform the promoter feature matrix P into the promoter feature entropy matrix E ( Figure 1G) as following: Where P is the element of row i of j column in the matrix P.

Creating Promoter Logo
We use the logomaker [18] to visualize the promoter feature entropy matrix E and created promoter logos.

Identification of Promoters with CNN
First, we identified promoters with CNN. Table 1 shows that the accuracies of the CNN model for recognizing six promoter classes are all above 97%, while the AUCs (Area Under Curve) in ROC curves for six promoter classes are above 0.84 except sigma 38 (AUC=0.63) (Figure 2). The good performances of CNN in identifications of promoters are guarantees to discover features in promoters.

Promoter Features
We use the Grad-CAM technique [17] to extract the weight matrix (G) in the last convolution layer of CNN ( Figure 1D), and then transform the matrix G into the promoter sequence feature matrix (S) (Figure 1E), the promoter feature matrix (P) ( Figure 1F), and the promoter feature entropy matrix (E) ( Figure 1G) in turn. The matrix E contains the promoter features in term of the information content. Finally, we visualize the matrix E using the logomaker tool [18]. Figure 3 shows the feature logos for six classes of promoters.  Figure 3 shows that our method could discover all feature regions successively found by Weblogo method (Figure 3). For example, the -10 region and the -30 region.
The Weblogo method is based on the probability of a base occurring at a position in the promoter sequence [19]. In detail, the Weblogo method finally generated a PSSM (Position-Specific Scoring Matrix) and visualized the PSSM. In previous study, we have demonstrated that CNN outperforms PSSM in promoter identification. An interesting question is why CNN performs better than PSSM? In this study, we found that CNN could discover the importance of each base at each position in the promoter sequence more precisely the PSSM. For example, in Figure 3B, at position -30, Weblogo shows that A is the dominant base, while our method shows that both A and G are important, and G is the dominant base. Moreover, our method is more sensitive than the PSSM method. For example, in Figure 3D, Weblogo shows faint signals outside the -10 region, but our method gives more signal details. The better sensitivity and accuracy contribute the CNN outperforming the PSSM method.

Conclusions
In this study, we demonstrated that deep convolutional neural netword model performs better than the traditional bioinformatic algorithm in finding features in DNA sequences. The approch could also be applied in finding features in protein amino acid sequences.