Chinese NER with Softlexion and Residual Gated CNNs

: The increment of accuracy and speed on Named Entity Recognition (NER), a key task in natural language processing, can further enhance downstream tasks. The method of residual gated convolution and attention mechanism is proposed to address the problem of insufficient recognition of nested entities and ambiguous entities by convolutional layers in the absence of context. It emphasizes local continuous features fusion to global ones to better obtain contextual semantic information in the stacked convolutional layer. Moreover, the optimized embedding layer with fusing character and lexical information by introducing a dictionary combines with a pre-trained BERT model containing a priori semantic effects, and the decoding layer in an entity-level method to alleviate the problem of nested entities and ambiguous entities in long-sequence text. In order to reduce abundant parameters of Bert model, during the training process, only the residual gated convolutional layer is iterated after fixing Bert layer parameters. After experiments on MSRA corpus, the result of entity recognition task in BERT-softlexion-RGCNN-GP model outperforms other models, with an F1 value of 94.96%, and the training speed is also better than that of the bidirectional LSTM model. Our model not only maintains a more efficient training speed but also recognizes Chinese entities more precisely, which is of practical value for fields required accuracy and speed.


Introduction
Named Entity Recognition are the recognition of relevant entity types from natural text and the determination of entity labels.There is identifiable information in the financial domain such as companies, brands, and legal entities, and as in the medical field are diseases, symptoms, and patient ages entities.As a fundamental task of natural language processing, accurate NER tasks can effectively improve the completion of downstream tasks, for instance, knowledge graphs, automatic question and answer, and machine translation.
The mainstream recognition approach is to utilize the bi-directional capturing capability of BiLSTM for long text sequences, build a suitable model structure, and improve the probability calculation and the representation of embedding layers to improve the accuracy of named entity recognition.With the growth of text sequence length as well as the number of model parameters, some researchers have also adopted convolutional layers as the backbone structure of recognition models.The method of reducing model training time by parallel CNN models and stacked convolutional layers capturing contextual semantics are used to improve entity recognition accuracy in long utterances, however, pure deep CNNs cannot solve the problem of long-distance dependency.
Therefore, we propose an incorporated residual gated convolution entity recognition model, which combines local continuous features and high-dimensional spatial semantics to selectively keep association information, and adds an attention mechanism to capture the important semantics associated with labels in the sequence: (1) To address the inadequate grasping of contextual information by stacked CNNs, the component of residual gated convolution and attention mechanism, which fuse local features to the global and reduce invalid information input, eventually alleviate the problems of gradient disappearance in convolutional layers and semantic dependence caused by cross-layers.
(2) To address the problems of nested entities and unregistered words, a pre-trained language model combined with lexical enhancement is proposed as an embedding layer, which introduces lexical information and a priori semantics of the large language model, to mine latent semantic information and alleviate the conflict between unregistered words and nested entities.
(3) To address the threshold problem of sequence label prediction in multiclassification datasets, Global Pointer (GP) [1] is invoked to perform label prediction of sequences with entity-level granularity to reduce the extraction error problem caused by correct sequence labeling but overly strict or lax determination conditions.

Related Research
NER tasks include rule-based [2,3], machine learning-based [4][5][6][7][8], and deep learning-based approaches.In recent years, neural network-based deep learning models have become a hot research topic due to the limitations of manual features.
Since neural networks automatically learn and capture semantic features from the corpus, the effectiveness of entity recognition relies heavily on the representation of word embeddings.There're three types of word embedding representations in the current NER task: word-level, character-level, and hybrid representations.
Word level is an intuitive way of clause splitting and is inherited from traditional recognition tasks [8].Nowadays, word separation is usually performed with the help of external tools such as Jieba and Hanlp to improve efficiency.Huang [10] proposed a word-level model based on LSTM-CRF [9] that effectively improves the performance of entity recognition.However, the coarse granularity at the word level leads to a significant increase in the parameters of the embedding layer but also introduces Out-of-vocabulary (OOV) problem.
The character-level representation can solve the problem of OOV and also avoid the propagation of subword errors.In the English corpus, the minimum division is character-level, and prefixes and suffixes composed of characters in words play a landmark help in the annotation.Lample [11] input the prefix and suffix extracted morphological features with word vector splicing into LSTM by modifying word embedding layer.
Hybrid representation is the fusion of multiple features as the input of a neural network.Ma [12] and Chiu [13] further optimized the input part by adding CNN (convolutional neural networks) to encode character information to capture long-range semantics, mixing character embedding and word embedding as the input of LSTM to construct contextual information.Dong [14] was inspired to bring in an assembly of Chinese character paraphernalia and combine sentiment features.Peng [15] chose a lexicon to enhance the embedding layer information and took spliced word vectors as input to a bidirectional LSTM.Lattice LSTM [16], based on the character-based model, integrated hidden lexical-level semantic information and achieved 93.18% on the MSRA corpus F1 value.
The named entity recognition task can be regarded as a sequence annotation task, in which RNNs (recurrent neural networks) are widely used.Because the BiLSTM model has a strong ability to capture contextual semantics and sequence modeling, which has achieved remarkable results in sequence labeling tasks, a series of subsequent studies [10][11][12][13] have used it as the structural basis.Nevertheless, with the growth of sequences, the long sequence modeling ability diminishes, so some studies [17,18] used CNN as the backbone structure with higher parallelism than LSTM and addressed the problem of contextual semantic capture by deep CNN stacking.Strubell [18] proposed the use of IDCNN for named entity recognition to improve the training speed while maintaining recognition accuracy.
Overall, the convolutional layer fused with residual gated as the encoder combine with used GP as the decoding layer to better identify nested entities, by introducing annotations in the semantic part of the extracted vocabulary.In more detail, the encoder stage uses stacked dilated convolution kernels to perform parallel calculations on the entire text sequence, expand the scope of feature capture, and extract contextual high-level semantic features of sentences.Then, a residual gated unit is introduced to fuse the local context features into the global, acquire contextual semantics, and alleviate the problem of gradient disappearance caused by cross-layer propagation.In addition, to solve the problem of labeling the same entity in different contexts, a multi-head attention mechanism is introduced to extract the global features of sentences and solve the long-distance dependency problem.

Model
The overall structure of the entity recognition model is shown in Figure 1, and the whole model is divided into three parts: the embedding layer, the encoding layer, and the decoding layer.Among them, the embedding layer contains the pre-training model obtained a dynamic vector representation, which effectively alleviates the problem of multiple meanings at a time.Then, the vectors of the embedding layer are input to the coding layer for feature extraction, which extracts word features in convolution layer and further acquires high-dimensional lexical semantics in combination with the residual gated convolution layer.Furthermore, parameters are set to N=5, convolution kernel 3x3, and dilation set at 1,2,4,1,1 in residual gated convolution module.Finally, the decoding layer uses GP to complete the label prediction of the sequence and achieve the global optimal sequence through global normalization in combination with relative position coding.

Embedding Layer
The embedding layer consists of the BERT [20] model embedding layer and the vector representation of softlexion [15].The structure of the BERT model is shown in Figure 2. which uses Transformer encoder as the basic architecture, and the input layer is the sum of word embedding, location embedding and segmentation embedding, and position embedding with temporal information, and then, label embedding that also integrated extra dictionary for softlexion part.With the avoidance of word separation, a character-level embedding-based representation is proposed to reduce the number of unregistered words.The character-level embedding layer is output to the BERT pre-training model.Word enhancement effectively alleviates the boundary recognition error problem by assigning each character to the set {B, M, E, S} and then stitching it to BERT's word vector after the whole as an embedding layer.
Since the frequency of word is static value obtained offline, the method of counting word's frequency as a weight greatly speeds up the calculation of the weight of each word.Specifically, let z(w) represent the frequency of the word w in the dictionary appearing in the statistical data, e w (w) corresponds to the embedding of the contributing word, and the weighted representation of the word set S is as follows: The embedding of the four sets is combined into one fixed dimensional feature and added to each character representation.

Residual Gated Convolution
The model is fed into a gated convolutional unit with residuals after the sequence is encoded in a pre-training layer.By adding a gated mechanism to the one-dimensional inflated convolution, the necessary high-dimensional semantic information is selectively extracted while expanding the context selection range.The residual gated linear unit is improved from GLU (gated linear unit), based on which the residual mechanism is introduced, as in Eq.8.weights will not be shared, where 4 & 9 uses the sigmoid activation function and 4 & 5 does the linear operation, so Eq. 9 and 10 are equivalent.The information passing probability of input x is controlled by two parts: the first part has the probability to pass directly, and the second part controls the passing probability through the gating of convolution operation, which alleviates the gradient disappearance problem by expanding the number of information transmission channels.

Expansion Convolution
The first application was in image domain, to expand the receptive field of the convolution kernel while keeping the size of the feature map constant, as in Figure 4.The model is a stack of four identically sized inflated convolution blocks, each containing three layers of inflated convolution with expansion widths of 1, 1, and 2. [18] covers the entire sequence more rapidly by allowing the perceptual field to grow exponentially and remaining the number of parameters.

Attention Mechanism
The encoder of Transformer consists of a combination of attention mechanism and feedforward neural network, as in Figure 5.The attention mechanism is the core part of the encoder.After the input of the BERT model, the attention operation is performed to calculate the information related to the other vectors in each word vector.
After projecting Q, K, and V in different linear spaces, splices of all the attention results are calculated as in Eq. 12.
head Attention DL G , EL H , FL M Multi-head D, E, F Concat head N , ⋯ ,head L P (13) Then the splicing result is connected with the input residual of the BERT layer in the normalized calculation to obtain a normal distribution result, which continues to input into the feedforward neural network, and the dimensionality reduction operation is completed through two linear transformations.

Decoding Layer
The sequence of encoded vectors obtained after encoding the input sentence t of length n is *Q N , Q R , ⋯ , Q S , .The encoding vector of each token is put into two linear layers to record the query and key belonging to each entity class, respectively: The α denotes a class of entities, and here it is equivalent to trying different q and k for various entity classes identified by Y * ,Z, of consecutive substrings, scoring the entity α: Within entities scores, conversions in consecutive substrings add the rotation position code RoPE, which is a transformation matrix _ that satisfies _ `_Z = _ ZaN : Since the final scoring function corresponds to α n(n+1)/2 class binary classification problems, in case of severe class imbalance for each type of entity candidate, the loss function uses a single-objective cross-entropy generalization of the multiclassification: ,Z fg e : + bcd 1 + ∑ % a e ,Z ,Z fG e (18) Among samples, h U is the sum of the closing sets of all entities of type α for that sample, and D U is the sum of the closing sets of entities of type not α for that sample or all non-entities, considering only the combinations i ≤ j: h U = l \, ] mY * ,Z, is an entity of type st (20) All segments that satisfy [ U \, ] > 0 fragments of Y * ,Z, are considered as entity outputs of type α.

Details
The evaluation indicators of the experiment are precision rate P, recall rate R, and F1 value.Dimension label of BIO from MSRA converted to BMES.The experimental environment is 1080ti, 64G memory.The parameters of the model are set as follows: Bert model adopted bert-base-chinese version, 12 heads mode.The hidden layer dimension is set to 768, batch size to 128, learning rate to 1e-3, and maximum input text length to 512.Using Adam optimizer to prevent overfitting, Dropout is set to 0.2.

Model Computing Efficiency
The calculation efficiency of the RGCNN model is in comparison with basic models on MSRA, comprising BiLSTM model commonly used in NER task, IDCNN based on expansive convolution, and GRN [19] model based on residual gated, to compare the single step time of the model processing the same batch of samples and updating the weight once.CNN-based models are generally faster to train than RNN models, and achieve higher F1 values.Among them, the speed of IDCNN is nearly 3 times faster than that of BILSTM, the single-step time of RGCNN based on expansion convolution is 2.5 times faster than that of BiLSTM, and the F1 value is 6.11% higher than which.The accuracy of the GRN model resembles that of RGCNN, but the single-step time and recall rate are not as good as that of RGCNN.
It can be seen that the CNN-based model has a significant speed advantage.The main reason is that the RNN model has to recursively obtain the global information, while CNN obtains the information by increasing the perceptual field through layer stacking, and the operations of each layer are parallel, so the speed of the model is greatly improved.

Impact of Embedding Layer on Entity Extraction
Based on the RGCNN model, a pre-training model is added to verify the influence of prior semantics on entity extraction.Character embedding, word embedding, and context embedding are respectively introduced to perform comparative experiments with the BERT model, which are 128,128 and 256, respectively.Then compare the accuracy of other vocabulary enhancement methods in the NER task.Algorithm improvement from vectors obtained from the embedding layer indicates F1 value of the embedding layer with the fused context in Table 2 is improved by 3.5% compared to using only RGCNN, verifying the improvement of contextual information for entity recognition.The overall evaluation criteria are all improved after embedding enhancement.Due to the sufficient word frequency statistics in the MRSA training set, the effect of RGCNN+softlexion differs only 1.3% from BERT.The fusion of BERT and softlexion has more improvement for entity recognition, mainly attributed to the a priori semantics of the BERT model.

Model Validation
The following comparative experiments are conducted to verify effectiveness of proposed model as shown in Table 3: 1) To verify the effect of the depth of the BERT model on the entity recognition effect, the model depths of 6, 8, and 10 layers were selected for the experiments.2) Based on the optimal model depth, two decoding methods, CRF and GP, were used to compare the entity recognition.3) The RGCNN+GP model and the mainstream BiLSTM+CRF model are compared.Several sets of different BERT layers demonstrate that the F1 value using the BERT pre-training model reaches the highest value of 93.42% at depth 8, the lowest value at depth 6, and the average value between depth 6 and 8 at depth 10.It indicates that appropriately deepening the number of network layers is beneficial to improve the accuracy of entity recognition, but as the model continues to deepen, the learning ability of the model decreases and causes degradation of the recognition effect.It's found that in comparison between groups 2 and 4, the decoding layer using GP gives better results than CRF because the loss function and evaluation metrics of GP are entity-based, which works well on the entity-level dataset of MSRA, but the improvement is not obvious on tag-level data.The F1 value of BERT+RGCNN+GP is 4.3% higher than that of the BiLSTM+CRF model which is the baseline.4, which shows that a priori semantics of the pre-trained models significantly improves the evaluation metrics of all the base models.Moreover, among all the models using BERT in Table 4, BERT-IDCNN-CRF has the least single-step elapsed time and BERT-RGCNN-CRF is closest to its time.Our model's rapid single-step time is attributed to lessening training parameters.The number of parameters of BERT pre-trained language model is more than 100 million, and BERT-finetuning updates all parameters, while the RGCNN model combined with BERT fixing parameters of the BERT layer, only updates the upper layer parameters.Therefore, the number of parameters of RGCNN is 59,000 is significantly reduced to 59,000.

Conclusion
To address the problem of easy disappearance of gradients between layers of extracted entities and insufficient access to contextual information by CNN models, residual gated connections and attention mechanisms are appended on the basis of one-dimensional expanded convolution in order to obtain contextual semantic information while maintaining the training speed of CNN architecture.The BERT-RGCNN-GP model achieves an F1 value of 94.96% for entity extraction on MSRA.The feasibility of the experiment and the method is verified, hence next step is to consider adding the identification between semantic relations to further enhance the entity extraction effect.

Figure 6 .
Figure 6.Model training process.As shown in Figure 6, the training process illustrates the F1 values of the RGCNN at different depths with the number of rounds, where the first three groups are the training results of the Bert+RGCNN+GP model with depths of 6, 8, and 10 layers.The F1 value of the 10 layers model reached a maximum of 92.57% at 30 epochs.It's found that in comparison between groups 2 and 4, the decoding layer using GP gives better results than CRF because the loss function and evaluation metrics of GP are entity-based, which works well on the entity-level dataset of MSRA, but the improvement is not obvious on tag-level data.The F1 value of BERT+RGCNN+GP is 4.3% higher than that of the BiLSTM+CRF model which is the baseline.

Table 3 .
Comparison of model groups.

Table 4 .
Mainstream Model Comparison.Models in Table 1 make comparisons with ones after adding the Bert layer in Table