Layered Feature Recognition Algorithm Based on Combined Convolution

: In recent years, the deep learning algorithms were gradually understood and accepted. It needs to take too many samples to train. Since the implementation of deep learning algorithm, it seems that the past classical algorithms have become gloomy. In this paper, we get an intelligent pattern recognition model by combining some classical algorithms in the past and extrapolating the convolution algorithm. This new model is based on a single regular sample, with its advanced generalization capabilities far beyond those of deep learning algorithms. Experimental results on MNIST, QMNIST, CMU PIE and Extended Yale B databases indicate that the proposed model is better than the related methods as compared with


Introduction
In the process of designing the deep learning algorithm, there was no consensus on the size of the convolution kernel. In general [1], the size of the convolution kernel contains 3 3 × and 5 5 × . Another problem was which layer to use which kind of convolution kernel. Obviously, all of these works require manual attempts over and over again. Artificial design of network architecture makes it difficult to debug network parameters and determine network structure. There are so many questions, why don't we think backwards? For example, we start with the study of convolution kernels. The literature [2,3] illustrated the visualization of features learned by layers and units. Each kernel or unit was a shared weight, which was acquired by back propagation algorithm. By convoluting the obtained weights with the input values of the layer, the feature can be obtained. Backing of the beginning, the kernel had some properties, such as size ( 3 3 × , 5 5 × and so on), mode of action (full, same, valid), form (normal or dilated convolution) etc. In this paper, we will study some information hidden in these convolution kernels in another way. The rest of this paper is organized as follows: We review some related work on kernel in Section 2, these work lay a solid foundation for our idea and provide motivation in developing the new view. Then we introduce our algorithm with theoretical analysis in Section 3. In Section 4, we evaluate the performance of our new algorithm on MNIST 1 , QMNIST 2 [4], CMU PIE [5] and Extended Yale B [6] datasets. We discuss the problems and draw a conclusion in Section 5.

Kernel View
Kai Yu [7] believed that learning results of the first layer of convolution were image edge features (see Figure 1). In his presentation 3 , he demonstrated the computational flow in tasks such as human face recognition, cars recognition, elephants recognition, chairs recognition, and presented the visualization results of each layer (see Figure 2).
For the features of 1st layer, it's very easy to recall the traditional operator of edge features calculation, such as Sobel operator 4 (see"(1)"), Prewitt operator [8] (see "(2)") and so on.
These operators are all third-order matrices. For any third-order matrix, there are always nine orthogonal vectors which can be used as its base (see "(3)"). Therefore, formula (1) can be represented as follows (see "(4)").

Gradient Descent Method
According to the point of view in the paper of Sebastian Ruder [9], given an input data , , and the coefficients ( ) of existence were assumed that it made the following formula valid (see "(5)"). If n takes a finite value, there is a bias (b) between the True Value (Y) and the Estimated Value ( ). The formula (see (5)) can be reformed as "(see (6))". And the loss function can be defined as "(see (7))". Minimizing the loss value, we can get the formula (see (8)). In the formula above, is the learning rate and b is the bias.

Similar Principal Component Analysis (SPCA)
Principal Component Analysis (PCA, see Algorithm 1) algorithm was widely used in dimension reduction, but O'TOOLE [10] deemed that it was unreasonable. HAN [11] gave an improved algorithm called Similar PCA (SPCA) algorithm. The detailed algorithm is shown in Algorithm 2. The dimensions of PCA's results and SPCA's results were the same. What's more, SPCA retained some information which discarded by PCA and this can be used for sample generation.

Main Theorem
For the pattern recognition or statistical pattern recognition, Bishop [12] believed that pattern recognition was a combination of theories and methods that contained a great deal of information processing. From the root, there was no accurate definition of this term. Hinton [13,14] proposed the concept of deep learning and used it to solve the problems in pattern recognition.
For any third-order matrix, defining the formula (3) as true kernel (K), and it can be represented as follows (see "9"). 11 12 13 The is the coefficient of (true kernel), the denotes a weight matrix or a convolution kernel. Convolving the weight matrix with the input data matrix (see "(10)"): x w x = ⊗ (10) And the action mode is "same". Assuming that a series of results can be linearly combined, which is shown in the formula (6). Lastly, the bias b is the distance between Y and h θ , which we named "impression".
Definition1. Pattern consists of , i k and . is expressed as patterns implied by all third-order matrices: In the formula (11), each w needs to compute one-time summation, and there is just only one impression (b). In order to layering the impression, w was split as follows (see Table 1).
So the Pattern can be rewritten as (see "(12)"): (12) The impression 1 b is the bias of 1st layer. The impression 2 b is the bias of 2nd layer. By analogy, the patterns of all fifth-order matrices are as shown in Table 2. Table 2. Sub w and Corresponding Impressions of 5 P .

Symbol Combination Impression & expression
The Pattern of 5 P is (see "(13)"): The relative parameters of the expression of 7 9 11 , , P P P are shown in Table 3. Table 3. Sub w and Corresponding Impressions of 7 9 11 , , P P P .

Supplementary Views
In the process of pattern learning, only θ is a hyper-parameter to be learned (see"(8)"), and the parameter k should be artificially designed. Assuming input is single sample image data, the impressions b1 and b2 can be used as data source of pattern discrimination after training. In the process of testing, each test data can be impressed by the pattern learned. Hypothesis1. If the pattern is learned, for example, if we trained one figure data of number 1, the following statement should hold true: (1) All numbers mapped to 1 should be recognized as 1 by the pattern; (2) Numbers other than (1) should not be recognized as 1 by the pattern.
(3) More importantly, the input data can be replaced with other numbers' data.
Hypothesis2. The training picture data should be more regular, and the impressions after learning can be misidentified.
Methods of data normalization contains normalization of input data [15,16], normalization of weight data [17]. In this paper, the training data were normalized by 2-norm, each column of training data was divided by the module length of the column. In the process of learning pattern, a drop method was referred to dropout [18] and in order to reduce calculation data (see Algorithm 3), the viewpoint of contribution in the method of PCA [19] was adopted. In the output layer of ACM, function ( , ) x σ ∆ (see " (14)") is similar with rectified linear unit (ReLU) [20].
The pseudo-code of Algorithmic (ACM) computing flow is (see Algorithm 4): Algorithm 4 ACM Input: = 1,2, … ,10 , , , K, P, , , and the test data y . ( i x is one of the 0-9 10 graphs.) Output: Predictive labels l . Learning pattern: 1: Initializing parameter k, , ; Calculating 1 K and 2 K ; 2: Bring formula (11) ( 1 K ) into formula (7) and updating parameters with formula (8). Using dropout method to get parameter of the 1 st layer , ɷ , and the impression 1 b ; K , and change input data to 1 b in step two, we can get , ɷ , and 2 b ; Testing pattern: 4: According to the pattern learned at the 1 st layer, 10 types of impressions of training samples are calculated: Calculating the impression of 1 st layer of the test sets: For the 2 nd layer: T is the predictive label of y . 2 nd level prediction: T is the predictive label of y .

(Q) MNIST Data Set
All experiments were based on the MNIST hand-written digit recognition benchmark. All the images were pre-normalized into a unitary 784-dimensional vector. The data set was divided into a training set with 60000 images , and a test set with 10000 images.
The QMNIST dataset was generated from the original data found in the NIST Special Database 19 5 with the goal to match the MNIST preprocessing as closely as possible. The QMNIST test dataset contained the 60000 testing examples.
All the test data were divided into two categories. One was the original test data and the other was that converted the non-zero value of the original test data into 1 ("*"_1, * is the name of test data set). The second category only retained calligraphy. In order to facilitate the calculation with MATLAB, the test set data was collated as shown in the Table 4. MNIST and MNIST_1 test data were matrices with 784 × 10000, QMNIST and QMNIST_1 test data were matrices with 784 × 60000.
The experimental training set was also divided into two categories: 1. From the original training set of 0-9 classes, each class was randomly selected one picture to form a group, lastly, a total of 10 groups and 100 pictures.
2. According to SPCA and Hypothesis2, 0-9 regular library of original library were made based on original library, and saw the regular data as a new training set. Referring to algorithm 4, the experimental flow is shown in Figure 3. We took a group of data for example (see Figure 4). Using data "0" for training, the original data and normalized data were as follows (see Figure 5). According to algorithm 4, the loss curve under Pattern refer to Figure 6, the impression under Pattern refer to Figure 7. And the correct recognition rate (CRR) on MNIST test set was shown as Table 5. CRR on MNIST test set was shown that all Pattern and two kinds of impressions can map to number 1. However, the recognition rate was very unsatisfactory. CRR on QMNIST test set was shown as Table 6, CRR on QMNIST test set was the similar with MNIST test set, the overall recognition rate was low. There were remaining 99 groups of training results in the first category, referring to the open source link (OSL) 6 which included all the recognition results and all the unrecognized results. From the above experimental analysis, the results were closed to Hypothesis1, but not ideal.

Regular library of training data
The figure (see Figure 8) was generated based on SPCA. These pictures look much more regular than original handwriting. Similarly, the number 0 was used to train for comparison, the impression under Pattern refer to Figure 9. Experimental data of numbers 1 to 9 were shown in the OSL. The training data of "0" and the normal data were shown as Figure 10. And the training progress was shown in Figure 11 (refer to Algorithm 4), the results (impressions) of intermediate process were shown in Figure 12. The first row (origin_*) is the initial data, the 2 nd row is the impression of 1 st layer, and the 3 rd row is the impression of 2 nd layer.       Compared with the 1st category, the CRR on MNIST, MNIST_1, QMNIST, QMNIST_1 test set was shown as Tables 7, 8, 9, 10 respectively. In the end, the learning results and all the recognition results were shown in the OSL. These results declared that Pattern should be worth exploring and studying. Trained only one picture (see Figure 5) and let the Pattern on the whole test set (MNIST, 10000). Different patterns and different layers of impressions were used to recognize the 0-9 numbers. The overall recognition rate (ORR) was around 30%. Trained only one picture (see Figure 5) and let the Pattern on the whole test set (QMNIST, 60000). Different patterns and different layers of impressions were used to recognize the 0-9 numbers. The ORR was around 30%. Trained only one picture (see Figure 10) and let the Pattern on the whole test set (MNIST, 10000). Different patterns and different layers of impressions were used to recognize the 0-9 numbers. The ORR was around 60%. Compared with Table 5, ORR had been greatly improved. The recognition rate of digits 0 was higher than that rate of other digits, which illustrated the correctness of hypothesis 1. Trained only one picture (see Figure 10) and let the Pattern on the whole test set (MNIST_1, 10000). Different patterns and different layers of impressions were used to recognize the 0-9 numbers. The ORR was more than 80%. Compared with Table 7, ORR had been greatly improved. The recognition rate of digits 0 and 1 was higher than that rate of other digits, but other rates were not low, which further verified the correctness of hypothesis1 and hypothesis 2. Trained only one picture (see Figure 10) and let the Pattern on the whole test set (QMNIST, 60000). Different patterns and different layers of impressions were used to recognize the 0-9 numbers. The ORR was around 60%. Compared with Table 7, the ORR was similar. Trained only one picture (see Figure 10) and let the Pattern on the whole test set (QMNIST_1, 60000). Different patterns and different layers of impressions were used to recognize the 0-9 numbers. The ORR was more than 80%. Compared with Table 9, ORR had been greatly improved. The recognition rate of digits 0 and 1 was higher than that rate of other digits, but other rates were not low, which further verified the correctness of hypothesis1 and hypothesis 2.

Experiments on CMU PIE Database
Considering the frontal face samples and the influence of a single factor, for example, lighting changes, CMU PIE [5] and Extended Yale B [6] databases were chosen to conduct experiments. In order to compare with [21], the latest research under the same conditions was selected to design the experiment. In the process of dis-crimination, the average value of each group was subtracted to weaken the influence of illumination change, and the illumination threshold was set to adjust the optimal value.
The CMU PIE database contained 68 subjects with 41368 face images under varying pose, illumination and expression. In this paper, a subset (C27) was chosen which contained 1428 images of 68 individuals un-der different illumination conditions. The images were all reshaped to 32 × 32. 21 experiments were also con-ducted to evaluate the performance of ACM model. The sample images in CMU PIE database were shown in Figure 13. The image from each subject was chosen as the training image and all the other 20 images as the test data. The average recognition accuracy of the methods for all training images in CMU PIE data base was shown in Table 11. ACM achieved the highest average recognition accuracy of 92.00% (1st layer) and 99.46% (2nd layer). We can draw a conclusion that the ACM model is more robust to illumination changes than other methods in the task of single face sample recognition.

Experiments on Extended Yale B Database
Considering that the condition of light changes in the CMU PIE database was not complex, ACM was also verified on Extended Yale B database which covered more complex illumination variation in this section. The database included 38 subjects under nine poses and 64 illumination conditions. Most importantly, all the images contained positive face only. The only factor that interfered with positive face recognition was the change in illumination. For comparison, the frontal pose images captured under 64 different lighting conditions for each of the 38 persons. And these images were divided into 5 subsets according to the angle of the light source directions, refer to Table 12. In our experiments, the images were all resized to 64 × 64. Images with the light condition (0°) were treated as training data, and all the data of subsets were used for testing. The sample images were shown in Figure 14 and Table 13 listed the experiment results on Extended Yale B database. As shown in Table 13, ACM expressed the competitive performance in Sub1, Sub2, and had the best performance in Sub3, Sub4 and Sub5, especially in Sub4 and Sub5 where the illumination conditions are extremely poor. Each layer of ACM deepened the subjective impression of this class. By adjusting the illumination threshold parameters, the influence of illumination changes can be reduced. Hence, we can draw a conclusion that ACM is better for single sample recognition under varying illumination.

Conclusion
In the experimental process, we adopted convolution instead of neural network alone, and we got the surprising results under the condition of single sample training. ACM focuses on the shape of the sample itself and learns about it. Regardless of the handwritten number or the face sample, the shape of the sample has not changed, and the learning pattern should be consistent and this has been verified by the experimental results. ACM gives an attempt on pattern recognition. The following parts analyse its characteristics.

Sparse Property
The K has obvious sparse property. Many algorithms are pursuing sparse property, because the property has the following two advantages: firstly, it has the property of automatic feature selection; lastly, it makes the model easier to interpret. Considering the sparse property, we need to supplement the property in the later stage of the ACM algorithm. For example, increasing the number of layers and making patterns more abundant (refer to Table 14), the sparse property of later patterns will gradually decrease. There are so many combinations that one of the next tasks is studying whether these modes are combined or duplicated or their effects are equivalent.

Combining Neural Network Algorithms
For the above combination, if the sparse property de-creases, the combination of deep learning algorithm (DLA) is considered. Because in the neural network algorithm, the bigger sparse property will lead to vanishing gradient, so in the process of debugging parameters of such algorithms, bigger sparse property is generally not allowed. When the sparse property is reduced, the fusion of ACM and DLA is also one of the next tasks.

Pattern Fusion
In addition to the aforementioned algorithm fusion scheme, Can we try the pattern fusion?
As the various operators proposed in the traditional algorithms in the past, combining with each other, we can refer to them completely. From the experimental results, the advantages of the patterns itself are not outstanding, such as the results of P3 and P5 are not absolutely superior or inferior. For this reason, another branch of the next task is to study the integration of these models in order to achieve major breakthroughs.
After carefully study of these results, it can be concluded that in the sparse layer, the pattern has learned the common characteristics of samples. The shallow understanding of the algorithm is the common characteristics of the samples. From this point of view, the algorithm should deepen the number of layers, whether it is pattern fusion or algorithm fusion.
In this paper, we proposed ACM algorithm which is a novel pattern recognition model for single sample recognition. Compared with other classical algorithms related to illumination changes, ACM is a more efficient and stable representation in the task of single face sample recognition. ACM combines convolution kernel and drop operation to extract information. And the most important is that the information is extracted from sample which should be expressed regularly, and this has verified by the experiments of (Q) MNIST recognition. ACM is worthy of further study.