One Approach to the Problem of the Existence of a Solution in Neural Networks

Artificial neural networks are widely used to solve various applied problems. For the successful application of artificial neural networks, it is necessary to choose the correct network architecture, to select its parameters, threshold values of the elements, activation functions, etc. The problem of evaluating the neural network parameters, based on a study of the probabilistic behavior of the network is much promising. The study in the direction of developing probabilistic methods for perceptron-type pattern recognition systems is considered in different works. The concept of the characteristic function of the perceptron introduced by S. V Dayan was used by him to prove theorems on the existence of a perceptron solution. At the same time, issues of choosing a network architecture, theoretical assessment, and optimization of neural network parameters remain relevant. In this paper, we propose a mathematical apparatus for studying the relationship between the probability of correct classification of input data and the number of elements of hidden layers of a neural network. To evaluate the network performance and to estimate some parameters of the neural network such as the number of associative elements depending on the number of classification classes the mathematical expectation and variance of weights at the input of the output layer are considered. A theorem on the necessary and sufficient condition for the existence of a solution for a neural network is proved. By a solution of neural networks, the ability to recognize images with a probability other than zero is meant. The results of the proved theorem and its corollaries coincide with the results obtained by F. Rosenblat and S. Dayan for the perceptron in a different way.


Introduction
Artificial neural networks have been developed for a long time. They are widely used for solving various applied problems. Currently, there is a significant increase in interest in artificial intelligence, caused by both the development of technical means and the demand of the software market for a qualitatively new product.
Against this process, numerous attempts are being made to apply various models of neural networks. Artificial neural networks are becoming more common due to such factors as the ability to solve difficultly formalized tasks, perform parallel data processing, use large amounts of data, etc. [1][2][3][4][5][6].
For the successful application of artificial neural networks, it is necessary to choose the correct network architecture, select its parameters, threshold values of the elements, activation functions, etc. [2][3][4][5]27].
Research on the successful construction and the use of artificial neural networks is conducted mainly in the following areas: the selection of optimal learning algorithms, selection and optimization of neural network parameters (such number of layers, number of neurons in each layer, activation functions, etc.), as well as research on problems related to the convergence of the neural network. Since these tasks are interrelated, research on their solution was mainly conducted in parallel.
F. Rosenblatt proved a convergence theorem for the perceptron, which states that an elementary perceptron, regardless of the initial state of the weight coefficients and the sequence of occurrence of stimuli, will always lead to a solution in a finite time. F. Rosenblatt also presented the proves of some concomitant theorems and their consequences that showed what requirements the architecture of artificial neural networks and the methods of their training should meet [1].
The studies on neural networks were intensified in the 70s of the last century.
In 1970 A. G. Ivakhnenko developed a group method of data handling, which allows not only to calculate the weights of connections between neurons but also to determine the number of layers in the network and the neurons in them depending on the needs of the applied task [5][6][7][8][9].
In 1989, some of authors obtained a result stating that a perceptron with one hidden layer is an universal approximator, that is, it can approximate any continuous function if a continuous, monotonously increasing and limited function as an activation function of neural elements of the hidden layer is used [10][11][12]. Moreover, the accuracy of the approximation of the function depends on the number of neurons in the hidden layer. Thus, a perceptron with one hidden layer and an activation function of the aforementioned type is a universal classifier. In [12], it was also stated that for a network with ( − − ) architecture, to solve the problem of pattern classification (that is, perceptron convergence), there is to be inequality where is the number of elements in the input layer, is the number of neurons in the hidden layer, is the number of classes into which it is necessary to split the input space of images, is the volume of the training sample.
To select the network structure, aspects of the use of genetic algorithms have also been investigated. It should be noted that the conditions for the convergence of such algorithms are not well studied, even less is known about the rate of convergence [13].
In 1992 the architecture of cresceptron neural networks appeared [14,15]. Cresceptron changes its topology during training, by analogy with networks using a group method of data handling [5]. An important idea proposed in the cresceptron is the use of max-pooling layers instead of layers with average. Layers of maximum choice are now widely used in convolutional neural networks. However, for training modern convolutional networks, the error backpropagation algorithm is used, which is more efficient [16,17].
In 2006-2007, the development of deep learning convolutional networks based on training with a teacher took place. The work [16] describes the application of the error backpropagation algorithm for training a deep neural network with an architecture similar to a neocognitron and a cresceptron, consisting of alternating layers of convolution and maximum choice. This architecture of neural networks is actively used to date. Sergey Ioffe and Christian Szegedy in 2015 proposed to use in neural networks special layers of batch normalization [18][19][20][21]. In [22], it was shown that the backpropagation error algorithm converges faster if the input data are normalized. It was noticed that when a signal propagates through a neural network, its math. expectation and disperse change from layer to layer, which negatively affects the learning process. Joffe and Zhegedy proposed to perform normalization not only at the entrance to the neural network but also before each layer of the network.
Some scientists considered probabilistic neural networks (PNN) widely used in classification problems. The essence of such networks is that the outputs of the network can be interpreted as estimates of the probability that an element belongs to a certain class, and the network actually learns to evaluate the probability density function [23,24].
The task of estimating probability density according to data belongs to the field of Bayesian statistics. In contrast to Bayesian statistics, conventional statistics on a given model determines the probability of an outcome. In this case, the density has a certain definite form and the model parameters are estimated analytically. Bayesian statistics make it possible to evaluate the correctness of a model from available reliable data, that is, it makes it possible to estimate the probability density of distributions of model parameters from available data [25].
Another approach to estimating the probability density is based on nuclear estimates [26]. In this case, if there are a sufficient number of training examples, then the method gives a fairly good approximation to the true probability density.
Work on the creation of perceptron-type pattern recognition systems was also carried out in a different direction, namely, in the direction of developing probabilistic methods for perceptron studying [27]. The basis of this approach is the concept of the characteristic function of the perceptron (CFP), introduced by S. V Dayan [28][29][30]. Using CFP, theorems on the existence of a perceptron solution, on the choice of the number of elements of the hidden layer, on the length of the training sequence, etc., are proved [27,31]. This direction of research continues to be developed by colleagues and students of S. V. Dayan [27,[31][32][33][34][35][36]. At the same time, issues of choosing a network architecture, theoretical assessment, and optimization of neural network parameters remain relevant.
In this paper we propose a mathematical apparatus for studying the relationship between the probability of correct classification of input data and the number of elements of hidden layers of a neural network. A necessary and sufficient condition for the existence of a solution of a neural network is proved. By a solution of neural network the ability to recognize images with a probability other than zero is meant. As a consequence of the proved theorem, the results obtained by F. Rosenblat [1] and S. Dayan [27] for the perceptron were obtained.

The Proposed Mathematical Model of Neural Network and Solution of the Problem
In this work, a neural network of direct propagation is investigated. This kind of neural network consists of a layer of input nodes, hidden layers, and an output layer. Neurons have unidirectional connections, do not contain connections between the elements inside the layer and feedback connections between the layers. The neurons of the input layer are connected to the neurons of the hidden layer by excitatory and inhibitory connections in a random way. The outputs of all the neurons of the hidden layer are connected to the neurons of the output layer. Neurons in each layer are referred to as input, hidden and output elements, respectively [1,27,31,32,34].
The input layer is represented by the receptor field S, the hidden layer consists of N associative elements forming the set A, the output layer consists of a finite number of reacting R-elements. The outputs of all associative elements are connected to reactive elements [34][35][36].
An image is formed in the receptor field, corresponding to external irritation. Under the image we mean a certain vector, the coordinates of which correspond to individual elements of the receptor field and can take the values 1 and 0, depending on whether the corresponding element is excited or not.
We consider an N-valued function defined on some set of vectors and taking for each vector ∈ the values ( = 1, 2, … ). The function maps the receptor field to the associative layer with the value of being the weight of the associative element for the input vector . If there are two different mappings and then for vector the following inequality holds For all pathogens belonging to the same class / and for each element ∈ the functions 4 / are introduced, taking values 0 and 1 and characterizing the activity of the element under the influence of pathogens from class / [27]. If the external environment is divided into d classes * , , … , -and numbers of allocated in the classes representatives are * , , … , -, respectively, then in the k th A-element , using the mapping , the weight is accumulated and calculated by the formula [27,34,35]: where 8 is the initial weight of the k th A-element, δi is increment of the weight of the A-element, when one pathogen is shown from the 9: class of pathogens, 4 0 is activity of , ∈ under the influence of pathogens from class 0 .
When an image appears on the receptor field the A-element can either be excited or remain unexcited. Let us denote by ; / the probability that an A-element is excited when a single image from the class / appears, = 1, 2, … , ,.
As a measure of the quality of recognition Dayan S. V. has introduced the characteristic function of the perceptron-type neural network (CFP) [27][28][29][30]. For each class / the characteristic function ζ = has a form ζ = = ; / − ∑ ; /0 where ; * … ( = 1, 2, … , ,) is the probability of A-element excitation from pathogens A * , A , … , A -. Note that the CFP characterizes the probability that the A-element is excited when a pathogen belonging to a certain class is shown and is not excited by a pathogen not belonging to the same class.
where is the number of consecutively shown pathogens, is the number of A-elements, D / is the increment of the weight of the A-element when showing one pathogen from the 9: class of pathogens.
When the control pathogen A E is presented, summing formula (1) overall A-elements, we get the total weight of the associative layer F E at the output of the associative layer. Then the dispersion of the weights is represented by the formula Using the above concepts of the characteristic function and the math. expectation and disperse of weights, the following theorem is proved. Theorem.
If a set of neural networks and a classification of the external environment are given, then for the existence of a solution it is necessary and sufficient that there be an inequality H ≥ ; + G /J (4) Neural Networks

Sufficiency.
If the condition (4) is satisfied, then H ≥ 0. Let us show that H > 0.
Using the law of large numbers, one can find the dependence of the probability of correct identification on the expectation and variance of the input quantity, i.e. ; Consequently, for large µ and small σ, the correct separation of pathogens occurs with a probability close to one, so G /J → 0, where , is the number of classes. Since in [27, p. 331] there is the following relationship then substituting the estimate (8) in (7), and strengthening it, we obtain Considering inequality (5), we obtain ; ≥ 1 − G /J , ; + G /J ≥ 1 Substituting the last in (9), for all classes we get the inequalities 1/ * ∑ 1 -/)* < 1, ,/ < 1, , < The result obtained coincides with the result of the F. Rosenblatt theorem [1, Theorem 3, Corollary 2, p. 101] and the theorem on the choice of the number of A-elements [27, Theorem 11, Corollary 9, p. 331].
The dependence (4) is a convenient mathematical apparatus for the study of the statistical characteristics of neural networks. The obtained estimates can be used in defining the network architecture for application in practice tasks.

Conclusion
In this work, a mathematical apparatus for studying the probabilistic behavior of a perceptron-type neural network is developed. This apparatus is based on the characteristic function of a perceptron.
A theorem on the necessary and sufficient condition for the existence of a solution of a neural network is proved.
As a consequence of this theorem, results are obtained that are consistent with the results of F. Rosenblatt and S. Dayan [1,2].
The results obtained are of both theoretical and practical interest. Until now, in most cases, network parameters are selected empirically and refined as a result of experiments.
The results obtained in the work connect the network architecture with the probability of correct pattern recognition. These results give constructive recommendations for the construction of recognition systems based on neural networks. spent communicating with us to discuss the problems presented in the paper.