Asymptotic Behaviour of Gradient Learning Algorithms in Neural Network Models for the Identification of Nonlinear Systems
Valerii N. Azarskov^{1}, Dmytro P. Kucherov^{1}, Sergii A. Nikolaienko^{2}, Leonid S. Zhiteckii^{2}
^{1}Faculty of Computer Science, National Aviation University, Kiev, Ukraine
^{2}Cybernetics Centre, Dept. of Automated Data Processing Systems, Kiev, Ukraine
Email address
Valerii N. Azarskov, Dmytro P. Kucherov, Sergii A. Nikolaienko, Leonid S. Zhiteckii. Asymptotic Behaviour of Gradient Learning Algorithms in Neural Network Models for the Identification of Nonlinear Systems. American Journal of Neural Networks and Applications. Vol. 1, No. 1, 2015, pp. 110. doi: 10.11648/j.ajnna.20150101.11
Abstract: This paper deals with studying the asymptotical properties of multilayer neural networks models used for the adaptive identification of wide class of nonlinearly parameterized systems in stochastic environment. To adjust the neural network’s weights, the standard online gradient type learning algorithms are employed. The learning set is assumed to be infinite but bounded. The Lyapunovlike tool is utilized to analyze the ultimate behaviour of learning processes in the presence of stochastic input variables. New sufficient conditions guaranteeing the global convergence of these algorithms in the stochastic frameworks are derived. The main their feature is that they need no a penalty term to achieve the boundedness of weight sequence. To demonstrate asymptotic behaviour of the learning algorithms and support the theoretical studies, some simulation examples are also given.
Keywords: Neural Network, Nonlinear Model, Gradient Learning Algorithm, Stochastic Environment, Convergence
1. Introduction
Over the past decades, interest has been increasing toward the use of multilayer neural networks as models for the adaptive identification of nonlinearly parameterized dynamic systems [1–4]. This has been motivated by the theoretical works of several researches including, in particular, Cybenko and Funahashi [5,6] who proved that, even with one hidden layer, neural network can uniformly approximate any continuous mapping over a compact domain, provided that the network has sufficient number of neurons with corresponding weights.
Several learning methods for updating the weights of neural networks have been advanced in literature. Most of these methods rely on the gradient concept [4,7]. Although this concept has been successfully used in many empirical studies, there are very few fundamental results dealing with the convergence of gradient algorithms for learning neural networks. One of these results is based on utilizing the Lyapunov stability theory [2,8].
The asymptotic behaviour of online adaptive gradient algorithms for the network learning has been studied by many authors. In particular, White [9] investigated the convergence of the learning process for the socalled feedforward network models with single hidden layer by using the stochastic approximation theory. The convergence results have been derived in [10–16] among many others provided that input signals have a probabilistic nature. In their stochastic approach, the learning rate goes to zero as the learning process tends to infinity. Unfortunately, this gives that the learning goes faster in the beginning and slows down in the late stage.
The convergence analysis of learning algorithm with deterministic (nonstochastic) nature has been given in [17–22]. In contrast to the stochastic approach, several of these results allow to employ a constant learning rate [19,23]. However, they assume that learning set must be finite whereas in online identification schemes, this set is theoretically infinite. To the best of author’s knowledge, there are no general results in literature concerning the global convergence properties of training procedures with a fixed learning rate applicable to the case of infinite learning set.
The distinguishing feature of multilayer neural networks is that they describe some nonlinearly parameterized models needed to be identified. This leads to difficulties in deriving their convergence properties for a general case.
To avoid this difficulties in nonstochastic case, the assumption that similar nonlinear functions need to be convex (concave) is introduced in [24]. However, such an assumption is not appropriate for neural network’s description of nonlinearity.
A popular approach to analyze the asymptotic behaviour of online gradient algorithms in stochastic case is based on Martingale convergence theory [25]. This approach has been exploited in [26,27] to derive some local convergence in stochastic framework for standard online gradient algorithms with the constant learning rate.
This paper is an extension of [26,27]. The main efforts is focused on establishing sufficient conditions under which the global convergence of gradient algorithm for learning neural networks models in the stochastic environments is ensured. The key idea in deriving these convergence results is based on the use of the Lyapunov methodology [28].
2. The Description of System
Let
(1)
be the nonlinear equation in the compact form describing a complex system to be identified. In this equation, and are the scalar output and the socalled state vector, respectively, available for the measurement at each nth time instant and represents some unknown nonlinear mapping. (Note that may include the current inputs of this system and possibly its past inputs and also outputs; see [7, subsect. 5.15].) Without loss of generality, one supposes that the nonlinearity
(2)
is the continuous and smooth function on a bounded but infinite set
To approximate (2) by a suitable nonlinearly parameterized function, the twolayer neural network model containing neurons in its hidden layer is employed. The inputs to the each jth neuron of this layer at the time instant are the components of Its output signal at the nth time instant is specified as
(3)
where denotes the ith component of and and are the weight coefficients and the bias of this jth neuron, respectively. denotes the socalled activation function defined usually as the sigmoid functions
(4)
or
(5)
There is only one neuron in the output (second) layer, whose inputs are the outputs of the hidden layer’s neurons. The output signal of second layer, at the time instant is determined by
(6)
where are the weights of this neuron and is its bias.
Since defined by (4) and (5) are nonlinear, it follows from (3), (6) that is the nonlinear function depending on and also on the dimensional parameter vector
(7)
To emphasize this fact, define the output signal of the neural network in the form
(8)
using the notation Taking into account that the neural network plays the role of a model of (1), rewrite (8) as follows:
(9)
Now, define the variable
(10)
representing the discrepancy between the nonlinearity (2) and its neural network’s model for a fixed Due to (1), it yields the current model error
(11)
which can be measured at the nth time instant. Further, introduce the usual quadratic loss function
(12)
To do an adaptation of the neural network model to the uncertain system (1), the standard online gradient learning algorithm
(13)
taken, for example, from [4,7] is utilized. In this algorithm, denotes the gradient of with respect to at for given and is the learning rate (step size) of (13). Thus, (3), (6), (8) and (13) together with (9) and (12) describe the learning system necessary for the adaptive identification of (1). For better understanding its performance, the structure of this system is depicted in Fig. 1.
3. Problem Formulation
Let be a sequence of vectors appearing randomly in accordance with some probability density function such that
Furthermore, has the following properties:
for any subset whose dimension is and
if where denotes the probability of corresponding random event.
Additionally, it is assumed that represents a continuous function which may become zero only at some isolated points on
Now, introduce the performance index
(14)
which evaluates the quality of learning process with given in (12). In this expression,
denotes the expectation of with respect to the random
The aim of this paper consists in studying the asymptotic properties of the learning procedure (13). More certainty, the following problem is stated. It is required to derive the conditions under which caused by this recursive algorithm will converge in the sense that
as (15)
with probability 1 (almost sure), where is determined by (14) for
4. Preliminaries
Let be a vector minimizing i.e.,
(16)
Consider, first, the case when can exactly be approximated by a neural network representation for all implying
(17)
In this case called in [4, p. 304] as the ideal case, one has (by virtue of (12), (14)).
It turned out that, at least, in the ideal case, the set containing these becomes not onepoint [26,27]. To show it, put Due to (7), this implies Let be a vector satisfying (17). Then, (3) and (6) together with (4) give that another will also satisfy (17).
Introduce the scalar variable representing the square of Euclidean distance between and a and define
(18)
Denote Since (due to (18)), it is clear that if
(19)
then the sequence has always a limit, as tends to infinity, i.e.,
(20)
where is a random value (in general), meaning that the algorithm (13) converges. On the other hand, the fact that is monotonically nonincreasing sequence is not necessary to achieve (20) in principle.
Note that the existence of the limit (20) does not imply that even when the condition (17) is satisfied. Hence,
(21)
with is not guaranteed without additional assumptions on Moreover, the limit (20) may not exist if is an arbitrary nonstochastic sequence leading to the violation of (19) [26]. Nevertheless, if the asymptotic property (21) takes place, then converges to some in sense of (21) where
(22)
denotes the socalled limit set introduced in [27, sect. 1.3] in which
Comment 1. Note that the limit set, given by (22) represents a nonlinear manifold on whose dimension satisfies
It can be understood that the algorithm (13) "attempts" to solve the infinite set of the equations
(23)
with respect to unknown In fact, this algorithm may give the solution of the remainder of (23), which is determined as the limit set (22) but not as
It was observed that the condition (19) meaning that is the monotonically nonincreasing sequence, may not be satisfied if the neural network model contains the hidden layer neither for nonstochastic nor for stochastic
In [26, 27], it has been established that if initial is chosen at an neighbourhood, of some giving (according to (18)), then behaves as the socalled positive supermartingale [25, 29] if is a stochastic sequence. This implies that the conditional expectation satisfies
(24)
Notice that (24) represents a stochastic counterpart of (19).) By virtue of the wellknown Doob’s theorem [25], the property (24) yields
a.s. (25)
similar to (20) for nonstochastic case. However, if lies far enough from then the condition (24) may not be satisfied. In this case, instead of (24), other condition
(26)
which is more strong may take place.
The asymptotic behaviour of the gradient algorithm (13) for an arbitrary is now derived in the following theorem.
Theorem 1. Let (17) hold and
a.s. (27)
Then will converge with probability 1 (a.s.) for any provided that the condition (26) is satisfied.
Proof. Using the classical BorelCantelli lemma [25, Sect. 15.3], from (27) it can conclude that there exists a finite number such that cn = 0 for all n ³ n*. Since {cn} is nonnegative, this gives
a.s. (28)
meaning that the all conditions of the modified positive supermartingale result established in Corollary D.5.1 of [29, p.501] are satisfied. By this result, the validity of (25) follows.
Comment 2. Of course, the convergence conditions (27) given in Theorem 1 (or (28)) make only the mathematical sense because they cannot beforehand be verified. Nevertheless, this conditions are somewhat useful for understanding the asymptotic behaviour of the learning algorithm (13) in the stochastic environment.
At first sight, it seems that the variable V(w) given by (18) might be exploited as a Lyapunov function for analyzing the asymptotic behaviour of (13) in a stochastic framework. In fact, by the definition, V(w) has the property
(29)
Meanwhile, the partial derivatives of V(w) with respect to the components of are not continued at all To demonstrate this feature, let Then becomes
,
where In this case, the components of the gradient ÑV(w) which represent these partial derivatives are discontinuous at ws belonging to the boundary between the domains and (see Fig. 2). Thereby, the requirement
(30)
with the Lipschitz constant advanced in [28] is not satisfied for any from Thus, having the form (18) is indeed not admissible to study the global convergence properties of (13) based on results of [28].
5. Observations
To demonstrate some asymptotic properties of (13) pointed out in the section above, several simulation experiments with the scalar nonlinear system (1) having the nonlinearity
were conducted. It can be shown that this nonlinearity can explicitly be approximated by the twolayer neural network model described by (3), (4), (6) and (8) with and given as: and
In all of the experiments, h was taken as h=0.01.
Fig. 3 illustrated the results of the first simulation example, where was chosen as a nonstochastic sequence. It can be observed that in this example defined by (18) with no limit implying that the learning algorithm (13) is not converge.
In other experiments, was generated as sequence of independent identically distributed (i.i.d.) pseudo random numbers on (the stochastic cases). Namely, in the second example, the initial was taken closely to In this case, the first difference of defined by (18) changed its sign (see Fig. 4). Nevertheless, tends to zero and as n increases as shown in Fig. 4. This observation supports the convergence property of (13) showing that is here the supermartingale.
Simulation results of third and fourth experiments are presented in Fig. 5 left and right, respectively. The initial estimated in both examples was chosen so that the distance between and was large enough, and the condition was satisfied. It was observed that at an initial stage of the learning process, was increasing and for several as shown in Fig. 5, left. Further, became decreasing. Such a behaviour of these sequence leaded to appearing the feature that for all sufficiently large
In the fourth example, the initial w(0) was chosen to be close to that in the third example. One can observe that in this case, (see Fig. 5, right).
It turned out that in third and fourth simulation examples, the condition (24) is not satisfied whereas the learning algorithm (13) remains indeed convergent. In these examples, instead of (24), the condition (26) takes place. This fact is demonstrated in Fig. 6.
6. Main Results
The global stochastic convergence analysis of the gradient learning algorithm (13) is based on employing the fundamental convergence conditions established in the Key Technical Lemma which is the slightly reformulated Theorem 3 of [28].
Key Technical Lemma. Let V(w) be a function satisfying (29) and (30). Define the scalar variable
(31)
and denote
.
Suppose:
(i)
(ii)
Introduce the additional variable
(32)
Then the algorithm (13) yields a.s. provided that and
(33)
(34)
i.e., the limit (25) will be achieved for






Related results followed from the Theorem 3′ of (28) are:
Corollary. Under the conditions of the Key Technical Lemma, if and and then a.s. provided that
(35)
is satisfied.
Now, one is able to present the first convergence result summarized in the theorem below.
Theorem 2. Suppose the assumption (17) holds. Then the gradient algorithm (13) with a constant learning rate, will converge with probability 1 (in the sense that a.s.) and
a.s. (36)
for any initial chosen randomly so that if the conditions (35) with and specified by
(37)
(38)
are satisfied.
Proof. Set
(39)
Then condition (29) and (30) can be shown to be valid. This indicates that of the form (39) may be taken as the Lyapunov function. By virtue of (31) such a choice of gives Putting and with determined by (37) and (38), respectively, one can conclude that the conditions (i), (ii) of the Key Technical Lemma are satisfied. Applying its Corollary it proves that with probability 1.
Due to the definition (39) of together with the assumption (17), result (36) follows.
Figure 6. Variables and in Example 3.
Now, consider general case, where cannot exactly be approximated by (as in (17)). Obviously, in this case, and the choice of a constant learning rate, is not appropriate [4].
The convergence results are here established in the follow theorem.
Theorem 3. Subject to the conditions
(40)
the gradient algorithm (13) yields
a.s.
provided that with determined by (37).
Proof. Setting
it can show that the requirements (29) and (30) will be satisfied: and for Since for it follows that condition (ii) of the Key Technical Lemma assumes as
Suppose (ii) is not satisfied. Then, there is a finite such that
With
(41)
Since tn is assumed to be finite, there exists a finite such that requirement (33) will be satisfied for all sufficiently large provided that (i) takes place with and and the condition (b) of (40) is satisfied (due to the fact that (b) means as
Further, if the assumption holds then the series
with
diverges whereas the series
converges (because of the validity of (a)). This gives that (34) takes also place.
Since all the conditions of Key Technical Lemma are satisfied for By this Lemma, a.s. Therefore, with probability 1. But this contradicts the assumption that (see (41)). Hence, this assumption is false. This fact proves the validity of result given in theorem.
Remark. Setting
,
it can be concluded that, under the condition of the Theorem 3, the following features are observed: for all
Fig. 7 in which equal to demonstrates these features.
Figure 7. Variables and in Example 3.
As it is seen, is bounded away from zero whereas is upper bounded. It gives as shown in Fig. 7 below.
Comment 3. The conditions established in the theorem 2 and 3 are sufficient to guarantee the global convergence of (13) (for any with probability 1 both in ideal and nonideal cases. Under these conditions, the requirement (15) in which
will obviously be satisfied (final result). Again, the essential feature of this result in the fact that these convergence properties can be achieved without adding penalty term to as in [16].
Of course, the calculation of and for choosing the suitable constant learning rate, according to (37), (38) seems to be hard. Meanwhile, h may be replaced by the timevarying h(n) satisfying the requirements (35) if necessary. Note that they are usual in the stochastic learning theory [7].
7. Conclusion
The main contribution of this paper consisted in theoretical and experimental studying the asymptotical properties of standard online gradient algorithms applicable to the learning neural networks in the stochastic framework. Namely, new sufficient conditions for the global convergence of these algorithms have been established. It was shown that adding a penalty term to the current error function is indeed not necessary to guarantee their convergence properties. Further analysis will provide a study of the asymptotic behaviour of online gradient learning algorithms in the presence of noise whose importance was pointed out in [4].
Acknowledgment
The authors would like to thank Prof. S.V. Kovalevskyy for his invitation to submit this work.
References