American Journal of Neural Networks and Applications
Volume 1, Issue 1, August 2015, Pages: 1-10

Asymptotic Behaviour of Gradient Learning Algorithms in Neural Network Models for the Identification of Nonlinear Systems

Valerii N. Azarskov1, Dmytro P. Kucherov1, Sergii A. Nikolaienko2, Leonid S. Zhiteckii2

1Faculty of Computer Science, National Aviation University, Kiev, Ukraine

2Cybernetics Centre, Dept. of Automated Data Processing Systems, Kiev, Ukraine

Email address

(V. N. Azarskov)
(D. P. Kucherov)
(S. A. Nikolaienko)
(L. S. Zhiteckii)

To cite this article:

Valerii N. Azarskov, Dmytro P. Kucherov, Sergii A. Nikolaienko, Leonid S. Zhiteckii. Asymptotic Behaviour of Gradient Learning Algorithms in Neural Network Models for the Identification of Nonlinear Systems. American Journal of Neural Networks and Applications. Vol. 1, No. 1, 2015, pp. 1-10. doi: 10.11648/j.ajnna.20150101.11


Abstract: This paper deals with studying the asymptotical properties of multilayer neural networks models used for the adaptive identification of wide class of nonlinearly parameterized systems in stochastic environment. To adjust the neural network’s weights, the standard online gradient type learning algorithms are employed. The learning set is assumed to be infinite but bounded. The Lyapunov-like tool is utilized to analyze the ultimate behaviour of learning processes in the presence of stochastic input variables. New sufficient conditions guaranteeing the global convergence of these algorithms in the stochastic frameworks are derived. The main their feature is that they need no a penalty term to achieve the boundedness of weight sequence. To demonstrate asymptotic behaviour of the learning algorithms and support the theoretical studies, some simulation examples are also given.

Keywords: Neural Network, Nonlinear Model, Gradient Learning Algorithm, Stochastic Environment, Convergence


1. Introduction

Over the past decades, interest has been increasing toward the use of multilayer neural networks as models for the adaptive identification of nonlinearly parameterized dynamic systems [1–4]. This has been motivated by the theoretical works of several researches including, in particular, Cybenko and Funahashi [5,6] who proved that, even with one hidden layer, neural network can uniformly approximate any continuous mapping over a compact domain, provided that the network has sufficient number of neurons with corresponding weights.

Several learning methods for updating the weights of neural networks have been advanced in literature. Most of these methods rely on the gradient concept [4,7]. Although this concept has been successfully used in many empirical studies, there are very few fundamental results dealing with the convergence of gradient algorithms for learning neural networks. One of these results is based on utilizing the Lyapunov stability theory [2,8].

The asymptotic behaviour of online adaptive gradient algorithms for the network learning has been studied by many authors. In particular, White [9] investigated the convergence of the learning process for the so-called feedforward network models with single hidden layer by using the stochastic approximation theory. The convergence results have been derived in [10–16] among many others provided that input signals have a probabilistic nature. In their stochastic approach, the learning rate goes to zero as the learning process tends to infinity. Unfortunately, this gives that the learning goes faster in the beginning and slows down in the late stage.

The convergence analysis of learning algorithm with deterministic (non-stochastic) nature has been given in [17–22]. In contrast to the stochastic approach, several of these results allow to employ a constant learning rate [19,23]. However, they assume that learning set must be finite whereas in online identification schemes, this set is theoretically infinite. To the best of author’s knowledge, there are no general results in literature concerning the global convergence properties of training procedures with a fixed learning rate applicable to the case of infinite learning set.

The distinguishing feature of multi-layer neural networks is that they describe some nonlinearly parameterized models needed to be identified. This leads to difficulties in deriving their convergence properties for a general case.

To avoid this difficulties in non-stochastic case, the assumption that similar nonlinear functions need to be convex (concave) is introduced in [24]. However, such an assumption is not appropriate for neural network’s description of nonlinearity.

A popular approach to analyze the asymptotic behaviour of online gradient algorithms in stochastic case is based on Martingale convergence theory [25]. This approach has been exploited in [26,27] to derive some local convergence in stochastic framework for standard online gradient algorithms with the constant learning rate.

This paper is an extension of [26,27]. The main efforts is focused on establishing sufficient conditions under which the global convergence of gradient algorithm for learning neural networks models in the stochastic environments is ensured. The key idea in deriving these convergence results is based on the use of the Lyapunov methodology [28].

2. The Description of System

Let

(1)

be the nonlinear equation in the compact form describing a complex system to be identified. In this equation,  and  are the scalar output and the so-called state vector, respectively, available for the measurement at each nth time instant  and  represents some unknown nonlinear mapping. (Note that  may include the current inputs of this system and possibly its past inputs and also outputs; see [7, subsect. 5.15].) Without loss of generality, one supposes that the nonlinearity

(2)

is the continuous and smooth function on a bounded but infinite set  

To approximate (2) by a suitable nonlinearly parameterized function, the two-layer neural network model containing   neurons in its hidden layer is employed. The inputs to the each jth neuron of this layer at the time instant  are the components of  Its output signal at the nth time instant is specified as

(3)

where  denotes the ith component of  and  and  are the weight coefficients and the bias of this jth neuron, respectively.  denotes the so-called activation function defined usually as the sigmoid functions

(4)

or

(5)

There is only one neuron in the output (second) layer, whose inputs are the outputs of the hidden layer’s neurons. The output signal of second layer,  at the time instant  is determined by

(6)

where  are the weights of this neuron and  is its bias.

Since  defined by (4) and (5) are nonlinear, it follows from (3), (6) that  is the nonlinear function depending on  and also on the -dimensional parameter vector

(7)

To emphasize this fact, define the output signal of the neural network in the form

(8)

using the notation  Taking into account that the neural network plays the role of a model of (1), rewrite (8) as follows:

(9)

Now, define the variable

(10)

representing the discrepancy between the nonlinearity (2) and its neural network’s model for a fixed  Due to (1), it yields the current model error

(11)

which can be measured at the nth time instant. Further, introduce the usual quadratic loss function

(12)

To do an adaptation of the neural network model to the uncertain system (1), the standard online gradient learning algorithm

(13)

taken, for example, from [4,7] is utilized. In this algorithm,  denotes the gradient of  with respect to  at  for given  and  is the learning rate (step size) of (13). Thus, (3), (6), (8) and (13) together with (9) and (12) describe the learning system necessary for the adaptive identification of (1). For better understanding its performance, the structure of this system is depicted in Fig. 1.

Figure 1. Configuration of online learning system.

3. Problem Formulation

Let  be a sequence of vectors appearing randomly in accordance with some probability density function  such that

Furthermore,  has the following properties:

for any subset  whose dimension is  and

if  where  denotes the probability of corresponding random event.

Additionally, it is assumed that  represents a continuous function which may become zero only at some isolated points on

Now, introduce the performance index

(14)

which evaluates the quality of learning process with  given in (12). In this expression,

denotes the expectation of  with respect to the random

The aim of this paper consists in studying the asymptotic properties of the learning procedure (13). More certainty, the following problem is stated. It is required to derive the conditions under which  caused by this recursive algorithm will converge in the sense that

 as              (15)

with probability 1 (almost sure), where  is determined by (14) for

4. Preliminaries

Let  be a vector minimizing  i.e.,

(16)

Consider, first, the case when  can exactly be approximated by a neural network representation for all  implying

(17)

In this case called in [4, p. 304] as the ideal case, one has  (by virtue of (12), (14)).

It turned out that, at least, in the ideal case, the set  containing these  becomes not one-point [26,27]. To show it, put   Due to (7), this implies  Let  be a vector satisfying (17). Then, (3) and (6) together with (4) give that another  will also satisfy (17).

Introduce the scalar variable  representing the square of Euclidean distance between  and a  and define

(18)

Denote  Since  (due to (18)), it is clear that if

(19)

then the sequence  has always a limit,  as  tends to infinity, i.e.,

(20)

where  is a random value (in general), meaning that the algorithm (13) converges. On the other hand, the fact that  is monotonically non-increasing sequence is not necessary to achieve (20) in principle.

Note that the existence of the limit (20) does not imply that  even when the condition (17) is satisfied. Hence,

(21)

with  is not guaranteed without additional assumptions on  Moreover, the limit (20) may not exist if  is an arbitrary non-stochastic sequence leading to the violation of (19) [26]. Nevertheless, if the asymptotic property (21) takes place, then  converges to some  in sense of (21) where

(22)

denotes the so-called limit set introduced in [27, sect. 1.3] in which

Comment 1. Note that the limit set,  given by (22) represents a nonlinear manifold on  whose dimension satisfies

It can be understood that the algorithm (13) "attempts" to solve the infinite set of the equations

(23)

with respect to unknown  In fact, this algorithm may give the solution  of the remainder of (23), which is determined as the limit set (22) but not as

It was observed that the condition (19) meaning that  is the monotonically non-increasing sequence, may not be satisfied if the neural network model contains the hidden layer neither for non-stochastic nor for stochastic

In [26, 27], it has been established that if initial  is chosen at an -neighbourhood,  of some  giving  (according to (18)), then  behaves as the so-called positive supermartingale [25, 29] if  is a stochastic sequence. This implies that the conditional expectation satisfies

(24)

Notice that (24) represents a stochastic counterpart of (19).) By virtue of the well-known Doob’s theorem [25], the property (24) yields

 a.s.               (25)

similar to (20) for non-stochastic case. However, if  lies far enough from  then the condition (24) may not be satisfied. In this case, instead of (24), other condition

(26)

which is more strong may take place.

The asymptotic behaviour of the gradient algorithm (13) for an arbitrary  is now derived in the following theorem.

Theorem 1. Let (17) hold and

 a.s.            (27)

Then  will converge with probability 1 (a.s.) for any provided that the condition (26) is satisfied.

Proof. Using the classical Borel-Cantelli lemma [25, Sect. 15.3], from (27) it can conclude that there exists a finite number  such that cn = 0 for all n ³ n*. Since {cn} is nonnegative, this gives

 a.s.                (28)

meaning that the all conditions of the modified positive supermartingale result established in Corollary D.5.1 of [29, p.501] are satisfied. By this result, the validity of (25) follows.

Comment 2. Of course, the convergence conditions (27) given in Theorem 1 (or (28)) make only the mathematical sense because they cannot beforehand be verified. Nevertheless, this conditions are somewhat useful for understanding the asymptotic behaviour of the learning algorithm (13) in the stochastic environment.

At first sight, it seems that the variable V(w) given by (18) might be exploited as a Lyapunov function for analyzing the asymptotic behaviour of (13) in a stochastic framework. In fact, by the definition, V(w) has the property

(29)

Meanwhile, the partial derivatives of V(w) with respect to the components of  are not continued at all To demonstrate this feature, let  Then  becomes

,

where   In this case, the components of the gradient ÑV(w) which represent these partial derivatives are discontinuous at ws belonging to the boundary between the domains  and  (see Fig. 2). Thereby, the requirement

(30)

with the Lipschitz constant  advanced in [28] is not satisfied for any  from  Thus,  having the form (18) is indeed not admissible to study the global convergence properties of (13) based on results of [28].

Figure 2. I llustration of the two-layers networks properties with M=1, N=1.

5. Observations

To demonstrate some asymptotic properties of (13) pointed out in the section above, several simulation experiments with the scalar nonlinear system (1) having the nonlinearity

were conducted. It can be shown that this nonlinearity can explicitly be approximated by the two-layer neural network model described by (3), (4), (6) and (8) with  and  given as:  and

In all of the experiments, h was taken as h=0.01.

Fig. 3 illustrated the results of the first simulation example, where  was chosen as a non-stochastic sequence. It can be observed that in this example  defined by (18) with  no limit implying that the learning algorithm (13) is not converge.

Figure 3. Behaviour of gradient learning algorithm (13) in Example 1.

In other experiments,  was generated as sequence of independent identically distributed (i.i.d.) pseudo random numbers on  (the stochastic cases). Namely, in the second example, the initial  was taken closely to  In this case, the first difference  of  defined by (18) changed its sign (see Fig. 4). Nevertheless,  tends to zero and as n increases as shown in Fig. 4. This observation supports the convergence property of (13) showing that  is here the supermartingale.

Simulation results of third and fourth experiments are presented in Fig. 5 left and right, respectively. The initial estimated  in both examples was chosen so that the distance between  and  was large enough, and the condition  was satisfied. It was observed that at an initial stage of the learning process,  was increasing and  for several as shown in Fig. 5, left. Further,  became decreasing. Such a behaviour of these sequence leaded to appearing the feature that  for all sufficiently large

In the fourth example, the initial w(0) was chosen to be close to that in the third example. One can observe that in this case,  (see Fig. 5, right).

It turned out that in third and fourth simulation examples, the condition (24) is not satisfied whereas the learning algorithm (13) remains indeed convergent. In these examples, instead of (24), the condition (26) takes place. This fact is demonstrated in Fig. 6.

Figure 4. Behaviour of gradient learning algorithm (13) in Example 2.

6. Main Results

The global stochastic convergence analysis of the gradient learning algorithm (13) is based on employing the fundamental convergence conditions established in the Key Technical Lemma which is the slightly reformulated Theorem 3 of [28].

Key Technical Lemma. Let V(w) be a function satisfying (29) and (30). Define the scalar variable

(31)

and denote

.

Suppose:

(i)  

(ii)  

Introduce the additional variable

(32)

Then the algorithm (13) yields  a.s. provided that  and

(33)

(34)

i.e., the limit (25) will be achieved for

Figure 5. Behaviour of gradient learning algorithm (13) in Examples 3 (left) and 4 (right).

Related results followed from the Theorem 3′ of (28) are:

Corollary. Under the conditions of the Key Technical Lemma, if  and  and  then  a.s. provided that

(35)

is satisfied.

Now, one is able to present the first convergence result summarized in the theorem below.

Theorem 2. Suppose the assumption (17) holds. Then the gradient algorithm (13) with a constant learning rate,  will converge with probability 1 (in the sense that  a.s.) and

 a.s.               (36)

for any initial  chosen randomly so that  if the conditions (35) with  and  specified by

(37)

(38)

are satisfied.

Proof. Set

(39)

Then condition (29) and (30) can be shown to be valid. This indicates that  of the form (39) may be taken as the Lyapunov function. By virtue of (31) such a choice of  gives  Putting  and  with  determined by (37) and (38), respectively, one can conclude that the conditions (i), (ii) of the Key Technical Lemma are satisfied. Applying its Corollary it proves that with probability 1.

Due to the definition (39) of  together with the assumption (17), result (36) follows.

Figure 6. Variables   and  in Example 3.

Now, consider general case, where  cannot exactly be approximated by  (as in (17)). Obviously, in this case,  and the choice of a constant learning rate,  is not appropriate [4].

The convergence results are here established in the follow theorem.

Theorem 3. Subject to the conditions

(40)

the gradient algorithm (13) yields

 a.s.

provided that  with  determined by (37).

Proof. Setting

it can show that the requirements (29) and (30) will be satisfied:  and  for  Since  for  it follows that condition (ii) of the Key Technical Lemma assumes  as

Suppose (ii) is not satisfied. Then, there is a finite  such that

With

(41)

Since tn is assumed to be finite, there exists a finite  such that requirement (33) will be satisfied for all sufficiently large  provided that (i) takes place with  and  and the condition (b) of (40) is satisfied (due to the fact that (b) means  as

Further, if the assumption  holds then the series

 with

diverges whereas the series

converges (because of the validity of (a)). This gives that (34) takes also place.

Since  all the conditions of Key Technical Lemma are satisfied for  By this Lemma,  a.s. Therefore,  with probability 1. But this contradicts the assumption that (see (41)). Hence, this assumption is false. This fact proves the validity of result given in theorem.

Remark. Setting

,

it can be concluded that, under the condition of the Theorem 3, the following features are observed:   for all

Fig. 7 in which  equal to  demonstrates these features.

Figure 7. Variables  and  in Example 3.

As it is seen,  is bounded away from zero whereas  is upper bounded. It gives  as shown in Fig. 7 below.

Comment 3. The conditions established in the theorem 2 and 3 are sufficient to guarantee the global convergence of (13) (for any  with probability 1 both in ideal and non-ideal cases. Under these conditions, the requirement (15) in which

will obviously be satisfied (final result). Again, the essential feature of this result in the fact that these convergence properties can be achieved without adding penalty term to  as in [16].

Of course, the calculation of  and  for choosing the suitable constant learning rate,  according to (37), (38) seems to be hard. Meanwhile, h may be replaced by the time-varying h(n) satisfying the requirements (35) if necessary. Note that they are usual in the stochastic learning theory [7].

7. Conclusion

The main contribution of this paper consisted in theoretical and experimental studying the asymptotical properties of standard online gradient algorithms applicable to the learning neural networks in the stochastic framework. Namely, new sufficient conditions for the global convergence of these algorithms have been established. It was shown that adding a penalty term to the current error function is indeed not necessary to guarantee their convergence properties. Further analysis will provide a study of the asymptotic behaviour of online gradient learning algorithms in the presence of noise whose importance was pointed out in [4].

Acknowledgment

The authors would like to thank Prof. S.V. Kovalevskyy for his invitation to submit this work.


References

  1. J. Suykens, and B. D. Moor, "Nonlinear system identification using multilayer neural networks: some ideas for initial weights, number of hidden neurons and error criteria," in Proc. 12nd IFAC World Congress, vol. 3. Sydney, Australia, July 1993, pp. 49–52.
  2. E. S. Kosmatopoulos, M. M. Polycarpou, M. A. Christodoulou, and P.A. Ioannou, "High-order neural network structures for identification of dynamical systems," IEEE Trans. on Neural Networks, vol. 6, pp. 422–431, 1995.
  3. A. U. Levin, and K. S. Narendra, "Recursive identification using feedforward neural networks," Int. J. Contr vol. 61, pp. 533–547, 1995.
  4. Ya. Z. Tsypkin, J. D. Mason, E. D. Avedyan, K. Warwick, I. K. Levin, "Neural networks for identification of nonlinear systems under random piecewise polynomial disturbances," IEEE Trans. on Neural Networks, vol. 10, pp. 303–311, 1999.
  5. G. Cybenko, "Approximation by superpositions of a sigmoidal functions," Math. Control, Signals, Syst., vol. 2, pp. 303–313, 1989.
  6. K. Funahashi, "On the approximate realization of continuous mappings by neural networks," Neural Networks, vol. 2, pp. 182–192, 1989.
  7. Ya. Z. Tsypkin, Adaptation and Learning in Automatic Systems, New-York: Academic Press, 1971.
  8. L. Behera, S. Kumar, and A. Patnaik, "On adaptive learning rate that guarantees convergence in feedforward networks," IEEE Trans. on Neural Networks, vol. 17, pp. 1116–1125, 2006.
  9. H. White, "Some asymptotic results for learning in single hidden-layer neural network models," J. Amer. Statist. Assoc., vol. 84, pp. 117–134, 1987.
  10. C. M. Kuan, and K. Hornik, "Convergence of learning algorithms with constant learning rates," IEEE Trans. on Neural Networks, vol. 2, pp. 484 – 489, 1991.
  11. Z. Luo, "On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks," Neural Comput., vol. 3, pp. 226–245, 1991.
  12. W. Finnoff, "Diffusion approximations for the constant learning rate backpropagation algorithm and resistance to local minima," Neural Comput., 6, pp. 285– 295, 1994.
  13. A. A. Gaivoronski, "Convergence properties of backpropagation for neural nets via theory of stochastic gradient methods," Optim. Methods Software 4, pp. 117–134, 1994.
  14. T. L. Fine, and S. Mukherjee, "Parameter convergence and learning curves for neural networks," Neural Comput. 11, pp. 749–769, 1999.
  15. V.Tadic, and S. Stankovic, "Learning in neural networks by normalized stochastic gradient algorithm: Local convergence," in Proc. 5th Seminar Neural Netw. Appl. Electr. Eng., pp. 11–17, (Yugoslavia,Sept. 2000).
  16. H. Zhang, W. Wu, F. Liu, and M.Yao, "Boundedness and convergence of online gradient method with penalty for feedforward neural networks," IEEE Trans. on Neural Networks, vol. 20, pp. 1050–1054, 2009.
  17. O. L. Mangasarian, and M. V.Solodov, "Serial and parallel backpropagation convergence via nonmonotone perturbed minimization," Optim. Methods Software, pp. 103–106, 1994.
  18. W. Wu, G. Feng, and X. Li, "Training multilayer perceptrons via minimization of ridge functions," Advances in Comput. Mathematics, vol. 17, pp. 331–347, 2002.
  19. N. Zhang, W. Wu, and G. Zheng, "Convergence of gradient method with momentum for two-layer feedforward neural networks," IEEE Trans. on Neural Networks, vol. 17, pp. 522–525, 2006.
  20. W. Wu, G. Feng, X. Li, and Y. Xu, "Deterministic convergence of an online gradient method for BP neural networks," IEEE Trans. on Neural Networks, vol. 16, pp. 1–9, 2005.
  21. Z. B. Xu, R. Zhang, and W.F. Jing, "When does online BP training converge?" IEEE Trans. on Neural Networks, vol. 20, pp. 1529–1539, 2009.
  22. H. Shao, W. Wu, and L. Liu, "Convergence and monotonicity of an online gradient method with penalty for neural networks," WSEAS Trans. Math., vol. 6, pp. 469–476, 2007.
  23. S. W. Ellacott, "The numerical analysis approach," Mathematical Approaches to Neural Networks (J.G. Taylor, ed; B.V.: Elsevier Science Publisher), pp. 103–137, 1993.
  24. F. P.Skantze, A. Kojic, A. P. Loh, and A. M. Annaswamy, "Adaptive estimation of discrete time systems with nonlinear parameterization," Automatica, vol. 36, pp. 1879–1887, 2000.
  25. M. Loeve, Probability Theory New-York: Springer-Verlag, 1963.
  26. L. S. Zhiteckii, V. N. Azarskov, and S. A. Nikolaienko, "Convergence of learning algorithms in neural networks for adaptive identification of nonlinearly parameterized systems," in Proc. 16th IFAC Symposium on System Identification (Brussels, Belgium), pp. 1593–1598, 2012.
  27. V. N. Azarskov, L. S. Zhiteckii, and S. A. Nikolaienko, "Sequential learning processes in neural networks applied as models of nonlinear systems," Electronics and Control Systems, no. 3(37), pp. 124–132, 2013.
  28. B. T. Polyak, "Convergence and convergence rate of iterative stochastic algorithms, I: General case," Autom. Remote Control, vol. 12, pp. 1858–1868, 1976.
  29. G. C. Goodwin, and K. S.Sin, Adaptive Filtering, Prediction and Control Engewood Cliffs, NJ.: Prentice-Hall, 1984.

Article Tools
  Abstract
  PDF(532K)
Follow on us
ADDRESS
Science Publishing Group
548 FASHION AVENUE
NEW YORK, NY 10018
U.S.A.
Tel: (001)347-688-8931