Improvement of Echo State Network Generalization by Selective Ensemble Learning Based on BPSO
Xiaodong Zhang, Xuefeng Yan
Key Laboratory of Advanced Control and Optimization for Chemical Processes of Ministry of Education, East China University of Science and Technology, Shanghai, P. R. China
To cite this article:
Xiaodong Zhang, Xuefeng Yan. Improvement of Echo State Network Generalization by Selective Ensemble Learning Based on BPSO. Automation, Control and Intelligent Systems. Vol. 4, No. 6, 2016, pp. 84-88. doi: 10.11648/j.acis.20160406.11
Received: November 14, 2016; Accepted: November 21, 2016; Published: December 1, 2016
Abstract: The Echo State Network (ESN) is a novel and special type of recurrent neural network that has become increasingly popular in machine learning domains such as time series forecasting, data clustering, and nonlinear system identification. This network is characterized by large randomly constructed recurrent neural networks (RNN) called "reservoir", in which the neurons are sparsely connected and the weights remain unchanged during training, leaving the simple training of the output layer. However, the reservoir is criticized for its randomness and instability because of the random initialization of the connectivity and weights. In this article, we introduced the selective ensemble learning based on BPSO to improve the generalization performance of ESN. Two widely studied tasks are used to prove the feasibility and priority of the selective ESN ensemble based on BPSO(SESNE-BPSO) model. And the results indicate that the SESNE-BPSO model performs better than the general ESN ensemble, the single standard ESN and several other improved ESN models.
Keywords: Echo State Network, Reservoir Computing, Artificial Neural Network, Ensemble Learning, Selective Ensemble, Particle Swarm Optimization
In recent years, reservoir computing (RC) [1,2] has been extensively studied as a novel kind of training approach in the machine learning community for recurrent neural network (RNN). The RC approach consists of a large randomly constructed RNN called "reservoir", wherein the neurons are sparsely connected and the weights remain unchanged during training. With this approach, only the weights of networks from the reservoir to the readout layer require training through linear regression methods. Therefore, RC approach has numerous advantages such as high modeling accuracy, strong modeling capacity and low computational complexity. The echo state network (ESN) [3,4], liquid state machines  and Evolino  are some examples of the RC approach. In this paper, we discuss the most popular form of RC, the ESN.
ESN is characterized by a large reservoir (generally 100--1000 neurons) converting the input data to a high-dimensional dynamic state space, which can be the "echo" of recent input history. ESN has been applied in a wide range of domains, such as nonlinear system identification  and time series prediction [8,9]. However, one of ESN’s flaws is its poorly understood reservoir properties. The randomly generated connectivity values and the weight structure of internal neurons in the reservoir may lead to the randomness and instability of ESN in prediction performance. Nevertheless, the random and unstable prediction is not constantly considered a disadvantage of machine learning algorithm. For ensemble learning , one of the most popular machine learning algorithm, the randomness and diversity of individual learners in an ensemble contribute in promoting the generalization performance of the learner’s ensembles. Therefore, the ensemble learning method is introduced to the ESN model to solve the proposed ESN problem.
Ensemble learning [11-13] is a machine learning algorithm which improves learning performance by training multiple component learners to solve the same task. The final ensemble’s output is the average of all individual learners’ outputs. The ensemble learning has been widely recognized to provide a better generalization performance compared with a single component learner . The effectiveness of ensemble learning can be explained by the bias and variance decomposition of the ensemble error . Ensemble learning can reduce both the bias and variance of ensemble error. As is studied in , the trade-off of individuals’ accuracy and diversity is the key to improve the generalization performance of ensemble. However, whether all the trained individuals networks should be selected into ensemble? Zhou et al.  proposed that a selective subset of all individuals can be more effective than ensemble all the individuals. The selective ensemble, which combines the diverse individuals selected from plenty of trained accurate networks, has been proved effective theoretically and practically.
One of the most important procedure for selective ensemble is how to select the diverse individuals from a number of trained accurate networks, which can be regarded as a feature selection problem. Some several classical feature methods such as forward selection, backward elimination can be applied to select the most effective subsets of individuals. However, those methods are almost greedy search algorithms, which suffer from the stagnation in local optima. As well-known, the evolutionary computation techniques are famous for the global search ability. Compared with genetic algorithms (GA) , particle swarm optimization (PSO)  has many advantages such as fewer parameters and higher convergence speed. Additionally, the optimization of whether the individuals are selected into the ensemble is a discrete optimization problem. Therefore, a discrete binary version of PSO, called binary particle swarm optimization (BPSO) , is introduced to solve the binary combinational optimization problem.
In this paper, the selective ensemble based on BPSO algorithm was incorporated introduced to ESN to promote the generalization performance. To my knowledge, this is the first time that the selective ensemble algorithm is applied to ESN.
2. Echo State Network
2.1. Architecture of the ESN
The ESN is a kind of RNN whose structure can be divided into three sections: a liner input layer with input neurons, a large and fixed RNN with internal neurons, and a linear readout layer with output neurons. The fixed RNN part where the neurons are sparsely connected and the weights maintain unchanged during training is called "reservoir". Fig. 1 illustrates the basic structure of the ESN.
The states of internal neurons and output variables at a specific time point are expressed as follows :
where is the internal neuron stimulation function(typically a tanh sigmoid function), and , and are the input variable, internal neuron state, and output variable at a specified time step t, respectively. is the matrix, which indicates the input weights to the reservoir; is the matrix, which denotes the internal connection weights of the reservoir; is the matrix, which represents the output (readout) weights from the reservoir.; is the matrix, which indicates the feedback weights from the output to the reservoir. The initialization of reservoir state is a zero vector. The superscripted represents transpose.
2.2. Training of the ESN
As discussed above, , and are the fixed matrices generated in advance generated by using the stochastic numerical values obtained from a uniform distribution, which means that only trainable matrix is the output weight matrix . For ESN to maintain the "Echo State Property", which means that the internal neuron state is a nonlinear transformation of the entire history of the input signal, the spectral radius of the internal connection weights should be set to less than 1. Thus is generally scaled by , where is the spectral radius of and is a scaling parameter between 0 and 1.
The internal neuron state , obtained during the training process, can be expressed as follows:
and the output data stream state matrix can be expressed as follows:
where n represents the number of the training sample. Consequently, the output matrix to be adjusted during training should solve a linear regression problem:
the common method uses the least-squares solution:
where denotes the Euclidean norm, and the desired is calculated by the following equation:
This is implemented by the pseudo-inverse algorithm.
3. Selective ESN Ensembles Based on BPSO
3.1. Review of BPSO
Particle swarm optimization (PSO)  method was first proposed by Kennedy and Eberhart to solve the numerical optimization problem. As an evolutionary computational technique, PSO introduced a population of particles to simulate the bird flocks to search the best solution to the problem. Each particle represents a candidate solution. Then the discrete binary version of PSO (BPSO) was proposed to solve the combinatorial optimization problem in 1997. In BPSO, each dimension of a particle’ position is limited to 0 or 1. The velocity and position of each particle can be updated according to Eq.(9)-(11):
if then , else (10)
Where represents a random function on the domain [0,1], denotes the personal best oftheparticle and denotes global best for theparticle; , and are the parameters; donates the position of particle. represents velocity for particle . is limited in the range of .
In this section, the selective ESN ensemble based on BPSO (SESNE-BPSO) is described in detail. For the problem of selective ensemble, each dimension of the particle’ position values 0 or 1 to denote whether the originally generated individual ESN is selected or not. The position of particle denotes the selection status of the ensemble. The dimension of each particle is the size of the originally generated ensemble.
Where and denote the selective ensemble and the originally generated ensemble respectively.
The objective optimization function is the normalized root mean square error(NRMSE).
where is the desired output(target), is the output, is the variance of , and is the total number of .
The procedure for the SESNE-BPSO can be summarized as follows:
(1) All the data are divided into three parts: training, validation and the testing set.
(2) Generate ESNs with the input weights and the internal connection weights initialized at random values. The other parameters of the standard ESN such as the sparse degree of reservoir, the spectral radius, and the input extension are confirmed through the validation set.
(3) Each generated ESN is trained using the algorithm described in section 2.2 with the training data.
(4) Choose the error function NRMSE of the selected ensemble represented by according to Eq.(11) as the objective optimization function. Select from by minimizing the error function on the validation set with the BPSO.
(5) The best of the validation performance is found out.
4. Experiment and Result
4.1. Experimental Setup
In this section, the proposed SESNE-BPSO method was evaluated using two extensively studied tasks obtained from previous literature on ESN. The model performance is evaluated by the percentage of the NRMSE. The results of the proposed SESNE-BPSO performance are compared with those of the general ESN ensemble (ESN-En), which ensemble all the originally generated ESNs, and the single standard ESN as well as two other improved ESN models. The number of originally generated ESNs is 20 for the following two experiments
4.2. Experiment Tasks and Results
A) NARMA system
The 10-th order nonlinear autoregressive moving average (NARMA) system is described in the following equation :
where denotes the NARMA system output at time , represents the system input at time , and refers to an independent identically distributed stream of values generated uniformly from [0, 0.5]. The NARMA system identification task has been described in Jaeger , the ESN is trained to output based on . Modeling the NARMA system is generally difficult because the system is strongly nonlinear and requires a substantially long memory to accurately reproduce the output. The current output of the system is decided by both the input data and the previous output data from up to 10 steps ago. The NARMA data-set used in this experiment contains 6,000 items and all the values are divided into 3 parts. The first part is the training data-set with 2,000 values, the second part is the validation data-set with 2,000 values, and the third part is the testing data-set with the remaining 2,000 values. The first 100 values of each part are stored to wash out the initial memory of the dynamic reservoir. The reservoir size () of ESNs for this task is set to 100. The experiments are performed 10 times because of the random initialization of ESN. The testing performance NRMSE of the single standard ESN, ESN-En, and SESNE-BPSO are displayed in Table 1. Mean represents the mean value of NRMSE, SD stands for the standard deviation of NRMSE, Max indicates the maximum value of NRMSE, and Min stands for the minimum value of NRMSE.
B) Laser Time Series
The laser chaotic time series data  used in this prediction task is a real-world sequenceobtained from the Santa Fe Competition by sampling the intensity of a far-infrared laser in a chaotic regime. The task is set to forecast the next value (one step ahead forecast) depending on the history values up to time . Laser time series prediction is generally difficult because of its numerical round-off noise and diverse time scales, especially in the breakdown events of the sequence. The laser data set used in this experiment contained 10,000 values, which are divided into 3 parts. The first part is the training data-set with 6,000 values, the second part is the validation data set with 2,000 values, and the third part is the testing data set with the remaining 2,000 values. The first 1000 values of each part are also stored to wash out the initial memory of the dynamic reservoir. This laser series prediction task needs feedback connections. The reservoir size () of the ESNs for this task is also set to 100. The bias input is a constant 0.02 value. The experiments are conducted for 10 times because of the random initialization of the ESN. The testing performance of the single standard ESN, ESN-En and SESNE-BPSO are displayed in Table 2.
To validate the performance of the proposed SESNE-BPSO model, other improved ESN models, such as L2-Boost ESN , Scale-Free Highly Clustered ESN(SFHC-ESN) , are performed for the comparison. The result of the comparison is presented in Table 3.
Based on the data from Table 1 and Table 2, the experimental results indicate that the ESN-En model obviously improved the performance of generalization compared with the standard ESN and the proposed SESNE-BPSO outperformed the ESN-En. Furthermore, SESNE-BPSO performs better than several other improved ESN models based on Table 3. This result illustrates that the selective ensemble learning based on BPSO algorithm promotes the generalization performance of the ESN ensemble.
In this paper, a novel ESN ensemble called SESNE-BPSO is proposed. Ensemble learning is introduced to improve the generalization performance of the ESN model. The diversity of the individual ESNs in the ensemble is one of the key factors in reducing the ensemble generalization error. The diverse ESNs are created because of the random initialization of input and internal weights. The selective ensemble learning based on BPSO algorithm is applied as an ensemble learning approach to further increase the performance of ESN ensemble. Two widely used tasks are performed to test the performance of the proposed SESNE-BPSO model. The results indicate that SESNE-BPSO performs better than the general ESN ensemble, the standard ESN and other improved ESN models. Consequently the findings demonstrate the feasibility and superiority of the selective ensemble learning based on BPSO approach to ESN.
The authors gratefully acknowledge the support of the following foundations: 973 project of China (2013CB733605).