A Survey of Generative Adversarial Networks Based on Encoder-Decoder Model

: The generative model is a very important type of model in the field of artificial intelligence in recent years. Such models can comprehend the data through the neural network, and then create data according to the probability distribution of the input data, predict the results according to the data characteristics. The whole processing process is "intelligent". At present, two typical applications of generative model are Generative Adversarial Networks (GAN) model and Encoder-Decoder model, which have strong ability to generate image data. In the model of GAN, the generator simulates real data


Introduction
In recent years, with the remarkable improvement of computer processing ability and the explosive growth of data in industries, the rapid development in the field of deep learning knowledge has been brought forward. Among them, applying generative model to data processing is the most promising method at present [1,2]. The Generative Adversarial Networks (GAN) model [3] and the Encoder-Decoder model [4] are two successful applications built on the generative model.
The GAN model consists of a Generator and a Discriminator. The Generator aims to learn the real data distribution as much as possible. The purpose of Generator is to learn the real data distribution as much as possible. The purpose of Discriminator is to try to determine whether the input data comes from real data or from generated noise data. In order to improve the generation ability of Generator, the generated data is not discriminated by Discriminator. In order to improve its discriminating ability, Discriminator can more accurately judge whether the input data is from real data or generated noise data. Both need to be continuously optimized to reach the Nash equilibrium state [3,5]. GAN can effectively solve the problem of the generation of interpretable data, especially for high-dimensional data, the neural network method used does not limit the dimension of the data, nor does it limit the data type, which greatly broadens the scope of sample selection for generating data. At the same time, the neural network can integrate a variety of loss functions and increase the degree of freedom of design [6].
GAN transforms a random noise vector from a probability distribution to a probability distribution of the real data set. GAN creatively trains two neural networks during the training process. The training process does not require approximate inference, which improves the training difficulty and improves the training efficiency.
Encoder-Decoder is a kind of framework, end-to-end is one of its most prominent features [7]. Encoder denotes that the input data is converted into a fixed-length intermediate vector, and Decoder denotes that the previously generated intermediate vector is converted into output data. Encoder and Decoder are two kinds of neural network models, which can adopt arbitrary neural network algorithm according to the application requirements.
In Encoder-Decoder framework, the dimension of the intermediate vector is generally much smaller than that of the input data. Therefore, Encoder can be used to reduce the dimension. Then, the noise of the intermediate vector can be increased and the framework can be trained to restore the original input data as much as possible [8]. In this way, the intermediate vector has deep features that the original input data can extract.
Intuitively, the two models have obvious advantages and disadvantages compared with each other. GAN does not need pre-modeling, that is, it does not need a pre-hypothesized data distribution, and the loss function design is easy. As long as there is a standard, it can be handed to the discriminator for confrontation training, and finally the generated data can be obtained. The final generated data and the original data of Encoder-Decoder need to be the same distribution. However, GAN cannot directly compare the difference between the generated data and the original data, but Encoder-Decoder can do these.
Based on this, scholars have put forward some methods of combining GAN with Encoder-Decoder and realized some applications. This paper revolves around the current mainstream application of Encoder-Decoder to GAN. Chapter 1 outlines the basic theories and concepts of the two models. Chapter 2 combs the variable-inference method for applying Encoder-Decoder to the Generator structure based on probability theory. Chapter 3 introduces the energy function method of applying Encoder-Decoder to Discriminator structure based on energy model. Chapter 4 introduces the association transformation method that combines GAN and Encoder-Decoder to different distributed data. Chapter 5 summarizes these methods.

Basic Concepts
The theory and model of GAN were proposed by Ian Goodfellow et al. In GAN, any differentiable function can be expressed as Generator and Discriminator, which are represented here by differentiable functions and . The noise data is represented by , and ( ) represents the data generated by as far as possible from the probability distribution of the real data . Generator continuously learns the probability distribution of real data, and the goal is to optimize the input random noise to fit the probability distribution of real data, which makes the Discriminator indistinguishable. Discriminator judges whether the received data is real data or the data ′ generated by the Generator through the noise data , and distinguishes the "true" and "false" data generated by the generated model. The goal of here is to realize the two-class discrimination of data sources. If the data comes from real data , it is judged as "true" and marked as 1, and if it is derived from ( ), it is judged as "false" and marked as 0. The definition ( ) is defined to represent the probability distribution ( ) of the input ( ) in the generated model, ( ) is defined to represent the probability distribution ( ) of from real data, and its optimized loss function is defined as: and: * = "#$ min ( max + ( , ) The goal of the Generator is to make its own generated data ( ) consistent with the real data when it is discriminated by the Discriminator.
When updating the parameters of the discriminative model, it hopes ,-$ ( ) to be as large as possible from real data. For the ( ) generated from noise data, the larger the log (1 − ( ( ))) is, the better the max needs to be solved. When updating the parameters of the generative model, it hopes ( (.)) to be as large as possible. At this time, ( , ) will become smaller, that is, min is required for generating the model. These two confrontation and optimization processes make the performance of Generator and Discriminator improve constantly. The final ideal state is that when Discriminator's discriminating ability cannot correctly determine the data source, it can be considered that Generator has learned the real data distribution. The basic structure of GAN is shown in Figure 1. The idea and structure of Encoder-Decoder was proposed by Ilya Sutskever et al [9]. Encoder-Decoder can be understood as the process of "encoding→intermediate vector →decoding". The input data of the Encoder-Decoder structure is encoded by the Encoder to obtain the code vector , Then the output data / is obtained after Decoder processing. Neural network algorithms are often used here as the structure of the Encoder and Decoder. The model expects the input vector to obtain an intermediate vector through one neural network algorithm in the Encoder, which is a dimensionality reduction process. Then, another neural network in the decoder is used to Decode, and the dimensional reconstruction is added to obtain the generated data / similar to the input data. By comparing the two sets of data and /, and minimizing the difference between them, the parameters of the encoder and decoder are trained. The basic framework of Encoder-Decoder is shown in Figure 2. Encoder-Decoder is an end-to-end model. It is a process of continuously improving the similarity ℒ between / and in training.
Because the Encoder-Decoder model and GAN model are based on different theories, the data processing methods and the results will be different.
(1) The generated data in GAN comes from some random noise points, so it is possible to cause mode collapse. If GAN wants to generate some specific details, it must traverse the entire distribution of the input data to determine which part of the input data determines the details [10]. The Encoder-Decoder can obtain the same type of features as the input data through the encoding process of the output data, and then generate the desired content through specific noise, so that the generated data and input data have the same probability distribution [11].
(2) The training of GAN model needs to reach Nash equilibrium, and the method of gradient descent can guarantee the realization of Nash equilibrium only under the premise of convex function, and it is not easy to converge when the input data is discrete or sparse [12]. Although some GAN methods have been able to solve these problems, GAN is less trainable than Encoder-Decoder.
(3) GAN uses an Adversarial Networks, and Encoder-Decoder optimizes not the likelihood itself, which results in much less identifiable data generated by Encoder-Decoder than data generated by GAN [13].
(4) If the discriminator in the GAN is well trained, the generator can perfectly learn the distribution of the training samples, that is, GAN is progressively consistent, while Encoder-Decoder is biased.
Based on this, some GAN models based on Encoder-Decoder are obtained by combining the advantages of Encoder-Decoder and GAN.

Variational Inference Model
Considering GAN as a Possibility Based Model (PBM) is a commonly recognized method [13]. Since the essence of Discriminator is to calculate the conditional probability ( |A) that belongs to a certain class A, and the essence of Generator is to calculate the joint probability ( A) of in the whole distribution. For PBM, Generator is the core part, because in PBM theory, Generator was designed before Discriminator, and then the divergence between positive and negative samples was missing in the calculation of Generator. Discriminator must learn to calculate this divergence to assist Generator, this also makes the structure design of the Generator in PBM relatively more complicated [14]. Therefore, the Encoder-Decoder model can be applied to the Generator structure based on the PBM theory.

Variational Self-coding
Assuming that there is a distribution of real random variables, input data can be considered as samples extracted from the whole distribution, but the distribution of real data is unknown. John Paisley et al. [15] put forward an idea of approximating the distribution by controllable and known distribution, assuming that Z obeys some common distribution, such as normal distribution or uniform distribution, and then wished to train a model / = $( ), and use the distribution of the model to approximate the distribution of real data, that is, by transforming the parts to make the two distributions overlap as much as possible. This is the idea of Variational Auto Encoder (VAE).
Anders Boesen Lindbo Larsen et al. [16] combined VAE and GAN to form VAEGAN. Here, VAE includes Encoder and Decoder, and GAN includes Generator and Discriminator. In VAE, is generated by through Encoder and / by through Decoder. There are ~Enc( ) = D E ( | ) , /~Dec( ) = H ( | ) . The Generator in GAN can be regarded as the Decoder in VAE. After converting into /, the / produced by discriminator is "true" or "false", and gives a "score". Combining VAE and GAN, it can be seen that the discrimination effect of GAN is better than VAE, but the training process of VAE is easier than GAN.
Suppose that the input vector is encoded into an implicit variable , and then decoded back to the output vector /. VAE wants to generate (I) from a prior distribution H * ( ) and then generate (I) from a conditional distribution H * ( | ). However, the real parameter J * and the hidden variable (I) are unknown, so a general solution is to use the maximum likelihood estimation to solve the unknown parameters. However, it is difficult to get J either by solving For V( | ) log H ( | ) , the Monte Carlo method can be used to obtain, Finally, the loss function of VAE is, The loss function of VAE is, So, the loss function of VAE-GAN is, Initially, Generator is similar to the VAE. A vector is randomly generated, and then the Discriminator determines the true and false. After that, the Discriminator is fixed, the gradient descent is used to update the Generator parameters so that the Discriminator output is as close as possible to 1.
The basic framework of VAEGAN is shown in Figure 3. Figure 3. The structure of VAEGAN. After the input data is processed by Encoder-Decoder, the similarity between the generated data and the input data is compared, then the trained Decoder is used as the Generator in the GAN, and the quality of the generated data is improved by the Discriminator detection.

Variational Mutual Information
From the perspective of feature learning, GAN does not impose any conditional restrictions on the input random noise, which means that there is no obvious feature representation of any dimension of , so it is impossible to determine what kind of features can be generated by the noise dimension. Xi Chen et al. [17] improved the objective function of GAN and proposed a new GAN model --InfoGAN, which represents mutual information. InfoGAN adds an implicit coding < to the input vector of Generator and uses mutual information to indicate the degree of correlation between and <. In order to express the close relationship between and < , it is necessary to maximize the value of mutual information.
Furthermore, the objective function of GAN needs to add mutual information of and < as a regularization term. Thus, based on the loss function of the original GAN, a regularization constraint is added: Y(<; ( , <)). Then there, In fact, it is difficult to directly maximize Y(<; ( , <)). This involves a posterior probability distribution (<| ) , which is difficult to obtain in practice. At the same time, because the generator has a high degree of freedom, it is possible to find an analytical solution during the learning process, so that (<| ) = ( ), which causes < to lose its proper function. Therefore, we need to define an auxiliary probability distribution D(<| ) to approximate (<| ) and obtain the lower bound of (<| ). This method is called variational mutual information maximization.
When r(<) is maximum, the maximum value of mutual information is obtained. The objective function is equivalent to: InfoGAN model uses DCGAN as its network structure [18,19]. Input < and . to Generator, and input the generated data and real data randomly to Discriminator for judgment. ~ and D share the convolution layer. The probability distribution of ~ output C. Here G and ~ can be seen as Encoder-Decoder structures, and and are models in GAN. The basic framework of InfoGAN is shown in Figure 4.

Adversary Inference
The ALI (Adversarially Learned Inference) method proposed by Vincent Dumoulin et al. [20] and the BiGAN (Bidirectional GAN) method proposed by Jeff Donahue et al. [21] use an adversarial method to jointly train a generation network as Encoder and an inference network as Decoder.
The basic framework of ALI is shown in Figure 5.

Adversary Inference
The IntroVAE method proposed by Huaibo Huang et al. [22] introduces GAN into the VAE, ignores the Discriminator structure in GAN and the Decoder structure in AutoEncoder, and achieves a Self-introspective learning. That is, the model itself can judge the quality of its generated samples and make improvements to improve the performance of the model.
Here, P ‚\ is a loss function. IntroVAE is implemented by training Encoder to make the hidden variables of real image close to the prior distribution, while the hidden variables of fake image far away from the prior distribution. At the same time, training generator can make the hidden variables of fake image close to the prior distribution. That is: Here, • ˆ= max (0,•), … and • are superparametric, and ‰ is the reconstructed sample. Unlike GAN and AutoEncoder [23], Encoder and generator in IntroVAE are adversarial, but they also ensure that the error between the output image and the real image is as small as possible. For real data, this method is consistent with the traditional VAE method and keeps the stability of AutoEncoder. For fake data, this method of confrontation improves the quality of the generated images.

Energy Function Model
Considering GAN as an energy-based model (EBM) is another design idea of GAN [24]. It treats the Discriminator as an energy function with no explicit probability interpretation and is trained to assign low energy values to regions of high energy values. The EBM is based on the construction of the Discriminator, and the Discriminator requires a negative sample in the calculation process, which is then provided by the Generator. This shows that the structure design of Discriminator in EBM is more abundant. Therefore, the Encoder-Decoder model can be applied to the Discriminator structure based on EBM theory.

Energy Function
The EBGAN (Energy-Based GAN) method proposed by Junbo Zhao et al. [25] applies the concept of energy function to the Discriminator, and the Discriminator uses the Encoder-Decoder structure.
is regarded as an energy function, which gives low energy to real data and high energy to generated data. Finally, a mean square error including real and fake data is output through Discriminator. The loss functions of Generator and Discriminator are: Where • ˆ= max (0,•) means that in Ž + , when the reconstruction error of the fake data is less than a certain value †, • ˆ is positive, otherwise it is 0. Regarding the pattern collapse problem of GAN, EBGAN uses the Pulling-away Term method to generate diverse data from Generator: The idea of EBGAN is that the generated fake data will generate a vector after Encoder. The cosine similarity between each two vectors is calculated separately, and then the mean value is calculated. The closer the two vectors are orthogonal, the smaller the value of Ž '' (").
In EBGAN, Generator is based on Oakam's razor principle, while Discriminator is based on giving higher energy to the sample. The basic framework of EBGAN is shown in Figure  6.

Boundary Equilibrium
David Berthelot et al. applied the encoder to the Discriminator in GAN and proposed BEGAN (Boundary Equilibrium GAN) [26]. Different from other GAN methods, most of them start with reconstructing the sample distribution, BEGAN starts from the perspective of reconstruction error distribution, it matches the loss between real samples and generated samples in self-encoder based on Wasserstein Distance.
The loss function between the input data in Discriminator passing through Encoder and the output data generated by Decoder is as follows: This formula is used to measure the difference between the two distributions. i.e. loss_real distribution formed by real data and loss_fake distribution formed by Generator's fake data. Wasserstein Distance is used to measure the distance between the two distributions, thus improving Discriminator's ability to distinguish between true and false.
Assuming that the Loss distributions generated by real data and fake data are represented by ] _ and ] ? respectively. It is found by experiments that the reconstruction errors of each input data are independent and identically distributed and obey normal distribution. Thus, two one-dimensional normal distributions, loss_real and loss_fake, are denoted as Z _ = W( † _ , < _ ), Z ? = W( † ? , < ? ), where † _ , † ? and < _ , < ? are the mean and variance of loss_real and loss_fake, respectively. Their Wasserstein Distance is defined as: Here, _ and ? are random samples subject to Z _ and Z ? respectively. Their joint distribution obeys ¤, all possible forms of ¤ constitute a probability space Γ(Z _ , Z ? ). The joint distribution of the smallest value in Γ(Z _ , Z ? ) is the target distribution. Its expected value ( ¢ , £ )~ž | _ − ? | is the distance required. Discriminator hopes that the distance becomes larger, but the optimal solution of ¤ is unknown. It is very difficult to solve it directly, and it needs to be approximated by a variable lower bound. The lower bounds of 9(Z _ , Z ? ) can be determined by Jensen inequality. Intuitively, when the Discriminator is trained well, the error of real data is expected to be the smallest, that is, † _ → 0. Therefore, 9(Z _ , Z ? ) ≥ † ? − † _ . In this way, the goal of maximizing Discriminator can be equivalent to minimizing † _ − † ? , while the goal of Gnerator is to minimize the distance between two distributions, which can be achieved by minimizing † ? . Finally, When the loss in Generator and Discriminator are respectively equalized, the Discriminator cannot distinguish between true and false. At this time, 9(Z _ , Z ? ) → 0, so there are: (ℒ( )) = (ℒ( (. + ))) Since the Generator's generation process is slower than the Discriminator during the training process, which will lead to unstable training. Therefore, a coefficient ¨ is added to ℒ + to balance the difference between them.

Association Transformation of Different Distribution Data
For the problem of inter-association and transformation style of image datasets with different probability distributions, the integration of Encoder-Decoder with GAN has made great progress in recent years. The problem is actually to learn a mapping function. In the case that the underlying data does not change the distribution, the style conversion of different image data sets is performed by finding correspondences such as similar semantics. And this conversion is unsupervised learning.
Ming-Yu Liu et al. extended the distribution ( ) to the joint distribution ( _ , ? ) and proposed CoGAN to deal with domain adaptation problems [27]. CoGAN considered learning a combination of two domain types to get a joint distribution with different attributes. CoGAN consists of two GANs, each for a domain type image. If the two GANs are directly trained independently to process image adaptation problems of two different domain types, the generated results will lead to the inner product ( _ ) ( ? ) of the two edge distributions being not equal to their joint distribution ( _ , ? ) , which makes the GAN unable to learn a joint distribution with different attributes. In COGAN, the weights of two gas in some layers of the generator are shared and constrained, so that COGAN can learn a joint distribution when there is no correlation between two domain types. Because in the deep network of generator, the weight sharing of high-level semantic information, it can be seen that the GAN 's Discriminator can decompose the semantic information of the high-level, and the bottom layer in the Generator is still the content in different domain types.
Specifically, Cogan is composed of GAN1 and GAN2. Each GAN generates and distinguishes images of its own domain type. In training, two Generators need to share a part of the weight of the upper layer in the deep network. which is: The discriminator needs to share part of the weights in the lower layer of the deep network. which is: When the weights of all Generators of GAN1 and GAN2 in CoGAN are shared, Generator1 and Generator2 can be collectively regarded as an Encoder. The basic framework of CoGAN is shown in Figure 7.
Compared with the structure of Encoder-Decoder introduced by COGAN, Guillaume Lample et al. integrated GAN into Encoder-Decoder and proposed the structure of Fader Networks [28]. Fader networks combines Encoder and Discriminator to form a generation discrimination combination of GAN. After the Encoder generates the data 5 , the Encoder continuously discriminates whether 5 is related to the attribute A of the original data through the Discriminator. Continuously optimizing the process can achieve the stripping of 5 and A. Finally, when the decoder generates the new image data, it can add some attributes of A that meet the needs, thus achieving the purpose of controlling the image to be generated as required.
Thus, the Discriminator discriminates whether ( ) is related to A and its expression is: Here © \ > 0. © \ needs to be adjusted precisely. A larger © \ will limit the amount of information about contained in ( ), which resulting in image blurring. A smaller © \ will limit the discriminator's dependence on A . The basic framework of Fader Networks is shown in Figure 8. Jun-Yan Zhu et al. designed CycleGAN to convert images into styles [29]. That is, there are two differently distributed image data _ and ? . CycleGAN can convert the distribution of samples in _ space into the distribution of ? space. Therefore, the goal of CycleGAN is to learn the mapping from _ to ? .
There are two two encoders, decoder and discriminator in XGAN, and it has one classifier. The discriminator in XGAN acts similarly to CycleGAN. XGAN let _ get 5 ¢ through encoder1, then generate £ through decoder2, then output 5 £ with £ as the input of encoder2, final calculate the distance between 5 ¢ and 5 £ as loss function.
In XGAN, the loss function of two pairs of self-Encoders is: The classifier in XGAN is used to distinguish the images in different domains after coding.
If the encoded image is still distinguishable, the encoded information has not only feature information but also domain information. If it cannot be distinguished, it means that the encoded information is feature information common to the two domains. ℒ ¼½¾¾ = ¢~ ( ¶ ¢ ) ℓ(1, <(m _ ( _ ))) + ¢~ ( ¶ ¢ ) ℓ(2, <(m ? ( ? ))) (55) ℓ is a loss function for classification. Two encoders need to maintain feature consistency when encoding two fields. The loss of consistency is: XGAN introduces an optional function which means teacher loss. When there is prior knowledge, this function can be integrated into XGAN 's model. It is asymmetric. finally, The basic framework of XGAN is shown in Figure 10. Asha Anoosheh et al. proposed ComboGAN [31] to apply the Encoder-Decoder structure to different distribution problems of image data. Let _ get 5 ¢ through Encoder1, then generate £ through Decoder2, then output 5 £ as input of Encoder2, and finally generate ¢ through Decoder1. The distance between 5 ¢ and ¢ is calculated as a loss function. The basic framework of combogan is shown in Figure 11.
Jianlin Su proposed the model of O-GAN (original general advanced network) considering that the structure of Discriminator is similar to Encoder [32]. The Discriminator output is the scalar of classification, the Encoder output is a vector, so the Discriminator is written as a composite function ( ) ≜ È(m( )), where È is the mapping from noise space to discrimination space. Here, in order to realize the functions of generator and discriminator, O-GAN adds a Pearson coefficient as the regularization term: In this way, the discriminator is divided into two parts, where m is the decoder. For È, avg(m( )) is directly used in O-GAN. That is, È(m( )) = avg(m( )).

Summary
GAN and Encoder-Decoder are both generative models. Their output is determined by the input. A model can be used to learn a characteristic representation of the input. And both models are directly learning from input samples, so it is not necessary for label information. Specifically, for the generation model in GAN, it is no longer necessary to need a rigorously formatted generated data representation like the traditional model, which directly avoids the computability of the model due to the complexity of the data, and also avoids the inability of data input or generation due to the complexity of the model. However, GAN has the possibility of mode collapse, and the input data needs to be continuous. Encoder-Decoder is composed of two multilayer neural networks. Input data and output data need to express the same probability distribution with the same number of nodes. The significance of the model lies in the middle vector layer, which represents "dimension reduction" and maintains the feature distribution of input data to the greatest extent. The data generated by the Encoder-Decoder is lost due to dimensionality reduction after dimension reduction. Based on the advantages and disadvantages of the two models, a new model can be generated by combining them, which can overcome their shortcomings and produce data information that can meet the needs better.