Similarity Noise Training for Image Denoising

Deep learning has attracted a lot of attention lately, thanks. Thanks to its high modeling performance, technological advancement, and big data for training, deep learning has achieved a remarkable improvement in both high and low-level vision tasks. One crucial aspect of the success of a deep learning-based model is an adequate large data set for fueling the training stage. But in many cases, well-labeled large data is hard to acquire. Recent works have shown that it is possible to optimize denoising models by minimizing the difference between different noise instances of the same image. Yet, it is not a common practice to collect data with different noise instances of the same sample. Addressing this issue, we propose a training method that enables training deep convolutional neural network models for Gaussian denoising to be trained in cases of no ground truth data. More specifically, we propose to train a deep learning-based denoising model using only a single noise instance. With that in mind we develop a non-local self-similarity noise training method that uses only one noise instance.


Introduction
Recently, deep learning based models have pushed computer vision tasks to a new level of performance, making deep learning the go to framework when dealing with images, many breaks throw have been witnessed in high-level tasks like Image Classification [1][2][3], Object Detection [4][5][6] and low-level vision like in Image Denoising [7][8][9][10], Inpainting [11] Single Image Super Resolution [12] and more. A great emphasis has been placed in developing deeper and more complex architectures to improve modeling performance, as well as improving training technics.
One major reason that enables deep learning models to achieve such success is the availability of large labeled data sets for fueling the training process, and by the same mean achieving high performance, yet large data sets with clean ground truth labels are not always given, and sometimes even not feasible.
Image denoising is a low-level vision problem where deep learning has become extremely popular, it is also a pre-processing step in different computer vision tasks, it aims to solve an inverse problem of the form where we seek to estimate the clean image from its noisy measurement by reducing the noise perturbation . Across the denoising literature and specially in natural image denoising, it is usual to model the noise perturbation as Additive White Gaussian Noise (AWGN), though real world noise does not match exactly this distribution it is still useful in practice to model it as so.
Two main approaches can be distinguished in the denoising literature, following its chronological development we encounter prior based methods first [13][14][15][16][17], sparsity and non-local self-similarity played a central role in the development of these prior based methods. These priors are usually mathematical models engineered by researchers so as to model the inherent images structure. These methods were the reference choice when dealing with image denoising problems, indeed the quest for best denoising performance was till lately an improvement over the mathematical modeling of the image inner structures. Though providing a well-founded theoretical foundation and stable convergence criteria, these hand-crafted models lakes of modeling performance, as no model can perfectly describe images. This has led to the adoption of learning-based models. This second stream of methods has become quite popular lately, in this framework the burden of engineering the model has been lifted by learning strategies [7,8,10,18] and has led to state-of-the-art performance in most computer vision tasks.
In a deep learning based framework, two main aspects are critical for the model success, one is the model architecture and second is training data. For training a deep learning model large labeled data set need to be at hand. In terms of image denoising, pairs of (noisy, ground truth) samples are needed to perform the training process by minimizing a loss function (Mean Squared Error (MSE) for example) between the network estimation on the noisy sample and its ground truth (noise free) images. That being said, in many cases ground truth data can be hard to acquire or technically consuming, one example is in medical imaging, where 3D Magnetic resonance imaging (MRI) needs hours of acquisition for a single high quality volume and reducing time acquisition leads to noise perturbation, which is harmful for medical diagnose.
Recently, an interesting property has been observed in the context of image restoration tasks [19], in this work it is shown that the training process of various image restoration tasks including image denoising can be robust to normal perturbations in ground truth data, meaning that if the ground truth data used during training had to be altered by a zero mean Gaussian noise then this would not significantly harm the denoising performance, it has been also shown that in many cases we can achieve similar denoising performance. Said differently we can train a deep learning denoising model by minimizing the loss of two different noise degradation of the same sample and still achieve denoising outcome. This sort of behavior is rather appealing, and motivate us to further explore other training scenarios with no ground truth image.
In this paper we propose to approach the issue of training deep learning based denoising models in case of no ground truth data set, in contrast with noise2noise training our method relays on using only single noise degradation samples. We propose to do that by joining noise2noise training with the non-local self-similarity image prior. To that end, we design a group patch based convolutional neural network (CNN) model that suits the need of non-local self-similarity and noise2noise training method [19], we discuss the training process which raises a certain number of difficulties and propose adequate approaches to deal with them.

Related Works
Lately different attempts have been conducted in terms of no ground truth dataset training, in [19], it has been pointed out that deep neural networks can be trained without the need of ground truth data by mapping noisy to noisy samples, in [18] a generative adversarial network has been proposed for forming pairs of training data's, and has led to impressive results in image blind denoising. Others like [20] have chosen to introduce a statistical noise prior by modifying the training objective function using Stein's unbiased risk estimate (SURE) estimator.
In [19], it is assumed to have pairs of noisy data, which is not really common, usually it is not relevant to acquire couples of noisy data, in contrast we propose to use single instance of noise data, and form our pairs from similar patches. In in [18] assumes statistical prior knowledge of the noise induced by a SURE noise estimation, here we assume no statistical noise prior knowledge.

Our Method
Our goal is to design a no ground-truth training method for image denoising, most existing deep learning methods namely supervised learning relay exclusively on training the model by minimizing the model prediction with its corresponding label or ground truth, It is obviously the default approach to adopt if available. Here, is the model to be optimized, the model's parameters, the number of training samples and the (noisy, clean pairs) { , } . From [19] we know that in the context of image restoration training is not only restricted to ground truth data, and noisy target can also achieve satisfying denoising performance, this property is highly interesting since it remarkably changes our perceptions of how learning models can achieve specific tasks. But in case where no multiple degradations of the same sample are available it is still not clear how can we use this property to our advantage.
Indeed, it is rare to find datasets that include pairs of noisy data, in fact, the most common case is to find data of only single sample noise instance captions, here the noise2noise training method proposed in [19] while being highly relevant, does not meet an optimal practical training requirement, since the noise2noise training method relays on the availability of multiple degradations of the same sample which makes impractical if else. It would be advantageous to find a way train exclusively on noisy data, yet, a learning model needs an objective function with training pairs for optimize the model parameters and in this case non-local self-similarity suits perfectly our need. Natural images exhibit redundancies across different spatial position in the image, it is thus possible to extract training pairs from the given data using similar patches, which enables us to train a denoising model using only corrupted data with no corresponding ground truth versions based on a noise2noise training method.
Motivated by the above reasoning, we combine the non-local self-similarity prior with noise2noise training method, we also design an appropriate CNN architecture for that matter, our training model is composed of two networks. This being said we can proceed to the training process. First, let us dive into the detail of our model for similarity noise training method. In the following subsections, we will present the details of our training and implementation, we take inspiration from the way Block-matching and 3D filtering (BM3D) [14] exploit non-local self-similarity and we propose a patch group based CNN model; a first network takes as inputs groups of N similar patches are formed after performing a block matching on the noisy images (with respect to a reference patch in research window of size W); these N patches are then fed to a first CNN and denoised jointly, the resulting estimates are afterward fed to a second CNN, this second network would provide the final estimation of the clean reference patch, Figure 1 shows a diagram of the two networks architecture.
If we would make the analogy with BM3D, the first network would be the equivalent of the collaborative filtering. Though here, we employ it in terms of learning, our assumption is that equivalently to collaborative filtering that aims the enhance denoising performance by filtering together groups of similar patches, we can expect our training process and by the same mean the denoising performance to gain from training and denoising groups of similar patches together. We name this network Collaborative Learning-Net (CL-Net), the second network aims to aggregate the group of similar patches and we call it Aggregation-Net (Ag-Net), more details about the CL-Net and Ag-Net in subsection 3.2.1 and 3.2.2.

Block Matching
In the settings of our training, we choose the patch size to be 20, 20 , most state-of-the-art methods that implements a non-local self-similarity prior uses a small patches size for block matching, generally 7, 7 , in the context of CNN and for the learning purposes, a larger patch size is usually needed, we have found empirically that 20, 20 achieves a good balance between learning size and block matching error, typically the Euclidean distance (MSE) is used as similarity measure, similar patches are collected in a 27, 27 window centered around the reference patch. Our patch groups will be formed of 9 patches including the reference patch we will refer to a patch as , ! 1, and # the reference patch, we have found imperially that increasing the of patch in the groups is produces denoising artifacts.

Model Architecture
Designing a suitable network architecture for the task at hand is a crucial step for the model success, here we do seek an architecture that both suits a patch group formulation with a noise to noise driven training, state-of-the-art CNN [7] denoising models could achieve state-of-the-art denoising performance with a 3, 3 filter size, the same has been previously indicated by Simonyan et al. [21]; also the network depth plays an essential role in determining the receptive fields size, in [7] it has been shown that for a 3, 3 filter size CNN the depth of the networks correlates with the receptive fields size, and convolutional layers CNN would result in a 2% 1, 2% 1 receptive field.
Since working in patch based model we choose the depth of the network to be rather small, a 5 convolutional layer deep network, Rectifier Liner Unit ReLU &' (, 0 are adopted as activations functions. Batch Normalization shows to enhance performance and here again we adopt Batch Normalization in our network. It is important to distinguish between the Collaborative Learning-Net pipeline and the Aggregation-Net one, we choose to perform convolutions for the Collaborative Learning-Net in a filter group manner, 8 groups of 64 filters per layer for each channel, this means that though passing through the same network the channels are actually processed separately, but still working jointly, this is because we choose to map each one of their estimations to the reference patch. The reason for adopting this strategy is that when adopting a combined forward pass, the Collaborative Learning-Net ends up with all the patches in the group producing the same outcome (estimations of the reference patch) and loses their distinctive structures, and that do not satisfy our needs for aggregation, it is thus preferable that the Collaborative

Collaborative Learning
The Collaborative Learning-Net is designed to perform two different tasks simultaneously. In one hand, our network learns its denoising function; as we are training this network to map groups of similar patches inputs to their reference noisy one. This follows our previous assumption on similarity noise to noise training method, Figure 2 representing MSE loss across different epochs using only one noisy image verifies our assumption, we can clearly see a convergence toward a denoised outcome. We can also see how training our networks by minimizing the MSE between the prediction of our model on groups of noisy similar patches to their reference noisy patch converges towards minimizing the MSE of the same groups with their reference ground truth patch, this proves that our noise similarity training method is achieving the expected denoising behavior.
Moreover, we can see a high correlation between ground truth target loss and the noisy target loss which again shows the reliability of the method. On the second hand while learning to restore the patch groups it also learns to work collaboratively, this is mainly due to the fact that during the learning process all the N similar patches referred as , ! 2, in each group are trained to map the same reference patch and thus, collaborates to its final aggregation, note that this formulation includes only noisy samples from the noisy image.

Aggregation
The Aggregation-Net works at producing the final estimate of the clean reference patch; previously the Collaborative Learning-Net did a good job at denoising, and outputs patch groups that approaches the references patches; but still these patch group's estimations are only linked by their objective target (i.e., mapping the reference noisy patch) and did not yet produce a clean estimation of the reference patch, the Aggregation-Net that takes the N similar patches and pass them jointly into one channel in the Aggregation-Net, here again we employ a noise mapping formulation, meaning that the Aggregation-Net will be trained to map to the reference noisy patch.
CL-Net and Ag-Net denote the Collaborative Learning-Net and the Aggregation-Net and Θ, Ψ their respective parameters. Note that when the Collaborative Learning-Net is being optimized, it is fed with similar patches without their reference patch, during our experiments we have noticed that providing the CL-Net with the reference patch in the patch group input tends overfit our model since then the network would just learn to copy noisy patches. Once the CL-Net network converges, it is used to produce the inputs of the second network, this time including the reference patch. The group mean of the estimated patches which is defined as B C7D is also passed to the Aggregation-Net we found that this help the Aggregation-Net to better estimate the clean reference patch by integrating a basic estimation of it of the estimated patches.

Experiments
Here we present the experiment conducted following our similarity noise training method, evaluation and performance. To analyze the training process and denoising result we conduct the training on a test set of 12 widely used images, Figure 3 shows the test set; also the BSD500 [22] has been used to train on a larger scale and study the effect of big data set for the training, 400 training images are cropped to ( 180, 180) and AWGN is added before performing block matching, and groups of similar patches are formed, flips and rotation for data augmentation are performed. We consider three training data set, one image training data set, test dataset training and external data set training (BDS500), Table 1 show the denoising performance on for the three datasets. It is interesting to see that training on a single image achieve the best result in terms of PSNR, followed by test set (12 Set) training and last is the external set (BSD500) training. Obviously, the training can better adapt to the given data and thus achieve better denoising than the model trained on a different dataset. This encourages us to better develop the model for practical use in cases of no ground truth data.  Table 2 shows the PSNR values on the set 12 for different noise levels for different noise levels (sigma = 15, 25, 50); it can be noticed that the training method suffers from strong noise level during training, this is mainly due to the block-matching mismatch that occurs because of high noise levels, moderate noise levels like 15 and 25, produces a more stability during the training. Though the primary goal of this work is to explore no ground truth training method, the final aim is to attenuate noise perturbation in images. For reference, we compare the denoising performance our method with two non-local self-similarity based denoising methods, BM3D [14] and Non-Local Means NLM [13], the comparison gives us a fair placement of the denoising performance. Our method attenuation performs better than a classic NLM. Figures 5, 6 and 7 show further visual comparison with BM3D denoising results.     Both Networks have been trained for 50 epochs using Stochastic Gradient Decent with momentum 0.9, weight decay 0.0001 and a 64 mini-batch. All trainings have been conducted using MatConvNet [23], under Matlab (R2018a) environment running on a PC with Intel (R) Xeon (R) CPU E5-2683 v4 @ 2.10 GHz and Nvidia GeForce GTX 1080 Ti. Trained models and codes can be found at the Github link https://github.com/AbderraoufKhodja/SNT-NoGroundTruthT raining.git.

Discussion
While experimenting on only single instances of noisy samples and without statistical modelling of the noise corruption, we still managed to achieve good denoising result. In Figure 4 we can see that our method succeeds in recovering most of detail of the image Barbara, which exhibits abundant repetitive patterns. Visually, compared with state-of-the-art denoisers like DnCNN [7] we can notice that our method has a better recovery performance specially in those repetitive patterns regions, the major drawback is that the method has more difficulties in recovering smooth regions and introduces noise artifacts which diminishes the denoising performance. The proposed framework is simple and straightforward, there is still room for improvement either in terms of a better exploitation of the non-local self-similarity prior or by introducing a statistical prior, we think that combining with [20] can benefit both frameworks, we would address these aspects in future works.

Conclusion
Image denoising is a classic problem that has been extensively studied in the scientific literature, the adoption of deep learning based method open a new door for improvement and investigation. Deep learning denoisers performs remarkably well in a supervised regime, but quickly fall short when no labeled dataset is at hand. Apart from the high restoration performance, the supervised nature of the training in most formulation of the problem makes it challenging to optimally deploy it in different scenarios. To that end we proposed a simple yet practical approach to train denoising CNN without ground truth data and without statistical modeling of the noise corruption, given only single instances of noisy images. We hope that this work would further invoke inspiration in the advancement of unsupervised method for training denoising models as to better fill the gap between classic method and modern deep learning technics in terms of flexibility and practicality in various settings. For future work we would investigate ways to improve training performance and adopting more challenging types of noise.