Rohit
Rohit PhD student at University of Pennsylvania

Deep Image Prior

Deep Image Prior

My summary of the paper “Deep Image Prior”. Original paper is here.

Aim

The paper aims to solve “inverse” problems in image processing like denoising, super-resolution, inpainting, etc. using image priors from handcrafted neural networks and without data.

This paper uses the structure of ConvNets to learn the inverse function. The only information to reconstruct the output image is in the degraded input image, and the handcrafted structure of the network used in the reconstruction. This is remarkable because no aspect of the network is learnt from data.

Introduction

Consider a “good image” \(x \in X\) and a degraded version of the image \(y \in Y\). We only have \(y\) and wish to recover \(x\). We do that using Bayes Rule:

\[P_{posterior}(x|y) \propto P_{l}(y|x) P_{prior}(x)\]

The first term \(P_l(y|x)\) is the likelihood of the corrupted image given the good image, basically the degradation model. The second term \(P_{prior}(x)\) is the prior distribution over good images. So a candidate \(\hat{x}\) is good if the likelihood is high, and the prior is also high.

Priors can be explicit or implicit. Some examples of priors over “good images” are:

  • Good images should be smooth (Ising/Potts prior).
    • Binary/Categorical signals
  • Good images should have small Total Variation https://en.wikipedia.org/wiki/Total_variation_denoising.
    • Continuous signals
  • Good images should be from a particular distribution defined by a dataset of images.
  • Good images should have more low-frequency components than high-frequency ones (Gaussian smoothing).

To find a candidate \(\hat{x}\) we need to maximize the log-Posterior. Doing so gives us an optimization problem:

\[\hat{x} = \arg\max_{x} \log(P_{posterior}(x|y)) = \arg\max_x \log(P_{l}(y|x)) + \log(P_{prior}(x)) \\ = \arg\min_{x} E(x, y) + R(x)\]

where

\[P_{l}(y|x) = \frac{\exp(-E(x, y))}{Z_0} \\ P_{prior}(x) = \frac{\exp(-R(x))}{Z_1}\]

Different priors will lead to different choices of \(R(x)\). Let’s call this the regularizer function.

Explicit regularizers (like TV or Gaussian) will have a definite form of \(R(x).\) However, if the image is generated from some mechanism, then the prior term becomes implicit. One can also add explicit regularizers to satisfy more prior terms, its upto the end-user to specify.

An example - Explicit Priors in denoising

An example of explicit priors can be shown in denoising. In denoising, the good signal \(x \in \mathbb{R}^d\) is corrupted by the following process

\[y = x + \epsilon \quad , \quad\epsilon \sim \mathcal{N}(0, \sigma I_{d})\]

The likelihood therefore becomes:

\[P_l(y|x) = C \exp\Bigg(\frac{-\|y-x\|^2}{2\sigma^2}\Bigg)\]

and therefore,

\[E(x, y) = \frac{\|x-y\|^2}{2\sigma^2}\]

Simply minimizing \(E\) will give us \(\hat{x} = y\) which is not a good solution for denoising. Therefore, adding priors may be helpful. Consider the following regularizer:

\[R(x) = \sum_i (x_i - x_{i+1})^2\] \[P_{prior}(x) = \frac{\exp(-\sum_i(x_i - x_{i+1})^2)}{Z}\]

which promotes smoothness in space. Minimizing \(E(x, y) + \lambda R(x)\) will give a smoother solution for \(x\). Here is the output for a sinusoidal wave corrupted by noise.

/blog/assets/images/dip/Untitled.png

Implicit Priors

Images generated from ConvNets have constraints. ConvNets capture the local and translation-invariant nature of convolutions, and the usage of a sequence of such operators captures the relationship of pixel neighborhoods at multiple scales. These are the priors built into the network that generates the image \(x\).

Given a ConvNet \(f\) with parameters \(\theta\) having some implicit prior, we solve the following problem:

\[\theta^* = \arg\min_\theta E(f_\theta(z); y), \quad\quad x^* = f_{\theta^*}(z)\]

where \(z\) is a random, fixed code vector. The local minimizer of this problem is obtained using an optimizer such as gradient descent, starting from a random initialization of parameters \(\theta\). Since there is no data to tune the parameters, the prior comes only from the structure of the ConvNets.

/blog/assets/images/dip/Untitled%201.png

How are neural networks a prior?

After seeing the power of neural networks to solve complex problems, it seems that neural nets might fit to any image, including noise. However, the choice of network architecture has a major effect on how the solution space is searched.

In particular, “bad” solutions are resisted by the network and it descends much more quickly towards naturally-looking images. For example, these are the loss curves in a reconstruction problem:

/blog/assets/images/dip/Untitled%202.png

We see that neural networks have a higher impedance to noise. In the previous case (reconstruction), the optimization is simply \(\arg\min_\theta \| f_\theta(z) - x_0 \|^2\).

In the limit, the ConvNet can fit noise as well, but it doesn’t so very reluctantly. The parameterization offers high impedance to noise and low impedance to signal. However, to prevent noise from overfitting into the ConvNet, early stopping has to be used.

Here are examples of how that would work in the image-space:

  • Super-resolution

/blog/assets/images/dip/Untitled%203.png

The ground truth satisfies the energy equation, and adding a conventional prior may tweak the energy so that the candidate is close to the ground truth. DIP has a similar effect, but it tweks the optimization trajectory via re-parameterization.

  • Denoising

/blog/assets/images/dip/Untitled%204.png

Here the ground truth image doesn’t satisfy the energy equation. If run for long enough, DIP will obtain a near-zero cost far from the ground truth. However, the optimization path will pass close to the ground truth, and early stopping can help recover a good solution.

How to “sample” from this prior

Although the prior is implicit, we can still draw samples in a loose sense. To do that, take different initializations \(\theta\) for the same code vector \(z\) and same architecture.

/blog/assets/images/dip/Untitled%205.png

For example, this figure shows the outputs \(f_\theta(z)\) for two different initializations. Each column is a different architecture. Resulting images are far from independent noise and correspond to stochastic processes. There is a clear self-similarity in patches of the image across scales.


Applications

Denoising and general reconstruction

Since the parameterization presents high impedance to noise than signal, denoising is a natural application for this work.

The method also works for blind denoising where the noise model is not known. They simply use the loss

\[E(x, y) = \|x - y\|^2_2, \quad x = f_\theta(z)\]

If the noise model is known, the loss function can be changed, otherwise one can use the default L2 loss. This allows to use the method in a plug-n-play fashion.

Copy of Denoising performance (non-Gaussian noise)

Observations:

  • Overly wide skip connections lead to weak priors and fitting happens too fast.
  • Lack of skip connections leads to slow fitting and prior that is too strong.

Stark differences in architecture lead to varying results.

/blog/assets/images/dip/Untitled%206.png

Super-resolution

The input is \(x \in \mathbb{R}^{3\times tH \times tW}\) and output \(y \in \mathbb{R}^{3\times H\times W}\) is a downsampled image. To solve the inverse problem, the data term is:

\[E(x, y) = \| d(x) - y\|^2\]

where \(d(.)\) is the down-sampling operator. Super-resolution is ill-posed because there are infinitely many solutions that give the same downsampled image. A regularizer is needed to select one of the solutions.

/blog/assets/images/dip/Untitled%207.png

/blog/assets/images/dip/Untitled%208.png

Inpainting

The data term here is:

\[E(x, y) = \| (x - y) \odot m \|^2\]

Where \(m\) is the binary mask of known pixels, and \(\odot\) is the Hadamard product. Data prior is required here because the unknown region can be filled arbitrarily without affecting the data term.

For random masks with high dropout, DIP can fill the regions pretty well. It’s not supposed to work well for heavily-semantic inpainting (since there is no prior from a dataset) but it still works decent.

/blog/assets/images/dip/Untitled%209.png

/blog/assets/images/dip/Untitled%2010.png

/blog/assets/images/dip/Untitled%2011.png

Natural pre-image

The pre-image is a method to study invariances of a lossy function such as a deep network. Let \(\Phi\) be the first several layers of a neural network. The pre-image is the set:

/blog/assets/images/dip/Untitled%2012.png

This is the set of images that result in the same representation in the neural network \(\Phi\). However, directly optimizing this can result in non-natural images with artifacts. This has been done using priors like TV prior or using priors from a dataset. Priors from dataset are often found to regress towards the mean. Therefore, learnable discriminator and perceptual losses are used in such works. That is not required in DIP since there is only one image to reconstruct.

/blog/assets/images/dip/Untitled%2013.png

Activation maximization

Similar to the pre-image method, but here the image should maximize the activation of the neural network.

/blog/assets/images/dip/Untitled%2014.png

where \(m\) is the mask of a chosen neuron. This can be done at any level, even the final layer, which can give an idea of what images the neural network “thinks of” when it is associated with some class.

/blog/assets/images/dip/Untitled%2015.png

Image enhancement

Given a target image \(x_0\) we use the denoising formulation to obtain a coarse image \(x_c\). The fine details are computed as

\[x_f = x_0 - x_c\]

and the enhanced image is calculated as

\[x_e = x_0 + x_f\]

Results of image enhancement. First row is enhanced images. Second row is corresponding coarse image. Even for small iteration number the coarse approximation preserves edges for large objects.

Results of image enhancement. First row is enhanced images. Second row is corresponding coarse image. Even for small iteration number the coarse approximation preserves edges for large objects.

Flash and no-flash image reconstruction [Qualitative]

No-flash images usually have higher noise than flash, but one might be interested in having noise levels of the flash image with style of the non-flash image. DIP can be used to generate such an image, and instead of using a random code vector, one can use the flash image as input.

/blog/assets/images/dip/Untitled%2017.png

Technical details

  • Mostly used hourglass-based encoder-decoder architecture.
  • Use LeakyReLU as a non-linearity.
  • Downsampling doesn’t matter.
  • In upsampling, transposed convolutions do worse in general than bilinear or nearest-neighbour sampling.
  • \(z\) is filled with
    • uniform noise
    • meshgrid - works better with large hole inpainting

More hyperparameters are in the paper.

comments powered by Disqus