🐞

Deep Image Prior

Posted on Fri, Jul 23, 2021 paper-summary

Aim

The paper aims to solve "inverse" problems in image processing like denoising, super-resolution, inpainting, etc. using image priors from handcrafted neural networks and without data.

This paper uses the structure of ConvNets to learn the inverse function. The only information to reconstruct the output image is in the degraded input image, and the handcrafted structure of the network used in the reconstruction. This is remarkable because no aspect of the network is learnt from data.

Introduction

Consider a "good image" x∈Xx \in X and a degraded version of the image y∈Yy \in Y. We only have yy and wish to recover xx. We do that using Bayes Rule:

Pposterior(x∣y)∝Pl(y∣x)Pprior(x)P_{posterior}(x|y) \propto P_{l}(y|x) P_{prior}(x)

The first term Pl(y∣x)P_{l}(y|x) is the likelihood of the corrupted image given the good image, basically the degradation model. The second term Pprior(x)P_{prior}(x) is the prior distribution over good images. So a candidate x^\hat{x} is good if the likelihood is high, and the prior is also high.

Priors can be explicit or implicit. Some examples of priors over "good images" are:

To find a candidate x^\hat{x} we need to maximize the log-Posterior. Doing so gives us an optimization problem:

x^=arg max⁑xlog⁑(Pposterior(x∣y))=arg max⁑xlog⁑(Pl(y∣x))+log⁑(Pprior(x))=arg min⁑xE(x,y)+R(x)\hat{x} = \argmax_{x} \log(P_{posterior}(x|y)) = \argmax_x \log(P_{l}(y|x)) + \log(P_{prior}(x)) \\ = \argmin_{x} E(x, y) + R(x)

where

Pl(y∣x)=exp⁑(βˆ’E(x,y))Z0Pprior(x)=exp⁑(βˆ’R(x))Z1P_{l}(y|x) = \frac{\exp(-E(x, y))}{Z_0} \\ P_{prior}(x) = \frac{\exp(-R(x))}{Z_1}

Different priors will lead to different choices of R(x)R(x). Let's call this the regularizer function.

Explicit regularizers (like TV or Gaussian) will have a definite form of R(x).R(x). However, if the image is generated from some mechanism, then the prior term becomes implicit. One can also add explicit regularizers to satisfy more prior terms, its upto the end-user to specify.

An example - Explicit Priors in denoising

An example of explicit priors can be shown in denoising. In denoising, the good signal x∈Rdx \in \mathbb{R}^d is corrupted by the following process

y=x+Ο΅,ϡ∼N(0,ΟƒId)y = x + \epsilon \quad , \quad\epsilon \sim \mathcal{N}(0, \sigma I_{d})

The likelihood therefore becomes:

Pl(y∣x)=Cexp⁑(βˆ’βˆ₯yβˆ’xβˆ₯22Οƒ2)P_l(y|x) = C \exp\Bigg(\frac{-\|y-x\|^2}{2\sigma^2}\Bigg)

and therefore,

E(x,y)=βˆ₯xβˆ’yβˆ₯22Οƒ2E(x, y) = \frac{\|x-y\|^2}{2\sigma^2}

Simply minimizing EE will give us x^=y\hat{x} = y which is not a good solution for denoising. Therefore, adding priors may be helpful. Consider the following regularizer:

R(x)=βˆ‘i(xiβˆ’xi+1)2R(x) = \sum_i (x_i - x_{i+1})^2

Pprior(x)=exp⁑(βˆ’βˆ‘i(xiβˆ’xi+1)2)ZP_{prior}(x) = \frac{\exp(-\sum_i(x_i - x_{i+1})^2)}{Z}

which promotes smoothness in space. Minimizing E(x,y)+Ξ»R(x)E(x, y) + \lambda R(x) will give a smoother solution for xx. Here is the output for a sinusoidal wave corrupted by noise.

Implicit Priors

Images generated from ConvNets have constraints. ConvNets capture the local and translation-invariant nature of convolutions, and the usage of a sequence of such operators captures the relationship of pixel neighborhoods at multiple scales. These are the priors built into the network that generates the image xx.

Given a ConvNet ff with parameters ΞΈ\theta having some implicit prior, we solve the following problem:

ΞΈβˆ—=arg min⁑θE(fΞΈ(z);y),xβˆ—=fΞΈβˆ—(z)\theta^* = \argmin_\theta E(f_\theta(z); y), \quad\quad x^* = f_{\theta^*}(z)

where zz is a random, fixed code vector. The local minimizer of this problem is obtained using an optimizer such as gradient descent, starting from a random initialization of parameters ΞΈ\theta. Since there is no data to tune the parameters, the prior comes only from the structure of the ConvNets.

How are neural networks a prior?

After seeing the power of neural networks to solve complex problems, it seems that neural nets might fit to any image, including noise. However, the choice of network architecture has a major effect on how the solution space is searched.

In particular, "bad" solutions are resisted by the network and it descends much more quickly towards naturally-looking images. For example, these are the loss curves in a reconstruction problem:

We see that neural networks have a higher impedance to noise. In the previous case (reconstruction), the optimization is simply arg min⁑θβˆ₯fΞΈ(z)βˆ’x0βˆ₯2\argmin_\theta \| f_\theta(z) - x_0 \|^2.

In the limit, the ConvNet can fit noise as well, but it doesn't so very reluctantly. The parameterization offers high impedance to noise and low impedance to signal. However, to prevent noise from overfitting into the ConvNet, early stopping has to be used.

Here are examples of how that would work in the image-space:

The ground truth satisfies the energy equation, and adding a conventional prior may tweak the energy so that the candidate is close to the ground truth. DIP has a similar effect, but it tweks the optimization trajectory via re-parameterization.

Here the ground truth image doesn't satisfy the energy equation. If run for long enough, DIP will obtain a near-zero cost far from the ground truth. However, the optimization path will pass close to the ground truth, and early stopping can help recover a good solution.

How to "sample" from this prior

Although the prior is implicit, we can still draw samples in a loose sense. To do that, take different initializations ΞΈ\theta for the same code vector zz and same architecture.

For example, this figure shows the outputs fΞΈ(z)f_\theta(z) for two different initializations. Each column is a different architecture. Resulting images are far from independent noise and correspond to stochastic processes. There is a clear self-similarity in patches of the image across scales.

Applications

Denoising and general reconstruction

Since the parameterization presents high impedance to noise than signal, denoising is a natural application for this work.

The method also works for blind denoising where the noise model is not known. They simply use the loss

E(x,y)=βˆ₯xβˆ’yβˆ₯22,x=fΞΈ(z)E(x, y) = \|x - y\|^2_2, \quad x = f_\theta(z)

If the noise model is known, the loss function can be changed, otherwise one can use the default L2 loss. This allows to use the method in a plug-n-play fashion.

Copy of Denoising performance (non-Gaussian noise)

NamePSNR (signal to noise)
Deep Image Prior (architecture of [46])41.95
CBM3D30.13
DIP + UNet35.05
DIP + ResNet31.95

Observations:

Stark differences in architecture lead to varying results.

Super-resolution

The input is x∈R3Γ—tHΓ—tWx \in \mathbb{R}^{3\times tH \times tW} and output y∈R3Γ—HΓ—Wy \in \mathbb{R}^{3\times H\times W} is a downsampled image. To solve the inverse problem, the data term is:

E(x,y)=βˆ₯d(x)βˆ’yβˆ₯2E(x, y) = \| d(x) - y\|^2

where d(.)d(.) is the down-sampling operator. Super-resolution is ill-posed because there are infinitely many solutions that give the same downsampled image. A regularizer is needed to select one of the solutions.

Inpainting

The data term here is:

E(x,y)=βˆ₯(xβˆ’y)βŠ™mβˆ₯2 E(x, y) = \| (x - y) \odot m \|^2

Where mm is the binary mask of known pixels, and βŠ™\odot is the Hadamard product. Data prior is required here because the unknown region can be filled arbitrarily without affecting the data term.

For random masks with high dropout, DIP can fill the regions pretty well. It's not supposed to work well for heavily-semantic inpainting (since there is no prior from a dataset) but it still works decent.

Natural pre-image

The pre-image is a method to study invariances of a lossy function such as a deep network. Let Ξ¦\Phi be the first several layers of a neural network. The pre-image is the set:

This is the set of images that result in the same representation in the neural network Ξ¦\Phi. However, directly optimizing this can result in non-natural images with artifacts. This has been done using priors like TV prior or using priors from a dataset. Priors from dataset are often found to regress towards the mean. Therefore, learnable discriminator and perceptual losses are used in such works. That is not required in DIP since there is only one image to reconstruct.

Activation maximization

Similar to the pre-image method, but here the image should maximize the activation of the neural network.

where mm is the mask of a chosen neuron. This can be done at any level, even the final layer, which can give an idea of what images the neural network "thinks of" when it is associated with some class.

Image enhancement

Given a target image x0x_0 we use the denoising formulation to obtain a coarse image xcx_c. The fine details are computed as

xf=x0βˆ’xcx_f = x_0 - x_c

and the enhanced image is calculated as

xe=x0+xfx_e = x_0 + x_f

Results of image enhancement. First row is enhanced images. Second row is corresponding coarse image. Even for small iteration number the coarse approximation preserves edges for large objects.

Flash and no-flash image reconstruction [Qualitative]

No-flash images usually have higher noise than flash, but one might be interested in having noise levels of the flash image with style of the non-flash image. DIP can be used to generate such an image, and instead of using a random code vector, one can use the flash image as input.

Technical details

More hyperparameters are in the paper.