Deep Image Matting

This blog explains the The Deep Image Matting paper by Xu et al.


Introduction

The Deep Image Matting paper by Xu et al was a pivotal paper in image matting research using deep neural networks to solve some key challenges for the task and setting some important design choices that further works would take inspiration from. The goal of the paper is to take in an input image and a user specified refinement region and predict an alpha matte, which represents the opacity values in the refinement region. For a broader introduction to matting and some of it’s application have a look at A friendly introduction to Background Removal. The method presented in this paper is not fully automatic and requires a user generated trimap. To see image matting in action try out our background remover EraseBG. Offical upgraded launch date 10th March.

In the figure below we can see the input image in column 1 and an image indicating the refinement region by grey values in the column two. Both these images are input to the DIM model. In the third and fourth columns are the predicted alpha mattes of a classical method, Closed-form Matting, and their proposed method. We see that the classical method has difficulty with the similar dark colours of the hair and tower.

The existing approaches at this time like Bayesian Matting or Matting Laplacian largely relied on color information to solve the matting problem. Purely colour based approaches are bound to fail in many natural scenes as foregrounds and backgrounds often have similar colours. Instead, the authors propose to use deep neural networks which are capable of considering color information as well as structural and textural information. In cases where the foreground and background share similar colours, the network can rely on structural and textural information to predict the correct matte. In Fig 1 above we see that the classical method has difficulty with the similar dark colours of the hair and tower while the deep network relies on its structural knowledge of hair to predict the correct alpha matte. The major contributions of this paper are:

  1. Formulating the task of image matting as a deep learning problem
  2. Creating a dataset and augmentation techniques for the problem

Dataset

To create the dataset, simple images on simple background are found and ground truth alpha mattes and foreground images are manually created in Photoshop. The foreground image represents the true colours of the subject if it were non-transparent. Authors propose a dataset of 493 unique foreground subject image, each subject has a foreground image 2c) and a ground truth alpha matte 2b).

Using the matting equation then we are able to extract the foreground object and place on new background with a selected background B as:

\[I_{i}=\alpha_{i} F_{i}+\left(1-\alpha_{i}\right) B_{i} \quad \alpha_{i} \in[0,1]\]

The second row in fig 2 shows examples of such composition on different backgrounds.

During training the network takes in a user generated trimap which specifies a refinement region for which the network should output its prediction. This means that the network does not need to learn about object classes like humans or dogs; the semantic information of the image is provided by the trimap to the network. The authors train the model on random crops as the network only needs to predict transparency values which can be determined locally.

As creating matting data is labour intensive and expensive they follow another work and create a training scheme which uses synthetic data made from the small dataset they created.

The training dataset is created as follows:

  • A random crop of size 320X320,480X480 or 640X640 centred around a pixel in the refinement region is sampled from the image.
  • Each crop is resized to a fixed input size of 320X320
  • The crop is the composed on a COCO background to create a synthetic image using the matting equation

For each image they sample 100 crops, creating a total dataset of 493*100=49300 images

Model

Their model is a fairly straight forward encoder-decoder architecture. The network takes in a 4 channel input of the image and trimap concatenated. The encoder is initialised as the first 14 layers of a pretrained VGG16 with 5 max pools. The decoder has 6 convolution layers and outputs an alpha matte as a 1 channel map.

The large receptive filed size of the encoder-decoder architecture will lead to smooth and continuous prediction rather than sharp ones. Thus the authors include a small refinement block after the encoder-decoder to further refine it’s predicted alpha matte. The refinement network contains 4 convolutional layers operating at the full input resolution so that low level details are not compressed. The input to this block is a 4 channel concatenation of the original image and predicted alpha matte of the encoder-decoder. The refinement block is formulated as a residual learning problem to sharpen the previously predicted matte.

Losses

The paper uses two losses, the Apha Prediction Loss and a Compositional Loss. The Apha Prediction Loss is a fairly straightforward L1 loss between the predicted and growth truth alpha matte:

\[\mathcal{L}_{\alpha}^{i}=\sqrt{\left(\alpha_{p}^{i}-\alpha_{g}^{i}\right)^{2}+\epsilon^{2}}, \quad \alpha_{p}^{i}, \alpha_{g}^{i} \in[0,1]\]

Once our network predicts an alpha matte we can use it with the group truth foreground and a random background to make a new composition. We can use the ground truth alpha matte to do the same as well. The Compositional Loss then computes the L1 loss between the composition by predicted and ground truth alpha matte forcing the network to predict the alpha which is correct for the foreground map.

\[\mathcal{L}_{c}^{i}=\sqrt{\left(c_{p}^{i}-c_{g}^{i}\right)^{2}+\epsilon^{2}}\]

The final loss is a linear combination of the two losses. Both losses are computed over only the trimap region.

Results

The paper reports very good results compared to other techniques at the time. Their method is more robust to foreground and background colours and can predict sharper alpha mattes. The method also obtains much better SAD scores on the alphamatting.com test set. Kindy check the paper for more metrics and visual comparisons.

Conclusion

Overall this paper introduced several ideas that have greatly contributed the field of image matting research. Kindly check the paper for more details. To see image matting in action try out our background remover EraseBG. Offical upgraded launch date 10th March.

Written on February 25, 2021