CVPR 2020 is yet another big AI conference that takes place 100% virtually this year. But regardless of the format, the conference still showcases the most interesting cutting-edge research ideas in computer vision and image generation.
Here we’ve picked up the research papers that started trending within the AI research community months before their actual presentation at CVPR 2020. These papers cover the efficiency of object detectors, novel techniques for converting RGB-D images into 3D photography, and autoencoders that go beyond the capabilities of generative adversarial networks (GANs) with respect to image generation and manipulation.
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
If you’d like to skip around, here are the papers we featured:
- EfficientDet: Scalable and Efficient Object Detection
- 3D Photography using Context-aware Layered Depth Inpainting
- Adversarial Latent Autoencoders
Cutting-Edge Research Papers From CVPR 2020
1. EfficientDet: Scalable and Efficient Object Detection, by Mingxing Tan, Ruoming Pang, Quoc V. Le
Original Abstract
Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and EfficientNet backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single-model and single-scale, our EfficientDet-D7 achieves state-of-the-art 52.2 AP on COCO test-dev with 52M parameters and 325B FLOPs, being 4×–9× smaller and using 13×–42× fewer FLOPs than previous detectors. Code is available on GitHub.
Our Summary
The large size of object detection models deters their deployment in real-world applications such as self-driving cars and robotics. To address this problem, the Google Research team introduces two optimizations, namely (1) a weighted bi-directional feature pyramid network (BiFPN) for efficient multi-scale feature fusion and (2) a novel compound scaling method. By combining these optimizations with the EfficientNet backbones, the authors develop a family of object detectors, called EfficientDet. The experiments demonstrate that these object detectors consistently achieve higher accuracy with far fewer parameters and multiply-adds (FLOPs).
What’s the core idea of this paper?
- To improve the efficiency of object detection models, the authors suggest:
- A weighted bi-directional feature pyramid network (BiFPN) for easy and fast multi-scale feature fusion. It learns the importance of different input features and repeatedly applies top-down and bottom-up multi-scale feature fusion.
- A new compound scaling method for simultaneous scaling of the resolution, depth, and width for all backbone, feature network, and box/class prediction networks.
- These optimizations, together with the EfficientNet backbones, allow the development of a new family of object detectors, called EfficientDet.
What’s the key achievement?
- The evaluation demonstrates that EfficientDet object detectors achieve better accuracy than previous state-of-the-art detectors while having far fewer parameters, in particular:
- the EfficientDet model with 52M parameters gets state-of-the-art 52.2 AP on the COCO test-dev dataset, outperforming the previous best detector with 1.5 AP while being 4× smaller and using 13× fewer FLOPs;
- with simple modifications, the EfficientDet model achieves 81.74% mIOU accuracy, outperforming DeepLabV3+ by 1.7% on Pascal VOC 2012 semantic segmentation with 9.8x fewer FLOPs;
- the EfficientDet models are up to 3× to 8× faster on GPU/CPU than previous detectors.
What does the AI community think?
- The paper was accepted to CVPR 2020, the leading conference in computer vision.
- The high level of interest in the code implementations of this paper makes this research one of the highest-trending papers introduced recently.
What are possible business applications?
- The high accuracy and efficiency of the EfficientDet detectors may enable their application for real-world tasks, including self-driving cars and robotics.
Where can you get implementation code?
- The authors released the official TensorFlow implementation of EfficientDet.
- The PyTorch implementation of this paper can be found here and here.
2. 3D Photography using Context-aware Layered Depth Inpainting, by Meng-Li Shih, Shih-Yang Su, Johannes Kopf, Jia-Bin Huang
Original Abstract
We propose a method for converting a single RGB-D input image into a 3D photo – a multi-layer representation for novel view synthesis that contains hallucinated color and depth structures in regions occluded in the original view. We use a Layered Depth Image with explicit pixel connectivity as underlying representation, and present a learning-based inpainting model that synthesizes new local color-and-depth content into the occluded region in a spatial context-aware manner. The resulting 3D photos can be efficiently rendered with motion parallax using standard graphics engines. We validate the effectiveness of our method on a wide range of challenging everyday scenes and show fewer artifacts compared with the state of the arts.
Our Summary
The research team presents a new learning-based approach to generating a 3D photo from a single RGB-D image. The depth in the input image can either come from a cell phone with a stereo camera or be estimated from an RGB image. The authors suggest explicitly storing connectivity across pixels in the representation. To deal with the resulting complexity of the topology and the difficulty of applying a global CNN to the problem, the research team breaks the problem into many local inpainting sub-problems that are solved iteratively. The introduced algorithm results in 3D photos with synthesized textures and structures in occluded regions. The experiments demonstrate its effectiveness compared to the existing state-of-the-art techniques.
What’s the core idea of this paper?
- The core technical novelty of the suggested approach lies in creating a completed Layered Depth Image representation using context-aware color and depth inpainting.
- The algorithm takes an RGB-D image as an input and generates a Layered Depth Image (LDI) with color and depth inpainted in the parts that were occluded in the input image:
- First, a trivial LDI is initialized with a single layer everywhere.
- Then, the method detects major depth discontinuities and groups them into connected depth edges that become the basic units for the main algorithm.
- In the core part of the algorithm:
- each depth edge is selected iteratively;
- the LDI pixels across the edge are disconnected and only background pixels are considered for inpainting;
- the local context region is extracted from the “known” side of the edge to generate a synthesis region on the “unknown” side;
- the synthesized pixels are merged back into the LDI.
What’s the key achievement?
- The experimental results demonstrate that the introduced algorithm results in significantly fewer visual artifacts than existing state-of-the-art techniques:
- according to visual comparisons, content and structures that are synthesized around depth boundaries look plausible even in the case of challenging examples;
- following the quantitative comparisons, the synthesized views generated by the suggested method exhibit better perceptual quality than alternative approaches.
What does the AI community think?
- The paper was accepted to CVPR 2020, the leading conference in computer vision.
- The code implementation of this research paper is attracting lots of attention from the AI community.
What are possible business applications?
- 3D photography provides a much more immersive experience than usual 2D images, so the ability to easily generate a 3D photo from a single RGB-D image can be useful in many business areas, including real estate, e-commerce, marketing, and advertising.
Where can you get implementation code?
- The authors released the code implementation of the suggested approach to 3D photo inpainting on GitHub.
- Examples of the resulting 3D photos in a wide range of everyday scenes can be viewed here.
3. Adversarial Latent Autoencoders, by Stanislav Pidhorskyi, Donald Adjeroh, Gianfranco Doretto
Original Abstract
Autoencoder networks are unsupervised approaches aiming at combining generative and representational properties by learning simultaneously an encoder-generator map. Although studied extensively, the issues of whether they have the same generative power of GANs, or learn disentangled representations, have not been fully addressed. We introduce an autoencoder that tackles these issues jointly, which we call Adversarial Latent Autoencoder (ALAE). It is a general architecture that can leverage recent improvements on GAN training procedures. We designed two autoencoders: one based on a MLP encoder, and another based on a StyleGAN generator, which we call StyleALAE. We verify the disentanglement properties of both architectures. We show that StyleALAE can not only generate 1024×1024 face images with comparable quality of StyleGAN, but at the same resolution can also produce face reconstructions and manipulations based on real images. This makes ALAE the first autoencoder able to compare with, and go beyond the capabilities of a generator-only type of architecture.
Our Summary
The research group from West Virginia University investigates if autoencoders can have the same generative power as GANs while learning disentangled representation. In particular, they introduce an autoencoder, called Adversarial Latent Autoencoder (ALAE), that can generate images with quality comparable to state-of-the-art GANs while also learning a less entangled representation. This is achieved by allowing the latent distribution to be learned from data and the output data distribution to be learned with an adversarial strategy. Finally, the autoencoder’s reciprocity is imposed in the latent space. The experiments demonstrate that the introduced autoencoder architecture with the generator derived from a StyleGAN, called StyleALAE, has generative power comparable to that of StyleGAN but can also produce face reconstructions and image manipulations based on real images rather than generated.
ALAE architecture
What’s the core idea of this paper?
- Introducing a novel autoencoder architecture, called Adversarial Latent Autoencoder (ALAE), that has generative power comparable to state-of-the-art GANs while learning a less entangled representation. The novelty of the approach lies in three major factors:
- To address entanglement, the latent distribution is allowed to be learned from data.
- The output distribution is learned in adversarial settings.
- To implement the above optimizations, the autoencoder’s reciprocity is imposed in the latent space.
- Designing two ALAEs:
- One that is based on the multilayer perceptron (MLP) as an encoder and a symmetric generator.
- One called StyleALAE that has the generator derived from a StyleGAN, a specifically designed companion encoder, and a progressively growing architecture.
StyleALAE architecture
What’s the key achievement?
- Qualitative and quantitative evaluations demonstrate that:
- Both the MLP-based autoencoder and StyleALAE learn a latent space that is more disentangled than the imposed one.
- StyleALAE can generate high-resolution (1024 × 1024) face and bedroom images of comparable quality to that of StyleGAN.
- Thanks to also learning an encoder network, StyleALAE goes beyond the capabilities of GANs and allows face reconstruction and image manipulation at high resolution based on real images rather than generated.
Reconstruction of unseen images with StyleALAE at 1024 × 1024
What does the AI community think?
- The paper was accepted to CVPR 2020, the leading conference in computer vision.
- The official repository of the paper on GitHub received over 2000 stars, making it one of the highest-trending papers in this research area.
What are possible business applications?
- The suggested approach enables images to be generated and manipulated with a high level of visual detail, and thus may have numerous applications in real estate, marketing, advertising, etc.
Where can you get implementation code?
- The PyTorch implementation of this research, together with the pre-trained models, is available on GitHub.
If you are interested in the latest Computer Vision research breakthroughs, check out the following articles:
- 10 Cutting-Edge Research Papers In Computer Vision From 2019
- Top 10 Research Papers In Computer Vision and Image Generation From 2018
- 5 New Generative Adversarial Network (GAN) Architectures For Image Synthesis
- 4 Cutting-Edge AI Techniques for Video Generation
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.