Will transformers revolutionize computer vision like they did with natural language processing?
That’s one of the major research questions investigated by computer vision scientists in 2020. The first results indicate that transformers achieve very promising results on image recognition tasks.
Beyond transformers in vision applications, we also noticed a continuous interest in learning 3D objects from images, generating realistic images using GANs and autoencoders, etc.
To help you navigate through the overwhelming number of great computer vision papers presented in 2020, we’ve curated and summarized the top 10 CV research papers from this year. We hope that these research summaries will be a good starting point to help you understand the latest trends in this research area.
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
If you’d like to skip around, here are the papers we featured:
- EfficientDet: Scalable and Efficient Object Detection
- Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild
- 3D Photography using Context-aware Layered Depth Inpainting
- Adversarial Latent Autoencoders
- On Learning Sets of Symmetric Elements
- Tuning-free Plug-and-Play Proximal Algorithm for Inverse Imaging Problems
- Generative Pretraining from Pixels
- RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
- Training Generative Adversarial Networks with Limited Data
- An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
Best Computer Vision Research Papers 2020
1. EfficientDet: Scalable and Efficient Object Detection, by Mingxing Tan, Ruoming Pang, Quoc V. Le
Original Abstract
Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and EfficientNet backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single-model and single-scale, our EfficientDet-D7 achieves state-of-the-art 52.2 AP on COCO test-dev with 52M parameters and 325B FLOPs, being 4×–9× smaller and using 13×–42× fewer FLOPs than previous detectors. Code is available on https://github.com/google/automl/tree/master/efficientdet.
Our Summary
The large size of object detection models deters their deployment in real-world applications such as self-driving cars and robotics. To address this problem, the Google Research team introduces two optimizations, namely (1) a weighted bi-directional feature pyramid network (BiFPN) for efficient multi-scale feature fusion and (2) a novel compound scaling method. By combining these optimizations with the EfficientNet backbones, the authors develop a family of object detectors, called EfficientDet. The experiments demonstrate that these object detectors consistently achieve higher accuracy with far fewer parameters and multiply-adds (FLOPs).
What’s the core idea of this paper?
- To improve the efficiency of object detection models, the authors suggest:
- A weighted bi-directional feature pyramid network (BiFPN) for easy and fast multi-scale feature fusion. It learns the importance of different input features and repeatedly applies top-down and bottom-up multi-scale feature fusion.
- A new compound scaling method for simultaneous scaling of the resolution, depth, and width for all backbone, feature network, and box/class prediction networks.
- These optimizations, together with the EfficientNet backbones, allow the development of a new family of object detectors, called EfficientDet.
What’s the key achievement?
- The evaluation demonstrates that EfficientDet object detectors achieve better accuracy than previous state-of-the-art detectors while having far fewer parameters, in particular:
- the EfficientDet model with 52M parameters gets state-of-the-art 52.2 AP on the COCO test-dev dataset, outperforming the previous best detector with 1.5 AP while being 4× smaller and using 13× fewer FLOPs;
- with simple modifications, the EfficientDet model achieves 81.74% mIOU accuracy, outperforming DeepLabV3+ by 1.7% on Pascal VOC 2012 semantic segmentation with 9.8x fewer FLOPs;
- the EfficientDet models are up to 3× to 8× faster on GPU/CPU than previous detectors.
What does the AI community think?
- The paper was accepted to CVPR 2020, the leading conference in computer vision.
- The high level of interest in the code implementations of this paper makes this research one of the highest-trending papers introduced recently.
What are possible business applications?
- The high accuracy and efficiency of the EfficientDet detectors may enable their application for real-world tasks, including self-driving cars and robotics.
Where can you get implementation code?
- The authors released the official TensorFlow implementation of EfficientDet.
- The PyTorch implementation of this paper can be found here and here.
2. Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild, by Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi
Original Abstract
We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision. The method is based on an autoencoder that factors each input image into depth, albedo, viewpoint and illumination. In order to disentangle these components without supervision, we use the fact that many object categories have, at least in principle, a symmetric structure. We show that reasoning about illumination allows us to exploit the underlying object symmetry even if the appearance is not symmetric due to shading. Furthermore, we model objects that are probably, but not certainly, symmetric by predicting a symmetry probability map, learned end-to-end with the other components of the model. Our experiments show that this method can recover very accurately the 3D shape of human faces, cat faces and cars from single-view images, without any supervision or a prior shape model. On benchmarks, we demonstrate superior accuracy compared to another method that uses supervision at the level of 2D image correspondences.
Our Summary
The research group from the University of Oxford studies the problem of learning 3D deformable object categories from single-view RGB images without additional supervision. To decompose the image into depth, albedo, illumination, and viewpoint without direct supervision for these factors, they suggest starting by assuming objects to be symmetric. Then, considering that real-world objects are never fully symmetrical, at least due to variations in pose and illumination, the researchers augment the model by explicitly modeling illumination and predicting a dense map with probabilities that any given pixel has a symmetric counterpart. The experiments demonstrate that the introduced approach achieves better reconstruction results than other unsupervised methods. Moreover, it outperforms the recent state-of-the-art method that leverages keypoint supervision.
What’s the core idea of this paper?
- The goal of the introduced approach is to reconstruct the 3D pose, shape, albedo, and illumination of a deformable object from a single RGB image under two challenging conditions:
- no access to 2D or 3D ground truth information such as keypoints, segmentation, depth maps, or prior knowledge of a 3D model;
- using an unconstrained collection of single-view images without having multiple views of the same instance.
- To achieve this goal, the researchers suggest:
- leveraging symmetry as a geometric cue to constrain the decomposition;
- explicitly modeling illumination and using it as an additional cue for recovering the shape;
- augmenting the model to account for potential lack of symmetry – particularly, predicting a dense map that contains the probability of a given pixel having a symmetric counterpart in the image.
What’s the key achievement?
- Qualitative evaluation of the suggested approach demonstrates that it reconstructs 3D faces of humans and cats with high fidelity, containing fine details of the nose, eyes, and mouth.
- The method reconstructs higher-quality shapes compared to other state-of-the-art unsupervised methods, and even outperforms the DepthNet model, which uses 2D keypoint annotations for depth prediction.
What does the AI community think?
- The paper received the Best Paper Award at CVPR 2020, the leading conference in computer vision.
What are future research areas?
- Reconstructing more complex objects by extending the model to use either multiple canonical views or a different 3D representation, such as a mesh or a voxel map.
- Improving model performance under extreme lighting conditions and for extreme poses.
Where can you get implementation code?
- The implementation code and demo are available on GitHub.
3. 3D Photography using Context-aware Layered Depth Inpainting, by Meng-Li Shih, Shih-Yang Su, Johannes Kopf, Jia-Bin Huang
Original Abstract
We propose a method for converting a single RGB-D input image into a 3D photo – a multi-layer representation for novel view synthesis that contains hallucinated color and depth structures in regions occluded in the original view. We use a Layered Depth Image with explicit pixel connectivity as underlying representation, and present a learning-based inpainting model that synthesizes new local color-and-depth content into the occluded region in a spatial context-aware manner. The resulting 3D photos can be efficiently rendered with motion parallax using standard graphics engines. We validate the effectiveness of our method on a wide range of challenging everyday scenes and show fewer artifacts compared with the state of the arts.
Our Summary
The research team presents a new learning-based approach to generating a 3D photo from a single RGB-D image. The depth in the input image can either come from a cell phone with a stereo camera or be estimated from an RGB image. The authors suggest explicitly storing connectivity across pixels in the representation. To deal with the resulting complexity of the topology and the difficulty of applying a global CNN to the problem, the research team breaks the problem into many local inpainting sub-problems that are solved iteratively. The introduced algorithm results in 3D photos with synthesized textures and structures in occluded regions. The experiments demonstrate its effectiveness compared to the existing state-of-the-art techniques.
What’s the core idea of this paper?
- The core technical novelty of the suggested approach lies in creating a completed Layered Depth Image representation using context-aware color and depth inpainting.
- The algorithm takes an RGB-D image as an input and generates a Layered Depth Image (LDI) with color and depth inpainted in the parts that were occluded in the input image:
- First, a trivial LDI is initialized with a single layer everywhere.
- Then, the method detects major depth discontinuities and groups them into connected depth edges that become the basic units for the main algorithm.
- In the core part of the algorithm:
- each depth edge is selected iteratively;
- the LDI pixels across the edge are disconnected and only background pixels are considered for inpainting;
- the local context region is extracted from the “known” side of the edge to generate a synthesis region on the “unknown” side;
- the synthesized pixels are merged back into the LDI.
What’s the key achievement?
- The experimental results demonstrate that the introduced algorithm results in significantly fewer visual artifacts than existing state-of-the-art techniques:
- according to visual comparisons, content and structures that are synthesized around depth boundaries look plausible even in the case of challenging examples;
- following the quantitative comparisons, the synthesized views generated by the suggested method exhibit better perceptual quality than alternative approaches.
What does the AI community think?
- The paper was accepted to CVPR 2020, the leading conference in computer vision.
- The code implementation of this research paper is attracting lots of attention from the AI community.
What are possible business applications?
- 3D photography provides a much more immersive experience than usual 2D images, so the ability to easily generate a 3D photo from a single RGB-D image can be useful in many business areas, including real estate, e-commerce, marketing, and advertising.
Where can you get implementation code?
- The authors released the code implementation of the suggested approach to 3D photo inpainting on GitHub.
- Examples of the resulting 3D photos in a wide range of everyday scenes can be viewed here.
4. Adversarial Latent Autoencoders, by Stanislav Pidhorskyi, Donald Adjeroh, Gianfranco Doretto
Original Abstract
Autoencoder networks are unsupervised approaches aiming at combining generative and representational properties by learning simultaneously an encoder-generator map. Although studied extensively, the issues of whether they have the same generative power of GANs, or learn disentangled representations, have not been fully addressed. We introduce an autoencoder that tackles these issues jointly, which we call Adversarial Latent Autoencoder (ALAE). It is a general architecture that can leverage recent improvements on GAN training procedures. We designed two autoencoders: one based on a MLP encoder, and another based on a StyleGAN generator, which we call StyleALAE. We verify the disentanglement properties of both architectures. We show that StyleALAE can not only generate 1024×1024 face images with comparable quality of StyleGAN, but at the same resolution can also produce face reconstructions and manipulations based on real images. This makes ALAE the first autoencoder able to compare with, and go beyond the capabilities of a generator-only type of architecture.
Our Summary
The research group from West Virginia University investigates if autoencoders can have the same generative power as GANs while learning disentangled representation. In particular, they introduce an autoencoder, called Adversarial Latent Autoencoder (ALAE), that can generate images with quality comparable to state-of-the-art GANs while also learning a less entangled representation. This is achieved by allowing the latent distribution to be learned from data and the output data distribution to be learned with an adversarial strategy. Finally, the autoencoder’s reciprocity is imposed in the latent space. The experiments demonstrate that the introduced autoencoder architecture with the generator derived from a StyleGAN, called StyleALAE, has generative power comparable to that of StyleGAN but can also produce face reconstructions and image manipulations based on real images rather than generated.
What’s the core idea of this paper?
- Introducing a novel autoencoder architecture, called Adversarial Latent Autoencoder (ALAE), that has generative power comparable to state-of-the-art GANs while learning a less entangled representation. The novelty of the approach lies in three major factors:
- To address entanglement, the latent distribution is allowed to be learned from data.
- The output distribution is learned in adversarial settings.
- To implement the above optimizations, the autoencoder’s reciprocity is imposed in the latent space.
- Designing two ALAEs:
- One that is based on the multilayer perceptron (MLP) as an encoder and a symmetric generator.
- One called StyleALAE that has the generator derived from a StyleGAN, a specifically designed companion encoder, and a progressively growing architecture.
What’s the key achievement?
- Qualitative and quantitative evaluations demonstrate that:
- Both the MLP-based autoencoder and StyleALAE learn a latent space that is more disentangled than the imposed one.
- StyleALAE can generate high-resolution (1024 × 1024) face and bedroom images of comparable quality to that of StyleGAN.
- Thanks to also learning an encoder network, StyleALAE goes beyond the capabilities of GANs and allows face reconstruction and image manipulation at high resolution based on real images rather than generated.
What does the AI community think?
- The paper was accepted to CVPR 2020, the leading conference in computer vision.
- The official repository of the paper on GitHub received over 2000 stars, making it one of the highest-trending papers in this research area.
What are possible business applications?
- The suggested approach enables images to be generated and manipulated with a high level of visual detail, and thus may have numerous applications in real estate, marketing, advertising, etc.
Where can you get implementation code?
- The PyTorch implementation of this research, together with the pre-trained models, is available on GitHub.
5. On Learning Sets of Symmetric Elements, by Haggai Maron, Or Litany, Gal Chechik, Ethan Fetaya
Original Abstract
Learning from unordered sets is a fundamental learning setup, recently attracting increasing attention. Research in this area has focused on the case where elements of the set are represented by feature vectors, and far less emphasis has been given to the common case where set elements themselves adhere to their own symmetries. That case is relevant to numerous applications, from deblurring image bursts to multi-view 3D shape recognition and reconstruction.
In this paper, we present a principled approach to learning sets of general symmetric elements. We first characterize the space of linear layers that are equivariant both to element reordering and to the inherent symmetries of elements, like translation in the case of images. We further show that networks that are composed of these layers, called Deep Sets for Symmetric Elements layers (DSS), are universal approximators of both invariant and equivariant functions. DSS layers are also straightforward to implement. Finally, we show that they improve over existing set-learning architectures in a series of experiments with images, graphs, and point-clouds.
Our Summary
The research paper focuses on learning sets in the case when the elements of the set exhibit certain symmetries. That case is relevant when learning with sets of images, sets of point-clouds, or sets of graphs. The research team from NVIDIA Research, Stanford University, and Bar Ilan University introduces a principled approach to learning such sets, where they first characterize the space of linear layers that are equivariant both to element reordering and to the inherent symmetries of elements and then show that networks that are composed of these layers are universal approximators of both invariant and equivariant functions. The experiments demonstrate that the proposed approach achieves significant improvements over the previous approaches.
What’s the core idea of this paper?
- The research paper introduces a new principled approach to learning from unordered sets by utilizing the symmetries that the set elements exhibit:
- The authors describe the symmetry group of the sets and characterize the space of linear layers that are equivariant to this group. These layers are called Deep Sets for Symmetric elements layers (DSS).
- Then, the researchers prove that if invariant networks for the elements of interest are universal, the corresponding invariant DSS networks on sets of such elements are also universal. The same result holds for equivariant networks and equivariant DSS networks.
What’s the key achievement?
- The experimental results show that DSS layers outperform previous approaches in a series of tasks, including classification, frame selection in images and shapes, highest quality image selection, color-channel matching, and burst image deblurring.
What does the AI community think?
- The paper received the Outstanding Paper Award at ICML 2020.
6. Tuning-free Plug-and-Play Proximal Algorithm for Inverse Imaging Problems, by Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Carola-Bibiane Schnlieb, Hua Huang
Original Abstract
Plug-and-play (PnP) is a non-convex framework that combines ADMM or other proximal algorithms with advanced denoiser priors. Recently, PnP has achieved great empirical success, especially with the integration of deep learning-based denoisers. However, a key problem of PnP based approaches is that they require manual parameter tweaking. It is necessary to obtain high-quality results across the high discrepancy in terms of imaging conditions and varying scene content. In this work, we present a tuning-free PnP proximal algorithm, which can automatically determine the internal parameters including the penalty parameter, the denoising strength and the terminal time. A key part of our approach is to develop a policy network for automatic search of parameters, which can be effectively learned via mixed model-free and model-based deep reinforcement learning. We demonstrate, through numerical and visual experiments, that the learned policy can customize different parameters for different states, and often more efficient and effective than existing handcrafted criteria. Moreover, we discuss the practical considerations of the plugged denoisers, which together with our learned policy yield state-of-the-art results. This is prevalent on both linear and nonlinear exemplary inverse imaging problems, and in particular, we show promising results on Compressed Sensing MRI and phase retrieval.
Our Summary
A key issue with plug-and-play (PnP) approaches is the need to manually tweak parameters. The PnP algorithm introduced in this paper is tuning-free and can automatically determine internal parameters, including the penalty parameter, the denoising strength, and the terminal time. The parameters are optimized with a reinforcement learning (RL) algorithm, where a high reward is given if the policy leads to faster convergence and better restoration accuracy. The extensive numerical and visual experiments demonstrate the effectiveness of the suggested approach on compressed sensing MRI and phase retrieval problems.
What’s the core idea of this paper?
- PnP algorithms offer promising image recovery results. However, their performance is very sensitive to the internal parameter selection (i.e., the penalty parameter, the denoising strength, and the terminal time). The common approach is manual parameter tweaking for each specific problem setting, which is very cumbersome and time-consuming.
- To address this problem, the researchers introduce an RL-based method with a policy network that can customize well-suited parameters for different images:
- an automated parameter selection problem is formulated as a Markov decision process;
- a policy agent gets higher rewards for faster convergence and better restoration accuracy;
- the discrete terminal time and the continuous denoising strength and penalty parameters are optimized jointly.
What’s the key achievement?
- An extensive range of numerical and visual experiments demonstrate that the introduced tuning-free PnP algorithm:
- outperforms state-of-the-art techniques by a large margin on the linear inverse imaging problem, namely compressed sensing MRI (especially under the difficult settings);
- demonstrates state-of-the-art performance on the non-linear inverse imaging problem, namely phase retrieval, where it produces cleaner and clearer results than competing techniques;
- often reaches a level of performance comparable to the “oracle” parameters tuned via the inaccessible ground truth.
What does the AI community think?
- The paper received the Outstanding Paper Award at ICML 2020.
What are possible business applications?
- The introduced tuning-free PnP proximal algorithm can be applied to different inverse imaging problems, including magnetic resonance imaging (MRI), computed tomography (CT), microscopy, and inverse scattering.
Where can you get implementation code?
- The implementation of this research paper will be released on GitHub.
7. Generative Pretraining from Pixels, by Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever
Original Abstract
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
Our Summary
Generative pre-training methods have had a substantial impact on natural language processing over the last few years. The OpenAI research team re-evaluates these techniques on images and demonstrates that generative pre-training is competitive with other self-supervised approaches. The introduced approach consists of a pre-training stage, where both autoregressive and BERT objectives are explored, and a fine-tuning stage. The authors apply Transformer architecture to predict pixels instead of language tokens. The experiments demonstrate that generative image modeling learns state-of-the-art representations for low-resolution datasets and achieves comparable results to other self-supervised methods on ImageNet.
What’s the core idea of this paper?
- The authors claim that generative pre-training methods for images can be competitive with other self-supervised approaches when using a flexible architecture such as Transformer, an efficient likelihood-based objective, and significant computational resources (2048 TPU cores).
- They introduce Image GPT, or iGPT, which is based on GPT-2 but where the sequence Transformer architecture predicts pixels instead of language tokens:
- First, raw images are resized to low resolution and reshaped into a 1D sequence.
- Second, autoregressive next pixel prediction or masked pixel prediction (BERT) is chosen as the pre-training objective.
- Finally, the representations are learned by these objectives with linear probes or fine-tuning.
What’s the key achievement?
- The experiments demonstrate that iGPT:
- outperforms a supervised WideResNet on CIFAR-10, CIFAR-100, and STL-10 datasets;
- achieves 72% accuracy on ImageNet, which is competitive with the recent contrastive learning approaches that require fewer parameters but work with higher resolution and utilize knowledge of the 2D input structure;
- after fine-tuning, achieves 99% accuracy on CIFAR-10, similar to GPipe, the best model which pre-trains using ImageNet labels.
What does the AI community think?
- The paper received an Honorable Mention at ICML 2020.
What are future research areas?
- Exploring more efficient self-attention approaches.
- Revisiting the representation learning capabilities of other families of generative models (e.g., flows, VAEs).
Where can you get implementation code?
- TensorFlow implementation of iGPT by the OpenAI team is available here.
- PyTorch implementation of the model is available here.
8. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow, by Zachary Teed and Jia Deng
Original Abstract
We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance. On KITTI, RAFT achieves an F1-all error of 5.10%, a 16% error reduction from the best published result (6.10%). On Sintel (final pass), RAFT obtains an end-point-error of 2.855 pixels, a 30% error reduction from the best published result (4.098 pixels). In addition, RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count. Code is available at https://github.com/princeton-vl/RAFT.
Our Summary
The researchers from Princeton University investigate the problem of optical flow, the task of estimating per-pixel motion between video frames. They introduce Recurrent All-Pairs Field Transforms (RAFT), a deep network architecture that consists of three key components: (1) a feature encoder to extract a feature vector for each pixel; (2) a correlation layer to compute the visual similarity between pixels; and (3) a recurrent update operator to retrieve values from the correlation volumes and iteratively update a flow field. The experiments demonstrate that RAFT achieves state-of-the-art performance on both Sintel and KITTI datasets.
What’s the core idea of this paper?
- The researchers introduce a new deep network architecture for optical flow, called Recurrent All-Pairs Field Transforms (RAFT). It consists of three main components:
- A feature encoder that extracts per-pixel features from both input images, along with a context encoder that extracts features only from the first frame.
- A correlation layer that constructs a 4D correlation volume by taking the inner product of all pairs of feature vectors, with subsequent pooling to produce lower-resolution volumes.
- A recurrent GRU-based update operator that iteratively updates a flow field by retrieving values from the correlation volumes.
- The RAFT architecture is inspired by many existing works but is essentially novel:
- RAFT maintains and updates a single fixed flow field at high resolution, in contrast to the prevailing approach where the flow is first estimated at low resolution and then upsampled.
- The update operator of RAFT is recurrent and lightweight, while the recent approaches are mostly limited to a fixed number of iterations. It also has a design that allows lookups on 4D multi-scale correlation volumes, in contrast to prior work that typically uses only plain convolution or correlation layers.
What’s the key achievement?
- RAFT achieves:
- State-of-the-art accuracy: on KITTI, F1-all error of 5.10% compared to the previous best result of 6.10%, and on Sintel, an end-point-error of 2.855 pixels compared to the previous best published result of 4.098 pixels.
- Strong generalization: an end-point-error of 5.04 pixels on KITTI, when trained only on synthetic data.
- High efficiency with processing 1088×436 videos at 10 frames per second on a 1080Ti GPU.
What does the AI community think?
- The paper received the Best Paper Award at ECCV 2020, one of the key conferences in computer vision.
- The paper is trending in the AI research community, as evident from the repository stats on GitHub.
What are possible business applications?
- RAFT can improve the performance of computer vision systems in tracking a specific object of interest or tracking all objects of a particular type or category in the video.
Where can you get implementation code?
- The source code and demos are available on GitHub.
9. Training Generative Adversarial Networks with Limited Data, by Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, Timo Aila
Original Abstract
Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.42.
Our Summary
Despite a seemingly unlimited number of images available online, it’s usually difficult to collect a large dataset for training a generative adversarial network (GAN) for specific real-world applications. Datasets with images of a certain type are usually relatively small, which results in the discriminator overfitting to the training samples. To address this issue, the NVIDIA research team introduces an adaptive discriminator augmentation (ADA) approach that allows the application of a wide range of augmentation techniques, while ensuring that these augmentations do not leak into generated images. The approach is based on evaluating the discriminator and training the generator only using augmented images. The experiments on several datasets demonstrate that the suggested approach achieves good results with only a few thousand images.
What’s the core idea of this paper?
- Specific applications of GANs usually require images of a certain type that are not easily available in large numbers.
- Small datasets lead to a discriminator overfitting to the training samples.
- Data augmentation is a standard solution to the overfitting problem. However, when applied to GAN training, standard dataset augmentations tend to ‘leak’ into generated images (e.g., noisy augmentation leads to noisy results).
- To avoid leaking, the NVIDIA researchers suggest evaluating the discriminator and training the generator only using augmented images. They call their approach stochastic discriminator augmentation.
- They also introduce a variant of this approach, called adaptive discriminator augmentation (ADA), where the augmentation strength is adjusted algorithmically.
What’s the key achievement?
- Matching StyleGAN2 performance with an order of magnitude fewer images.
- Achieving a new record Fréchet inception distance (FID) of 2.42 on CIFAR-10, compared to the previous state of the art of 5.59.
- Introducing MetFaces, a novel benchmark dataset for limited data scenarios.
What does the AI community think?
- The paper was accepted to NeurIPS 2020, the top conference in artificial intelligence.
What are future research areas?
- Searching for the most effective set of augmentations.
- Exploring the effectiveness of recently published techniques, such as the U-net discriminator or multi-modal generator.
What are possible business applications?
- The introduced approach allows a significant reduction in the number of training images, which lowers the barrier for using GANs in many applied fields.
Where can you get implementation code?
- The implementation code and models are available on GitHub.
10. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Original Abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer attain excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Our Summary
The authors of this paper show that a pure Transformer can perform very well on image classification tasks. They introduce Vision Transformer (ViT), which is applied directly to sequences of image patches by analogy with tokens (words) in NLP. When trained on large datasets of 14M–300M images, Vision Transformer approaches or beats state-of-the-art CNN-based models on image recognition tasks. In particular, it achieves an accuracy of 88.36% on ImageNet, 90.77% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.16% on the VTAB suite of 19 tasks.
What’s the core idea of this paper?
- When applying Transformer architecture to images, the authors follow as closely as possible the design of the original Transformer designed for NLP.
- The introduced Transformer-based approach to image classification includes the following steps:
- splitting images into fixed-size patches;
- linearly embedding each of them;
- adding position embeddings to the resulting sequence of vectors;
- feeding the patches to a standard Transformer encoder;
- adding an extra learnable ‘classification token’ to the sequence.
- Similarly to Transformers in NLP, Vision Transformer is typically pre-trained on large datasets and fine-tuned to downstream tasks.
What’s the key achievement?
- Vision Transformer pre-trained on the JFT300M dataset matches or outperforms ResNet-based baselines while requiring substantially less computational resources to pre-train. It achieves an accuracy of:
- 88.36% on ImageNet;
- 90.77% on ImageNet-ReaL;
- 94.55% on CIFAR-100;
- 97.56% on Oxford-IIIT Pets;
- 99.74% on Oxford Flowers-102;
- 77.16% on the VTAB suite of 19 tasks.
What does the AI community think?
- The paper is trending in the AI research community, as evident from the repository stats on GitHub.
- It is also under review for ICLR 2021, one of the key conferences in deep learning.
What are future research areas?
- Applying Vision Transformer to other computer vision tasks, such as detection and segmentation.
- Exploring self-supervised pre-training methods.
- Analyzing the few-shot properties of Vision Transformer.
- Exploring contrastive pre-training.
- Further scaling ViT.
What are possible business applications?
- Thanks to their efficient pre-training and high performance, Transformers may substitute convolutional networks in many computer vision applications, including navigation, automatic inspection, and visual surveillance.
Where can you get implementation code?
- The PyTorch implementation of Vision Transformer is available on GitHub.
If you like these research summaries, you might be also interested in the following articles:
- 2020’s Top AI & Machine Learning Research Papers
- GPT-3 & Beyond: 10 NLP Research Papers You Should Read
- AAAI 2021: Top Research Papers With Business Applications
- ICLR 2021: Key Research Papers
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.