10 Cutting-Edge Research Papers In Computer Vision From 2019

UPDATE: We’ve also summarized the top 2020 Computer Vision research papers.

Today we can see how computer vision (CV) systems are revolutionizing whole industries and business functions with successful applications in healthcare, security, transportation, retail, banking, agriculture, and more.

In 2019, we saw lots of novel architectures and approaches that further improved the perceptive and generative capacities of visual systems. To help you navigate through the overwhelming number of great computer vision papers presented this year, we’ve curated and summarized the top 10 CV research papers of 2019 that will help you understand the latest trends in this research area.

The papers that we selected cover optimization of convolutional networks, unsupervised learning in computer vision, image generation and evaluation of machine-generated images, visual-language navigation, captioning changes between two images with natural language, and more.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Learning the Depths of Moving People by Watching Frozen People
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
A Theory of Fermat Paths for Non-Line-of-Sight Shape Reconstruction
Reasoning-RCNN: Unifying Adaptive Global Reasoning into Large-scale Object Detection
Fixing the Train-Test Resolution Discrepancy
SinGAN: Learning a Generative Model from a Single Natural Image
Local Aggregation for Unsupervised Learning of Visual Embeddings
Robust Change Captioning
HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models

10 Important Computer Vision Research Papers of 2019

1. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, by Mingxing Tan and Quoc V. Le

Original Abstract

Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet.

To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at this URL.

Our Summary

The researchers from the Google Research Brain Team introduce a better way to scale up Convolutional Neural Networks (CNNs). Conventionally, CNNs are first developed and then later scaled up, in terms of depth, width, or the resolution of the input images, as more resources become available. The authors show that if just one of these parameters is scaled up, or if the parameters are all scaled up arbitrarily, this leads to rapidly diminishing returns relative to the extra computational power needed. Instead, they demonstrate that there is an optimal ratio of depth, width, and resolution in order to maximize efficiency and accuracy. This is called compound scaling. The result is that EfficientNet’s performance surpasses the accuracy of other CNNs on ImageNet by up to 6% while being up to ten times more efficient in terms of speed and size.

What’s the core idea of this paper?

The depth (number of layers), width and input resolution of a CNN should be scaled up at a specific ratio relative to each other, rather than arbitrarily.
Moreover, since the effectiveness of model scaling depends heavily on the baseline network, the researchers leveraged a neural architecture search to develop a new baseline model and scaled it up to obtain a family of models, called EfficientNets.
You can choose one of the EfficientNets depending on the available resources.

What’s the key achievement?

EfficientNets achieve new state-of-the-art accuracy for 5 out of 8 datasets, with 9.6x fewer parameters on average.
In particular, EfficientNet with 66M parameters achieves 84.4% top-1 accuracy and 97-1% top-5 accuracy on ImageNet and is 8 times smaller and 6 times faster than GPipe (557M parameters), the previous state-of-the-art scalable CNN.

What does the AI community think?

The paper was presented orally at ICML 2019, the leading conference in machine learning.

What are future research areas?

The authors state on the Google AI blog that they expect EfficientNets to “serve as a new foundation for future computer vision tasks”.

What are possible business applications?

The results of this research can be very important for computer vision applications in the business setting as the suggested approach allows more accurate results from CNNs faster and more cheaply.

Where can you get implementation code?

The authors have released the source code for their TensorFlow implementation of EfficientNet here.
There is also a PyTorch implementation available here.

2. Learning the Depths of Moving People by Watching Frozen People, by Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, William T. Freeman

Original Abstract

We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects’ motion and may only recover sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene. Because people are stationary, training data can be generated using multi-view stereo reconstruction. At inference time, our method uses motion parallax cues from the static areas of the scenes to guide the depth prediction. We demonstrate our method on real-world sequences of complex human actions captured by a moving hand-held camera, show improvement over state-of-the-art monocular depth prediction methods, and show various 3D effects produced using our predicted depth.

Our Summary

Humans are adept at interpreting the geometry and depth of moving objects in a natural scene even with one eye closed, but computers have difficulty reconstructing depth when motion is involved. Currently, depth reconstruction relies on having a still subject with a camera that moves around it or a multi-camera array to capture moving subjects. The Google Research team proposes a new single-camera method for generating depth maps of entire natural scenes in the case of simultaneous subject and camera motion. The introduced deep neural network is trained on a novel database of YouTube videos in which people imitate still mannequins, which allow for traditional stereo mapping of natural human poses. The experiments demonstrate the effectiveness of the suggested approach in predicting depth in a number of real-world video sequences.

What’s the core idea of this paper?

This research addresses the challenge of mapping depth in a natural scene with a human subject where both the subject and the single camera are simultaneously moving.
The authors train a deep neural network using a database of YouTube videos of people imitating mannequins (the Mannequin Challenge Dataset), from which depth can be mapped with existing stereo techniques.
The suggested network takes an RGB image, a mask of human regions, and an initial depth of environment as input, and then outputs a dense depth map over the entire image, including the environment and humans.
Initial depth is estimated through motion parallax between two frames in a video, assuming humans are moving and the rest of the scene is stationary.

What’s the key achievement?

Suggesting a model that is able to recreate depth maps of moving scenes with significantly greater accuracy for both humans and their surroundings compared to existing methods.
Introducing the Mannequin Challenge Dataset, a set of 2,000 YouTube videos in which humans pose without moving while a camera circles around the scene.

What does the AI community think?

The paper received Best Paper Award (Honorable Mention) at CVPR 2019, the leading conference on computer vision and pattern recognition.

What are future research areas?

Expanding models to work for moving non-human objects such as cars and shadows.
Incorporating more than two views at a time into the model to eliminate temporary inconsistencies.

What are possible business applications?

Producing accurate 3D video effects, including synthetic depth-of-field, depth-aware inpainting, and inserting virtual objects into a 3D scene.
Using multiple frames to expand the field of view while maintaining an accurate scene depth.

Where can you get implementation code?

Implementation code and trained models are available on GitHub.

3. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation, by Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang

Original Abstract

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).

Our Summary

Vision-language navigation entails a machine using verbal instructions and visual perception to navigate a real 3D environment. This is a challenging task for artificial intelligence because it requires matching verbal clues to a given physical environment as well as parsing semantic instructions with respect to that environment. In this paper, the researchers propose a new Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via Reinforcement Learning (RL). The suggested framework encourages the agent to focus on the right sub-instructions and follow trajectories that match instructions. In addition, the researchers introduce a Self-Supervised Imitation Learning (SIL) method for the exploration of previously unseen environments, where an agent learns to imitate its own good experiences. The RCM approach outperforms the previous state-of-the-art vision-language navigation method on the Room-to-Room (R2R) dataset, improving the SPL score from 28% to 35%.

What’s the core idea of this paper?

Vision-language navigation requires a machine to parse verbal instructions, match those instructions to a visual environment, and then navigate that environment based on sub-phrases within the verbal instructions.
To address this challenging task, the researchers introduce a novel Reinforced Cross-Modal Matching approach that utilizes both extrinsic and intrinsic rewards for reinforcement learning:
- It includes a reasoning navigator that learns from both the natural language instructions and the local visual scene to infer which phrases to focus on and where to look.
- The agent is equipped with a matching critic that evaluates an executed path based on the probability of reconstructing the original instruction from it.
- In addition, a fine-grained intrinsic reward signal encourages the agent to better understand textual input and penalizes it for choosing trajectories that do not match the instructions.
The paper also introduces a Self-Supervised Imitation Learning (SIL) method for exploration of previously unseen environments:
- The navigator performs multiple roll-outs, and the good trajectories, as determined by the matching critic, are later used for the navigator to imitate.

What’s the key achievement?

The RCM framework outperforms the previous state-of-the-art vision-language navigation methods on the R2R dataset by:
- improving the SPL score from 28% to 35%;
- increasing the success rate by 8.1%.
Moreover, using SIL to imitate the RCM agent’s previous best experiences on the training set results in an average path length drop from 15.22m to 11.97m and an even better result on the SPL metric (38%).

What does the AI community think?

The paper received three “Strong Accept” peer reviews and was accepted for oral presentation at СVPR 2019, the leading conference on computer vision and pattern recognition.

What are future research areas?

Using the SIL approach to explore other unseen environments.

What are possible business applications?

The introduced framework can be leveraged in many real-world applications, including:
- in-home robots moving around a home or office following instructions;
- personal assistants accepting verbal instructions and navigating a complex environment to perform certain tasks.

4. A Theory of Fermat Paths for Non-Line-of-Sight Shape Reconstruction, by Shijie Xin, Sotiris Nousias, Kiriakos N. Kutulakos, Aswin C. Sankaranarayanan, Srinivasa G. Narasimhan, Ioannis Gkioulekas

Original Abstract

We present a novel theory of Fermat paths of light between a known visible scene and an unknown object not in the line of sight of a transient camera. These light paths either obey specular reflection or are reflected by the object’s boundary, and hence encode the shape of the hidden object. We prove that Fermat paths correspond to discontinuities in the transient measurements. We then derive a novel constraint that relates the spatial derivatives of the path lengths at these discontinuities to the surface normal. Based on this theory, we present an algorithm, called Fermat Flow, to estimate the shape of the non-line-of-sight object. Our method allows, for the first time, accurate shape recovery of complex objects, ranging from diffuse to specular, that are hidden around the corner as well as hidden behind a diffuser. Finally, our approach is agnostic to the particular technology used for transient imaging. As such, we demonstrate mm-scale shape recovery from pico-second scale transients using a SPAD and ultrafast laser, as well as micron-scale reconstruction from femto-second scale transients using interferometry. We believe our work is a significant advance over the state-of-the-art in non-line-of-sight imaging.

Our Summary

In many security and safety applications, the scene hidden from the camera’s view is of great interest. Currently, it is possible to estimate the shape of hidden, non-line-of-sight (NLOS) objects by measuring the intensity of photons scattered from them. However, this method relies on single-photon avalanche photodetectors that are prone to misestimating photon intensities and requires an assumption that reflection from NLOS objects is Lambertian. The researchers propose a new theory of NLOS photons that follow specific geometric paths, called Fermat paths, between the LOS and NLOS scene. The resulting method can reconstruct the surface of hidden objects that are around a corner or behind a diffuser without depending on the reflectivity of the object.

Non-line-of-sight imaging

What’s the core idea of this paper?

Existing methods for profiling hidden objects depend on measuring the intensities of reflected photons, which requires assuming Lambertian reflection and infallible photodetectors.
The research team suggests reconstructing non-line-of-sight shapes by relying on geometric constraints imposed by Fermat’s principle:
- Fermat paths correspond to discontinuities in the transient measurements.
- Specifically, it is possible to identify the discontinuities in the transient measurement as the length of Fermat paths that contribute to the transient.
- Given a collection of Fermat pathlengths, the procedure produces an oriented point cloud for the NLOS surface.

What’s the key achievement?

The Fermat Flow algorithm derived from the introduced theory can successfully reconstruct the surface of the hidden objects independent of the specific transient imaging technology used.
The Fermat paths theory applies to the scenarios of:
- reflective NLOS (looking around a corner);
- transmissive NLOS (seeing through a diffuser).

What does the AI community think?

The paper received the Best Paper Award at CVPR 2019, the leading conference on computer vision and pattern recognition.

What are future research areas?

Exploring the links between the geometric approach described here and newly introduced backprojection approaches for profiling hidden objects.
Combining geometric and backprojection approaches for other related applications, including acoustic and ultrasound imaging, lensless imaging, and seismic imaging.

What are possible business applications?

Enhanced security from cameras or sensors that can “see” beyond their field of view.
Potential use for autonomous vehicles to “see” around corners.

5. Reasoning-RCNN: Unifying Adaptive Global Reasoning into Large-scale Object Detection, by Hang Xu, Chenhan Jiang, Xiaodan Liang, Liang Lin, Zhenguo Li

Original Abstract

In this paper, we address the large-scale object detection problem with thousands of categories, which poses severe challenges due to long-tail data distributions, heavy occlusions, and class ambiguities. However, the dominant object detection paradigm is limited by treating each object region separately without considering crucial semantic dependencies among objects. In this work, we introduce a novel Reasoning-RCNN to endow any detection networks the capability of adaptive global reasoning over all object regions by exploiting diverse human commonsense knowledge. Instead of only propagating the visual features on the image directly, we evolve the high-level semantic representations of all categories globally to avoid distracted or poor visual features in the image. Specifically, built on feature representations of basic detection network, the proposed network first generates a global semantic pool by collecting the weights of previous classification layer for each category, and then adaptively enhances each object features via attending different semantic contexts in the global semantic pool. Rather than propagating information from all semantic information that may be noisy, our adaptive global reasoning automatically discovers most relative categories for feature evolving. Our Reasoning-RCNN is light-weight and flexible enough to enhance any detection backbone networks, and extensible for integrating any knowledge resources. Solid experiments on object detection benchmarks show the superiority of our Reasoning-RCNN, e.g. achieving around 16% improvement on VisualGenome, 37% on ADE in terms of mAP and 15% improvement on COCO.

Our Summary

Image detection algorithms struggle with large-scale detection across complex scenes because of the high number of object categories within an image, heavy occlusions, ambiguities between object classes, and small-scale objects within the image. To address this problem, the researchers introduce a simple global reasoning framework, Reasoning-RCNN, which explicitly incorporates multiple kinds of commonsense knowledge and also propagates visual information globally from all the categories. The experiments demonstrate that the proposed method significantly outperforms current state-of-the-art object detection methods on the VisualGenome, ADE, and COCO benchmarks.

An example of how the proposed adaptive global reasoning facilitates large-scale object detection

What’s the core idea of this paper?

Large-scale object detection has a number of significant challenges including highly imbalanced object categories, heavy occlusions, class ambiguities, tiny-size objects, etc.
To overcome these challenges, the researchers introduce a novel Reasoning-RCNN network that enables adaptive global reasoning over categories with certain relations or similar attributes:
- First, the model generates a global semantic pool over all categories in a large-scale image by collecting the weights of the prior classification layer.
- Second, a category-wise knowledge graph is designed to encode linguistic knowledge (e.g. attributes, co-occurrence, relationships).
- Third, the current image is encoded by an attention mechanism to automatically discover the most relevant categories for each object.
- Fourth, the enhanced categories are mapped back to the regions by a soft-mapping mechanism, enabling refinement of inaccurate classification results from the previous stage.
- Fifth, new enhanced features of each region are concatenated with the original features to enhance the performance of both classification and localization in an end-to-end manner.

An overview of adaptive global reasoning module

What’s the key achievement?

Reasoning-RCNN outperforms the current state-of-the-art object detection methods, including Faster R-CNN, RetinaNet, RelationNet, and DetNet.
In particular, the model achieves the following improvements in terms of mean average precision (mAP):
- 15% on VisualGenome with 1000 categories;
- 16% on VisualGenome with 3000 categories;
- 37% on ADE;
- 15% on MS-COCO;
- 2% on Pascal VOC.

What does the AI community think?

The paper was accepted for oral presentation at CVPR 2019, the key conference in computer vision.

What are future research areas?

Embedding the reasoning framework used in Reasoning-RCNN into other tasks, including instance-level segmentation.

What are possible business applications?

The proposed approach can significantly improve the performance of systems that rely on large-scale object detection (e.g., threat detection on city streets).

Where can you get implementation code?

Implementation code for Reasoning-RCNN is available on GitHub.

6. Fixing the Train-Test Resolution Discrepancy, by Hugo Touvron, Andrea Vedaldi, Matthijs Douze, Hervé Jégou

Original Abstract

Data-augmentation is key to the training of neural networks for image classification. This paper first shows that existing augmentations induce a significant discrepancy between the typical size of the objects seen by the classifier at train and test time. We experimentally validate that, for a target test resolution, using a lower train resolution offers better classification at test time.

We then propose a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ. It involves only a computationally cheap fine-tuning of the network at the test resolution. This enables training strong classifiers using small training images. For instance, we obtain 77.1% top-1 accuracy on ImageNet with a ResNet-50 trained on 128×128 images, and 79.8% with one trained on 224×224 images. In addition, if we use extra training data we get 82.5% with the ResNet-50 train with 224×224 images.

Conversely, when training a ResNeXt-101 32×48d pre-trained in weakly-supervised fashion on 940 million public images at resolution 224×224 and further optimizing for test resolution 320×320, we obtain a test top-1 accuracy of 86.4% (top-5: 98.0%) (single-crop). To the best of our knowledge this is the highest ImageNet single-crop, top-1 and top-5 accuracy to date.

Our Summary

The Facebook AI research team draws our attention to the fact that even though the best possible performance of convolutional neural networks is achieved when the training and testing data distributions match, the data preprocessing procedures are typically different for training and testing. These differences result in a significant discrepancy between the size of objects at training and at test time. To address this problem and yet keep the benefits of existing preprocessing protocols, the researchers propose jointly optimizing the resolutions and scales of images at training and testing. For example, they demonstrate that using lower resolution crops at training than at test time improves the classifier performance and significantly decreases the processing time. The experiments demonstrate that the introduced approach sets a new state of the art in image classification on ImageNet.

What’s the core idea of this paper?

The difference in image preprocessing procedures at training and at testing has a detrimental effect on the performance of the image classifier:
- To augment training data, the common practice is to extract a rectangle with random coordinates from the image (i.e., a Region of Classification or RoC).
- At test time, the RoC is extracted from a central part of the image.
- This results in a significant discrepancy between the objects’ size as seen by the classifier at train and test time.
To address this problem, the researchers suggest joint optimization of resolutions and scales of images at training and at test time:
- The analysis shows that:
  - increasing the size of image crops at test time compensates for the random selection of RoC at training time;
  - using lower resolution crops at training than at test time improves the performance of the model.
- Thus, the Facebook AI team suggests keeping the same RoC sampling and only fine-tuning two layers of the network to compensate for the changes in the crop size.

What’s the key achievement?

Improving the performance of ResNet-50 model in image classification on ImageNet by obtaining:
- top-1 accuracy of 77.1% when trained on 128×128 images;
- top-1 accuracy of 79.8% when trained on 224×224 images;
- top-1 accuracy of 82.5% when trained on 224×224 images with extra training data.
Enabling a ResNeXt-101 32×48d pre-trained on 940 million public images at a resolution of 224×224 images to set a new state of the art in image classification on ImageNet:
- top-1 accuracy of 86.4%;
- top-5 accuracy of 98.0%.

What are possible business applications?

The suggested approach can boost the performance of AI systems for automated image organization in large databases, image classification on stock websites, visual product search, and more.

Where can you get implementation code?

The authors provide the official PyTorch implementation of the introduced method for fixing the train-test resolution discrepancy.

7. SinGAN: Learning a Generative Model from a Single Natural Image, by Tamar Rott Shaham, Tali Dekel, Tomer Michaeli

Original Abstract

We introduce SinGAN, an unconditional generative model that can be learned from a single natural image. Our model is trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples that carry the same visual content as the image. SinGAN contains a pyramid of fully convolutional GANs, each responsible for learning the patch distribution at a different scale of the image. This allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and the fine textures of the training image. In contrast to previous single image GAN schemes, our approach is not limited to texture images, and is not conditional (i.e. it generates samples from noise). User studies confirm that the generated samples are commonly confused to be real images. We illustrate the utility of SinGAN in a wide range of image manipulation tasks.

Our Summary

The researchers from Technion and Google Research introduce SinGAN, a new model for the unconditional generation of high-quality images given a single natural image. Their approach is based on the notion that the internal statistics of patches within a single image are usually sufficient for learning a powerful generative model. Thus, SinGAN contains a pyramid of fully convolutional lightweight GANs, where each GAN is responsible for learning the patch distribution at a different scale. The images generated by the introduced model semantically resemble the training image but include new object configurations and structures.

Image generation learned from a single training image

What’s the core idea of this paper?

To learn an unconditional generative model from a single image, the researchers suggest using patches of a single image as training samples instead of the whole image samples as in the conventional GAN setting.
The SinGAN generative framework:
- consists of a hierarchy of patch-GANs, each responsible for capturing the distribution of patches at a different scale (e.g., some GANs learn global properties and shapes of large objects like “sky at the top” and “ground at the bottom”, and other GANs learn fine details and texture information);
- goes beyond texture generation and can deal with general natural images;
- allows images of arbitrary size and aspect ratio to be generated;
- enables control over the variability of generated samples via selection of the scale from which to start the generation at test time.

What’s the key achievement?

The experiments demonstrate that SinGAN:
- can generate images that depict new realistic structures and object configurations, while preserving the content of the training image;
- successfully preserves global image properties and fine details;
- can realistically synthesize reflections and shadows;
- generates samples that are hard to distinguish from the real ones.

What does the AI community think?

The paper received the Best Paper Award at ICCV 2019, one of the leading conferences in computer vision.

What are possible business applications?

The SinGAN model can assist with a number of image manipulation tasks, including image editing, superresolution, harmonization, generating images from paintings, and creating animations from a single image.

Where can you get implementation code?

The official PyTorch implementation of SinGAN is available on GitHub.

8. Local Aggregation for Unsupervised Learning of Visual Embeddings, by Chengxu Zhuang, Alex Lin Zhai, Daniel Yamins

Original Abstract

Unsupervised approaches to learning in neural networks are of substantial interest for furthering artificial intelligence, both because they would enable the training of networks without the need for large numbers of expensive annotations, and because they would be better models of the kind of general-purpose learning deployed by humans. However, unsupervised networks have long lagged behind the performance of their supervised counterparts, especially in the domain of large-scale visual recognition. Recent developments in training deep convolutional embeddings to maximize non-parametric instance separation and clustering objectives have shown promise in closing this gap. Here, we describe a method that trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate. This aggregation metric is dynamic, allowing soft clusters of different scales to emerge. We evaluate our procedure on several large-scale visual recognition datasets, achieving state-of-the-art unsupervised transfer learning performance on object recognition in ImageNet, scene recognition in Places 205, and object detection in PASCAL VOC.

Our Summary

The research team from Stanford University addresses the problem of object detection and recognition with unsupervised learning. To tackle this problem, they introduce the Local Aggregation (LA) procedure, which causes dissimilar inputs to move apart in the embedding space while allowing similar inputs to converge into clusters. Specifically, the researchers suggest starting with the non-linear embedding of inputs in a lower-dimensional space, and then iteratively identifying close neighbors in the embedding space. The experiments demonstrate the robustness of the presented approach for downstream tasks, including object recognition, scene recognition, and object detection.

What’s the core idea of this paper?

The paper introduces a novel unsupervised learning algorithm that enables local non-parametric aggregation of similar images in a latent feature space.
The overall goal of the presented Local Aggregation (LA) procedure is to learn an embedding function that maps images to features in a representation space where similar images group together and different images are separated:
- For each input image, a deep neural network is used to embed the image into a lower-dimensional space.
- Then, the model identifies close neighbors, whose embeddings are similar, and background neighbors, which are used to set the distance scale for judging closeness.
- Through optimization, the current embedding vector is pushed closer to its close neighbors and further from its background neighbors.
The representation resulting from the introduced procedure supports downstream computer vision tasks.

What’s the key achievement?

Local aggregation significantly outperforms other architectures in:
- object recognition, with LA-trained ResNet-50 achieving 60.2% top-1 accuracy on ImageNet – higher than AlexNet trained directly on the supervised task;
- scene categorization, by demonstrating a strong transfer learning performance on the Places dataset with 50.1% accuracy for LA-trained ResNet-50;
- object detection, by achieving state-of-the-art results in unsupervised transfer learning for the PASCAL detection task (i.e., mean Average Precision of 69.1% with ResNet-50).

What does the AI community think?

The paper was nominated for the Best Paper Award at ICCV 2019, one of the leading conferences in computer vision.

What are future research areas?

Exploring the possibility of detecting similarities with non-local manifold learning-based priors.
Improving dissimilarity detection by analyzing representational change over multiple steps of learning.
Applying the LA objective to other domains, including video and audio.
Comparing the LA procedure with biological vision systems.

What are possible business applications?

This research is an important step towards making unsupervised learning applicable to real-world computer vision tasks and enabling object detection and object recognition systems to perform well without the costly collection of annotations.

Where can you get implementation code?

The TensorFlow implementation of the Local Aggregation algorithm is available on GitHub.

9. Robust Change Captioning, by Dong Huk Park, Trevor Darrell, Anna Rohrbach

Original Abstract

Describing what has changed in a scene can be useful to a user, but only if generated text focuses on what is semantically relevant. It is thus important to distinguish distractors (e.g. a viewpoint change) from relevant changes (e.g. an object has moved). We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning. Our model learns to distinguish distractors from semantic changes, localize the changes via Dual Attention over “before” and “after” images, and accurately describe them in natural language via Dynamic Speaker, by adaptively focusing on the necessary visual inputs (e.g. “before” or “after” image). To study the problem in depth, we collect a CLEVR-Change dataset, built off the CLEVR engine, with 5 types of scene changes. We benchmark a number of baselines on our dataset, and systematically study different change types and robustness to distractors. We show the superiority of our DUDA model in terms of both change captioning and localization. We also show that our approach is general, obtaining state-of-the-art results on the recent realistic Spot-the-Diff dataset which has no distractors.

Our Summary

The University of California research team introduces a novel Dual Dynamic Attention (DUDA) model for tracking semantically relevant changes between two images and accurately describing these changes in natural language. The Dual Attention component of the model predicts separate spatial attention for both the “before” and “after” images, while the Dynamic Speaker component generates a change description by adaptively focusing on the necessary visual inputs from the Dual Attention network. To address change captioning in the presence of distractors, the researchers also present a new CLEVR-Change dataset with 80K image pairs covering 5 scene change types and containing distractors. The experiments demonstrate that the DUDA model outperforms the baselines on the CLEVR-Change dataset in terms of change captioning and localization.

What’s the core idea of this paper?

The research team proposes a Dual Dynamic Attention Model (DUDA) for change detection and captioning:
- The model includes the Dual Attention component for change localization and the Dynamic Speaker component for generating change descriptions.
- Both neural networks are trained jointly using caption-level supervision, and without information about the change location.
- GIven “before” and “after” images, the model detects whether the scene has changed; if so, it locates the changes on both images, then generates a sentence that describes the change and is spatially and temporally based on the image pair.
The paper also introduces a new CLEVR-Change dataset that:
- contains 80K “before”/”after” image pairs;
- considers 5 scene change types, such as color or material change, adding, dropping, or moving an object;
- includes image pairs with only distractors (i.e., illumination/viewpoint change) and images with both distractors and a semantically relevant scene change.

What’s the key achievement?

Introducing a new CLEVR-Change benchmark that can assist the research community in training new models for:
- localizing scene changes when the viewpoint shifts;
- correctly referring to objects in complex scenes;
- defining the correspondence between objects when the viewpoint shifts.
Proposing a change-captioning DUDA model that, when evaluated on the CLEVR-Change dataset, outperforms the baselines across all scene change types in terms of:
- overall sentence fluency and similarity to ground-truth (BLEU-4, METEOR, CIDEr, and SPICE metrics);
- change localization (Pointing Game evaluation).

What does the AI community think?

The paper was nominated for the Best Paper Award at ICCV 2019, one of the leading conferences in computer vision.

What are future research areas?

Collecting real-image datasets with “before”/”after” image pairs containing both semantically significant and distractor changes.

What are possible business applications?

The DUDA model can assist with a variety of realistic applications, including:
- change tracking in medical images;
- surveillance of facilities;
- aerial photography.

10. HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models, by Sharon Zhou, Mitchell L. Gordon, Ranjay Krishna, Austin Narcomey, Li Fei-Fei, Michael S. Bernstein

Original Abstract

Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model’s outputs appear real (e.g. 250ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.

Our Summary

In this paper, the Stanford University research team addresses the evaluation of image generative models. They introduce a gold standard human benchmark, Human eYe Perceptual Evaluation (HYPE), to evaluate the realism of machine-generated images. The first evaluation method, called $HYPE_{time}$ , evaluates the realism of images by measuring the minimum time, in milliseconds, required to distinguish the real image from the fake one. The second method, called $HYPE_{\infty}$ , measures the rate at which humans confuse fake images with real images, given unlimited time. The experiments with six state-of-the-art GAN architectures and four different datasets demonstrate that HYPE provides reliable scores that can be easily and cheaply reproduced.

What’s the core idea of this paper?

With automatic metrics being inaccurate on high dimensional problems and human evaluations being unreliable and over-dependent on the task design, a systematic gold standard benchmark for evaluation of generative models is needed.
To address this problem, the researchers introduce the Human eYe Perceptual Evaluation (HYPE), with two methods of evaluation:
- $HYPE_{time}$ scores how much time a person needs to distinguish real images and fake images generated by a specific model: the longer it takes, the better the model.
- $HYPE_{\infty}$ measures the human error rate without time constraints: a score above 50% indicates that generated images look even more real than real images.

What’s the key achievement?

Introducing a gold standard human benchmark for evaluation of generative models that is:
- grounded in psychophysics research;
- reliable and consistent;
- able to produce statistically separable results for different models;
- cost and time efficient.

What does the AI community think?

The paper was selected for oral presentation at NeurIPS 2019, the leading conference in artificial intelligence.

What are future research areas?

Extending HYPE to other generative tasks, including text, music, and video generation.

Where can you get implementation code?

The authors have deployed HYPE online so that any researcher can upload a model and retrieve a HYPE score using Mechanical Turk workers.

If you like these research summaries, you might be also interested in the following articles:

We’ll let you know when we release more summary articles like this one.

Bots

Brands

Business

China

Commerce

Computer Vision

Conversational AI

Customer Service

Cybersecurity

Data Science & Engineering

Design

Education

Ethics & Safety

Finance

Gaming

Healthcare

HR & Recruiting

Infrastructure

Leadership & Management

Manufacturing

Marketing

Natural Language Processing

Reinforcement Learning

Research

Retail & CPG

Society

Technical Guide

Technology

About TOPBOTS