This year, the International Conference on Learning Representations (ICLR) takes place virtually from May 3rd through May 7th. As usual, it is a premier gathering of professionals researching various deep learning topics with applications in computer vision, natural language processing, speech recognition, robotics, and other fields.
To help you stay aware of the latest AI research breakthroughs, we’ve summarized some of the ICLR 2021 research papers that received the most attention from the AI research community.
If you’d like to skip around, here are the papers we featured:
- An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
- Deformable DETR: Deformable Transformers for End-to-End Object Detection
- DeBERTa: Decoding-enhanced BERT with Disentangled Attention
- Rethinking Attention with Performers
- Complex Query Answering with Neural Link Predictors
- Hopfield Networks is All You Need
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Top ICLR 2021 Research Papers
1. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Original Abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer attain excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Our Summary
The authors of this paper submitted anonymously to ICLR 2021 show that a pure Transformer can perform very well on image classification tasks. They introduce Vision Transformer (ViT), which is applied directly to sequences of image patches by analogy with tokens (words) in NLP. When trained on large datasets of 14M–300M images, Vision Transformer approaches or beats state-of-the-art CNN-based models on image recognition tasks. In particular, it achieves an accuracy of 88.36% on ImageNet, 90.77% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.16% on the VTAB suite of 19 tasks.
What’s the core idea of this paper?
- When applying Transformer architecture to images, the authors follow as closely as possible the design of the original Transformer designed for NLP.
- The introduced Transformer-based approach to image classification includes the following steps:
- splitting images into fixed-size patches;
- linearly embedding each of them;
- adding position embeddings to the resulting sequence of vectors;
- feeding the patches to a standard Transformer encoder;
- adding an extra learnable ‘classification token’ to the sequence.
- Similarly to Transformers in NLP, Vision Transformer is typically pre-trained on large datasets and fine-tuned to downstream tasks.
What’s the key achievement?
- Vision Transformer pre-trained on the JFT300M dataset matches or outperforms ResNet-based baselines while requiring substantially less computational resources to pre-train. It achieves an accuracy of:
- 88.36% on ImageNet;
- 90.77% on ImageNet-ReaL;
- 94.55% on CIFAR-100;
- 97.56% on Oxford-IIIT Pets;
- 99.74% on Oxford Flowers-102;
- 77.16% on the VTAB suite of 19 tasks.
What does the AI community think?
- The paper is trending in the AI research community, as evident from the repository stats on GitHub.
- It has also been accepted for oral presentation at ICLR 2021, one of the key conferences in deep learning.
What are future research areas?
- Applying Vision Transformer to other computer vision tasks, such as detection and segmentation.
- Exploring self-supervised pre-training methods.
- Analyzing the few-shot properties of Vision Transformer.
- Exploring contrastive pre-training.
- Further scaling ViT.
What are possible business applications?
- Thanks to their efficient pre-training and high performance, Transformers may replace convolutional networks in many computer vision applications, including navigation, automatic inspection, and visual surveillance.
Where can you get implementation code?
- The PyTorch implementation of Vision Transformer is available on GitHub.
2. Deformable DETR: Deformable Transformers for End-to-End Object Detection, by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai
Original Abstract
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.
Our Summary
The authors propose Deformable DETR, improving on DETR (DEtection TRansformer), which was published last year. One of the features of DETR is that it eliminates the hand-crafted components such as anchor generation and non-maximum suppression (NMS), among others, and thus reduces the number of hyperparameters and lowers computation intensity. However, DETR is notorious for slower convergence and poor performance on small objects. In contrast, the Deformable DETR proposed by the authors mitigates both of these problems by using the Deformable Attention Module to process multi-scale feature maps. The Deformable Attention Module facilitates faster convergence and reduces the quadratic complexity of the standard attention module to linear complexity. The linear complexity attention module enables Deformable DETR to process multi-scale feature maps, which improves its performance in detecting small objects. Experiments on the COCO benchmark show that Deformable DETR achieves better performance (especially on small objects) with 10× fewer training epochs.
What’s the core idea of this paper?
- Deformable Attention Module: The standard attention module calculates the attention of features at all possible spatial locations to all possible spatial locations. Instead, the authors propose a Deformable Attention Module to process multi-scale feature maps. It is inspired by deformable convolution and only attends to a small set of locations around a reference point. The number of reference points K is a hyperparameter, which is much smaller than all possible spatial locations.
- Iterative Bounding Box Refinement: In the original DETR, the last decoder layer outputs the bounding boxes. But in Deformable DETR, each decoder layer outputs the bounding box estimates and each layer refines the predictions from the previous layer. This is called Iterative Bounding Box Refinement, first developed for optical flow estimation.
- Two-stage Deformable DETR: In the original DETR, when predicting bounding boxes in the DETR decoder, object queries are randomly initialized and are irrelevant to the current image. Inspired by two-stage object detectors like Faster R-CNN, the authors present a variant of Deformable DETR that generates region proposals as the first stage. The second stage comprises feeding the top-scoring region proposals as object queries to the decoder.
What’s the key achievement?
- Deformable DETR achieves better performance compared to the original DETR and Faster R-CNN with FPN with 10× fewer training epochs and 2× fewer training epochs respectively.
- Deformable DETR performs competitively with state-of-the-art methods on the COCO 2017 test-dev set.
What does the AI community think?
- The paper has been accepted for oral presentation at ICLR 2021, one of the key conferences in deep learning.
What are future research areas?
- Designing more efficient end-to-end object detection models: inference frames per second (FPS) of the proposed method is 19 whereas theFPS of Faster R-CNN+FPN is 26 (higher is better).
What are possible business applications?
- Object detection models like Deformable DETR are used for people detection, defect detection in the manufacturing domain, and for the perception module of self-driving cars.
Where can you get implementation code?
- The implementation of Deformable DETR is available on GitHub.
3. DeBERTa: Decoding-enhanced BERT with Disentangled Attention, by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen
Original Abstract
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models’ generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural language generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8).
Our Summary
The authors from Microsoft Research propose DeBERTa, with two main improvements over BERT, namely disentangled attention and an enhanced mask decoder. DeBERTa has two vectors representing a token/word by encoding content and relative position respectively. The self-attention mechanism in DeBERTa processes self-attention of content-to-content, content-to-position, and also position-to-content, while the self-attention in BERT is equivalent to only having the first two components. The authors hypothesize that position-to-content self-attention is also needed to comprehensively model relative positions in a sequence of tokens. Furthermore, DeBERTa is equipped with an enhanced mask decoder, where the absolute position of the token/word is also given to the decoder along with the relative information. A single scaled-up variant of DeBERTa surpasses the human baseline on the SuperGLUE benchmark for the first time. The ensemble DeBERTa is the top-performing method on SuperGLUE at the time of this publication.
What’s the core idea of this paper?
- Disentangled attention: In the original BERT, the content embedding and position embedding are added before self-attention and the self-attention is applied only on the output of content and position vectors. The authors hypothesize that this only accounts for content-to-content self-attention and content-to-position self-attention and that we need position-to-content self-attention as well to model position information completely. DeBERTa has two separate vectors representing content and position and self-attention is calculated between all possible pairs, i.e., content-to-content, content-to-position, position-to-content, and position-to-position. Position-to-position self-attention is trivially 1 all the time and has no information, so it is not computed.
- Enhanced mask decoder: The authors hypothesize that the model needs absolute position information to understand syntactical nuances such as subject-object characterization. So, DeBERTa is provided with absolute position information along with relative position information. The absolute position embedding is provided to the last decoder layer just before the softmax layer, which gives the output.
- Scale-invariant fine-tuning: A virtual adversarial training algorithm called scale-invariant fine-tuning is used as a regularization method to increase generalization. The word embeddings are perturbed to a small extent and trained to produce the same output as they would on non-perturbed word embeddings. The word embedding vectors are normalized to stochastic vectors (where the sum of the elements in a vector is 1) to be invariant to the number of parameters in the model.
What’s the key achievement?
- Compared to the current state-of-the-art method RoBERTa-Large, the DeBERTA model trained on half the training data achieves:
- an improvement of +0.9% in accuracy on MNLI (91.1% vs. 90.2%),
- an improvement of +2.3% in accuracy on SQuAD v2.0 (90.7% vs. 88.4%),
- an improvement of +3.6% in accuracy on RACE (86.8% vs. 83.2%)
- A single scaled-up variant of DeBERTa surpasses the human baseline on the SuperGLUE benchmark for the first time (89.9 vs. 89.8). The ensemble DeBERTa is the top-performing method on SuperGLUE at the time of this publication, outperforming the human baseline by a decent margin (90.3 versus 89.8).
What are future research areas?
- Improving pretraining by introducing other useful information, in addition to positions, with the Enhanced Mask Decoder (EMD) framework.
- A more comprehensive study of scale-invariant fine-tuning (SiFT).
What are possible business applications?
- The contextual representations of pretrained language modeling could be used in search, question answering, summarization, virtual assistants, and chatbots, among other tasks.
Where can you get implementation code?
- The implementation of DeBERTa is available on GitHub.
4. Rethinking Attention with Performers, by Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
Original Abstract
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
Our Summary
The authors from Google and DeepMind propose an efficient transformer architecture called Performer. The attention module in the standard transformer architecture has quadratic space and time complexity, which makes it inefficient to scale to long-sequence inputs. Most of the existing techniques for efficient attention modules rely on a sparsity assumption, which has to be verified empirically by trial and error. The remaining techniques are less applicable to tasks with long-sequence inputs because of their poor performance processing long sequences. Performer, on the other hand, doesn’t rely on any assumptions such as sparsity or low-rankness and is provably accurate in approximating the softmax attention values. Performer uses a scalable kernel method termed Fast Attention Via positive Orthogonal Random features approach (FAVOR+). This method can be applied to efficiently model other kernelizable attention mechanisms beyond softmax and presents a framework to compare softmax alternatives. Performer shows competitive results compared to other efficient sparse and dense attention methods on a rich set of tasks ranging from pixel-prediction to text models to protein sequence modeling.
What’s the core idea of this paper?
- The authors proposed a scalable kernel method, FAVOR+, that:
- approximates the standard attention weights without any assumptions about sparsity and low-rankness;
- provides strong theoretical guarantees such as unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence, and lower variance of the approximation;
- enables softmax to be accurately compared to other kernelizable attention mechanisms that are beyond the standard transformer architecture;
- can be combined with ideas for efficient transformers like reversible layers or cluster-based attention.
What’s the key achievement?
- It is shown empirically that the Performer can be 2× faster than Reformer, the best of the current efficient transformer architectures.
What does the AI community think?
- The paper has been accepted for oral presentation at ICLR 2021, one of the key conferences in deep learning.
What are future research areas?
- Exploring more optimal attention mechanisms with the help of the proposed FAVOR+ framework.
What are possible business applications?
- The proposed transformer architecture can be used in machine translation, semantic parsing, protein sequence modeling, and image completion, among others.
Where can you get implementation code?
- The implementation of Performer is available on GitHub.
5. Complex Query Answering with Neural Link Predictors, by Erik Arakelyan, Daniel Daza, Pasquale Minervini, Michael Cochez
Original Abstract
Neural link predictors are immensely useful for identifying missing edges in large scale Knowledge Graphs. However, it is still not clear how to use these models for answering more complex queries that arise in a number of domains, such as queries using logical conjunctions (∧), disjunctions (∨) and existential quantifiers (∃), while accounting for missing edges. In this work, we propose a framework for efficiently answering complex queries on incomplete Knowledge Graphs. We translate each query into an end-to-end differentiable objective, where the truth value of each atom is computed by a pre-trained neural link predictor. We then analyse two solutions to the optimisation problem, including gradient-based and combinatorial search. In our experiments, the proposed approach produces more accurate results than state-of-the-art methods – black-box neural models trained on millions of generated queries – without the need of training on a large and diverse set of complex queries. Using orders of magnitude less training data, we obtain relative improvements ranging from 8% up to 40% in Hits@3 across different knowledge graphs containing factual information. Finally, we demonstrate that it is possible to explain the outcome of our model in terms of the intermediate solutions identified for each of the complex query atoms. All our source code and datasets are available online.
Our Summary
The authors propose an approach to answering complex queries about the knowledge stored in a knowledge graph (KG). KGs are graph-structured knowledge bases, where knowledge about the world is stored as relationships between entities. Although KGs are a versatile representation of knowledge used for many downstream tasks, most of the real-world KGs are incomplete. Either the links/edges between the entities or the entities themselves can potentially be missing. Answering complex queries on these incomplete KGs has been a challenge. To address this problem, the authors propose to break a complex query into a sequence of simple queries, which can be solved by pretrained neural link predictors. The overall problem of complex query answering is posed as continuous optimization (for example, using Adam) or combinatorial optimization to find the variables (entities) that match well given other parts of the complex query. The proposed method performs better than black-box neural models trained on a very large and diverse set of generated queries.
What’s the core idea of this paper?
- The introduced method includes the following steps:
- An existential first-order logical query of any complexity (sequence of logical steps) is transformed into a sequence of atoms (queries with one logical step) using t-norm and t-conorm.
- A neural link predictor, ComplEx, is trained for answering single-step logical queries.
- Answering the complex query is posed as an optimal variable-to-entity assignment after identifying the variables in the complex query.
- An optimal variable-to-entity assignment is achieved with either gradient-based optimization or combinatorial optimization.
- The intermediate logical steps are examined for explainability and understanding the shortcomings of the model.
What’s the key achievement?
- The proposed method produces more accurate results than GQE and Q2B, the current best methods for complex query answering, while using orders of magnitude less training data.
What does the AI community think?
- The paper received the Outstanding paper award at ICLR 2021.
What are future research areas?
- Improving the proposed method with respect to processing queries beyond first-order logic.
What are possible business applications?
- Answering complex queries using Knowledge Graphs (KGs) could be used in fact-checking, information retrieval, question answering, and recommendations, among other tasks.
Where can you get implementation code?
- The implementation of the proposed method is available in GitHub.
6. Hopfield Networks is All You Need, by Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter
Original Abstract
We introduce a modern Hopfield network with continuous states and a corresponding update rule. The new Hopfield network can store exponentially (with the dimension of the associative space) many patterns, retrieves the pattern with one update, and has exponentially small retrieval errors. It has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. The new update rule is equivalent to the attention mechanism used in transformers. This equivalence enables a characterization of the heads of transformer models. These heads perform in the first layers preferably global averaging and in higher layers partial averaging via metastable states. The new modern Hopfield network can be integrated into deep learning architectures as layers to allow the storage of and access to raw input data, intermediate results, or learned prototypes. These Hopfield layers enable new ways of deep learning, beyond fully-connected, convolutional, or recurrent networks, and provide pooling, memory, association, and attention mechanisms. We demonstrate the broad applicability of the Hopfield layers across various domains. Hopfield layers improved state-of-the-art on three out of four considered multiple instance learning problems as well as on immune repertoire classification with several hundreds of thousands of instances. On the UCI benchmark collections of small classification tasks, where deep learning methods typically struggle, Hopfield layers yielded a new state-of-the-art when compared to different machine learning methods. Finally, Hopfield layers achieved state-of-the-art on two drug design datasets. The implementation is available at https://github.com/ml-jku/hopfield-layers.
Our Summary
The authors present a modern Hopfield network with continuous states and an update rule equivalent to the attention mechanism in Transformers. In the context of this work, a Hopfield network represents an energy function that gives the energy of every possible data point (input and output) and an update rule which changes that energy function based on observed data so that the network converges to an energy minimum. With continuous states and an update rule equivalent to attention modules, the authors claim that the proposed Hopfield network would be a general framework that could be used as a pooling layer, GRU or LSTM layer, and attention layer.
The proposed Hopfield networks are continuous and differentiable with respect to their parameters and can be incorporated into any deep learning architecture as a component. As a general framework, the proposed Hopfield networks provide pooling, memory, association, and attention mechanisms, which enables them to summarize a set of vectors, perform better at multiple instance learning (MIL), and use associative memories, among other capabilities. The results achieved by the proposed methods included superior performance on multiple MIL datasets and better performance than standard ML methods on UCI datasets.
What’s the core idea of this paper?
- The authors introduce different types of Hopfield layers for different tasks:
- Layer Hopfield: This layer takes two sets of vectors and processes the association between those two sets. This layer could replace the attention module in the standard transformer architecture. With this capability, this layer could be used for sequence-to-sequence learning or any operations on point sets (sets of vectors).
- Layer HopfieldPooling: This layer takes a set of vectors and produces a summarization of that set of vectors. This layer has a list of queries where each query produces a vector as an output. Here, the query is the key and the input set of vectors are values in the attention module. So, the output of this layer could be a single vector or a set of vectors based on the number of queries in the layer. This type of layer could be used for multiple-instance learning, as it can produce a summary of n instances.
- Layer HopfieldLayer: This layer takes a set of vectors and produces a set of vectors. The layer has another set of vectors in memory, which could be a fixed set or trainable vectors. The output set of vectors is the output of the attention module, where the vectors in memory act as keys and the input set of vectors act as values. This generic layer could approximate Support Vector Machines (SVMs), k-nearest neighbor, approaches that learn vector quantization, and methods for pattern search.
What’s the key achievement?
- The proposed Hopfield networks:
- showed superior performance on multiple-instance learning datasets such as Tiger, Fox, Elephant, and UCSB;
- outperformed standard machine learning methods and deep learning methods on UCI Benchmark Collection tabular datasets.
What are future research areas?
- Studying the advantages and limitations of GRUs/LSTMs, transformers, memory-based neural networks, and pooling layers compared to their Hopfield network equivalents.
What are possible business applications?
- The proposed Hopfield network could be instantiated into many different models where the applications include information retrieval, sequence classification, detecting outliers or border cases, and drug design, among other tasks.
Where can you get implementation code?
- The implementation of the proposed Hopfield Networks is available on GitHub.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more research summaries.
Leave a Reply
You must be logged in to post a comment.