Self-Supervised Learning In Vision Transformers

Anyone who has ever approached the world of machine learning has certainly heard of supervised learning and unsupervised learning. These are in fact two important possible approaches to Machine Learning that have been widely used for years. Only recently, however, has there been an explosion of a new term, Self-Supervised Learning! But let’s get there step by step and look at the various methods one by one, trying to find an analogy with the human brain.

Supervised Learning is like “learning based on labelled examples”. The model is trained using labelled data so they have been carefully labelled in such a way that each example is associated with a particular class. By studying the characteristics of the various examples of each class, the model learns to generalise and will be able to classify even data it has never seen. To apply this approach, well-labelled data are therefore required and are not always available, and the model may develop biases depending on how the labelling was conducted.

The analogy with the human brain: Study a book that explicitly tells you what a dog is and what a cat is by showing you numerous labelled examples.

If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.

Unsupervised Learning instead, consists of searching in unlabelled data for groups of examples that share common characteristics. In the majority of the cases, these methods are related to clustering. The unsupervised approach does not require the dataset to be labelled but does require many examples, computational resources and a function to be defined to describe the difference between the two of them, which is not always easy.

The analogy with the human brain: Observe lots of dogs and cats running around and work out which are dogs and which are cats, dividing them into two groups.

Self-Supervised Learning is an innovative unsupervised approach that is enjoying great success and is now considered by many to be the future of Machine Learning [1, 3, 6].

The main method is to train on a dataset, e.g. of images, but each of these are provided as input in its original form and a transformed version. These transformations can be of any kind such as cropping or rotation.

The model will have to manage to minimise the difference of prediction between the inner network that had as input the original image and that, therefore, has a complete and unchanged vision of the input, and the prediction made by the network that received the transformed image.

With this approach, it has been seen that the resulting model can learn to generalise in an excellent way and without the need for labels, producing high-quality representations of the inputs, in some cases even better than supervised approaches! Models trained with this approach will then learn a representation system on their own, in which transformed images obtained from similar subjects will be close.

The analogy with the human brain: Imagining something that is not present in what you are observing. For example, imagining that a pen will fall when you see it roll towards the edge of the table, or imagining the end of a cat’s tail even though it is hidden behind a tree.

Why Self-Supervised Learning in Vision Transformers?

Although Vision Transformers can achieve better results than other traditional architectures, their success depends on a considerable demand for data. Therefore, training these models in a supervised manner requires extensive labelling work that is not always possible or sustainable. Realising self-supervised approaches for Vision Transformers may therefore be a possible way to making these models not only powerful but also easier to apply to a wider range of problems.

To understand how powerful this approach is, let’s move for a second to the field of Natural Language Processing where self-supervised approaches have made it possible to achieve unthinkable results.

GPT-3, one of the largest language models to date with 175 billion parameters, is considered the first step towards Artificial General Intelligence (AGI) [7] and is able to translate texts, summarise them, answer questions and even write code based on a description in words! But to train a large model like this, which is also based on Transformers, you need a lot of data, and GPT-3 in particular was trained with 570GB of text information gathered by crawling the internet. Suppose we wanted to train this model in a supervised manner, it would mean labelling all this data manually and that’s just insane!

It would also be possible to overcome this hurdle with other, more classical unsupervised approaches, but it would be necessary to define a suitable similarity measure (and think at what this means for images if we move to Computer Vision) and consume a greater amount of computational resources and then probably end up with a less competent model!

In the following paragraphs, some basic aspects of Vision Transformers will be taken for granted, if you want to go deeper into the subject I suggest you read my previous overview of the architecture.

SiT: Self-Supervised Vision Transformer

Given the unquestionable advantages of training a model in a self-supervised manner, one of the possible methods proposed is the Self-Supervised Vision Transformer (SiT) [4]. The underlying assumption of this approach is that, by recovering the corrupted part of an image from the uncorrupted part based on the context from the whole visual field, the network will implicitly learn the notion of visual integrity.

In this method, the input image is corrupted according to one of the possible strategies available: random drop, random replace, colour distortion, etc.

Image from SiT: Self-Supervised vIsion Transformer

The image is then divided into patches and passed through the classic Vision Transformer mechanisms together with two additional tokens, the rotation token used for rotation prediction and the contrastive token used for contrastive learning.

Image by the author based on SiT: Self-Supervised vIsion Transformer

The resulting representations from the transformer encoder are then transformed back into patches and recomposed to obtain the reconstructed image. The model shall attempt to reduce the difference between the reconstruction and the original image.

The weights of the network trained with this approach can then be used as a starting point for another task such as image classification, object detection, segmentation etc.

DINO: Self-Distillation with no labels

One of the approaches that has achieved the most amazing results is certainly DINO [2], which, through a series of data augmentation and using the knowledge distillation technique, has been able to carry out image segmentation in an amazing way!

Comparison between original video(left), segmentation obtained by a supervised model(center) and the one generated by DINO(right). Image from Facebook AI.

A detailed overview of DINO and its architecture can be found in my previous article on this approach, click here to read more.

This is currently one of the most promising approaches of those presented and was able to highlight the possibilities that the combination of Vision Transformers and Self-Supervision has to offer.

EsViT: Efficient Self-Supervised Vision Transformer

DINO was then recently used as the basis for a new, more advanced Vision Transformer, called Efficient Self-Supervised Vision Transformer (EsViT) [8]. EsViT also exploits Knowledge Distillation with a fixed teacher network and a student network that is continuously updated in an attempt to minimise a loss function. An interesting peculiarity, in that case, is that it employs a multi-stage transformer instead of a monolithic one and exploits sparse attention to reduce computation. The loss function of the overall model is given by the combination of two distinct losses, a Region Loss and a View Loss.

EsViT global and local tokens visualization. Image by the author based on Efficient Self-supervised Vision Transformers for Representation Learning

Given an input image, a set of different views is indeed generated using different data augmentation techniques. These resulting views are paired to obtain a set of pairs. Each pair is then transformed into tokens and used to compose the first component of the loss, the view loss.

The authors highlight: “In DINO the loss function encourages “local-to-global” correspondences only at a coarse level: the large crop and the small crop are matched in the view level, leaving region-to-region correspondence unspecified” [8]. An important peculiarity of EsViT which solves this problem can be instead seen when moving to the region-level. In fact, both images of the considered pair are divided into patches that are then exploited by a special dense self-supervised learning method that works directly at the level of the local features taking into account their correspondences.

Looking at the attention maps it can be seen that DINO tends to automatically learn class-specific attention maps leading to foreground object segmentation, regardless of its query located in foreground or background while the attention maps learnt by the various heads of EsViT are more diverse.

Attention maps from different heads from DINO(left) and EsViT(right). Image by the author based on Efficient Self-supervised Vision Transformers for Representation Learning

This different approach has led EsViT to achieve very good results on ImageNet and with a lighter and more efficient model!

Conclusions

The amount of data humanity is producing is staggering and unprecedented. It is estimated that 2.5 quintillion bytes are produced every day, and this figure is set to rise [11]. Internet of Things systems are becoming more and more widespread with sensors collecting data at any given moment, the mass use of social networks and their accessibility allows anyone in a few moments to enter information on the web and satellites collecting data of all kinds on our planet.

Just think that over the last two years alone, 90% of the data in the world was generated!

This data is literally gold for Machine Learning, the fuel for any model, and its abundance can open the door to countless applications that we cannot even imagine today. It is unthinkable, however, to believe that these models can be trained in a supervised manner because this would require exhausting and unsustainable manual labelling.

Unsupervised, and in particular self-supervised, methods will therefore be increasingly central and important in this sector, and in combination with new architectures such as Vision Transformers, they will be the main actors of the future of Machine Learning.

References and Insights

[1] “Facebook AI”. “Self-supervised learning: The dark matter of intelligence”

[2] “Davide Coccomini”. “On DINO, Self-Distillation with no labels”

[3] “Nilesh Vijayrania”. “Self-Supervised Learning Methods for Computer Vision”

[4] “Sara Atito et al.”. “SiT: Self-supervised vIsion Transformer”

[5] “Davide Coccomini”. “On Transformers, Timesformers and Attention”

[6] “Matvii Kovtun”. “Self-supervised Learning, Future of AI”

[7] “OpenAI”. “GPT-3 Powers the Next Generation of Apps”

[8] “Chunyan Li et al.”. “Efficient Self-supervised Vision Transformers for Representation Learning”

[9] “Davide Coccomini”. “Is Attention what you really need in Transformers?”

[10] “Davide Coccomini”. “Vision Transformers or Convolutional Neural Networks? Both!”

[11] “Bernard Marr”. “How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read”

This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.

We’ll let you know when we release more technical education.

Self-Supervised Learning In Vision Transformers

Why Self-Supervised Learning in Vision Transformers?

SiT: Self-Supervised Vision Transformer

DINO: Self-Distillation with no labels

EsViT: Efficient Self-Supervised Vision Transformer

Conclusions

References and Insights

Related

Bots

Brands

Business

China

Commerce

Computer Vision

Conversational AI

Customer Service

Cybersecurity

Data Science & Engineering

Design

Education

Ethics & Safety

Finance

Gaming

Healthcare

HR & Recruiting

Infrastructure

Leadership & Management

Manufacturing

Marketing

Natural Language Processing

Reinforcement Learning

Research

Retail & CPG

Society

Technical Guide

Technology

About TOPBOTS

Why Self-Supervised Learning in Vision Transformers?

SiT: Self-Supervised Vision Transformer

DINO: Self-Distillation with no labels

EsViT: Efficient Self-Supervised Vision Transformer

Conclusions

References and Insights

Enjoy this article? Sign up for more AI updates.

Related

Reader Interactions

About Davide Coccomini

Leave a Reply

Footer

About TOPBOTS