It has been clear for some time that the Transformers had arrived in the field of computer vision to amaze, but hardly anyone could have imagined such astonishing results from a Vision Transformer in such a short time since their first application. In this article, we discuss one of the most interesting advances in the field of computer vision, DINO, announced a few days ago by Facebook AI with their last publication “DINO: Emerging Properties in Self-Supervised Vision Transformers”[2]. This research presents a self-supervised method called DINO, defined as a form of self-distillation with no labels, and used to train a Vision Transformer.
If you’ve never heard of Vision Transformers or Transformers in general, I suggest you take a look at my first article, which covers this topic in great depth throughout.
Vision Transformer
Just to give a very brief introduction to the subject, transformers are a Deep Learning architecture that has become one of the most widely used in the field of Natural Language Processing for many years now and which since 2020 has been applied to the field of Computer Vision, obtaining exceptional results.
If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
The normal Transformer works by taking as input a series of vectors representing the words of a sentence to which the mechanism of self-attention is applied. The intuition that allowed to bring this architecture in the field of computer vision is to see the images as a series of non-overlapping patches that, through a linear transformation, are transformed into vectors and treated as if they were words of a sentence.
DINO: Self-Distillation with no labels
Facebook AI researchers wondered whether the success of the Transformers in Computer Vision stemmed from supervised training and whether there was a way to build a self-supervised system that could be trained on unlabelled datasets.
This idea seemed to be interesting in order to be able to achieve results with Vision Transformers that were not only comparable to convolutional networks but also clearly outperformed them and in contexts of poorly labeled datasets, thus making the computational demands of Transformers and their need for large amounts of data more meaningful.
The inspiration comes from the Transformers’ successes in the field of Natural Language Processing where the use of self-supervised pretraining led to the emergence of extremely effective models such as BERT or GPT.
To work in a self-supervised environment it is necessary to find intelligent ways of extracting relevant information from the data available and in this case, the approach chosen by the researchers was to use two networks with the same architecture, one defined as a student and one as a teacher.
These two networks will take as input two representations of the same image. In particular, for each image in the training set, a multi-crop augmentation is applied to extract two sets of images from it. Two patches of great dimensions and partially overlapped are obtained, able to give a global idea of the image in consideration, and a series of other smaller patches that will give instead a local representation of the image.
All the views are passed to the student network while only the globals ones are passed to the teacher.
Taking an image of two kittens as an example, two global views will be extracted from it, representing a good part of the image and thus making it easier to interpret its content, and three local views representing various areas of the image that are significantly more difficult for a network to interpret. These views are also augmented using different techniques such as random rotation and color jitter.
During training, only the student is trained and what we want to achieve is that the set of networks becomes able to understand that the local and global representation, although apparently different, represent the same subject.
But why are these two networks called student and teacher? And why only the student is trained? This comes from the self-distillation approach, a technique used to transfer knowledge accumulated during training from a large model to a simpler one. In the traditional self-distillation approach we try to train a student network to match the output of a given teacher network.
In this case, self-distillation is used in a quite different way, in fact, both the networks have the same size and the teacher is not created apriori but is trained by the student! During the student training, a bit of information learned is propagated to the teacher which gradually learns from the views seen by the student but it has to perform classification based only on global views given to it.
DINO’s results
One of the biggest challenges in computer vision has always been segmentation, which is extremely useful in a large number of different tasks and requires the network to be able to fully understand what is in an image. Normally this task is tackled with supervised approaches, but if you try to use DINO you will find that the segmentation obtained is much cleaner and more correct than that obtained with supervised approaches, and all without the need for labels!
By training a Vision Transformer with the DINO algorithm, it was also found that the model is able to learn to identify the main object in the scene in exactly the same way as any human being would, and in an extremely precise manner, even handling obstacles that partially reduce the view of the object such as sea waves.
Applying DINO to Image-Net, the researcher saw how it is able to use the features it has learned to carry out an extremely accurate cluster division.
Once again, Transformers proved to be the bearer of great novelties, and thanks to DINO’s innovative approach, the results are more exciting than ever in a year that has already given us quite a few novelties in this field.
If you want to know more about Transformers, Vision Transformers, TimeSformers and Self-Attention, let’s read my first article!
References and Insights
[1] “Facebook AI”. “Advancing the state of the art in computer vision with self-supervised Transformers and 10x more efficient training”
[2] “Mathilde Caron et al”. “Emerging Properties in Self-Supervised Vision Transformers”
[3] “Yannic Kilcher”. “DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)”
[4] “Antti Tarvainen et al.”. “Mean Teachers are better role models”
[5] “Davide Coccomini”. “On Transformers, TimeSformers and Attention”
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more AI updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.