A Vision Transformer (ViT) is a transformer that is targeted at vision processing tasks such as image recognition.[1]

Vision Transformers

Vision Transformer Architecture for Image Classification

Transformers found their initial applications in natural language processing (NLP) tasks, as demonstrated by language models such as BERT and GPT-3. By contrast the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet,[2] DenseNet,[3] and Inception.[1]

Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer.[1]

As in the case of BERT, a fundamental role in classification tasks is played by the class token. A special token that is used as the only input of the final MLP Head as it has been influenced by all the others.

The architecture for image classification is the most common and uses only the Transformer Encoder in order to transform the various input tokens. However, there are also other applications in which the decoder part of the traditional Transformer Architecture is also used.


Transformers initially introduced in 2017 in the well-known paper "Attention is All You Need".[4] have spread widely in the field of Natural Language Processing soon becoming one of the most widely used and promising neural network architectures in the field.

In 2020 Vision Transformers were then adapted for tasks in Computer Vision with the paper "An image is worth 16x16 words".[5] The idea is basically to break down input images as a series of patches which, once transformed into vectors, are seen as words in a normal transformer.

If in the field of Natural Language Processing the mechanism of attention of the Transformers tried to capture the relationships between different words of the text to be analysed, in Computer Vision the Vision Transformers try instead to capture the relationships between different portions of an image.

In 2021 a pure transformer model demonstrated better performance and greater efficiency than CNNs on image classification.[1]

A study in June 2021 added a transformer backend to Resnet, which dramatically reduced costs and increased accuracy.[6][7][8]

In the same year, some important variants of the Vision Transformers were proposed. These variants are mainly intended to be more efficient, more accurate or better suited to a specific domain. Among the most relevant is the Swin Transformer,[9] which through some modifications to the attention mechanism and a multi-stage approach achieved state-of-the-art results on some object detection datasets such as COCO. Another interesting variant is the TimeSformer, designed for video understanding tasks and able to capture spatial and temporal information through the use of divided space-time attention.[10][11]

Vision Transformers were also able to get out of the lab and into one of the most important fields of Computer Vision, autonomous driving.

Comparison with Convolutional Neural Networks

Due to the commonly used (comparatively) large patch size, ViT performance depends more heavily on decisions including that of the optimizer, dataset-specific hyperparameters, and network depth than convolutional networks. Preprocessing with a layer of smaller-size, overlapping (stride < size) convolutional filters helps with performance and stability.[8]

The CNN translates from the basic pixel level to a feature map. A tokenizer translates the feature map into a series of tokens that are then fed into the transformer, which applies the attention mechanism to produce a series of output tokens. Finally, a projector reconnects the output tokens to the feature map. The latter allows the analysis to exploit potentially significant pixel-level details. This drastically reduces the number of tokens that need to be analyzed, reducing costs accordingly.[6]

The differences between CNNs and Vision Transformers are many and lie mainly in their architectural differences.

In fact, CNNs achieve excellent results even with training based on data volumes that are not as large as those required by Vision Transformers.

This different behaviour seems to derive from the different inductive biases they possess. The filter-oriented architecture of CNNs can be somehow exploited by these networks to grasp more quickly the particularities of the analysed images even if, on the other hand, they end up limiting them making it more complex to grasp global relations.[12][13]

On the other hand, the Vision Transformers possess a different kind of bias toward exploring topological relationships between patches, which leads them to be able to capture also global and wider range relations but at the cost of a more onerous training in terms of data.

Vision Transformers also proved to be much more robust to input image distortions such as adversarial patches or permutations.[14]

However, choosing one architecture over another is not always the wisest choice, and excellent results have been obtained in several Computer Vision tasks through hybrid architectures combining convolutional layers with Vision Transformers.[15][16][17]

The Role of Self-Supervised Learning

The considerable need for data during the training phase has made it essential to find alternative methods to train these models,[18] and a central role is now played by self-supervised methods. Using these approaches, it is possible to train a neural network in an almost autonomous way, allowing it to deduce the peculiarities of a specific problem without having to build a large dataset or provide it with accurately assigned labels. Being able to train a Vision Transformer without having to have a huge vision dataset at its disposal could be the key to the widespread dissemination of this promising new architecture.


Vision Transformers have been used in many Computer Vision tasks with excellent results and in some cases even state-of-the-art.

Among the most relevant areas of application are:


There are many implementations of Vision Transformers and its variants available in open source online. The main versions of this architecture have been implemented in PyTorch[19] but implementations have also been made available for TensorFlow.[20]

See also


  1. ^ a b c d Sarkar, Arjun (2021-05-20). "Are Transformers better than CNN's at Image Recognition?". Medium. Retrieved 2021-07-11.
  2. ^ Tan, Mingxing; Le, Quoc V. (23 June 2021). "EfficientNet V2: Smaller Models and Faster Training". arXiv:2104.00298 [cs.CV].
  3. ^ Huang, Gao; Liu, Zhuang; van der Maaten, Laurens; Q. Weinberger, Kilian (28 Jan 2018). "Densely Connected Convolutional Networks". arXiv:1608.06993 [cs.CV].
  4. ^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017-12-05). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].
  5. ^ Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV].
  6. ^ a b Synced (2020-06-12). "Facebook and UC Berkeley Boost CV Performance and Lower Compute Cost With Visual Transformers". Medium. Retrieved 2021-07-11.
  7. ^ Wu, Bichen; Xu, Chenfeng; Dai, Xiaoliang; Wan, Alvin; Zhang, Peizhao; Yan, Zhicheng; Masayoshi, Tomizuka; Gonzalez, Joseph; Keutzer, Kurt; Vajda, Peter (2020). "Visual Transformers: Token-based Image Representation and Processing for Computer Vision". arXiv:2006.03677 [cs.CV].
  8. ^ a b Xiao, Tete; Singh, Mannat; Mintun, Eric; Darrell, Trevor; Dollár, Piotr; Girshick, Ross (2021-06-28). "Early Convolutions Help Transformers See Better". arXiv:2106.14881 [cs.CV].
  9. ^ Liu, Ze; Lin, Yutong; Cao, Yue; Hu, Han; Wei, Yixuan; Zhang, Zheng; Lin, Stephen; Guo, Baining (2021-03-25). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows". arXiv:2103.14030 [cs.CV].
  10. ^ Bertasius, Gedas; Wang, Heng; Torresani, Lorenzo (2021-02-09). "Is Space-Time Attention All You Need for Video Understanding?". arXiv:2102.05095 [cs.CV].
  11. ^ Coccomini, Davide (2021-03-31). "On Transformers, TimeSformers, and Attention. An exciting revolution from text to videos". Towards Data Science.
  12. ^ Raghu, Maithra; Unterthiner, Thomas; Kornblith, Simon; Zhang, Chiyuan; Dosovitskiy, Alexey (2021-08-19). "Do Vision Transformers See Like Convolutional Neural Networks?". arXiv:2108.08810 [cs.CV].
  13. ^ Coccomini, Davide (2021-07-24). "Vision Transformers or Convolutional Neural Networks? Both!". Towards Data Science.
  14. ^ Naseer, Muzammal; Ranasinghe, Kanchana; Khan, Salman; Hayat, Munawar; Khan, Fahad Shahbaz; Yang, Ming-Hsuan (2021-05-21). "Intriguing Properties of Vision Transformers". arXiv:2105.10497 [cs.CV].
  15. ^ Dai, Zihang; Liu, Hanxiao; Le, Quoc V.; Tan, Mingxing (2021-06-09). "CoAtNet: Marrying Convolution and Attention for All Data Sizes". arXiv:2106.04803 [cs.CV].
  16. ^ Wu, Haiping; Xiao, Bin; Codella, Noel; Liu, Mengchen; Dai, Xiyang; Yuan, Lu; Zhang, Lei (2021-03-29). "CvT: Introducing Convolutions to Vision Transformers". arXiv:2103.15808 [cs.CV].
  17. ^ Coccomini, Davide; Messina, Nicola; Gennaro, Claudio; Falchi, Fabrizio (2022). "Combining Efficient Net and Vision Transformers for Video Deepfake Detection". Image Analysis and Processing – ICIAP 2022. Lecture Notes in Computer Science. Vol. 13233. pp. 219–229. arXiv:2107.02612. doi:10.1007/978-3-031-06433-3_19. ISBN 978-3-031-06432-6. S2CID 235742764.
  18. ^ Coccomini, Davide (2021-07-24). "Self-Supervised Learning in Vision Transformers". Towards Data Science.
  19. ^ vit-pytorch on GitHub
  20. ^ Salama, Khalid (2021-01-18). "Image classification with Vision Transformer". keras.io.