Vision Transformer (ViT) Study Material

Data/Machine learning 2021. 12. 5. 18:48

3. What is the Class Token?

One of the interesting things about the Vision Transformer is that the architecture uses Class Tokens. These Class Tokens are randomly initialized tokens that are prepended to the beginning of your input sequence. What is the reason for this Class Token and what does it do? Note that the Class Token is randomly initialized so it doesn’t contain any useful information on its own. However, the Class Token is able to accumulate information from the other tokens in the sequence the deeper and more layers the Transformer is. When the Vision Transformer finally performs the final classification of the sequence, it uses an MLP head which only looks at data from the last layer’s Class Token and no other information. This operation suggests that the Class Token is a placeholder data structure that’s used to store information that is extracted from other tokens in the sequence. By allocating an empty token for this procedure, it seems like the Vision Transformer makes it less likely to bias the final output towards or against any single one of the other individual tokens.
[1] https://deepganteam.medium.com/vision-transformers-for-computer-vision-9f70418fe41a

4. ViT PyTorch Code

https://github.com/lucidrains/vit-pytorch

저작자표시

'Data > Machine learning' 카테고리의 다른 글

Recall and Precision (0)	2022.01.10
Transformer Study Materials (0)	2021.12.04
Machine learning - Introduction to Gaussian processes (0)	2021.09.15

Comments

ABOUT ME

Daesoo Lee's Blog

'Data > Machine learning' 카테고리의 다른 글

티스토리툴바