Pre-training in Image DomainWhen moving into the image domain, the immediate question is how we form the image “token sequence.” The natural thinking is just to use the ViT architecture, breaking an image into a grid of image patches (visual tokens).BEiT. Published as an arXiv preprint in 2022, the idea of BEiT is straightforward. After tokenizing an image into a sequence of 14*14 visual tokens, 40% of the tokens are randomly masked, replaced by learnable embeddings, and fed into the transformer. The pre-training objective is to maximize the log-likelihood of the correct visual tokens, and no decoder is needed for this stage. The pipeline is shown in the figure below.BEiT pre-training pipeline. Image source: https://arxiv.org/abs/2106.08254In the original paper, the authors also provided a theoretical link between the BEiT and the Variational Autoencoder. So the natural question is, can an Autoencoder be used for pre-training purposes?MAE-ViT. This paper answered the question above by designing a masked autoencoder architecture. Using the same ViT formulation and random masking, the authors proposed to “discard” the masked patches during training and only use unmasked patches in the visual token sequence as input to the encoder. The mask tokens will be used for reconstruction during the decoding stage at the pre-training. The decoder could be flexible, ranging from 1–12 transformer blocks with dimensionality between 128 and 1024. More detailed architectural information could be found in the original paper.Masked Autoencoder architecture. Image source: https://arxiv.org/abs/2111.06377SimMIM. Slightly different from BEiT and MAE-ViT, the paper proposes using a flexible backbone such as Swin Transformer for encoding purposes. The proposed prediction head is extremely lightweight—a single linear layer of a 2-layer MLP to regress the masked pixels.SimMIM pipeline. Image source: https://arxiv.org/abs/2111.09886