From Masked Image Modeling to Autoregressive Image Modeling | by Mengliu Zhao

Pre-training in Image DomainWhen moving into the image domain, the immediate question is how we form the image “token sequence.” The natural thinking is just to use the ViT architecture, breaking an image into a grid of image patches (visual tokens).BEiT. Published as an arXiv preprint in 2022, the idea of BEiT is straightforward. After tokenizing an image into a sequence of 14*14 visual tokens, 40% of the tokens are randomly masked, replaced by learnable embeddings, and fed into the transformer. The pre-training objective is to maximize the log-likelihood of the correct visual tokens, and no decoder is needed for this stage. The pipeline is shown in the figure below.BEiT pre-training pipeline. Image source: https://arxiv.org/abs/2106.08254In the original paper, the authors also provided a theoretical link between the BEiT and the Variational Autoencoder. So the natural question is, can an Autoencoder be used for pre-training purposes?MAE-ViT. This paper answered the question above by designing a masked autoencoder architecture. Using the same ViT formulation and random masking, the authors proposed to “discard” the masked patches during training and only use unmasked patches in the visual token sequence as input to the encoder. The mask tokens will be used for reconstruction during the decoding stage at the pre-training. The decoder could be flexible, ranging from 1–12 transformer blocks with dimensionality between 128 and 1024. More detailed architectural information could be found in the original paper.Masked Autoencoder architecture. Image source: https://arxiv.org/abs/2111.06377SimMIM. Slightly different from BEiT and MAE-ViT, the paper proposes using a flexible backbone such as Swin Transformer for encoding purposes. The proposed prediction head is extremely lightweight—a single linear layer of a 2-layer MLP to regress the masked pixels.SimMIM pipeline. Image source: https://arxiv.org/abs/2111.09886

From Masked Image Modeling to Autoregressive Image Modeling | by Mengliu Zhao | Jun, 2024

Farewell: Fintech Nexus is shutting down

Goldman Sachs loses profit after hits from GreenSky, real estate

Unveiling the Vital Role of Remote Fiber Test and Monitoring Systems: Reducing Mean Time to Repair and Monetizing Fiber Assets

How We Share Knowledge as a Web Collective

Transforming or Disrupting Data Centers?

Related articles

Mortgage Rates Could Fall Another Half Point Just from Market Normalization

Farewell: Fintech Nexus is shutting down

Goldman Sachs loses profit after hits from GreenSky, real estate

Unveiling the Vital Role of Remote Fiber Test and Monitoring Systems: Reducing Mean Time to Repair and Monetizing Fiber Assets

About

Latest news

Popular news

Goldman Sachs loses profit after hits from GreenSky, real estate

Farewell: Fintech Nexus is shutting down

Mortgage Rates Could Fall Another Half Point Just from Market Normalization