From patch embeddings to hierarchical attention: ViT, DeiT, Swin, DINOv2 self-supervised features, Segment Anything, patch size trade-offs, and how each architecture plugs into a VLM image encoder.
Convolutional networks dominated vision from AlexNet (2012) through EfficientNet (2019) because their inductive biases — translation equivariance, local receptive fields, weight sharing — matched natural image statistics well. The cost: global context required stacking many layers, and architectural engineering was hand-crafted.
Transformers process sequences without locality assumptions. ViT (An Image is Worth 16×16 Words, Dosovitskiy et al., 2020) made the observation: split the image into non-overlapping patches, flatten each patch to a vector, and treat them as a sequence. Self-attention then models all pairwise patch interactions in one layer.
A 224×224 image split into 16×16 patches yields 196 tokens of dimension 768. This is a short sequence compared to language (GPT-2 used 1,024 tokens), so the quadratic attention cost is tractable. The patch is the unit of meaning: the model learns which patches matter and which don’t via attention weights.
A ViT-B/16 on 224×224 input has exactly 197 tokens: 196 patch tokens + 1 prepended [CLS] token. The [CLS] token aggregates global information and is used as the image representation for classification or projection to language space.
| Variant | Layers | Heads | Hidden | MLP dim | Params | IN-1K acc |
|---|---|---|---|---|---|---|
| ViT-S/16 | 12 | 6 | 384 | 1536 | 22M | 79.9% |
| ViT-B/16 | 12 | 12 | 768 | 3072 | 86M | 81.8% |
| ViT-L/16 | 24 | 16 | 1024 | 4096 | 307M | 84.0% |
| ViT-H/14 | 32 | 16 | 1280 | 5120 | 632M | 85.2% |
DeiT (Data-efficient Image Transformers, Touvron et al., Facebook 2020) showed that ViT can match ResNets on IN-1K without extra data, purely through improved training recipes. The key innovations: stronger augmentation, mixup, cutmix, label smoothing, stochastic depth, and a novel distillation token.
DeiT-B⇑ adds a second learned token alongside [CLS]. This distillation token is trained to mimic a CNN teacher (RegNetY-16GF: 82.9%) via soft cross-entropy on teacher logits. At inference the [CLS] and distillation token are averaged.
DeiT proved that the ViT architecture is not intrinsically data-hungry — it was the training recipe that was inadequate. All subsequent image encoders (DINOv2, EVA-CLIP, SigLIP) use DeiT-style augmentation recipes. The distillation token idea also influenced DINOv2’s self-distillation design.
import timm
# DeiT-B without distillation head (for embedding, not classification)
model = timm.create_model(
'deit_base_patch16_224',
pretrained=True,
num_classes=0 # remove head → returns [B, 768] CLS embedding
)
# DeiT-B with distillation (deit_base_distilled_patch16_224)
# features = model.forward_features(x) → [B, 197+1, 768] (extra distill token)
Swin Transformer (Liu et al., Microsoft 2021) reintroduced hierarchy and local windows to the Transformer, recovering CNN-like inductive biases while keeping attention-based feature learning. It became the dominant backbone for dense prediction tasks (detection, segmentation) where multi-scale features are essential.
The hierarchical feature map (C4, C8, C16, C32 at 1/4, 1/8, 1/16, 1/32 resolution) plugs directly into FPN necks used by Mask R-CNN, Cascade Mask R-CNN, and SegFormer. Swin-L with HTC++ backbone achieved 58.7 box AP on COCO, surpassing EfficientDet and ResNet-based detectors. For VLM document understanding (LayoutLMv3, UDOP) the hierarchical tokens from Swin carry spatial layout information that a flat ViT cannot represent at the same resolution budget.
DINO (Self-DIstillation with NO labels, Caron et al., Meta 2021) showed that self-supervised ViT features, without any label supervision, learn object-level semantic segmentation in their attention maps. DINOv2 (Oquab et al., Meta 2023) scaled this to ViT-g with a curated 142M image dataset (LVD-142M) and produced the best general-purpose vision features available.
Student network sees random-cropped & augmented patches; Teacher network sees global views. Teacher weights are the EMA of student weights (no teacher gradients — momentum encoder). Loss: cross-entropy between student and teacher softmax output (centering + sharpening to avoid collapse).
[CLS] attention maps ≈ semantic segmentation masks without labelsCombines DINO (self-distillation), iBOT (masked image modelling — predict masked patch tokens), and SwAV (online cluster assignments). Trained on LVD-142M: curated from 1.2B internet images via SSCD dedup + retrieval from ImageNet curators.
facebook/dinov2-giant on HuggingFacefrom transformers import AutoModel
import torch
dinov2 = AutoModel.from_pretrained("facebook/dinov2-giant")
# x: [B, 3, 518, 518] (native res; also works at 224×224)
outputs = dinov2(pixel_values=x)
cls_feat = outputs.last_hidden_state[:, 0] # [B, 1536]
patch_feats = outputs.last_hidden_state[:, 1:] # [B, 1369, 1536] (37×37 @ 518px)
# For dense tasks: reshape patch tokens to spatial grid
h = w = 518 // 14 # = 37
feats_2d = patch_feats.reshape(B, h, w, -1).permute(0, 3, 1, 2)
DINOv2 patch tokens carry dense spatial semantic information; CLIP patch tokens are optimised for global image-text matching. For tasks requiring pixel-level understanding (depth, segmentation, OCR) DINOv2 wins. For tasks requiring open-vocabulary text alignment (classification, retrieval, VQA with novel categories) CLIP or SigLIP wins. Some recent VLMs (e.g. InternViL) concatenate both.
SAM (Segment Anything Model, Kirillov et al., Meta 2023) is a promptable segmentation system trained on SA-1B: 11 million images with over 1 billion automatically-generated masks. The image encoder is a ViT-H/16 with windowed attention plus four global attention layers — the highest-capacity image encoder in any publicly released model at the time.
SAM 2 adds a memory bank (streaming memory attention) to propagate masks across video frames. The image encoder switches to Hiera (hierarchical ViT) at 4 scales for efficiency. SAM 2 processes a 1024×1024 frame in ~8ms on an A100 vs ~51ms for SAM, enabling real-time interactive segmentation. Model weights are Apache-2.0 licensed.
Patch size is the primary dial controlling the token count, and hence the memory & compute budget, of any ViT-based system. The relationship is: for an H×W image with patch size P, the number of tokens is (H/P)×(W/P). Attention cost is O(n²), so halving patch size quadruples attention FLOPs.
| Patch | Tokens @ 224px | Tokens @ 448px | Tokens @ 896px | Notes |
|---|---|---|---|---|
| /32 | 49 | 196 | 784 | Very coarse; only viable at 224px for global tasks |
| /16 | 196 | 784 | 3136 | Standard ViT-B/L; good balance |
| /14 | 256 | 1024 | 4096 | EVA-CLIP, DINOv2, SigLIP-SO; finer spatial resolution |
| /8 | 784 | 3136 | 12544 | Dense tasks; expensive; used in DINO-small pre-training |
Position embeddings are trained at a fixed resolution. To run at a higher resolution, bicubic interpolation of the 2D position embedding grid is used. This works reasonably for up to 2× the training resolution. Beyond that, fine-tuning on the target resolution is strongly recommended.
Modern VLMs (LLaVA-HD, LLaVA-Next, InternVL2, Qwen2-VL) tile a high-resolution image into sub-images that each fit the encoder’s native resolution, then encode each tile separately plus a global thumbnail. A 1024×1024 image with 448px tiles = 4 tiles + 1 global = 5 encodings. See Deck 03 for full detail.
NaViT packs variable-resolution images into a sequence without padding, using fractional position encodings and sequence-level masking. This eliminates the resolution mismatch at train/eval time and allows mixed-aspect-ratio batching. Not yet mainstream in open models but used in Gemini’s image encoder.
For a new VLM project in 2025: start with SigLIP-SO400M/14 @ 448px (1024 tokens per image) or DINOv2-g/14 @ 448px depending on whether text alignment or dense spatial features matter more. Avoid /8 patch size unless you have FlashAttention-3 and 80 GB VRAM to spare.
When a ViT is inserted into a VLM the key question is: which tokens do you pass to the language model? The answer differs between architectures and has a large impact on the token budget and the type of visual understanding the model develops.
Most production VLMs keep the ViT frozen during Stage 1 (projection training) and optionally unfreeze it during Stage 2 (full fine-tuning). Unfreezing risks catastrophic forgetting of CLIP’s alignment if fine-tuning data distribution is narrow. EVA-02 and InternViT are specifically designed to be tuned — their MIM pre-training gives them more stable feature representations that survive gradient updates without collapsing.
Average pooling: simple but loses spatial structure. Pixel shuffle: groups P×P spatially adjacent tokens into one token of P²×C (used in InternVL). Q-Former / Perceiver Resampler: cross-attention with N learned queries — strongly compresses but risks dropping fine details needed for OCR. No compression (Qwen2-VL, LLaVA-Next): more tokens, higher LLM cost, better detail.
Deck 03 assembles these encoders into full VLMs: Llama 3.2 V, Qwen2-VL, InternVL2, Pixtral, Gemini, and Claude. You’ll see how the projection connector (MLP, Q-Former, Perceiver Resampler) links the image encoder output to the LLM, and how different training curricula (alignment → instruction tuning → RLHF) shape the final model capabilities.