VLM 02 — Vision Transformers: ViT, DeiT, Swin, DINOv2 & SAM

00

Topics We’ll Cover

From CNN to ViT — Why Patches Work
ViT Architecture — Patch Embed, Position, Encoder
DeiT — Data-Efficient Training
Swin — Hierarchical Attention
DINO / DINOv2 — Self-Supervised Vision Features
SAM — Segment Anything
Patch Size & Resolution Trade-Offs
ViT as a VLM Image Encoder
What to Take Away

01

From CNN to ViT — Why Patches Work

Convolutional networks dominated vision from AlexNet (2012) through EfficientNet (2019) because their inductive biases — translation equivariance, local receptive fields, weight sharing — matched natural image statistics well. The cost: global context required stacking many layers, and architectural engineering was hand-crafted.

Transformers process sequences without locality assumptions. ViT (An Image is Worth 16×16 Words, Dosovitskiy et al., 2020) made the observation: split the image into non-overlapping patches, flatten each patch to a vector, and treat them as a sequence. Self-attention then models all pairwise patch interactions in one layer.

CNN inductive biases

Translation equivariance: a cat in the corner looks the same as a cat in the centre
Locality: 3×3 kernels can only see a 3×3 neighbourhood per layer
Hierarchical: edge → texture → part → object is baked in by pooling
Works well with small data; struggles to model long-range dependencies

ViT inductive biases

None by default: position encodings must carry all spatial info
Global receptive field from layer 1: every patch attends to every other
Data hungry: without pre-training, ViT underperforms ResNet on IN-1K
At scale: outperforms CNNs and generalises to broader distributions

The critical insight

A 224×224 image split into 16×16 patches yields 196 tokens of dimension 768. This is a short sequence compared to language (GPT-2 used 1,024 tokens), so the quadratic attention cost is tractable. The patch is the unit of meaning: the model learns which patches matter and which don’t via attention weights.

02

ViT Architecture — Patch Embed, Position, Encoder

A ViT-B/16 on 224×224 input has exactly 197 tokens: 196 patch tokens + 1 prepended [CLS] token. The [CLS] token aggregates global information and is used as the image representation for classification or projection to language space.

Variant	Layers	Heads	Hidden	MLP dim	Params	IN-1K acc
ViT-S/16	12	6	384	1536	22M	79.9%
ViT-B/16	12	12	768	3072	86M	81.8%
ViT-L/16	24	16	1024	4096	307M	84.0%
ViT-H/14	32	16	1280	5120	632M	85.2%

03

DeiT — Data-Efficient Training

DeiT (Data-efficient Image Transformers, Touvron et al., Facebook 2020) showed that ViT can match ResNets on IN-1K without extra data, purely through improved training recipes. The key innovations: stronger augmentation, mixup, cutmix, label smoothing, stochastic depth, and a novel distillation token.

Training recipe improvements

RandAugment (9 ops, magnitude 9): rotation, colour jitter, shear
Mixup α=0.8 + CutMix α=1.0: convex interpolation of inputs and labels
Label smoothing ε=0.1
Stochastic depth (drop path) p=0.1: randomly drops residual branches, acts as ensemble
Repeated augmentation: each image seen ×3 per epoch with different augs
300 epochs on IN-1K only → DeiT-B: 81.8% top-1

Distillation token

DeiT-B⇑ adds a second learned token alongside [CLS]. This distillation token is trained to mimic a CNN teacher (RegNetY-16GF: 82.9%) via soft cross-entropy on teacher logits. At inference the [CLS] and distillation token are averaged.

DeiT-B with distillation: 83.1% top-1 (matches ViT-L trained on JFT-300M)
Hard distillation (class label): works but less stable than soft
CNN teacher outperforms ViT teacher at this scale

Why DeiT matters for VLMs

DeiT proved that the ViT architecture is not intrinsically data-hungry — it was the training recipe that was inadequate. All subsequent image encoders (DINOv2, EVA-CLIP, SigLIP) use DeiT-style augmentation recipes. The distillation token idea also influenced DINOv2’s self-distillation design.

DeiT-B via timm (Python)

import timm

# DeiT-B without distillation head (for embedding, not classification)
model = timm.create_model(
    'deit_base_patch16_224',
    pretrained=True,
    num_classes=0   # remove head → returns [B, 768] CLS embedding
)

# DeiT-B with distillation (deit_base_distilled_patch16_224)
# features = model.forward_features(x) → [B, 197+1, 768]  (extra distill token)

04

Swin — Hierarchical Attention

Swin Transformer (Liu et al., Microsoft 2021) reintroduced hierarchy and local windows to the Transformer, recovering CNN-like inductive biases while keeping attention-based feature learning. It became the dominant backbone for dense prediction tasks (detection, segmentation) where multi-scale features are essential.

Why Swin matters for dense tasks

The hierarchical feature map (C4, C8, C16, C32 at 1/4, 1/8, 1/16, 1/32 resolution) plugs directly into FPN necks used by Mask R-CNN, Cascade Mask R-CNN, and SegFormer. Swin-L with HTC++ backbone achieved 58.7 box AP on COCO, surpassing EfficientDet and ResNet-based detectors. For VLM document understanding (LayoutLMv3, UDOP) the hierarchical tokens from Swin carry spatial layout information that a flat ViT cannot represent at the same resolution budget.

05

DINO / DINOv2 — Self-Supervised Vision Features

DINO (Self-DIstillation with NO labels, Caron et al., Meta 2021) showed that self-supervised ViT features, without any label supervision, learn object-level semantic segmentation in their attention maps. DINOv2 (Oquab et al., Meta 2023) scaled this to ViT-g with a curated 142M image dataset (LVD-142M) and produced the best general-purpose vision features available.

DINO self-distillation

Student network sees random-cropped & augmented patches; Teacher network sees global views. Teacher weights are the EMA of student weights (no teacher gradients — momentum encoder). Loss: cross-entropy between student and teacher softmax output (centering + sharpening to avoid collapse).

Emergent property: [CLS] attention maps ≈ semantic segmentation masks without labels
ViT-S/8 DINO: kNN IN-1K 74.5%, linear probe 77.0%

DINOv2 improvements

Combines DINO (self-distillation), iBOT (masked image modelling — predict masked patch tokens), and SwAV (online cluster assignments). Trained on LVD-142M: curated from 1.2B internet images via SSCD dedup + retrieval from ImageNet curators.

ViT-g/14 DINOv2: linear probe IN-1K 86.5%, depth estimation & semantic seg SOTA
Frozen features outperform supervised ViT-L on ADE20K, NYUd, SUN-RGBD
Weights: facebook/dinov2-giant on HuggingFace

DINOv2 feature extraction

from transformers import AutoModel
import torch

dinov2 = AutoModel.from_pretrained("facebook/dinov2-giant")

# x: [B, 3, 518, 518]  (native res; also works at 224×224)
outputs = dinov2(pixel_values=x)
cls_feat    = outputs.last_hidden_state[:, 0]          # [B, 1536]
patch_feats = outputs.last_hidden_state[:, 1:]         # [B, 1369, 1536] (37×37 @ 518px)

# For dense tasks: reshape patch tokens to spatial grid
h = w = 518 // 14   # = 37
feats_2d = patch_feats.reshape(B, h, w, -1).permute(0, 3, 1, 2)

DINOv2 vs CLIP as a VLM backbone

DINOv2 patch tokens carry dense spatial semantic information; CLIP patch tokens are optimised for global image-text matching. For tasks requiring pixel-level understanding (depth, segmentation, OCR) DINOv2 wins. For tasks requiring open-vocabulary text alignment (classification, retrieval, VQA with novel categories) CLIP or SigLIP wins. Some recent VLMs (e.g. InternViL) concatenate both.

06

SAM — Segment Anything

SAM (Segment Anything Model, Kirillov et al., Meta 2023) is a promptable segmentation system trained on SA-1B: 11 million images with over 1 billion automatically-generated masks. The image encoder is a ViT-H/16 with windowed attention plus four global attention layers — the highest-capacity image encoder in any publicly released model at the time.

Image Encoder
ViT-H (636M)

→

Image Embedding
64×64 × 256

→

Prompt Encoder
points / boxes / masks

→

Prompt Tokens
sparse + dense

→

Mask Decoder
2-layer transformer

→

3 Masks + IoU
multi-resolution

Architecture details

Image encoder: ViT-H/16 @ 1024×1024. Windowed attention (14×14 windows) in most layers to keep memory O(n). 4 global attention layers at regular intervals restore global context. Outputs 64×64 × 256 feature map.
Prompt encoder: Points/boxes → positional encodings + learned type embeddings. Masks → 4×downsampled convolution. Free-form text → CLIP text encoder (experimental).
Mask decoder: Two transformer layers with bidirectional cross-attention (tokens attend to image, image attends to tokens). Predicts 3 masks per prompt (whole / part / subpart ambiguity) + IoU confidence.

SAM 2 (2024) — video extension

SAM 2 adds a memory bank (streaming memory attention) to propagate masks across video frames. The image encoder switches to Hiera (hierarchical ViT) at 4 scales for efficiency. SAM 2 processes a 1024×1024 frame in ~8ms on an A100 vs ~51ms for SAM, enabling real-time interactive segmentation. Model weights are Apache-2.0 licensed.

07

Patch Size & Resolution Trade-Offs

Patch size is the primary dial controlling the token count, and hence the memory & compute budget, of any ViT-based system. The relationship is: for an H×W image with patch size P, the number of tokens is (H/P)×(W/P). Attention cost is O(n²), so halving patch size quadruples attention FLOPs.

Patch	Tokens @ 224px	Tokens @ 448px	Tokens @ 896px	Notes
/32	49	196	784	Very coarse; only viable at 224px for global tasks
/16	196	784	3136	Standard ViT-B/L; good balance
/14	256	1024	4096	EVA-CLIP, DINOv2, SigLIP-SO; finer spatial resolution
/8	784	3136	12544	Dense tasks; expensive; used in DINO-small pre-training

Resolution interpolation

Position embeddings are trained at a fixed resolution. To run at a higher resolution, bicubic interpolation of the 2D position embedding grid is used. This works reasonably for up to 2× the training resolution. Beyond that, fine-tuning on the target resolution is strongly recommended.

AnyRes / tiling (VLM context)

Modern VLMs (LLaVA-HD, LLaVA-Next, InternVL2, Qwen2-VL) tile a high-resolution image into sub-images that each fit the encoder’s native resolution, then encode each tile separately plus a global thumbnail. A 1024×1024 image with 448px tiles = 4 tiles + 1 global = 5 encodings. See Deck 03 for full detail.

NaViT — native resolution (Google, 2023)

NaViT packs variable-resolution images into a sequence without padding, using fractional position encodings and sequence-level masking. This eliminates the resolution mismatch at train/eval time and allows mixed-aspect-ratio batching. Not yet mainstream in open models but used in Gemini’s image encoder.

Practical recommendation

For a new VLM project in 2025: start with SigLIP-SO400M/14 @ 448px (1024 tokens per image) or DINOv2-g/14 @ 448px depending on whether text alignment or dense spatial features matter more. Avoid /8 patch size unless you have FlashAttention-3 and 80 GB VRAM to spare.

08

ViT as a VLM Image Encoder

When a ViT is inserted into a VLM the key question is: which tokens do you pass to the language model? The answer differs between architectures and has a large impact on the token budget and the type of visual understanding the model develops.

LLaVA-1.5

CLIP ViT-L/14@336 → 576 patch tokens → 2-layer MLP proj → prepended to LLM [CLS] discarded

InternVL 2

InternViT-6B (SigLIP-style) → patch tokens pixel-shuffle 2× → quarter token count → MLP proj → LLM

Qwen2-VL

ViT (native res via 2D RoPE) → 2D-RoPE patch tokens → MLP proj → merged into LLM (no fixed count)

BLIP-2

CLIP ViT-L or ViT-g → Q-Former (32 learned queries) → only 32 tokens to LLM max compression, information loss

Frozen vs tuned encoder

Most production VLMs keep the ViT frozen during Stage 1 (projection training) and optionally unfreeze it during Stage 2 (full fine-tuning). Unfreezing risks catastrophic forgetting of CLIP’s alignment if fine-tuning data distribution is narrow. EVA-02 and InternViT are specifically designed to be tuned — their MIM pre-training gives them more stable feature representations that survive gradient updates without collapsing.

Token compression strategies

Average pooling: simple but loses spatial structure. Pixel shuffle: groups P×P spatially adjacent tokens into one token of P²×C (used in InternVL). Q-Former / Perceiver Resampler: cross-attention with N learned queries — strongly compresses but risks dropping fine details needed for OCR. No compression (Qwen2-VL, LLaVA-Next): more tokens, higher LLM cost, better detail.

09

What to Take Away

ViT is a sequence model applied to patches. Patch size, resolution, and the use of [CLS] vs patch tokens are the primary architectural knobs.
DeiT proved that ViT is not data-hungry — it was the training recipe. Strong augmentation, distillation, and stochastic depth get ResNet parity on IN-1K from scratch.
Swin recovers hierarchy and locality via windowed attention + shifted windows, enabling hierarchical multi-scale features needed for dense prediction. Not commonly used as a VLM backbone today, but dominant in detection and segmentation.
DINOv2-g/14 is the best general-purpose frozen visual feature extractor for dense spatial tasks (depth, segmentation, retrieval). Its patch tokens carry semantic object information without any text supervision.
SAM / SAM 2 use ViT-H with windowed attention for efficient encoding of 1024px inputs. SAM 2 extends this to video via a streaming memory bank.
Patch size /14 @ 448px = 1024 tokens is the current sweet spot for VLMs. Tiling (AnyRes) handles high-resolution inputs by encoding sub-crops independently.

Where to next

Deck 03 assembles these encoders into full VLMs: Llama 3.2 V, Qwen2-VL, InternVL2, Pixtral, Gemini, and Claude. You’ll see how the projection connector (MLP, Q-Former, Perceiver Resampler) links the image encoder output to the LLM, and how different training curricula (alignment → instruction tuning → RLHF) shape the final model capabilities.