Vision-Language Models Series — Presentation 02

Vision Transformers — ViT, DeiT, Swin, DINOv2 & SAM

From patch embeddings to hierarchical attention: ViT, DeiT, Swin, DINOv2 self-supervised features, Segment Anything, patch size trade-offs, and how each architecture plugs into a VLM image encoder.

ViT DeiT Swin DINOv2 SAM Self-Supervised Patch Embeddings
CNN bias Patch + PE MHSA Distil / SSL Hierarchical VLM Encode
00

Topics We’ll Cover

01

From CNN to ViT — Why Patches Work

Convolutional networks dominated vision from AlexNet (2012) through EfficientNet (2019) because their inductive biases — translation equivariance, local receptive fields, weight sharing — matched natural image statistics well. The cost: global context required stacking many layers, and architectural engineering was hand-crafted.

Transformers process sequences without locality assumptions. ViT (An Image is Worth 16×16 Words, Dosovitskiy et al., 2020) made the observation: split the image into non-overlapping patches, flatten each patch to a vector, and treat them as a sequence. Self-attention then models all pairwise patch interactions in one layer.

CNN inductive biases

  • Translation equivariance: a cat in the corner looks the same as a cat in the centre
  • Locality: 3×3 kernels can only see a 3×3 neighbourhood per layer
  • Hierarchical: edge → texture → part → object is baked in by pooling
  • Works well with small data; struggles to model long-range dependencies

ViT inductive biases

  • None by default: position encodings must carry all spatial info
  • Global receptive field from layer 1: every patch attends to every other
  • Data hungry: without pre-training, ViT underperforms ResNet on IN-1K
  • At scale: outperforms CNNs and generalises to broader distributions
The critical insight

A 224×224 image split into 16×16 patches yields 196 tokens of dimension 768. This is a short sequence compared to language (GPT-2 used 1,024 tokens), so the quadratic attention cost is tractable. The patch is the unit of meaning: the model learns which patches matter and which don’t via attention weights.

02

ViT Architecture — Patch Embed, Position, Encoder

A ViT-B/16 on 224×224 input has exactly 197 tokens: 196 patch tokens + 1 prepended [CLS] token. The [CLS] token aggregates global information and is used as the image representation for classification or projection to language space.

ViT-B/16: 197 tokens → 12 encoder blocks → [CLS] for classification Image 224×224×3 = 150,528 px Patch Embed 16×16 patches Conv2d(3,768,16,16) 196 × 768 Add [CLS] learnable token 197 × 768 + pos embed Encoder Block ×12 LayerNorm MHSA (12 heads, 64-d) LayerNorm MLP (768→3072→768) GELU activation Residual around each Head [CLS] token only LayerNorm + Linear or projection to d_embed ViT family sizes (ImageNet-21K pre-training) ViT-T/16: 5.7M ViT-S/16: 22M ViT-B/16: 86M ViT-L/16: 307M ViT-H/14: 632M ViT-G/14: 1.84B patch size /16 → 196 tokens /14 → 256 tokens /8 → 784 tokens (all at 224×224) Position encoding: 2D sine-cosine (original) or learnable 1D (most ViT-B/L models)
VariantLayersHeadsHiddenMLP dimParamsIN-1K acc
ViT-S/16126384153622M79.9%
ViT-B/161212768307286M81.8%
ViT-L/16241610244096307M84.0%
ViT-H/14321612805120632M85.2%
03

DeiT — Data-Efficient Training

DeiT (Data-efficient Image Transformers, Touvron et al., Facebook 2020) showed that ViT can match ResNets on IN-1K without extra data, purely through improved training recipes. The key innovations: stronger augmentation, mixup, cutmix, label smoothing, stochastic depth, and a novel distillation token.

Training recipe improvements

  • RandAugment (9 ops, magnitude 9): rotation, colour jitter, shear
  • Mixup α=0.8 + CutMix α=1.0: convex interpolation of inputs and labels
  • Label smoothing ε=0.1
  • Stochastic depth (drop path) p=0.1: randomly drops residual branches, acts as ensemble
  • Repeated augmentation: each image seen ×3 per epoch with different augs
  • 300 epochs on IN-1K only → DeiT-B: 81.8% top-1

Distillation token

DeiT-B⇑ adds a second learned token alongside [CLS]. This distillation token is trained to mimic a CNN teacher (RegNetY-16GF: 82.9%) via soft cross-entropy on teacher logits. At inference the [CLS] and distillation token are averaged.

  • DeiT-B with distillation: 83.1% top-1 (matches ViT-L trained on JFT-300M)
  • Hard distillation (class label): works but less stable than soft
  • CNN teacher outperforms ViT teacher at this scale
Why DeiT matters for VLMs

DeiT proved that the ViT architecture is not intrinsically data-hungry — it was the training recipe that was inadequate. All subsequent image encoders (DINOv2, EVA-CLIP, SigLIP) use DeiT-style augmentation recipes. The distillation token idea also influenced DINOv2’s self-distillation design.

DeiT-B via timm (Python)
import timm

# DeiT-B without distillation head (for embedding, not classification)
model = timm.create_model(
    'deit_base_patch16_224',
    pretrained=True,
    num_classes=0   # remove head → returns [B, 768] CLS embedding
)

# DeiT-B with distillation (deit_base_distilled_patch16_224)
# features = model.forward_features(x) → [B, 197+1, 768]  (extra distill token)
04

Swin — Hierarchical Attention

Swin Transformer (Liu et al., Microsoft 2021) reintroduced hierarchy and local windows to the Transformer, recovering CNN-like inductive biases while keeping attention-based feature learning. It became the dominant backbone for dense prediction tasks (detection, segmentation) where multi-scale features are essential.

Swin-B on 224×224: 4 stages, 7×7 window W-MSA + SW-MSA Stage 1 Patch embed 4×4 56×56 tokens C=128, 2 layers W-MSA + SW-MSA Patch merge 2×downsample Stage 2 28×28 tokens C=256, 2 layers W-MSA + SW-MSA Stage 3 14×14 tokens C=512, 18 layers W-MSA + SW-MSA Stage 4 7×7 tokens C=1024, 2 layers W-MSA + SW-MSA Head GAP + MLP W-MSA vs SW-MSA (alternating layers) W-MSA: non-overlapping 7×7 windows → O(n) attention (no cross-window connections) SW-MSA: shift window by (3,3) → attention crosses original boundaries → restores global receptive field Relative position bias: 2D learnable bias added to attention logits (not absolute PE) Swin-B: 88M params, 83.5% IN-1K. Swin-L: 197M, 84.9%. SwinV2-G: 3B, 90.2% IN-22K→1K
Why Swin matters for dense tasks

The hierarchical feature map (C4, C8, C16, C32 at 1/4, 1/8, 1/16, 1/32 resolution) plugs directly into FPN necks used by Mask R-CNN, Cascade Mask R-CNN, and SegFormer. Swin-L with HTC++ backbone achieved 58.7 box AP on COCO, surpassing EfficientDet and ResNet-based detectors. For VLM document understanding (LayoutLMv3, UDOP) the hierarchical tokens from Swin carry spatial layout information that a flat ViT cannot represent at the same resolution budget.

05

DINO / DINOv2 — Self-Supervised Vision Features

DINO (Self-DIstillation with NO labels, Caron et al., Meta 2021) showed that self-supervised ViT features, without any label supervision, learn object-level semantic segmentation in their attention maps. DINOv2 (Oquab et al., Meta 2023) scaled this to ViT-g with a curated 142M image dataset (LVD-142M) and produced the best general-purpose vision features available.

DINO self-distillation

Student network sees random-cropped & augmented patches; Teacher network sees global views. Teacher weights are the EMA of student weights (no teacher gradients — momentum encoder). Loss: cross-entropy between student and teacher softmax output (centering + sharpening to avoid collapse).

  • Emergent property: [CLS] attention maps ≈ semantic segmentation masks without labels
  • ViT-S/8 DINO: kNN IN-1K 74.5%, linear probe 77.0%

DINOv2 improvements

Combines DINO (self-distillation), iBOT (masked image modelling — predict masked patch tokens), and SwAV (online cluster assignments). Trained on LVD-142M: curated from 1.2B internet images via SSCD dedup + retrieval from ImageNet curators.

  • ViT-g/14 DINOv2: linear probe IN-1K 86.5%, depth estimation & semantic seg SOTA
  • Frozen features outperform supervised ViT-L on ADE20K, NYUd, SUN-RGBD
  • Weights: facebook/dinov2-giant on HuggingFace
DINOv2 feature extraction
from transformers import AutoModel
import torch

dinov2 = AutoModel.from_pretrained("facebook/dinov2-giant")

# x: [B, 3, 518, 518]  (native res; also works at 224×224)
outputs = dinov2(pixel_values=x)
cls_feat    = outputs.last_hidden_state[:, 0]          # [B, 1536]
patch_feats = outputs.last_hidden_state[:, 1:]         # [B, 1369, 1536] (37×37 @ 518px)

# For dense tasks: reshape patch tokens to spatial grid
h = w = 518 // 14   # = 37
feats_2d = patch_feats.reshape(B, h, w, -1).permute(0, 3, 1, 2)
DINOv2 vs CLIP as a VLM backbone

DINOv2 patch tokens carry dense spatial semantic information; CLIP patch tokens are optimised for global image-text matching. For tasks requiring pixel-level understanding (depth, segmentation, OCR) DINOv2 wins. For tasks requiring open-vocabulary text alignment (classification, retrieval, VQA with novel categories) CLIP or SigLIP wins. Some recent VLMs (e.g. InternViL) concatenate both.

06

SAM — Segment Anything

SAM (Segment Anything Model, Kirillov et al., Meta 2023) is a promptable segmentation system trained on SA-1B: 11 million images with over 1 billion automatically-generated masks. The image encoder is a ViT-H/16 with windowed attention plus four global attention layers — the highest-capacity image encoder in any publicly released model at the time.

Image Encoder
ViT-H (636M)
Image Embedding
64×64 × 256
Prompt Encoder
points / boxes / masks
Prompt Tokens
sparse + dense
Mask Decoder
2-layer transformer
3 Masks + IoU
multi-resolution

Architecture details

SAM 2 (2024) — video extension

SAM 2 adds a memory bank (streaming memory attention) to propagate masks across video frames. The image encoder switches to Hiera (hierarchical ViT) at 4 scales for efficiency. SAM 2 processes a 1024×1024 frame in ~8ms on an A100 vs ~51ms for SAM, enabling real-time interactive segmentation. Model weights are Apache-2.0 licensed.

07

Patch Size & Resolution Trade-Offs

Patch size is the primary dial controlling the token count, and hence the memory & compute budget, of any ViT-based system. The relationship is: for an H×W image with patch size P, the number of tokens is (H/P)×(W/P). Attention cost is O(n²), so halving patch size quadruples attention FLOPs.

PatchTokens @ 224pxTokens @ 448pxTokens @ 896pxNotes
/3249196784Very coarse; only viable at 224px for global tasks
/161967843136Standard ViT-B/L; good balance
/1425610244096EVA-CLIP, DINOv2, SigLIP-SO; finer spatial resolution
/8784313612544Dense tasks; expensive; used in DINO-small pre-training

Resolution interpolation

Position embeddings are trained at a fixed resolution. To run at a higher resolution, bicubic interpolation of the 2D position embedding grid is used. This works reasonably for up to 2× the training resolution. Beyond that, fine-tuning on the target resolution is strongly recommended.

AnyRes / tiling (VLM context)

Modern VLMs (LLaVA-HD, LLaVA-Next, InternVL2, Qwen2-VL) tile a high-resolution image into sub-images that each fit the encoder’s native resolution, then encode each tile separately plus a global thumbnail. A 1024×1024 image with 448px tiles = 4 tiles + 1 global = 5 encodings. See Deck 03 for full detail.

NaViT — native resolution (Google, 2023)

NaViT packs variable-resolution images into a sequence without padding, using fractional position encodings and sequence-level masking. This eliminates the resolution mismatch at train/eval time and allows mixed-aspect-ratio batching. Not yet mainstream in open models but used in Gemini’s image encoder.

Practical recommendation

For a new VLM project in 2025: start with SigLIP-SO400M/14 @ 448px (1024 tokens per image) or DINOv2-g/14 @ 448px depending on whether text alignment or dense spatial features matter more. Avoid /8 patch size unless you have FlashAttention-3 and 80 GB VRAM to spare.

08

ViT as a VLM Image Encoder

When a ViT is inserted into a VLM the key question is: which tokens do you pass to the language model? The answer differs between architectures and has a large impact on the token budget and the type of visual understanding the model develops.

LLaVA-1.5
CLIP ViT-L/14@336 → 576 patch tokens → 2-layer MLP proj → prepended to LLM [CLS] discarded
InternVL 2
InternViT-6B (SigLIP-style) → patch tokens pixel-shuffle 2× → quarter token count → MLP proj → LLM
Qwen2-VL
ViT (native res via 2D RoPE) → 2D-RoPE patch tokens → MLP proj → merged into LLM (no fixed count)
BLIP-2
CLIP ViT-L or ViT-g → Q-Former (32 learned queries) → only 32 tokens to LLM max compression, information loss

Frozen vs tuned encoder

Most production VLMs keep the ViT frozen during Stage 1 (projection training) and optionally unfreeze it during Stage 2 (full fine-tuning). Unfreezing risks catastrophic forgetting of CLIP’s alignment if fine-tuning data distribution is narrow. EVA-02 and InternViT are specifically designed to be tuned — their MIM pre-training gives them more stable feature representations that survive gradient updates without collapsing.

Token compression strategies

Average pooling: simple but loses spatial structure. Pixel shuffle: groups P×P spatially adjacent tokens into one token of P²×C (used in InternVL). Q-Former / Perceiver Resampler: cross-attention with N learned queries — strongly compresses but risks dropping fine details needed for OCR. No compression (Qwen2-VL, LLaVA-Next): more tokens, higher LLM cost, better detail.

09

What to Take Away

Where to next

Deck 03 assembles these encoders into full VLMs: Llama 3.2 V, Qwen2-VL, InternVL2, Pixtral, Gemini, and Claude. You’ll see how the projection connector (MLP, Q-Former, Perceiver Resampler) links the image encoder output to the LLM, and how different training curricula (alignment → instruction tuning → RLHF) shape the final model capabilities.