VLM 03 — Modern VLMs: Llama 3.2, Qwen2-VL, InternVL2, Pixtral, Gemini & Claude

00

Topics We’ll Cover

The VLM Landscape — Llama 3.2 V, Qwen2-VL, InternVL2, Pixtral, Gemini, Claude
Three Architectures — Q-Former, Perceiver Resampler, Simple Projection
Native Multimodal vs Adapter — Gemini 1.5 vs Llama 3.2
Image Preprocessing — Resolution, Tiling, AnyRes
Token Budget — How Many Tokens Per Image
Vision-Language Alignment Training
Multi-Image & Video Extensions
Eval — MMMU, MathVista, ChartQA, OCRBench
What to Take Away

01

The VLM Landscape — Llama 3.2 V, Qwen2-VL, InternVL2, Pixtral, Gemini, Claude

As of 2025 there are broadly three tiers of VLM: API-only closed models (Gemini 1.5 Pro, GPT-4o, Claude 3.5 Sonnet), open-weight models with proprietary data (Pixtral-Large, Qwen2-VL-72B, InternVL2-76B), and fully open models (LLaVA-NeXT, MiniCPV 2.6, PaliGemma). The landscape changes monthly; what matters is understanding the architectural patterns.

Model	LLM backbone	Image encoder	Connector	Max res / tiles	MMMU
LLaVA-1.5-13B	Vicuna-13B	CLIP ViT-L/14@336	2-layer MLP	336px / 1	36.3%
Llama 3.2 Vision 90B	Llama 3.1 90B	ViT (cross-attn)	Cross-attention layers	1120px / 4	60.3%
Qwen2-VL-72B	Qwen2-72B	ViT 675M (2D-RoPE)	MLP (no fixed token cnt)	Dynamic / unlimited	64.5%
InternVL2-76B	InternLM2.5-20B×2	InternViT-6B	MLP + pixel-shuffle	448px / 40 tiles	65.4%
Pixtral-Large-124B	Mistral Large 2	Pixtral-ViT-400M	Linear proj (RoPE-2D)	1024px / variable	72.0%
Gemini 1.5 Pro	Native multimodal	Native (no sep. enc.)	Native	Variable	62.2%
Claude 3.5 Sonnet	Native multimodal	Native	Native	Up to ~8k px	68.3%

Why InternVL2 leads open-weight models

InternVL2-76B uses InternViT-6B, a 6-billion-parameter image encoder trained with a SigLIP-style contrastive objective and then fine-tuned jointly with the LLM. Most open VLMs use a frozen CLIP ViT-L (307M) with a tiny MLP. The 20× larger encoder, combined with multi-tile input and large-scale instruction data (InternLM2.5 training corpus), pushes MMMU to 65.4%.

02

Three Architectures — Q-Former, Perceiver Resampler, Simple Projection

The architectural decision that most shapes a VLM’s capabilities is how image tokens are projected into the LLM’s token space. Three patterns dominate:

Q-Former (BLIP-2 / InstructBLIP)

A Transformer with N learned queries (default: 32) that cross-attends to the frozen image encoder’s output. Only the 32 query tokens are passed to the LLM.

Pro: minimal LLM token cost regardless of image resolution
Con: 32 tokens cannot carry fine-grained text (OCR fails badly)
Con: introduces an extra bottleneck module to train
Used in: BLIP-2, InstructBLIP, MiniGPT-4

Perceiver Resampler (Flamingo / Otter)

A stack of transformer layers where M learned latents (64–256) cross-attend to all image patch tokens. Inserted between frozen LLM layers.

Pro: interleaved vision-language (arbitrary image-text sequences)
Pro: more latents than Q-Former → better detail retention
Con: modifies the LLM forward pass; harder to distrib. across GPUs
Used in: Flamingo, OpenFlamingo, Idefics, Otter

Simple Linear / MLP Projection (LLaVA family)

A 1–2-layer MLP maps each patch token directly to the LLM’s embedding dimension. All patch tokens (256–2048 depending on tile count) are prepended to the text sequence.

Pro: simplest; every patch token preserved; strongest fine detail
Pro: fast to train; easy to scale
Con: LLM context blown up by hundreds/thousands of tokens
Used in: LLaVA-1.5, LLaVA-NeXT, PaliGemma, InternVL2, Pixtral

Industry convergence in 2024–2025

The field largely abandoned Q-Former and Perceiver Resampler for new models after LLaVA-1.5 showed that a simple 2-layer MLP with a larger, better-trained image encoder and AnyRes tiling outperforms both on OCR, charts, and document tasks. The LLM context cost of 256–1024 tokens per image is considered acceptable given modern 128k+ context windows.

03

Native Multimodal vs Adapter — Gemini 1.5 vs Llama 3.2

The architectures above are all adapter VLMs: a separately-trained LLM gets a vision encoder bolted on. Native multimodal models train from scratch with both modalities interleaved, so the LLM weights themselves encode joint image-text understanding rather than learning it post-hoc.

Adapter VLM — Llama 3.2 Vision (Meta, 2024)

Llama 3.1 text LLM pre-trained on 15T tokens. Vision added via cross-attention layers inserted at every 4th LLM layer. Image tokens never enter the self-attention stack directly; they are injected via cross-attention.

Image encoder: ViT-H (630M), trained with CLIP then fine-tuned
Cross-attention keys/values come from image encoder; queries from LLM
Only cross-attention + image encoder weights updated in Stage 1; full model in Stage 2
Sizes: 11B and 90B. Llama-3.2-Vision-90B: MMMU 60.3%, DocVQA 91.3%
Advantage: preserves strong text capabilities; image encoding is isolatable

Native multimodal — Gemini 1.5 (Google, 2024)

Trained end-to-end on interleaved image, video, audio, and text from the beginning. No separate image encoder or projection step. The backbone itself (a Mixture-of-Experts architecture) handles all modalities via a shared token sequence.

Image/video tokenised at training time via a learned tokeniser (not ViT patches explicitly)
1M token context window: can process entire codebases, long videos, PDFs
Gemini 1.5 Pro: MMMU 62.2%, Video-MME 75.0%, long-context bench near-perfect
Advantage: no modality boundary artefacts; arbitrary interleaving; better long-context fusion
Disadvantage: no open weights; extremely expensive to train

Claude 3.5 architecture (inferred)

Anthropic has not published detailed architecture papers for Claude 3.x. From published benchmarks and the Constitutional AI + RLHF training regime, it is likely native multimodal in the same sense as Gemini: trained with interleaved vision-text from early stages. Claude 3.5 Sonnet achieves MMMU 68.3% and strong performance on charts and screenshots, suggesting high-resolution tile-based processing similar to Gemini’s approach.

04

Image Preprocessing — Resolution, Tiling, AnyRes

CLIP was trained at 224px. Most high-resolution text and document understanding requires at least 448–1024px. AnyRes (introduced in LLaVA-HD / LLaVA-NeXT) solves this by tiling the image into crops that fit the encoder’s native resolution, encoding each tile independently, and concatenating the tokens alongside a low-resolution global thumbnail.

Aspect-ratio-aware tiling (Qwen2-VL)

Qwen2-VL takes this further with a dynamic resolution approach: it discretises the image into 2D patch tokens with 2D-RoPE positional encoding, treating any resolution as valid input without a fixed tile grid. The number of visual tokens per image varies from 256 to ~16,384 depending on input resolution. The LLM receives interleaved <|vision_start|> and <|vision_end|> tokens wrapping the visual tokens.

05

Token Budget — How Many Tokens Per Image

Visual tokens are the single biggest driver of inference cost in VLMs. Understanding the token budget lets you reason about latency, context length limits, and what fidelity you actually need for a given task.

Model	Tokens/image (typical)	Max tokens/image	Compression method
BLIP-2 / InstructBLIP	32	32	Q-Former (32 learned queries)
LLaVA-1.5 (336px)	576	576	None (all patch tokens)
Llama 3.2 Vision 11B	1680	6720	4 tiles × 420 + global
InternVL2-76B (1 tile)	256	10,240	Pixel-shuffle 2× (1024→256 per tile)
Qwen2-VL	256–16,384	~16,384	None (native res, 2D-RoPE)
Pixtral-12B	1024 per tile	~16,384	None (linear proj)
Gemini 1.5 Flash	258 (low res)	3,072	Internal tokeniser

The OCR accuracy vs token budget trade-off

Low token budget (≤ 256)

Fast, cheap, suitable for: scene classification, coarse object detection, general Q&A about natural images. Fails at: reading small text, chart axis labels, table cell contents, handwriting, code screenshots.

High token budget (≥ 1024)

Accurate OCR, detailed chart reading, full-page document parsing. Necessary for: DocVQA (> 90% acc requires ~1024+ tokens), TableVQA, infographic Q&A. Cost: 4× longer LLM forward pass, 4× KV cache memory.

Token compression strategies that work

Pixel shuffle (InternVL, MiniCPM): spatially group 2×2 adjacent patch tokens into one token of 4× the channel depth. Information-theoretically lossless for spatial patterns; preserves layout better than attention pooling. 2D average pooling: simple, fast, but blurs fine details. Neither should be used for OCR tasks — at single-tile 448px resolution you need all 1024 raw patch tokens to reliably read 8pt font.

06

Vision-Language Alignment Training

Most adapter VLMs are trained in two or three stages. Getting the stage boundaries wrong (e.g. fine-tuning the image encoder before the projection has converged) is one of the most common causes of poor downstream performance.

Stage 1 — Projection Alignment

↓

Freeze image encoder + LLM. Train only the MLP projection. Data: 600K–1.2M image-caption pairs (LAION, CC3M, ShareGPT4V). Loss: next-token prediction on caption. Goal: projection learns to map visual tokens into LLM’s vocabulary space. ~1 epoch.

↓

Stage 2 — Visual Instruction Tuning

↓

Unfreeze projection (and optionally LLM). Train on instruction-following data: LLaVA-Instruct-150K, ShareGPT4V-100K, VQAv2, GQA, TextVQA, ChartQA, DocVQA, OCRBench, RefCOCO, VIST, MathVista, MMMU training split. Mixed batch. SFT loss (response tokens only). 1–3 epochs.

↓

Stage 3 (optional) — Preference Optimisation

↓

RLHF / DPO on preference pairs for visual hallucination reduction. Datasets: LLaVA-RLHF, POVID, SILKIE. Target: reduce “I see a cat” when there is no cat (object hallucination, measured by CHAIR score).

LLaVA-1.5 vs InternVL2 data scale

LLaVA-1.5 (2023) used 665K alignment + 665K instruction data. InternVL2 (2024) uses >20M alignment samples and >4M instruction samples. The 30× data scale — not the 20× larger image encoder — is arguably the bigger factor in the benchmark gap. Data sourcing (ShareGPT4V, InternData, MMInstruct) matters as much as architecture at this stage of the field.

07

Multi-Image & Video Extensions

Single-image VLMs generalise poorly to video and multi-image tasks. The three primary approaches to handle temporal and multi-frame inputs are: frame sampling, memory-efficient encoding with recurrence, and native video training.

Frame sampling (simple)

Uniformly sample N frames from the video. Encode each frame independently with the image encoder. Concatenate all visual tokens. Pass the full sequence (visual + text) to the LLM.

Used by: LLaVA-NeXT-Video, most early video VLMs
N=8–32 frames is typical for 1–5 min clips
Context blowup: 32 × 256 = 8,192 visual tokens
Temporal ordering only via position in sequence

Temporal compression (Video-LLaVA)

Encode each frame. Apply a temporal Q-Former or pooling across the time axis to compress N frames to M < N summary tokens. Preserves motion information while constraining token count.

Video-LLaVA: 256 tokens per video regardless of length
mPLUG-Owl3: 64 video tokens via temporal mean pooling
Risk: temporal details (e.g. frame order, fast motion) may be lost

Native video training (Qwen2-VL / Gemini)

3D position encoding (frame index + 2D spatial) allows the model to reason natively about frame differences. Trained with interleaved video-text data including dense temporal captions.

Qwen2-VL: 3D-RoPE; processes up to 20-min video
Gemini 1.5 Pro: 1M context = ~1 hour of video @ 1fps
Gemini 1.5: Video-MME long-form 64.0% (best open score: 67.3%)

Multi-image interleaving (Idefics2 / LLaVA-Interleave)

Supporting multiple images in one context requires: (1) each image to be enclosed in image tokens that the text tokeniser understands (<image_1> sentinel tokens), (2) the LLM to have cross-image attention — or at minimum, positional distinctness between image token groups. LLaVA-Interleave achieves this by prepending per-image tokens before each image sequence in the prompt. Flamingo’s Perceiver Resampler naturally supports this via interleaved cross-attention insertion.

08

Eval — MMMU, MathVista, ChartQA, OCRBench

VLM evaluation benchmarks test very different capability facets. A model that tops MMMU may be mediocre at OCRBench, so it’s essential to pick benchmarks appropriate to your task:

Benchmark	Task	Scale	What it tests	SOTA (2025 Q1)
MMMU	Multi-discipline MCQ	11,550 Qs, 30 subjects	Expert-level domain knowledge (science, art, medicine) from college textbooks; images include diagrams, charts, photos	GPT-4o 69.1%; Claude 3.5 Sonnet 68.3%
MathVista	Visual math reasoning	6,141 Qs	Geometry, function plots, statistics — requires both vision and multi-step reasoning	GPT-4V 58.1%; Claude 3.5 Sonnet 67.7%
ChartQA	Chart Q&A	9,866 Qs	Reads bar/line/pie charts; relaxed numerical accuracy (±5%)	Gemini 1.5 Pro 87.2%; InternVL2-76B 88.4%
OCRBench	OCR & text understanding	1,000 Qs	Scene text, handwriting, formula recognition, document structure — exact string match	InternVL2-76B 825/1000; Qwen2-VL-72B 866/1000
DocVQA	Document Q&A	5,349 Qs	Reads typed/printed documents; ANLS metric (handles partial matches)	InternVL2-76B 94.1%; Qwen2-VL-72B 96.5%

Benchmark saturation & the leaderboard problem

MMMU test-set contamination is a real concern: many training datasets now include MMMU-style question-answer pairs derived from college textbooks. OpenCompass and the MMMU team maintain a MMMU-Pro variant with process-level reasoning questions (harder; less contamination-prone). Similarly, ChartQA’s training split is small enough that fine-tuning on it directly inflates scores without generalising to real-world charts. Always cross-validate with a held-out private evaluation set if deploying on specialised chart data.

09

What to Take Away

Three connector architectures dominate. Q-Former (32 tokens, BLIP-2) is fastest but worst at OCR. Perceiver Resampler (64–256 tokens, Flamingo) handles interleaved multi-image. Simple MLP (all patch tokens, LLaVA family) gives the best detail — and has largely won in 2024–2025.
Native multimodal models (Gemini, Claude) skip the image encoder / projection design entirely. They trade explainability and modularity for long-context multimodal fusion — and for tasks requiring >4 images or long video, they have a fundamental advantage.
AnyRes tiling is the standard technique for high-resolution VLMs. Select the tile grid that minimises padding for the input aspect ratio. Qwen2-VL’s 2D-RoPE approach is the most principled alternative.
Token budget is the key engineering trade-off. 256 tokens/image = fast + cheap + OCR-blind. 1024+ tokens/image = accurate OCR + expensive. Pixel-shuffle 2× is the best-known compression that preserves layout.
InternVL2-76B and Qwen2-VL-72B lead open-weight models on most benchmarks. On OCRBench specifically, Qwen2-VL’s native-resolution encoding gives it a decisive edge.
Evaluation is noisy. Always cross-reference MMMU, a chart benchmark, an OCR benchmark, and a real-world task. Single-benchmark leaderboard rankings are misleading.

Where to next

Deck 04 specialises on document AI: Donut (OCR-free end-to-end parsing), ColPali (vision-based PDF retrieval), LayoutLM and UDOP (layout-aware models), and production pipeline patterns for PDF, table, and chart processing.