Vision-Language Models Series — Presentation 03

Modern VLMs — Llama 3.2, Qwen2-VL, InternVL, Gemini & Claude

Architecture patterns, projection connectors, native multimodal vs adapter, AnyRes tiling, token budget, alignment training, multi-image & video extensions, and evaluation on MMMU, MathVista, ChartQA & OCRBench.

LLaVA Llama 3.2 V Qwen2-VL InternVL2 Pixtral Gemini MMMU
Image Encoder Projection LLM Context Alignment Instruct Eval
00

Topics We’ll Cover

01

The VLM Landscape — Llama 3.2 V, Qwen2-VL, InternVL2, Pixtral, Gemini, Claude

As of 2025 there are broadly three tiers of VLM: API-only closed models (Gemini 1.5 Pro, GPT-4o, Claude 3.5 Sonnet), open-weight models with proprietary data (Pixtral-Large, Qwen2-VL-72B, InternVL2-76B), and fully open models (LLaVA-NeXT, MiniCPV 2.6, PaliGemma). The landscape changes monthly; what matters is understanding the architectural patterns.

ModelLLM backboneImage encoderConnectorMax res / tilesMMMU
LLaVA-1.5-13BVicuna-13BCLIP ViT-L/14@3362-layer MLP336px / 136.3%
Llama 3.2 Vision 90BLlama 3.1 90BViT (cross-attn)Cross-attention layers1120px / 460.3%
Qwen2-VL-72BQwen2-72BViT 675M (2D-RoPE)MLP (no fixed token cnt)Dynamic / unlimited64.5%
InternVL2-76BInternLM2.5-20B×2InternViT-6BMLP + pixel-shuffle448px / 40 tiles65.4%
Pixtral-Large-124BMistral Large 2Pixtral-ViT-400MLinear proj (RoPE-2D)1024px / variable72.0%
Gemini 1.5 ProNative multimodalNative (no sep. enc.)NativeVariable62.2%
Claude 3.5 SonnetNative multimodalNativeNativeUp to ~8k px68.3%
Why InternVL2 leads open-weight models

InternVL2-76B uses InternViT-6B, a 6-billion-parameter image encoder trained with a SigLIP-style contrastive objective and then fine-tuned jointly with the LLM. Most open VLMs use a frozen CLIP ViT-L (307M) with a tiny MLP. The 20× larger encoder, combined with multi-tile input and large-scale instruction data (InternLM2.5 training corpus), pushes MMMU to 65.4%.

02

Three Architectures — Q-Former, Perceiver Resampler, Simple Projection

The architectural decision that most shapes a VLM’s capabilities is how image tokens are projected into the LLM’s token space. Three patterns dominate:

Q-Former (BLIP-2 / InstructBLIP)

A Transformer with N learned queries (default: 32) that cross-attends to the frozen image encoder’s output. Only the 32 query tokens are passed to the LLM.

  • Pro: minimal LLM token cost regardless of image resolution
  • Con: 32 tokens cannot carry fine-grained text (OCR fails badly)
  • Con: introduces an extra bottleneck module to train
  • Used in: BLIP-2, InstructBLIP, MiniGPT-4

Perceiver Resampler (Flamingo / Otter)

A stack of transformer layers where M learned latents (64–256) cross-attend to all image patch tokens. Inserted between frozen LLM layers.

  • Pro: interleaved vision-language (arbitrary image-text sequences)
  • Pro: more latents than Q-Former → better detail retention
  • Con: modifies the LLM forward pass; harder to distrib. across GPUs
  • Used in: Flamingo, OpenFlamingo, Idefics, Otter

Simple Linear / MLP Projection (LLaVA family)

A 1–2-layer MLP maps each patch token directly to the LLM’s embedding dimension. All patch tokens (256–2048 depending on tile count) are prepended to the text sequence.

  • Pro: simplest; every patch token preserved; strongest fine detail
  • Pro: fast to train; easy to scale
  • Con: LLM context blown up by hundreds/thousands of tokens
  • Used in: LLaVA-1.5, LLaVA-NeXT, PaliGemma, InternVL2, Pixtral
Industry convergence in 2024–2025

The field largely abandoned Q-Former and Perceiver Resampler for new models after LLaVA-1.5 showed that a simple 2-layer MLP with a larger, better-trained image encoder and AnyRes tiling outperforms both on OCR, charts, and document tasks. The LLM context cost of 256–1024 tokens per image is considered acceptable given modern 128k+ context windows.

03

Native Multimodal vs Adapter — Gemini 1.5 vs Llama 3.2

The architectures above are all adapter VLMs: a separately-trained LLM gets a vision encoder bolted on. Native multimodal models train from scratch with both modalities interleaved, so the LLM weights themselves encode joint image-text understanding rather than learning it post-hoc.

Adapter VLM — Llama 3.2 Vision (Meta, 2024)

Llama 3.1 text LLM pre-trained on 15T tokens. Vision added via cross-attention layers inserted at every 4th LLM layer. Image tokens never enter the self-attention stack directly; they are injected via cross-attention.

  • Image encoder: ViT-H (630M), trained with CLIP then fine-tuned
  • Cross-attention keys/values come from image encoder; queries from LLM
  • Only cross-attention + image encoder weights updated in Stage 1; full model in Stage 2
  • Sizes: 11B and 90B. Llama-3.2-Vision-90B: MMMU 60.3%, DocVQA 91.3%
  • Advantage: preserves strong text capabilities; image encoding is isolatable

Native multimodal — Gemini 1.5 (Google, 2024)

Trained end-to-end on interleaved image, video, audio, and text from the beginning. No separate image encoder or projection step. The backbone itself (a Mixture-of-Experts architecture) handles all modalities via a shared token sequence.

  • Image/video tokenised at training time via a learned tokeniser (not ViT patches explicitly)
  • 1M token context window: can process entire codebases, long videos, PDFs
  • Gemini 1.5 Pro: MMMU 62.2%, Video-MME 75.0%, long-context bench near-perfect
  • Advantage: no modality boundary artefacts; arbitrary interleaving; better long-context fusion
  • Disadvantage: no open weights; extremely expensive to train
Claude 3.5 architecture (inferred)

Anthropic has not published detailed architecture papers for Claude 3.x. From published benchmarks and the Constitutional AI + RLHF training regime, it is likely native multimodal in the same sense as Gemini: trained with interleaved vision-text from early stages. Claude 3.5 Sonnet achieves MMMU 68.3% and strong performance on charts and screenshots, suggesting high-resolution tile-based processing similar to Gemini’s approach.

04

Image Preprocessing — Resolution, Tiling, AnyRes

CLIP was trained at 224px. Most high-resolution text and document understanding requires at least 448–1024px. AnyRes (introduced in LLaVA-HD / LLaVA-NeXT) solves this by tiling the image into crops that fit the encoder’s native resolution, encoding each tile independently, and concatenating the tokens alongside a low-resolution global thumbnail.

AnyRes: 1024×1024 input → 2×2 tile grid + thumbnail → 5 encodings Input 1024×1024 px aspect ratio preserved Tile 1 512×512 →448px Tile 2 512×512 →448px Tile 3 512×512 →448px Tile 4 512×512 →448px Global 224px ViT Encode 5 passes through CLIP ViT-L/14@448 1024 tokens each Token sequence (LLM) tile_1: 1024 tokens tile_2: 1024 tokens tile_3: 1024 tokens tile_4: 1024 tokens global: 256 tokens ———————— total: 4,352 tokens (InternVL2 pixel-shuffle → ¼ = 1,088 tokens) Grid selection: model picks the tiling (1×1, 1×2, 2×2, 2×3, ...) that best matches the image aspect ratio with fewest tiles
Aspect-ratio-aware tiling (Qwen2-VL)

Qwen2-VL takes this further with a dynamic resolution approach: it discretises the image into 2D patch tokens with 2D-RoPE positional encoding, treating any resolution as valid input without a fixed tile grid. The number of visual tokens per image varies from 256 to ~16,384 depending on input resolution. The LLM receives interleaved <|vision_start|> and <|vision_end|> tokens wrapping the visual tokens.

05

Token Budget — How Many Tokens Per Image

Visual tokens are the single biggest driver of inference cost in VLMs. Understanding the token budget lets you reason about latency, context length limits, and what fidelity you actually need for a given task.

ModelTokens/image (typical)Max tokens/imageCompression method
BLIP-2 / InstructBLIP3232Q-Former (32 learned queries)
LLaVA-1.5 (336px)576576None (all patch tokens)
Llama 3.2 Vision 11B168067204 tiles × 420 + global
InternVL2-76B (1 tile)25610,240Pixel-shuffle 2× (1024→256 per tile)
Qwen2-VL256–16,384~16,384None (native res, 2D-RoPE)
Pixtral-12B1024 per tile~16,384None (linear proj)
Gemini 1.5 Flash258 (low res)3,072Internal tokeniser

The OCR accuracy vs token budget trade-off

Low token budget (≤ 256)

Fast, cheap, suitable for: scene classification, coarse object detection, general Q&A about natural images. Fails at: reading small text, chart axis labels, table cell contents, handwriting, code screenshots.

High token budget (≥ 1024)

Accurate OCR, detailed chart reading, full-page document parsing. Necessary for: DocVQA (> 90% acc requires ~1024+ tokens), TableVQA, infographic Q&A. Cost: 4× longer LLM forward pass, 4× KV cache memory.

Token compression strategies that work

Pixel shuffle (InternVL, MiniCPM): spatially group 2×2 adjacent patch tokens into one token of 4× the channel depth. Information-theoretically lossless for spatial patterns; preserves layout better than attention pooling. 2D average pooling: simple, fast, but blurs fine details. Neither should be used for OCR tasks — at single-tile 448px resolution you need all 1024 raw patch tokens to reliably read 8pt font.

06

Vision-Language Alignment Training

Most adapter VLMs are trained in two or three stages. Getting the stage boundaries wrong (e.g. fine-tuning the image encoder before the projection has converged) is one of the most common causes of poor downstream performance.

Stage 1 — Projection Alignment
Freeze image encoder + LLM. Train only the MLP projection. Data: 600K–1.2M image-caption pairs (LAION, CC3M, ShareGPT4V). Loss: next-token prediction on caption. Goal: projection learns to map visual tokens into LLM’s vocabulary space. ~1 epoch.
Stage 2 — Visual Instruction Tuning
Unfreeze projection (and optionally LLM). Train on instruction-following data: LLaVA-Instruct-150K, ShareGPT4V-100K, VQAv2, GQA, TextVQA, ChartQA, DocVQA, OCRBench, RefCOCO, VIST, MathVista, MMMU training split. Mixed batch. SFT loss (response tokens only). 1–3 epochs.
Stage 3 (optional) — Preference Optimisation
RLHF / DPO on preference pairs for visual hallucination reduction. Datasets: LLaVA-RLHF, POVID, SILKIE. Target: reduce “I see a cat” when there is no cat (object hallucination, measured by CHAIR score).
LLaVA-1.5 vs InternVL2 data scale

LLaVA-1.5 (2023) used 665K alignment + 665K instruction data. InternVL2 (2024) uses >20M alignment samples and >4M instruction samples. The 30× data scale — not the 20× larger image encoder — is arguably the bigger factor in the benchmark gap. Data sourcing (ShareGPT4V, InternData, MMInstruct) matters as much as architecture at this stage of the field.

07

Multi-Image & Video Extensions

Single-image VLMs generalise poorly to video and multi-image tasks. The three primary approaches to handle temporal and multi-frame inputs are: frame sampling, memory-efficient encoding with recurrence, and native video training.

Frame sampling (simple)

Uniformly sample N frames from the video. Encode each frame independently with the image encoder. Concatenate all visual tokens. Pass the full sequence (visual + text) to the LLM.

  • Used by: LLaVA-NeXT-Video, most early video VLMs
  • N=8–32 frames is typical for 1–5 min clips
  • Context blowup: 32 × 256 = 8,192 visual tokens
  • Temporal ordering only via position in sequence

Temporal compression (Video-LLaVA)

Encode each frame. Apply a temporal Q-Former or pooling across the time axis to compress N frames to M < N summary tokens. Preserves motion information while constraining token count.

  • Video-LLaVA: 256 tokens per video regardless of length
  • mPLUG-Owl3: 64 video tokens via temporal mean pooling
  • Risk: temporal details (e.g. frame order, fast motion) may be lost

Native video training (Qwen2-VL / Gemini)

3D position encoding (frame index + 2D spatial) allows the model to reason natively about frame differences. Trained with interleaved video-text data including dense temporal captions.

  • Qwen2-VL: 3D-RoPE; processes up to 20-min video
  • Gemini 1.5 Pro: 1M context = ~1 hour of video @ 1fps
  • Gemini 1.5: Video-MME long-form 64.0% (best open score: 67.3%)
Multi-image interleaving (Idefics2 / LLaVA-Interleave)

Supporting multiple images in one context requires: (1) each image to be enclosed in image tokens that the text tokeniser understands (<image_1> sentinel tokens), (2) the LLM to have cross-image attention — or at minimum, positional distinctness between image token groups. LLaVA-Interleave achieves this by prepending per-image tokens before each image sequence in the prompt. Flamingo’s Perceiver Resampler naturally supports this via interleaved cross-attention insertion.

08

Eval — MMMU, MathVista, ChartQA, OCRBench

VLM evaluation benchmarks test very different capability facets. A model that tops MMMU may be mediocre at OCRBench, so it’s essential to pick benchmarks appropriate to your task:

BenchmarkTaskScaleWhat it testsSOTA (2025 Q1)
MMMUMulti-discipline MCQ11,550 Qs, 30 subjectsExpert-level domain knowledge (science, art, medicine) from college textbooks; images include diagrams, charts, photosGPT-4o 69.1%; Claude 3.5 Sonnet 68.3%
MathVistaVisual math reasoning6,141 QsGeometry, function plots, statistics — requires both vision and multi-step reasoningGPT-4V 58.1%; Claude 3.5 Sonnet 67.7%
ChartQAChart Q&A9,866 QsReads bar/line/pie charts; relaxed numerical accuracy (±5%)Gemini 1.5 Pro 87.2%; InternVL2-76B 88.4%
OCRBenchOCR & text understanding1,000 QsScene text, handwriting, formula recognition, document structure — exact string matchInternVL2-76B 825/1000; Qwen2-VL-72B 866/1000
DocVQADocument Q&A5,349 QsReads typed/printed documents; ANLS metric (handles partial matches)InternVL2-76B 94.1%; Qwen2-VL-72B 96.5%
Benchmark saturation & the leaderboard problem

MMMU test-set contamination is a real concern: many training datasets now include MMMU-style question-answer pairs derived from college textbooks. OpenCompass and the MMMU team maintain a MMMU-Pro variant with process-level reasoning questions (harder; less contamination-prone). Similarly, ChartQA’s training split is small enough that fine-tuning on it directly inflates scores without generalising to real-world charts. Always cross-validate with a held-out private evaluation set if deploying on specialised chart data.

09

What to Take Away

Where to next

Deck 04 specialises on document AI: Donut (OCR-free end-to-end parsing), ColPali (vision-based PDF retrieval), LayoutLM and UDOP (layout-aware models), and production pipeline patterns for PDF, table, and chart processing.