Architecture patterns, projection connectors, native multimodal vs adapter, AnyRes tiling, token budget, alignment training, multi-image & video extensions, and evaluation on MMMU, MathVista, ChartQA & OCRBench.
As of 2025 there are broadly three tiers of VLM: API-only closed models (Gemini 1.5 Pro, GPT-4o, Claude 3.5 Sonnet), open-weight models with proprietary data (Pixtral-Large, Qwen2-VL-72B, InternVL2-76B), and fully open models (LLaVA-NeXT, MiniCPV 2.6, PaliGemma). The landscape changes monthly; what matters is understanding the architectural patterns.
| Model | LLM backbone | Image encoder | Connector | Max res / tiles | MMMU |
|---|---|---|---|---|---|
| LLaVA-1.5-13B | Vicuna-13B | CLIP ViT-L/14@336 | 2-layer MLP | 336px / 1 | 36.3% |
| Llama 3.2 Vision 90B | Llama 3.1 90B | ViT (cross-attn) | Cross-attention layers | 1120px / 4 | 60.3% |
| Qwen2-VL-72B | Qwen2-72B | ViT 675M (2D-RoPE) | MLP (no fixed token cnt) | Dynamic / unlimited | 64.5% |
| InternVL2-76B | InternLM2.5-20B×2 | InternViT-6B | MLP + pixel-shuffle | 448px / 40 tiles | 65.4% |
| Pixtral-Large-124B | Mistral Large 2 | Pixtral-ViT-400M | Linear proj (RoPE-2D) | 1024px / variable | 72.0% |
| Gemini 1.5 Pro | Native multimodal | Native (no sep. enc.) | Native | Variable | 62.2% |
| Claude 3.5 Sonnet | Native multimodal | Native | Native | Up to ~8k px | 68.3% |
InternVL2-76B uses InternViT-6B, a 6-billion-parameter image encoder trained with a SigLIP-style contrastive objective and then fine-tuned jointly with the LLM. Most open VLMs use a frozen CLIP ViT-L (307M) with a tiny MLP. The 20× larger encoder, combined with multi-tile input and large-scale instruction data (InternLM2.5 training corpus), pushes MMMU to 65.4%.
The architectural decision that most shapes a VLM’s capabilities is how image tokens are projected into the LLM’s token space. Three patterns dominate:
A Transformer with N learned queries (default: 32) that cross-attends to the frozen image encoder’s output. Only the 32 query tokens are passed to the LLM.
A stack of transformer layers where M learned latents (64–256) cross-attend to all image patch tokens. Inserted between frozen LLM layers.
A 1–2-layer MLP maps each patch token directly to the LLM’s embedding dimension. All patch tokens (256–2048 depending on tile count) are prepended to the text sequence.
The field largely abandoned Q-Former and Perceiver Resampler for new models after LLaVA-1.5 showed that a simple 2-layer MLP with a larger, better-trained image encoder and AnyRes tiling outperforms both on OCR, charts, and document tasks. The LLM context cost of 256–1024 tokens per image is considered acceptable given modern 128k+ context windows.
The architectures above are all adapter VLMs: a separately-trained LLM gets a vision encoder bolted on. Native multimodal models train from scratch with both modalities interleaved, so the LLM weights themselves encode joint image-text understanding rather than learning it post-hoc.
Llama 3.1 text LLM pre-trained on 15T tokens. Vision added via cross-attention layers inserted at every 4th LLM layer. Image tokens never enter the self-attention stack directly; they are injected via cross-attention.
Trained end-to-end on interleaved image, video, audio, and text from the beginning. No separate image encoder or projection step. The backbone itself (a Mixture-of-Experts architecture) handles all modalities via a shared token sequence.
Anthropic has not published detailed architecture papers for Claude 3.x. From published benchmarks and the Constitutional AI + RLHF training regime, it is likely native multimodal in the same sense as Gemini: trained with interleaved vision-text from early stages. Claude 3.5 Sonnet achieves MMMU 68.3% and strong performance on charts and screenshots, suggesting high-resolution tile-based processing similar to Gemini’s approach.
CLIP was trained at 224px. Most high-resolution text and document understanding requires at least 448–1024px. AnyRes (introduced in LLaVA-HD / LLaVA-NeXT) solves this by tiling the image into crops that fit the encoder’s native resolution, encoding each tile independently, and concatenating the tokens alongside a low-resolution global thumbnail.
Qwen2-VL takes this further with a dynamic resolution approach: it discretises the image into 2D patch tokens with 2D-RoPE positional encoding, treating any resolution as valid input without a fixed tile grid. The number of visual tokens per image varies from 256 to ~16,384 depending on input resolution. The LLM receives interleaved <|vision_start|> and <|vision_end|> tokens wrapping the visual tokens.
Visual tokens are the single biggest driver of inference cost in VLMs. Understanding the token budget lets you reason about latency, context length limits, and what fidelity you actually need for a given task.
| Model | Tokens/image (typical) | Max tokens/image | Compression method |
|---|---|---|---|
| BLIP-2 / InstructBLIP | 32 | 32 | Q-Former (32 learned queries) |
| LLaVA-1.5 (336px) | 576 | 576 | None (all patch tokens) |
| Llama 3.2 Vision 11B | 1680 | 6720 | 4 tiles × 420 + global |
| InternVL2-76B (1 tile) | 256 | 10,240 | Pixel-shuffle 2× (1024→256 per tile) |
| Qwen2-VL | 256–16,384 | ~16,384 | None (native res, 2D-RoPE) |
| Pixtral-12B | 1024 per tile | ~16,384 | None (linear proj) |
| Gemini 1.5 Flash | 258 (low res) | 3,072 | Internal tokeniser |
Fast, cheap, suitable for: scene classification, coarse object detection, general Q&A about natural images. Fails at: reading small text, chart axis labels, table cell contents, handwriting, code screenshots.
Accurate OCR, detailed chart reading, full-page document parsing. Necessary for: DocVQA (> 90% acc requires ~1024+ tokens), TableVQA, infographic Q&A. Cost: 4× longer LLM forward pass, 4× KV cache memory.
Pixel shuffle (InternVL, MiniCPM): spatially group 2×2 adjacent patch tokens into one token of 4× the channel depth. Information-theoretically lossless for spatial patterns; preserves layout better than attention pooling. 2D average pooling: simple, fast, but blurs fine details. Neither should be used for OCR tasks — at single-tile 448px resolution you need all 1024 raw patch tokens to reliably read 8pt font.
Most adapter VLMs are trained in two or three stages. Getting the stage boundaries wrong (e.g. fine-tuning the image encoder before the projection has converged) is one of the most common causes of poor downstream performance.
LLaVA-1.5 (2023) used 665K alignment + 665K instruction data. InternVL2 (2024) uses >20M alignment samples and >4M instruction samples. The 30× data scale — not the 20× larger image encoder — is arguably the bigger factor in the benchmark gap. Data sourcing (ShareGPT4V, InternData, MMInstruct) matters as much as architecture at this stage of the field.
Single-image VLMs generalise poorly to video and multi-image tasks. The three primary approaches to handle temporal and multi-frame inputs are: frame sampling, memory-efficient encoding with recurrence, and native video training.
Uniformly sample N frames from the video. Encode each frame independently with the image encoder. Concatenate all visual tokens. Pass the full sequence (visual + text) to the LLM.
Encode each frame. Apply a temporal Q-Former or pooling across the time axis to compress N frames to M < N summary tokens. Preserves motion information while constraining token count.
3D position encoding (frame index + 2D spatial) allows the model to reason natively about frame differences. Trained with interleaved video-text data including dense temporal captions.
Supporting multiple images in one context requires: (1) each image to be enclosed in image tokens that the text tokeniser understands (<image_1> sentinel tokens), (2) the LLM to have cross-image attention — or at minimum, positional distinctness between image token groups. LLaVA-Interleave achieves this by prepending per-image tokens before each image sequence in the prompt. Flamingo’s Perceiver Resampler naturally supports this via interleaved cross-attention insertion.
VLM evaluation benchmarks test very different capability facets. A model that tops MMMU may be mediocre at OCRBench, so it’s essential to pick benchmarks appropriate to your task:
| Benchmark | Task | Scale | What it tests | SOTA (2025 Q1) |
|---|---|---|---|---|
| MMMU | Multi-discipline MCQ | 11,550 Qs, 30 subjects | Expert-level domain knowledge (science, art, medicine) from college textbooks; images include diagrams, charts, photos | GPT-4o 69.1%; Claude 3.5 Sonnet 68.3% |
| MathVista | Visual math reasoning | 6,141 Qs | Geometry, function plots, statistics — requires both vision and multi-step reasoning | GPT-4V 58.1%; Claude 3.5 Sonnet 67.7% |
| ChartQA | Chart Q&A | 9,866 Qs | Reads bar/line/pie charts; relaxed numerical accuracy (±5%) | Gemini 1.5 Pro 87.2%; InternVL2-76B 88.4% |
| OCRBench | OCR & text understanding | 1,000 Qs | Scene text, handwriting, formula recognition, document structure — exact string match | InternVL2-76B 825/1000; Qwen2-VL-72B 866/1000 |
| DocVQA | Document Q&A | 5,349 Qs | Reads typed/printed documents; ANLS metric (handles partial matches) | InternVL2-76B 94.1%; Qwen2-VL-72B 96.5% |
MMMU test-set contamination is a real concern: many training datasets now include MMMU-style question-answer pairs derived from college textbooks. OpenCompass and the MMMU team maintain a MMMU-Pro variant with process-level reasoning questions (harder; less contamination-prone). Similarly, ChartQA’s training split is small enough that fine-tuning on it directly inflates scores without generalising to real-world charts. Always cross-validate with a held-out private evaluation set if deploying on specialised chart data.
Deck 04 specialises on document AI: Donut (OCR-free end-to-end parsing), ColPali (vision-based PDF retrieval), LayoutLM and UDOP (layout-aware models), and production pipeline patterns for PDF, table, and chart processing.