VLM 04 — Document AI: Donut, ColPali & Layout-Aware Models

00

Topics We’ll Cover

Documents as a Special Multimodal Target
OCR-then-LLM Pipelines — When & Why
Donut — OCR-Free Document Understanding
ColPali — Vision-Based Document Retrieval
Layout-Aware Models — LayoutLM, UDOP
Modern VLMs as Document Processors
Tables & Charts — Chart-to-Text
Pipeline Patterns for Production Document AI
What to Take Away

01

Documents as a Special Multimodal Target

A PDF page is not a natural image. It is a precisely typeset composition of text, tables, figures, and layout structure that encodes meaning through both visual appearance and positional arrangement. Standard NLP models see it as raw text (after OCR) and discard spatial context. Standard vision models see it as pixels and miss the semantic structure. Documents demand a hybrid understanding.

What makes documents hard

Multi-column layout: reading order is not left-to-right, top-to-bottom; columns must be detected and linearised
Tables: cell adjacency encodes relational structure invisible to flat text
Figures with captions: caption often appears below or beside the figure; cross-referencing requires layout
Formulae: LaTeX/MathML symbols require specialist tokenisation
Scanned documents: OCR noise + skew + font variation
Mixed languages: RTL text, CJK characters, ligatures

Document AI task taxonomy

Document classification: invoice vs contract vs receipt
Key information extraction (KIE): total amount, date, vendor name
Document Q&A (DocVQA): answer free-form Qs about a page
Table structure recognition: detect rows, columns, merged cells
Document retrieval: find the relevant page in a 1,000-page corpus
Chart understanding: extract data series values from bar/line/scatter plots

Why “just OCR it and feed to an LLM” often fails

Tesseract / AWS Textract OCR produces flat text that loses column order, table structure, and figure-caption links. A 10-column financial table becomes a meaningless run-on string. Vision models that reason over the page image retain the 2D layout context. The field has split into two schools: improve the OCR and layout parsing pipeline (Amazon, Adobe, ABBYY approach) vs train end-to-end vision models that skip OCR entirely (Google Donut / ColPali approach).

02

OCR-then-LLM Pipelines — When & Why

Despite the appeal of end-to-end vision models, OCR-then-LLM pipelines remain dominant in production document AI because they provide explainable intermediate representations, modular error correction, and strong performance on well-formatted PDFs and born-digital documents.

PDF / Image

→

OCR Engine
Tesseract / Textract / Azure / Surya

→

Layout Parser
Detectron2 / DocLayout-YOLO / nougat

→

Markdown/HTML
reading order + structure

→

LLM
GPT-4 / Llama / Claude

Tool	Type	Strengths	Weaknesses
Tesseract 5	Open-source OCR	Free, offline, 100+ langs, LSTM engine	Poor on low-DPI scans; no layout; 70–80% char accuracy on hand-written
AWS Textract	Cloud OCR + layout	Table/form extraction, bounding boxes, confidence scores	Cost at scale ($1.50/1k pages); US-region data residency constraint
Surya	Open-source, neural	Multi-language, layout-aware, benchmarks near Textract quality	GPU needed; 3× slower than Tesseract on CPU
Nougat (Meta)	End-to-end, vision	Scientific PDFs: preserves LaTeX equations, tables, figures natively	Hallucination on non-scientific docs; slow (1 page/sec on A100)
Marker	Open-source pipeline	Combines Surya OCR + layout; Markdown output; fast	Table merging heuristics can fail on complex grids

Rule of thumb

Use OCR-then-LLM when: documents are born-digital PDFs (not scanned), layout is simple (single-column, no complex tables), and you need citation of exact text spans. Use end-to-end vision models when: documents are scanned images, contain complex tables or charts, or you need to retrieve by visual similarity.

03

Donut — OCR-Free Document Understanding

Donut (Document Understanding Transformer, Kim et al., NAVER 2022) eliminates the OCR step entirely. It is an encoder-decoder transformer that takes a page image as input and produces structured text (JSON, key-value pairs, HTML) directly, treating document understanding as a conditional generation task.

Donut benchmarks & limitations

DocVQA: 67.5% ANLS (vs Textract + LLM pipeline ~85%). CORD (receipt KIE): 91.3% F1. RVL-CDIP document classification: 95.3%. Donut excels at templated forms and receipts, but struggles on multi-page documents, complex tables, and low-DPI scans. The Swin encoder serialises 2D layout information but cannot re-read patches; it cannot handle text that wraps across the encoder’s receptive field boundaries at coarse scales.

04

ColPali — Vision-Based Document Retrieval

ColPali (Faysse et al., 2024) addresses the document retrieval problem: given a text query, find the relevant page(s) in a large PDF corpus. Traditional approaches OCR the PDFs, embed the text, and retrieve. ColPali embeds each page as a set of patch-level visual embeddings using a PaliGemma-based VLM, then scores queries against patches via late interaction (the ColBERT mechanism).

Indexing (offline)

For each document page:

Render to 448×448 image
Encode with PaliGemma-3B vision encoder → 1024 patch embeddings of 128-d each
Store all 1024 patch embeddings per page (L2-normalised)
Index: each page = 1024 × 128-d = 131,072 floats = 0.5 MB at fp32

Retrieval (online)

For each query string:

Tokenise → encode with PaliGemma text encoder → M query token embeddings (128-d)
MaxSim scoring (ColBERT-style): for each query token, find max cosine similarity across all 1024 page patches; sum over query tokens
Score = ∑_q max_p(cos(q, p))
Rank pages by score; return top-K

ColPali inference with colpali-engine library

from colpali_engine.models import ColPali, ColPaliProcessor
import torch
from PIL import Image

model     = ColPali.from_pretrained("vidore/colpali-v1.2", torch_dtype=torch.bfloat16)
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")

# Index: encode page images
pages  = [Image.open(p) for p in page_paths]
inputs = processor.process_images(pages)
with torch.no_grad():
    page_embeds = model(**inputs)   # [N_pages, 1024, 128]

# Query
query_input  = processor.process_queries(["What is the total revenue for Q3 2023?"])
with torch.no_grad():
    query_embed = model(**query_input)  # [1, M_q, 128]

# MaxSim scoring (ColBERT)
scores = processor.score_multi_vector(query_embed, page_embeds)  # [1, N_pages]
top_pages = scores.topk(5).indices

ViDoRe benchmark results

On ViDoRe (Visual Document Retrieval benchmark, 10 diverse PDF corpora), ColPali achieves NDCG@5 = 89.3 vs BM25 = 44.1, DPR-text (OCR → dense embed) = 74.3, and BGE-M3 (text) = 83.5. ColPali is the first retrieval model to significantly outperform text-based retrieval on document pages containing tables, charts, and complex layouts. The key reason: it does not require accurate OCR to retrieve — layout, colour, and chart shape contribute to the match score.

05

Layout-Aware Models — LayoutLM, UDOP

Before end-to-end vision models, the state of the art in document understanding was to augment a BERT-style language model with explicit 2D position encodings derived from OCR bounding boxes. This allowed the model to reason about both word content and spatial layout simultaneously.

LayoutLM v1

BERT token embed + 2D pos embed (x0,y0,x1,y1 from OCR) + image region embed (ResNet CNN) → FUNSD F1: 79.3%

LayoutLM v2

Text + layout + image jointly pre-trained Visual-layout pre-training on IIT-CDIP 11M Spatial-aware self-attention (relative 2D bias) → FUNSD F1: 82.8%, DocVQA ANLS: 78.1%

LayoutLM v3

Unified text-image pre-training (no CNN) Masked layout-language modelling Word-patch alignment objective → FUNSD F1: 92.1%, DocVQA ANLS: 83.4%

UDOP (2023)

T5-based encoder-decoder Joint text, layout (x0,y0,x1,y1), image tokens Unified seq-to-seq for all doc tasks → DocVQA ANLS: 84.7%, FUNSD: 93.3%

UDOP’s “joint text and layout reconstruction”

UDOP pre-trains on a novel objective: given a page with some text tokens masked, predict both the masked text and the bounding box coordinates of those tokens. This forces the model to learn a tighter binding between what a word says and where it appears — critical for KIE tasks where the same word (e.g. “Total”) can be a column header or a label depending on its position in the table.

06

Modern VLMs as Document Processors

As of 2024–2025, the best results on DocVQA and similar benchmarks come not from purpose-built document models but from general VLMs with high-resolution tiling. The key is getting enough tokens per page to read the text without explicit OCR.

Model	DocVQA ANLS	InfoVQA	ChartQA	OCRBench	Approach
LayoutLMv3-Large	83.4%	46.0%	—	—	OCR + 2D pos embed
UDOP	84.7%	58.5%	—	—	OCR + T5 seq2seq
Donut-base	67.5%	11.6%	—	—	OCR-free (Swin+BART)
InternVL2-76B	94.1%	88.7%	88.4%	825	High-res tiling (40 tiles)
Qwen2-VL-72B	96.5%	92.2%	88.3%	866	Native res, 2D-RoPE
GPT-4V	88.4%	75.1%	78.5%	645	Unknown (tiling probable)
Claude 3.5 Sonnet	~91%	—	—	—	Native multimodal

Why Qwen2-VL leads on DocVQA

Qwen2-VL’s dynamic resolution encoding with 2D-RoPE allows it to process a full A4 page at up to 1280px resolution with no padding artefacts. InternVL2-76B achieves comparable results via 40 tiles of 448px (equivalent effective resolution ~2240px). Both avoid the resolution ceiling of fixed 448px single-image encoders, which cap DocVQA at ~85% ANLS. The LLM backbone scale (72B vs 7B) also matters: large models are better at interpreting partially-degraded OCR regions without OCR grounding.

07

Tables & Charts — Chart-to-Text

Tables and charts are the hardest document elements for both OCR pipelines and vision models. A table is a 2D relational structure; a chart is a visual encoding of quantitative data. Both require understanding geometric structure, not just reading text.

Table structure recognition

TableTransformer (Smock et al., Microsoft 2022) uses DETR (DEtection TRansformer) to detect table bounding boxes and then predict row/column/cell structure as a set of objects. Trained on PubTables-1M (995K annotated tables from scientific publications) and FinTabNet (112K financial tables). TableTransformer achieves 92.3% TEDS-S on PubTabNet — beating earlier CNN approaches by 5pp.

Chart-to-text (DePlot, Matcha)

DePlot (Liu et al., Google 2023)

Fine-tunes PaLI on chart linearisation: given a chart image, output a markdown table of the underlying data. The table is then passed to an LLM (PaLM) for Q&A. Two-step pipeline separates visual parsing from language reasoning.

ChartQA: 56.0% (one-shot, no fine-tuning on ChartQA)
The linearised table is inspectable — useful for explainability

Matcha (Liu et al., Google 2023)

Fine-tunes PaLI end-to-end on chart Q&A, chart-to-text, and chart data extraction jointly. No separate linearisation step; model answers directly from the chart image.

ChartQA: 90.2% (fine-tuned)
FigureQA: 93.5%
Trained on ChartQA + PlotQA + Chart-to-Text dataset (30K samples)

Production chart extraction recipe

For a production chart extraction system that does not require fine-tuning: (1) use DePlot (or Qwen2-VL with a “extract data table from this chart” prompt) to get a structured markdown table, (2) pass the table to an LLM for Q&A or analysis, (3) validate numeric outputs against the extracted values. This two-step approach is more debuggable than end-to-end and allows the LLM to reason without the visual grounding constraint.

08

Pipeline Patterns for Production Document AI

Production document AI must handle heterogeneous inputs (scanned, born-digital, mixed), latency constraints, cost, and observability. Four patterns cover most real-world requirements:

Pattern 1: OCR + RAG (classic)

Use when: large corpus (>10k pages), text-centric documents, latency < 500ms for retrieval.

Ingest: Marker/Surya OCR + layout → Markdown per page
Chunk: page-level or paragraph-level (512 tokens)
Embed: OpenAI text-embedding-3-small or BGE-M3
Store: pgvector or Qdrant
Query: dense retrieval + BM25 hybrid + reranker (Cohere Rerank-3)
Generate: GPT-4o / Claude 3.5 with retrieved chunks in context

Pattern 2: ColPali + VLM (vision-native)

Use when: complex layouts, charts, mixed text/image, multilingual scans.

Ingest: render each PDF page to 448px PNG; run colpali-v1.2 → store patch embeddings
Query: encode query with ColPali; MaxSim retrieve top-K pages
Generate: pass top-K page images directly to Qwen2-VL-72B or GPT-4V as multi-image prompt
Post-process: extract structured fields from VLM response

Pattern 3: Routing by document type

Use when: heterogeneous corpus with predictable document types (invoices, contracts, reports).

Classify page: VLM + prompt or fine-tuned LayoutLMv3 classifier
Invoice: route to KIE model (UDOP or Donut fine-tuned on CORD)
Contract: route to OCR + clause extraction LLM
Chart-heavy report: route to ColPali + DePlot pipeline
Benefits: each route can be optimised independently; observability per type

Pattern 4: Unified VLM (simplest)

Use when: low volume (<1k pages/day), highest possible accuracy required, budget not a constraint.

Render every page to high-res PNG (≥ 1024px)
Send to Claude 3.5 / GPT-4V / Qwen2-VL-72B API directly
Structured output via tool-use / JSON mode
Cost: ~$0.005–$0.02 per page depending on model and tile count
Latency: 2–8 sec per page including API round-trip

Observability & quality gates

Always instrument document AI pipelines with: (1) per-page confidence scores from OCR or VLM (use token log-probabilities or self-evaluation prompts); (2) hallucination checks by cross-referencing extracted numbers against visible text; (3) a ground-truth hold-out set of 100–500 labelled pages per document type. ANLS score on the hold-out set should be monitored weekly — document corpus drift is real as new document templates enter production.

09

What to Take Away

Documents are not natural images. Layout, 2D text position, table structure, and figure-caption relationships all encode meaning that flat OCR text loses. Vision models that reason over the page image retain this context.
Donut (Swin encoder + BART decoder) was the first competitive OCR-free document understanding model. It excels at templated forms but saturates at ~85% DocVQA ANLS — high-resolution tiling VLMs now outperform it at 94–96%.
ColPali is the current best practice for document retrieval in complex corpora. Its MaxSim scoring over patch embeddings beats text-based dense retrieval by 5–15 NDCG@5 points on layout-heavy documents.
LayoutLM / UDOP remain relevant when OCR is reliable and you need interpretable intermediate representations (bounding box outputs, NER tagging per detected text span).
Modern VLMs (Qwen2-VL-72B, InternVL2-76B) reach 94–96% DocVQA ANLS with no document-specific training — just high-resolution tiling. For most new projects, start here before investing in specialised pipelines.
Four pipeline patterns cover production: OCR+RAG (text-centric, large scale), ColPali+VLM (visual-native), routing by document type (heterogeneous corpora), or unified VLM API (simplest, highest accuracy, highest cost).

Series complete

You have now covered the full VLM stack: CLIP contrastive learning and the shared embedding space (Deck 01), ViT architectures and their spatial properties (Deck 02), modern VLM connector designs and evaluation (Deck 03), and document-specific pipelines (this deck). The natural next step is Deck series on Retrieval-Augmented Generation and Multimodal Agents, both of which build directly on these components.