Vision-Language Models Series — Presentation 04

Document AI — Donut, ColPali & Layout-Aware Models

OCR pipelines vs end-to-end vision models, Donut, ColPali vision-based PDF retrieval, LayoutLM, UDOP, modern VLMs as document processors, chart-to-text, and production PDF pipeline patterns.

Donut ColPali LayoutLM UDOP Document AI OCR-Free PDF Parsing
PDF Page OCR / Vision Layout Parse Embed / Retrieve Generate
00

Topics We’ll Cover

01

Documents as a Special Multimodal Target

A PDF page is not a natural image. It is a precisely typeset composition of text, tables, figures, and layout structure that encodes meaning through both visual appearance and positional arrangement. Standard NLP models see it as raw text (after OCR) and discard spatial context. Standard vision models see it as pixels and miss the semantic structure. Documents demand a hybrid understanding.

What makes documents hard

  • Multi-column layout: reading order is not left-to-right, top-to-bottom; columns must be detected and linearised
  • Tables: cell adjacency encodes relational structure invisible to flat text
  • Figures with captions: caption often appears below or beside the figure; cross-referencing requires layout
  • Formulae: LaTeX/MathML symbols require specialist tokenisation
  • Scanned documents: OCR noise + skew + font variation
  • Mixed languages: RTL text, CJK characters, ligatures

Document AI task taxonomy

  • Document classification: invoice vs contract vs receipt
  • Key information extraction (KIE): total amount, date, vendor name
  • Document Q&A (DocVQA): answer free-form Qs about a page
  • Table structure recognition: detect rows, columns, merged cells
  • Document retrieval: find the relevant page in a 1,000-page corpus
  • Chart understanding: extract data series values from bar/line/scatter plots
Why “just OCR it and feed to an LLM” often fails

Tesseract / AWS Textract OCR produces flat text that loses column order, table structure, and figure-caption links. A 10-column financial table becomes a meaningless run-on string. Vision models that reason over the page image retain the 2D layout context. The field has split into two schools: improve the OCR and layout parsing pipeline (Amazon, Adobe, ABBYY approach) vs train end-to-end vision models that skip OCR entirely (Google Donut / ColPali approach).

02

OCR-then-LLM Pipelines — When & Why

Despite the appeal of end-to-end vision models, OCR-then-LLM pipelines remain dominant in production document AI because they provide explainable intermediate representations, modular error correction, and strong performance on well-formatted PDFs and born-digital documents.

PDF / Image
OCR Engine
Tesseract / Textract / Azure / Surya
Layout Parser
Detectron2 / DocLayout-YOLO / nougat
Markdown/HTML
reading order + structure
LLM
GPT-4 / Llama / Claude
ToolTypeStrengthsWeaknesses
Tesseract 5Open-source OCRFree, offline, 100+ langs, LSTM enginePoor on low-DPI scans; no layout; 70–80% char accuracy on hand-written
AWS TextractCloud OCR + layoutTable/form extraction, bounding boxes, confidence scoresCost at scale ($1.50/1k pages); US-region data residency constraint
SuryaOpen-source, neuralMulti-language, layout-aware, benchmarks near Textract qualityGPU needed; 3× slower than Tesseract on CPU
Nougat (Meta)End-to-end, visionScientific PDFs: preserves LaTeX equations, tables, figures nativelyHallucination on non-scientific docs; slow (1 page/sec on A100)
MarkerOpen-source pipelineCombines Surya OCR + layout; Markdown output; fastTable merging heuristics can fail on complex grids
Rule of thumb

Use OCR-then-LLM when: documents are born-digital PDFs (not scanned), layout is simple (single-column, no complex tables), and you need citation of exact text spans. Use end-to-end vision models when: documents are scanned images, contain complex tables or charts, or you need to retrieve by visual similarity.

03

Donut — OCR-Free Document Understanding

Donut (Document Understanding Transformer, Kim et al., NAVER 2022) eliminates the OCR step entirely. It is an encoder-decoder transformer that takes a page image as input and produces structured text (JSON, key-value pairs, HTML) directly, treating document understanding as a conditional generation task.

Donut architecture: Swin encoder + BART-style decoder, conditioned on task token Page Image 2560×1920 px resized + padded to model resolution Swin-B Encoder 1280 patch tokens 4-stage hierarchical No OCR, no text input BART Decoder 8 layers, 768-d cross-attn on Swin feats teacher-forced training Structured Output <s_date>2024-03-01</s_date> <s_total>€142.50</s_total> JSON / HTML / plain text Task conditioning DocVQA: prepend <s_docvqa><s_question>{q}</s_question> → decoder generates answer Document classification: prepend <s_classify> → decoder generates class label token KIE (receipts, forms): prepend <s_rvlcdip> → decoder generates key-value JSON Training: IIT-CDIP 11M + SynthDoG (synthetic document generator) 1M
Donut benchmarks & limitations

DocVQA: 67.5% ANLS (vs Textract + LLM pipeline ~85%). CORD (receipt KIE): 91.3% F1. RVL-CDIP document classification: 95.3%. Donut excels at templated forms and receipts, but struggles on multi-page documents, complex tables, and low-DPI scans. The Swin encoder serialises 2D layout information but cannot re-read patches; it cannot handle text that wraps across the encoder’s receptive field boundaries at coarse scales.

04

ColPali — Vision-Based Document Retrieval

ColPali (Faysse et al., 2024) addresses the document retrieval problem: given a text query, find the relevant page(s) in a large PDF corpus. Traditional approaches OCR the PDFs, embed the text, and retrieve. ColPali embeds each page as a set of patch-level visual embeddings using a PaliGemma-based VLM, then scores queries against patches via late interaction (the ColBERT mechanism).

Indexing (offline)

For each document page:

  1. Render to 448×448 image
  2. Encode with PaliGemma-3B vision encoder → 1024 patch embeddings of 128-d each
  3. Store all 1024 patch embeddings per page (L2-normalised)
  4. Index: each page = 1024 × 128-d = 131,072 floats = 0.5 MB at fp32

Retrieval (online)

For each query string:

  1. Tokenise → encode with PaliGemma text encoder → M query token embeddings (128-d)
  2. MaxSim scoring (ColBERT-style): for each query token, find max cosine similarity across all 1024 page patches; sum over query tokens
  3. Score = ∑q maxp(cos(q, p))
  4. Rank pages by score; return top-K
ColPali inference with colpali-engine library
from colpali_engine.models import ColPali, ColPaliProcessor
import torch
from PIL import Image

model     = ColPali.from_pretrained("vidore/colpali-v1.2", torch_dtype=torch.bfloat16)
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")

# Index: encode page images
pages  = [Image.open(p) for p in page_paths]
inputs = processor.process_images(pages)
with torch.no_grad():
    page_embeds = model(**inputs)   # [N_pages, 1024, 128]

# Query
query_input  = processor.process_queries(["What is the total revenue for Q3 2023?"])
with torch.no_grad():
    query_embed = model(**query_input)  # [1, M_q, 128]

# MaxSim scoring (ColBERT)
scores = processor.score_multi_vector(query_embed, page_embeds)  # [1, N_pages]
top_pages = scores.topk(5).indices
ViDoRe benchmark results

On ViDoRe (Visual Document Retrieval benchmark, 10 diverse PDF corpora), ColPali achieves NDCG@5 = 89.3 vs BM25 = 44.1, DPR-text (OCR → dense embed) = 74.3, and BGE-M3 (text) = 83.5. ColPali is the first retrieval model to significantly outperform text-based retrieval on document pages containing tables, charts, and complex layouts. The key reason: it does not require accurate OCR to retrieve — layout, colour, and chart shape contribute to the match score.

05

Layout-Aware Models — LayoutLM, UDOP

Before end-to-end vision models, the state of the art in document understanding was to augment a BERT-style language model with explicit 2D position encodings derived from OCR bounding boxes. This allowed the model to reason about both word content and spatial layout simultaneously.

LayoutLM v1
BERT token embed + 2D pos embed (x0,y0,x1,y1 from OCR) + image region embed (ResNet CNN) → FUNSD F1: 79.3%
LayoutLM v2
Text + layout + image jointly pre-trained Visual-layout pre-training on IIT-CDIP 11M Spatial-aware self-attention (relative 2D bias) → FUNSD F1: 82.8%, DocVQA ANLS: 78.1%
LayoutLM v3
Unified text-image pre-training (no CNN) Masked layout-language modelling Word-patch alignment objective → FUNSD F1: 92.1%, DocVQA ANLS: 83.4%
UDOP (2023)
T5-based encoder-decoder Joint text, layout (x0,y0,x1,y1), image tokens Unified seq-to-seq for all doc tasks → DocVQA ANLS: 84.7%, FUNSD: 93.3%
UDOP’s “joint text and layout reconstruction”

UDOP pre-trains on a novel objective: given a page with some text tokens masked, predict both the masked text and the bounding box coordinates of those tokens. This forces the model to learn a tighter binding between what a word says and where it appears — critical for KIE tasks where the same word (e.g. “Total”) can be a column header or a label depending on its position in the table.

06

Modern VLMs as Document Processors

As of 2024–2025, the best results on DocVQA and similar benchmarks come not from purpose-built document models but from general VLMs with high-resolution tiling. The key is getting enough tokens per page to read the text without explicit OCR.

ModelDocVQA ANLSInfoVQAChartQAOCRBenchApproach
LayoutLMv3-Large83.4%46.0%OCR + 2D pos embed
UDOP84.7%58.5%OCR + T5 seq2seq
Donut-base67.5%11.6%OCR-free (Swin+BART)
InternVL2-76B94.1%88.7%88.4%825High-res tiling (40 tiles)
Qwen2-VL-72B96.5%92.2%88.3%866Native res, 2D-RoPE
GPT-4V88.4%75.1%78.5%645Unknown (tiling probable)
Claude 3.5 Sonnet~91%Native multimodal
Why Qwen2-VL leads on DocVQA

Qwen2-VL’s dynamic resolution encoding with 2D-RoPE allows it to process a full A4 page at up to 1280px resolution with no padding artefacts. InternVL2-76B achieves comparable results via 40 tiles of 448px (equivalent effective resolution ~2240px). Both avoid the resolution ceiling of fixed 448px single-image encoders, which cap DocVQA at ~85% ANLS. The LLM backbone scale (72B vs 7B) also matters: large models are better at interpreting partially-degraded OCR regions without OCR grounding.

07

Tables & Charts — Chart-to-Text

Tables and charts are the hardest document elements for both OCR pipelines and vision models. A table is a 2D relational structure; a chart is a visual encoding of quantitative data. Both require understanding geometric structure, not just reading text.

Table structure recognition

TableTransformer (Smock et al., Microsoft 2022) uses DETR (DEtection TRansformer) to detect table bounding boxes and then predict row/column/cell structure as a set of objects. Trained on PubTables-1M (995K annotated tables from scientific publications) and FinTabNet (112K financial tables). TableTransformer achieves 92.3% TEDS-S on PubTabNet — beating earlier CNN approaches by 5pp.

Chart-to-text (DePlot, Matcha)

DePlot (Liu et al., Google 2023)

Fine-tunes PaLI on chart linearisation: given a chart image, output a markdown table of the underlying data. The table is then passed to an LLM (PaLM) for Q&A. Two-step pipeline separates visual parsing from language reasoning.

  • ChartQA: 56.0% (one-shot, no fine-tuning on ChartQA)
  • The linearised table is inspectable — useful for explainability

Matcha (Liu et al., Google 2023)

Fine-tunes PaLI end-to-end on chart Q&A, chart-to-text, and chart data extraction jointly. No separate linearisation step; model answers directly from the chart image.

  • ChartQA: 90.2% (fine-tuned)
  • FigureQA: 93.5%
  • Trained on ChartQA + PlotQA + Chart-to-Text dataset (30K samples)
Production chart extraction recipe

For a production chart extraction system that does not require fine-tuning: (1) use DePlot (or Qwen2-VL with a “extract data table from this chart” prompt) to get a structured markdown table, (2) pass the table to an LLM for Q&A or analysis, (3) validate numeric outputs against the extracted values. This two-step approach is more debuggable than end-to-end and allows the LLM to reason without the visual grounding constraint.

08

Pipeline Patterns for Production Document AI

Production document AI must handle heterogeneous inputs (scanned, born-digital, mixed), latency constraints, cost, and observability. Four patterns cover most real-world requirements:

Pattern 1: OCR + RAG (classic)

Use when: large corpus (>10k pages), text-centric documents, latency < 500ms for retrieval.

  1. Ingest: Marker/Surya OCR + layout → Markdown per page
  2. Chunk: page-level or paragraph-level (512 tokens)
  3. Embed: OpenAI text-embedding-3-small or BGE-M3
  4. Store: pgvector or Qdrant
  5. Query: dense retrieval + BM25 hybrid + reranker (Cohere Rerank-3)
  6. Generate: GPT-4o / Claude 3.5 with retrieved chunks in context

Pattern 2: ColPali + VLM (vision-native)

Use when: complex layouts, charts, mixed text/image, multilingual scans.

  1. Ingest: render each PDF page to 448px PNG; run colpali-v1.2 → store patch embeddings
  2. Query: encode query with ColPali; MaxSim retrieve top-K pages
  3. Generate: pass top-K page images directly to Qwen2-VL-72B or GPT-4V as multi-image prompt
  4. Post-process: extract structured fields from VLM response

Pattern 3: Routing by document type

Use when: heterogeneous corpus with predictable document types (invoices, contracts, reports).

  • Classify page: VLM + prompt or fine-tuned LayoutLMv3 classifier
  • Invoice: route to KIE model (UDOP or Donut fine-tuned on CORD)
  • Contract: route to OCR + clause extraction LLM
  • Chart-heavy report: route to ColPali + DePlot pipeline
  • Benefits: each route can be optimised independently; observability per type

Pattern 4: Unified VLM (simplest)

Use when: low volume (<1k pages/day), highest possible accuracy required, budget not a constraint.

  • Render every page to high-res PNG (≥ 1024px)
  • Send to Claude 3.5 / GPT-4V / Qwen2-VL-72B API directly
  • Structured output via tool-use / JSON mode
  • Cost: ~$0.005–$0.02 per page depending on model and tile count
  • Latency: 2–8 sec per page including API round-trip
Observability & quality gates

Always instrument document AI pipelines with: (1) per-page confidence scores from OCR or VLM (use token log-probabilities or self-evaluation prompts); (2) hallucination checks by cross-referencing extracted numbers against visible text; (3) a ground-truth hold-out set of 100–500 labelled pages per document type. ANLS score on the hold-out set should be monitored weekly — document corpus drift is real as new document templates enter production.

09

What to Take Away

Series complete

You have now covered the full VLM stack: CLIP contrastive learning and the shared embedding space (Deck 01), ViT architectures and their spatial properties (Deck 02), modern VLM connector designs and evaluation (Deck 03), and document-specific pipelines (this deck). The natural next step is Deck series on Retrieval-Augmented Generation and Multimodal Agents, both of which build directly on these components.