OCR pipelines vs end-to-end vision models, Donut, ColPali vision-based PDF retrieval, LayoutLM, UDOP, modern VLMs as document processors, chart-to-text, and production PDF pipeline patterns.
A PDF page is not a natural image. It is a precisely typeset composition of text, tables, figures, and layout structure that encodes meaning through both visual appearance and positional arrangement. Standard NLP models see it as raw text (after OCR) and discard spatial context. Standard vision models see it as pixels and miss the semantic structure. Documents demand a hybrid understanding.
Tesseract / AWS Textract OCR produces flat text that loses column order, table structure, and figure-caption links. A 10-column financial table becomes a meaningless run-on string. Vision models that reason over the page image retain the 2D layout context. The field has split into two schools: improve the OCR and layout parsing pipeline (Amazon, Adobe, ABBYY approach) vs train end-to-end vision models that skip OCR entirely (Google Donut / ColPali approach).
Despite the appeal of end-to-end vision models, OCR-then-LLM pipelines remain dominant in production document AI because they provide explainable intermediate representations, modular error correction, and strong performance on well-formatted PDFs and born-digital documents.
| Tool | Type | Strengths | Weaknesses |
|---|---|---|---|
| Tesseract 5 | Open-source OCR | Free, offline, 100+ langs, LSTM engine | Poor on low-DPI scans; no layout; 70–80% char accuracy on hand-written |
| AWS Textract | Cloud OCR + layout | Table/form extraction, bounding boxes, confidence scores | Cost at scale ($1.50/1k pages); US-region data residency constraint |
| Surya | Open-source, neural | Multi-language, layout-aware, benchmarks near Textract quality | GPU needed; 3× slower than Tesseract on CPU |
| Nougat (Meta) | End-to-end, vision | Scientific PDFs: preserves LaTeX equations, tables, figures natively | Hallucination on non-scientific docs; slow (1 page/sec on A100) |
| Marker | Open-source pipeline | Combines Surya OCR + layout; Markdown output; fast | Table merging heuristics can fail on complex grids |
Use OCR-then-LLM when: documents are born-digital PDFs (not scanned), layout is simple (single-column, no complex tables), and you need citation of exact text spans. Use end-to-end vision models when: documents are scanned images, contain complex tables or charts, or you need to retrieve by visual similarity.
Donut (Document Understanding Transformer, Kim et al., NAVER 2022) eliminates the OCR step entirely. It is an encoder-decoder transformer that takes a page image as input and produces structured text (JSON, key-value pairs, HTML) directly, treating document understanding as a conditional generation task.
DocVQA: 67.5% ANLS (vs Textract + LLM pipeline ~85%). CORD (receipt KIE): 91.3% F1. RVL-CDIP document classification: 95.3%. Donut excels at templated forms and receipts, but struggles on multi-page documents, complex tables, and low-DPI scans. The Swin encoder serialises 2D layout information but cannot re-read patches; it cannot handle text that wraps across the encoder’s receptive field boundaries at coarse scales.
ColPali (Faysse et al., 2024) addresses the document retrieval problem: given a text query, find the relevant page(s) in a large PDF corpus. Traditional approaches OCR the PDFs, embed the text, and retrieve. ColPali embeds each page as a set of patch-level visual embeddings using a PaliGemma-based VLM, then scores queries against patches via late interaction (the ColBERT mechanism).
For each document page:
For each query string:
from colpali_engine.models import ColPali, ColPaliProcessor
import torch
from PIL import Image
model = ColPali.from_pretrained("vidore/colpali-v1.2", torch_dtype=torch.bfloat16)
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")
# Index: encode page images
pages = [Image.open(p) for p in page_paths]
inputs = processor.process_images(pages)
with torch.no_grad():
page_embeds = model(**inputs) # [N_pages, 1024, 128]
# Query
query_input = processor.process_queries(["What is the total revenue for Q3 2023?"])
with torch.no_grad():
query_embed = model(**query_input) # [1, M_q, 128]
# MaxSim scoring (ColBERT)
scores = processor.score_multi_vector(query_embed, page_embeds) # [1, N_pages]
top_pages = scores.topk(5).indices
On ViDoRe (Visual Document Retrieval benchmark, 10 diverse PDF corpora), ColPali achieves NDCG@5 = 89.3 vs BM25 = 44.1, DPR-text (OCR → dense embed) = 74.3, and BGE-M3 (text) = 83.5. ColPali is the first retrieval model to significantly outperform text-based retrieval on document pages containing tables, charts, and complex layouts. The key reason: it does not require accurate OCR to retrieve — layout, colour, and chart shape contribute to the match score.
Before end-to-end vision models, the state of the art in document understanding was to augment a BERT-style language model with explicit 2D position encodings derived from OCR bounding boxes. This allowed the model to reason about both word content and spatial layout simultaneously.
UDOP pre-trains on a novel objective: given a page with some text tokens masked, predict both the masked text and the bounding box coordinates of those tokens. This forces the model to learn a tighter binding between what a word says and where it appears — critical for KIE tasks where the same word (e.g. “Total”) can be a column header or a label depending on its position in the table.
As of 2024–2025, the best results on DocVQA and similar benchmarks come not from purpose-built document models but from general VLMs with high-resolution tiling. The key is getting enough tokens per page to read the text without explicit OCR.
| Model | DocVQA ANLS | InfoVQA | ChartQA | OCRBench | Approach |
|---|---|---|---|---|---|
| LayoutLMv3-Large | 83.4% | 46.0% | — | — | OCR + 2D pos embed |
| UDOP | 84.7% | 58.5% | — | — | OCR + T5 seq2seq |
| Donut-base | 67.5% | 11.6% | — | — | OCR-free (Swin+BART) |
| InternVL2-76B | 94.1% | 88.7% | 88.4% | 825 | High-res tiling (40 tiles) |
| Qwen2-VL-72B | 96.5% | 92.2% | 88.3% | 866 | Native res, 2D-RoPE |
| GPT-4V | 88.4% | 75.1% | 78.5% | 645 | Unknown (tiling probable) |
| Claude 3.5 Sonnet | ~91% | — | — | — | Native multimodal |
Qwen2-VL’s dynamic resolution encoding with 2D-RoPE allows it to process a full A4 page at up to 1280px resolution with no padding artefacts. InternVL2-76B achieves comparable results via 40 tiles of 448px (equivalent effective resolution ~2240px). Both avoid the resolution ceiling of fixed 448px single-image encoders, which cap DocVQA at ~85% ANLS. The LLM backbone scale (72B vs 7B) also matters: large models are better at interpreting partially-degraded OCR regions without OCR grounding.
Tables and charts are the hardest document elements for both OCR pipelines and vision models. A table is a 2D relational structure; a chart is a visual encoding of quantitative data. Both require understanding geometric structure, not just reading text.
TableTransformer (Smock et al., Microsoft 2022) uses DETR (DEtection TRansformer) to detect table bounding boxes and then predict row/column/cell structure as a set of objects. Trained on PubTables-1M (995K annotated tables from scientific publications) and FinTabNet (112K financial tables). TableTransformer achieves 92.3% TEDS-S on PubTabNet — beating earlier CNN approaches by 5pp.
Fine-tunes PaLI on chart linearisation: given a chart image, output a markdown table of the underlying data. The table is then passed to an LLM (PaLM) for Q&A. Two-step pipeline separates visual parsing from language reasoning.
Fine-tunes PaLI end-to-end on chart Q&A, chart-to-text, and chart data extraction jointly. No separate linearisation step; model answers directly from the chart image.
For a production chart extraction system that does not require fine-tuning: (1) use DePlot (or Qwen2-VL with a “extract data table from this chart” prompt) to get a structured markdown table, (2) pass the table to an LLM for Q&A or analysis, (3) validate numeric outputs against the extracted values. This two-step approach is more debuggable than end-to-end and allows the LLM to reason without the visual grounding constraint.
Production document AI must handle heterogeneous inputs (scanned, born-digital, mixed), latency constraints, cost, and observability. Four patterns cover most real-world requirements:
Use when: large corpus (>10k pages), text-centric documents, latency < 500ms for retrieval.
text-embedding-3-small or BGE-M3Use when: complex layouts, charts, mixed text/image, multilingual scans.
colpali-v1.2 → store patch embeddingsUse when: heterogeneous corpus with predictable document types (invoices, contracts, reports).
Use when: low volume (<1k pages/day), highest possible accuracy required, budget not a constraint.
Always instrument document AI pipelines with: (1) per-page confidence scores from OCR or VLM (use token log-probabilities or self-evaluation prompts); (2) hallucination checks by cross-referencing extracted numbers against visible text; (3) a ground-truth hold-out set of 100–500 labelled pages per document type. ANLS score on the hold-out set should be monitored weekly — document corpus drift is real as new document templates enter production.
You have now covered the full VLM stack: CLIP contrastive learning and the shared embedding space (Deck 01), ViT architectures and their spatial properties (Deck 02), modern VLM connector designs and evaluation (Deck 03), and document-specific pipelines (this deck). The natural next step is Deck series on Retrieval-Augmented Generation and Multimodal Agents, both of which build directly on these components.