Vision-Language Models

Presentations in This Series

CLIP & Contrastive →
CLIP architecture and InfoNCE; OpenCLIP, EVA-CLIP, SigLIP (sigmoid loss); scale curves; zero-shot classification & retrieval; building block for VLMs.
live
Vision Transformers →
From CNN to ViT, patch embeddings; DeiT, Swin, DINOv2, SAM; patch size and resolution trade-offs; ViT as a VLM image encoder.
live
Modern VLMs →
Three architectures (Q-Former, Perceiver Resampler, simple projection); Llama 3.2 Vision, Qwen2-VL, InternVL, Pixtral, Gemini, Claude; AnyRes; token budget; eval (MMMU, MathVista, ChartQA, OCRBench).
live
Document AI →
OCR-then-LLM pipelines vs OCR-free (Donut); ColPali for vision-based retrieval; LayoutLM, UDOP; tables & charts; production patterns.
live