LLM Hub — Vision-Language Models

Vision-Language Models

From CLIP and ViTs to modern VLMs (Llama 3.2 V, Qwen2-VL, InternVL, Pixtral, Gemini, Claude) and the document AI stack (ColPali, Donut).

CLIPSigLIPViTVLMColPaliDocument AI

Presentations in This Series

  1. CLIP & Contrastive →
    CLIP architecture and InfoNCE; OpenCLIP, EVA-CLIP, SigLIP (sigmoid loss); scale curves; zero-shot classification & retrieval; building block for VLMs.
    live
  2. Vision Transformers →
    From CNN to ViT, patch embeddings; DeiT, Swin, DINOv2, SAM; patch size and resolution trade-offs; ViT as a VLM image encoder.
    live
  3. Modern VLMs →
    Three architectures (Q-Former, Perceiver Resampler, simple projection); Llama 3.2 Vision, Qwen2-VL, InternVL, Pixtral, Gemini, Claude; AnyRes; token budget; eval (MMMU, MathVista, ChartQA, OCRBench).
    live
  4. Document AI →
    OCR-then-LLM pipelines vs OCR-free (Donut); ColPali for vision-based retrieval; LayoutLM, UDOP; tables & charts; production patterns.
    live