Presentations in This Series
- CLIP & Contrastive →CLIP architecture and InfoNCE; OpenCLIP, EVA-CLIP, SigLIP (sigmoid loss); scale curves; zero-shot classification & retrieval; building block for VLMs.
- Vision Transformers →From CNN to ViT, patch embeddings; DeiT, Swin, DINOv2, SAM; patch size and resolution trade-offs; ViT as a VLM image encoder.
- Modern VLMs →Three architectures (Q-Former, Perceiver Resampler, simple projection); Llama 3.2 Vision, Qwen2-VL, InternVL, Pixtral, Gemini, Claude; AnyRes; token budget; eval (MMMU, MathVista, ChartQA, OCRBench).
- Document AI →OCR-then-LLM pipelines vs OCR-free (Donut); ColPali for vision-based retrieval; LayoutLM, UDOP; tables & charts; production patterns.