Vision-Language Models Series — Presentation 01

CLIP & Contrastive Vision-Language Learning

CLIP, OpenCLIP, EVA-CLIP, SigLIP, InfoNCE loss, contrastive scale curves, zero-shot classification & retrieval, and how CLIP became the backbone of modern generative models.

CLIP OpenCLIP EVA-CLIP SigLIP InfoNCE Contrastive Learning Zero-Shot
Image Encoder Text Encoder Embed InfoNCE Scale Zero-Shot
00

Topics We’ll Cover

01

Multimodal Alignment — Text + Image in One Space

Before CLIP, vision models were trained on fixed label sets. To add a new category you re-trained the classification head. CLIP (Contrastive Language-Image Pre-training, Radford et al., OpenAI 2021) learns a joint embedding space from 400 million noisy image-text pairs scraped from the web. Both modalities collapse to a common unit-normed vector; cosine similarity becomes the universal comparator.

Classic supervised vision

ResNet-50 trained on ImageNet-1K: 1,000 output logits, fixed. Adding “radiograph” as a class requires labelled data and full fine-tuning. Zero-shot transfer is essentially impossible.

  • Label space: finite, curated
  • Transfer: linear probe required
  • Generalisation: closed-vocabulary

CLIP contrastive learning

ViT-L/14 and a Transformer text encoder share a 768-d embedding space. Any string is a valid “class”. Comparison is a dot product. New concepts cost nothing at inference time.

  • Label space: arbitrary natural language
  • Transfer: prompt engineering, no grad
  • Generalisation: open-vocabulary

The key insight is that the web already contains supervisory signal: alt-text, captions, surrounding context. The model does not need human-curated labels — it just has to learn to assign high similarity to the correct (image, text) pair and low similarity to the 4,095 other texts in a batch of 32,768.

The embedding space property

Once trained, the space is compositional. “a dog wearing a hat” lands between the “dog” region and the “hat” region without any hat-dog training images. This compositionality is what makes CLIP embeddings so widely reused in downstream systems.

02

CLIP Architecture & Training

CLIP uses two independent encoders that share no weights. After encoding, both outputs are L2-normalised and projected to the same dimensionality. Training maximises the cosine similarity of correct pairs and minimises it for all other pairs within the batch.

Image 224×224 px Text “a photo of a dog” Image Encoder ViT-B/32, ViT-L/14 or ResNet-50/101 Text Encoder 63M Transformer max 77 tokens (BPE) Linear Proj 512-d or 768-d Linear Proj 512-d or 768-d Cosine Sim I · T (normalised) Batch: N pairs → N×N similarity matrix → InfoNCE on diagonal
VariantImage encoderEmbed dimParams (total)WIT-400M zero-shot IN-1K
CLIP RN50ResNet-501024102M59.6%
CLIP ViT-B/32ViT-B patch=32512151M63.3%
CLIP ViT-B/16ViT-B patch=16512150M68.3%
CLIP ViT-L/14ViT-L patch=14768428M75.3%
CLIP ViT-L/14@336ViT-L patch=14, 336px768428M76.2%
Training data: WIT-400M

OpenAI's WebImageText (WIT) dataset was 400M image-text pairs collected from public internet sources. It was never released. This opacity drove the OpenCLIP project to re-create CLIP on public datasets (LAION-400M, LAION-2B, DataComp-1B). WIT's composition — in particular, its balance of creative-commons art vs photographic vs diagram data — still isn't fully characterised.

03

Contrastive Loss Math — InfoNCE

The training objective is the InfoNCE loss (van den Oord et al., 2018), also called NT-Xent in SimCLR. CLIP applies it symmetrically to both modalities. For a batch of N pairs, the model sees an N × N similarity matrix; diagonal entries are the positives.

InfoNCE loss (both directions, temperature τ)
# I = image embeddings  [N, d],  L2-normalised
# T = text  embeddings  [N, d],  L2-normalised
# logit_scale = exp(log_temperature)  (learnable, init ~14 = 1/0.07)

import torch, torch.nn.functional as F

logits = logit_scale * I @ T.T          # [N, N]
labels = torch.arange(N, device=logits.device)

loss_i = F.cross_entropy(logits,   labels)  # image→text
loss_t = F.cross_entropy(logits.T, labels)  # text→image
loss   = (loss_i + loss_t) / 2

The temperature τ controls peakiness of the softmax distribution. CLIP initialises log(τ) = log(1/0.07) ≈ 2.66 and learns it jointly. A lower τ sharpens the distribution and amplifies gradients from hard negatives; too low and the loss saturates on easy pairs and ignores hard ones.

Why cross-entropy implements InfoNCE

Cross-entropy over a row of the logit matrix is equivalent to maximising the log-probability of the positive pair relative to all negatives — which is exactly the InfoNCE bound on mutual information I(I;T). With batch size N, the bound is at most log(N) bits. Larger batches give a tighter bound and more hard negatives per update.

Gradient explosion at small τ

CLIP clips logit_scale to log(100) ≈ 4.6 during training. Without this, the effective temperature can collapse below 0.01, gradients become enormous, and training destabilises. OpenCLIP and SigLIP both inherit this constraint.

Batch size matters enormously

CLIP used batch N = 32,768 across 592 V100s. At N = 256 you get only 255 negatives per positive. LAION-5B training used N = 86,016 (48 A100s × 1,792 per GPU). More negatives → harder task → richer features.

False negatives in large batches

With N = 32,768 images of generic subjects, many texts describe multiple images. These “false negatives” contaminate the loss. SigLIP's sigmoid formulation handles this more gracefully by treating each pair independently rather than as a single-correct softmax problem.

04

OpenCLIP, EVA-CLIP, SigLIP — What Each Fixed

CLIP's closed training data created a reproducibility gap. Three major follow-ups each targeted a different limitation:

OpenCLIP LAION / stability.ai

What it fixed: reproducibility. Open weights, open training code (open_clip library), open data (LAION-400M, LAION-2B, DataComp-1B).

  • ViT-H/14 on LAION-2B: 78.0% IN-1K zero-shot
  • ViT-bigG/14 on LAION-2B: 80.1%
  • DataComp-XL ViT-L/14: 79.2% (better data curation matters more than model size)
  • PyPI: pip install open_clip_torch

EVA-CLIP BAAI 2023

What it fixed: capacity. Uses masked image modelling pre-training (EVA) to initialise the image encoder before contrastive fine-tuning. This unlocks huge ViTs that otherwise diverge from random init.

  • EVA-CLIP-18B: 18B param image encoder, 83.0% IN-1K zero-shot
  • EVA02-CLIP-bigE/14+: 82.0% with only 5B params
  • MIM pre-training saves ~4× data vs random init at scale

SigLIP Google 2023

What it fixed: loss formulation. Replaces softmax over the batch with a per-pair sigmoid binary cross-entropy. This removes the false-negative problem and decouples batch size from loss validity.

  • SigLIP-B/16: 76.1% IN-1K at batch 16k vs CLIP 70.0% at same batch
  • SigLIP-SO400M/14: 83.1% (PaLI-3 backbone)
  • Trained on WebLI (10B pairs), open weights on HuggingFace
ModelDataLossMax IN-1K 0-shotOpen weights
CLIP ViT-L/14WIT-400M (closed)InfoNCE softmax76.2%Yes (weights only)
OpenCLIP ViT-bigG/14LAION-2B (open)InfoNCE softmax80.1%Yes (weights + code + data)
EVA-CLIP-18BLAION-400M + CC12MInfoNCE softmax83.0%Yes
SigLIP-SO400MWebLI-10B (closed)Sigmoid BCE83.1%Yes (weights)
05

Sigmoid vs Softmax Loss

The original CLIP uses softmax across the entire batch for each positive, making the problem a single-correct multi-class classification. SigLIP treats each pair in the N × N matrix as an independent binary classification using sigmoid binary cross-entropy. This has deep consequences.

SigLIP sigmoid loss (Zhai et al. 2023, eq. 1)
# logits: [N, N] = image @ text.T * scale
# labels: +1 on diagonal, -1 off-diagonal

import torch

labels = 2 * torch.eye(N, device=logits.device) - 1   # {-1, +1}

# Per-pair sigmoid binary cross-entropy:
loss = -torch.nn.functional.logsigmoid(labels * logits).sum() / N

# Equivalently (numerically stable):
# loss = F.binary_cross_entropy_with_logits(logits, (labels+1)/2)

Key differences

Softmax (CLIP / OpenCLIP)

  • One positive per row; all others are hard negatives
  • Loss normalises over entire batch → false negatives dilute signal
  • Large batches required for stable training
  • N×N matrix must fit in memory on one device (or use all-gather)
  • Temperature leaks into the gradient magnitude globally

Sigmoid (SigLIP)

  • Each pair is independent: no normalisation across batch
  • False negatives just create noisy positive labels — less catastrophic
  • Works well at smaller batch sizes (tested down to B=32)
  • All-gather still used for throughput, but not required for correctness
  • Bias initialisation trick: init logit bias to -10 to offset the overwhelming fraction of negative pairs
Why SigLIP became the preferred VLM backbone

PaLI-3, PaLM-E, Gemini, and several open VLMs (PaliGemma, InternVL2) use SigLIP-SO400M as the image encoder because it transfers better to fine-grained tasks like OCR, charts, and dense captioning — precisely because the sigmoid loss forces richer per-image discrimination rather than relative ranking within a batch.

06

Scale Curves — Dataset, Model, Batch Size

The DataComp benchmark (Gadre et al., 2023) and a series of OpenCLIP ablations produced the clearest picture yet of what actually drives CLIP zero-shot performance. The answer is mostly data quality, then model size, then compute budget — a different ranking from language-model scaling laws.

IN-1K zero-shot accuracy: DataComp-1B ViT-L/14 ablations (approximate) 60% 70% 80% 90% 79.2% DataComp-1B ViT-L/14 76.9% LAION-2B ViT-L/14 78.0% LAION-2B ViT-H/14 80.1% LAION-2B ViT-bigG/14 ← data curation beats 2× scale

The key takeaways from DataComp

Compute-optimal frontier

Unlike Chinchilla (equal tokens & params), CLIP's optimal frontier is harder to characterise because the effective dataset size after curation is not fixed. The DataComp paper recommends: filter aggressively (CLIP-score > 0.3 + English-text heuristics), use ViT-L or larger, batch ≥ 32k, and train for 13B image-text pair-steps. Beyond that, returns diminish.

07

Zero-Shot Classification & Retrieval

CLIP's canonical downstream use is zero-shot classification: construct one text prompt per class label, encode all prompts, then rank the image against the text embeddings by cosine similarity. No fine-tuning, no labelled data.

Zero-shot ImageNet with open_clip (Python)
import open_clip, torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='datacomp_xl_s13b_b90k'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')

# Build class embeddings once (cache for large label sets)
classnames = ["a dog", "a cat", "a car"]
templates  = ["a photo of {}", "a {} in the wild"]
texts = [t.format(c) for c in classnames for t in templates]
with torch.no_grad():
    txt_feats = model.encode_text(tokenizer(texts)).float()
    txt_feats /= txt_feats.norm(dim=-1, keepdim=True)
    # Average over templates for each class
    txt_feats = txt_feats.view(len(classnames), len(templates), -1).mean(1)
    txt_feats /= txt_feats.norm(dim=-1, keepdim=True)

img = preprocess(Image.open("photo.jpg")).unsqueeze(0)
with torch.no_grad():
    img_feat = model.encode_image(img).float()
    img_feat /= img_feat.norm(dim=-1, keepdim=True)

probs = (img_feat @ txt_feats.T).softmax(dim=-1)
pred  = classnames[probs.argmax()]

Prompt engineering matters

OpenAI found that ensembling 80 hand-crafted templates (“a photo of a {}”, “a blurry photo of the {}”, …) improved IN-1K from 71.3% to 76.2% for ViT-L/14. The averaged embedding is more robust than any single prompt. This gave rise to the field of CLIP prompt tuning (CoOp, CoCoOp, CLIP-Adapter) which learns the prompt prefix in continuous embedding space rather than by hand.

Cross-modal retrieval

The same embedding space enables text → image and image → text retrieval. On MS-COCO Recall@1: CLIP ViT-L/14 achieves 58.4% image→text, 37.8% text→image. SigLIP-SO400M pushes these to ~70% / ~50%. This is the backbone of every DALL-E 2 / Stable Diffusion retrieval augmentation pipeline.

08

As a Building Block (Text-to-Image, VLMs)

CLIP is rarely deployed standalone. Its embedding space and image encoder are load-bearing components in three major downstream architectures:

DALL-E 2
CLIP image embed → prior (diffusion) → CLIP image embed → decoder (unCLIP)
Stable Diff. 1.x
CLIP ViT-L/14 text encoder → cross-attention conditioning in UNet (image encoder unused)
SD 2.x / XL
OpenCLIP ViT-H text encoder + second CLIP-L encoder (SDXL) → concatenated 2048-d conditioning
LLaVA / VLMs
CLIP ViT-L/14 image encoder → MLP projection → prepended to LLM token sequence

The frozen-vs-tuned decision

Most text-to-image systems keep the CLIP text encoder frozen during diffusion training, treating it as a fixed feature extractor. LLaVA-1.5 keeps the CLIP image encoder frozen too (only the MLP projection and LLM are trained). This is why a fine-tuned diffusion model that breaks the CLIP conditioning — e.g. by further training the text encoder on NSFW data — loses compositionality on common prompts.

CLIP score as an evaluation metric

CLIP score (Hessel et al., 2021) measures image-text alignment by computing the cosine similarity of a generated image and its prompt using a fixed CLIP model (usually ViT-L/14). It has become the standard automated metric for text-to-image models: DALL-E 3 reports 0.754 CLIP-score-H on COCO-30K, vs Stable Diffusion XL at 0.743. The metric is imperfect — it rewards style over semantic precision — but it is reproducible and correlates with human preference at scale.

09

What to Take Away

Where to next

Deck 02 drills into the image encoder side: ViT, DeiT, Swin, DINOv2, and SAM. Once you understand those architectures the CLIP image encoder column becomes much clearer, and you’ll be able to reason about why patch size and resolution are the primary knobs when adapting a CLIP model to a new task.