VLM 01 — CLIP & Contrastive Vision-Language Learning

00

Topics We’ll Cover

Multimodal Alignment — Text + Image in One Space
CLIP Architecture & Training
Contrastive Loss Math — InfoNCE
OpenCLIP, EVA-CLIP, SigLIP — What Each Fixed
Sigmoid vs Softmax Loss
Scale Curves — Dataset, Model, Batch Size
Zero-Shot Classification & Retrieval
As a Building Block (Text-to-Image, VLMs)
What to Take Away

01

Multimodal Alignment — Text + Image in One Space

Before CLIP, vision models were trained on fixed label sets. To add a new category you re-trained the classification head. CLIP (Contrastive Language-Image Pre-training, Radford et al., OpenAI 2021) learns a joint embedding space from 400 million noisy image-text pairs scraped from the web. Both modalities collapse to a common unit-normed vector; cosine similarity becomes the universal comparator.

Classic supervised vision

ResNet-50 trained on ImageNet-1K: 1,000 output logits, fixed. Adding “radiograph” as a class requires labelled data and full fine-tuning. Zero-shot transfer is essentially impossible.

Label space: finite, curated
Transfer: linear probe required
Generalisation: closed-vocabulary

CLIP contrastive learning

ViT-L/14 and a Transformer text encoder share a 768-d embedding space. Any string is a valid “class”. Comparison is a dot product. New concepts cost nothing at inference time.

Label space: arbitrary natural language
Transfer: prompt engineering, no grad
Generalisation: open-vocabulary

The key insight is that the web already contains supervisory signal: alt-text, captions, surrounding context. The model does not need human-curated labels — it just has to learn to assign high similarity to the correct (image, text) pair and low similarity to the 4,095 other texts in a batch of 32,768.

The embedding space property

Once trained, the space is compositional. “a dog wearing a hat” lands between the “dog” region and the “hat” region without any hat-dog training images. This compositionality is what makes CLIP embeddings so widely reused in downstream systems.

02

CLIP Architecture & Training

CLIP uses two independent encoders that share no weights. After encoding, both outputs are L2-normalised and projected to the same dimensionality. Training maximises the cosine similarity of correct pairs and minimises it for all other pairs within the batch.

Variant	Image encoder	Embed dim	Params (total)	WIT-400M zero-shot IN-1K
CLIP RN50	ResNet-50	1024	102M	59.6%
CLIP ViT-B/32	ViT-B patch=32	512	151M	63.3%
CLIP ViT-B/16	ViT-B patch=16	512	150M	68.3%
CLIP ViT-L/14	ViT-L patch=14	768	428M	75.3%
CLIP ViT-L/14@336	ViT-L patch=14, 336px	768	428M	76.2%

Training data: WIT-400M

OpenAI's WebImageText (WIT) dataset was 400M image-text pairs collected from public internet sources. It was never released. This opacity drove the OpenCLIP project to re-create CLIP on public datasets (LAION-400M, LAION-2B, DataComp-1B). WIT's composition — in particular, its balance of creative-commons art vs photographic vs diagram data — still isn't fully characterised.

03

Contrastive Loss Math — InfoNCE

The training objective is the InfoNCE loss (van den Oord et al., 2018), also called NT-Xent in SimCLR. CLIP applies it symmetrically to both modalities. For a batch of N pairs, the model sees an N × N similarity matrix; diagonal entries are the positives.

InfoNCE loss (both directions, temperature τ)

# I = image embeddings  [N, d],  L2-normalised
# T = text  embeddings  [N, d],  L2-normalised
# logit_scale = exp(log_temperature)  (learnable, init ~14 = 1/0.07)

import torch, torch.nn.functional as F

logits = logit_scale * I @ T.T          # [N, N]
labels = torch.arange(N, device=logits.device)

loss_i = F.cross_entropy(logits,   labels)  # image→text
loss_t = F.cross_entropy(logits.T, labels)  # text→image
loss   = (loss_i + loss_t) / 2

The temperature τ controls peakiness of the softmax distribution. CLIP initialises log(τ) = log(1/0.07) ≈ 2.66 and learns it jointly. A lower τ sharpens the distribution and amplifies gradients from hard negatives; too low and the loss saturates on easy pairs and ignores hard ones.

Why cross-entropy implements InfoNCE

Cross-entropy over a row of the logit matrix is equivalent to maximising the log-probability of the positive pair relative to all negatives — which is exactly the InfoNCE bound on mutual information I(I;T). With batch size N, the bound is at most log(N) bits. Larger batches give a tighter bound and more hard negatives per update.

Gradient explosion at small τ

CLIP clips logit_scale to log(100) ≈ 4.6 during training. Without this, the effective temperature can collapse below 0.01, gradients become enormous, and training destabilises. OpenCLIP and SigLIP both inherit this constraint.

Batch size matters enormously

CLIP used batch N = 32,768 across 592 V100s. At N = 256 you get only 255 negatives per positive. LAION-5B training used N = 86,016 (48 A100s × 1,792 per GPU). More negatives → harder task → richer features.

False negatives in large batches

With N = 32,768 images of generic subjects, many texts describe multiple images. These “false negatives” contaminate the loss. SigLIP's sigmoid formulation handles this more gracefully by treating each pair independently rather than as a single-correct softmax problem.

04

OpenCLIP, EVA-CLIP, SigLIP — What Each Fixed

CLIP's closed training data created a reproducibility gap. Three major follow-ups each targeted a different limitation:

OpenCLIP LAION / stability.ai

What it fixed: reproducibility. Open weights, open training code (open_clip library), open data (LAION-400M, LAION-2B, DataComp-1B).

ViT-H/14 on LAION-2B: 78.0% IN-1K zero-shot
ViT-bigG/14 on LAION-2B: 80.1%
DataComp-XL ViT-L/14: 79.2% (better data curation matters more than model size)
PyPI: pip install open_clip_torch

EVA-CLIP BAAI 2023

What it fixed: capacity. Uses masked image modelling pre-training (EVA) to initialise the image encoder before contrastive fine-tuning. This unlocks huge ViTs that otherwise diverge from random init.

EVA-CLIP-18B: 18B param image encoder, 83.0% IN-1K zero-shot
EVA02-CLIP-bigE/14+: 82.0% with only 5B params
MIM pre-training saves ~4× data vs random init at scale

SigLIP Google 2023

What it fixed: loss formulation. Replaces softmax over the batch with a per-pair sigmoid binary cross-entropy. This removes the false-negative problem and decouples batch size from loss validity.

SigLIP-B/16: 76.1% IN-1K at batch 16k vs CLIP 70.0% at same batch
SigLIP-SO400M/14: 83.1% (PaLI-3 backbone)
Trained on WebLI (10B pairs), open weights on HuggingFace

Model	Data	Loss	Max IN-1K 0-shot	Open weights
CLIP ViT-L/14	WIT-400M (closed)	InfoNCE softmax	76.2%	Yes (weights only)
OpenCLIP ViT-bigG/14	LAION-2B (open)	InfoNCE softmax	80.1%	Yes (weights + code + data)
EVA-CLIP-18B	LAION-400M + CC12M	InfoNCE softmax	83.0%	Yes
SigLIP-SO400M	WebLI-10B (closed)	Sigmoid BCE	83.1%	Yes (weights)

05

Sigmoid vs Softmax Loss

The original CLIP uses softmax across the entire batch for each positive, making the problem a single-correct multi-class classification. SigLIP treats each pair in the N × N matrix as an independent binary classification using sigmoid binary cross-entropy. This has deep consequences.

SigLIP sigmoid loss (Zhai et al. 2023, eq. 1)

# logits: [N, N] = image @ text.T * scale
# labels: +1 on diagonal, -1 off-diagonal

import torch

labels = 2 * torch.eye(N, device=logits.device) - 1   # {-1, +1}

# Per-pair sigmoid binary cross-entropy:
loss = -torch.nn.functional.logsigmoid(labels * logits).sum() / N

# Equivalently (numerically stable):
# loss = F.binary_cross_entropy_with_logits(logits, (labels+1)/2)

Key differences

Softmax (CLIP / OpenCLIP)

One positive per row; all others are hard negatives
Loss normalises over entire batch → false negatives dilute signal
Large batches required for stable training
N×N matrix must fit in memory on one device (or use all-gather)
Temperature leaks into the gradient magnitude globally

Sigmoid (SigLIP)

Each pair is independent: no normalisation across batch
False negatives just create noisy positive labels — less catastrophic
Works well at smaller batch sizes (tested down to B=32)
All-gather still used for throughput, but not required for correctness
Bias initialisation trick: init logit bias to -10 to offset the overwhelming fraction of negative pairs

Why SigLIP became the preferred VLM backbone

PaLI-3, PaLM-E, Gemini, and several open VLMs (PaliGemma, InternVL2) use SigLIP-SO400M as the image encoder because it transfers better to fine-grained tasks like OCR, charts, and dense captioning — precisely because the sigmoid loss forces richer per-image discrimination rather than relative ranking within a batch.

06

Scale Curves — Dataset, Model, Batch Size

The DataComp benchmark (Gadre et al., 2023) and a series of OpenCLIP ablations produced the clearest picture yet of what actually drives CLIP zero-shot performance. The answer is mostly data quality, then model size, then compute budget — a different ranking from language-model scaling laws.

The key takeaways from DataComp

DataComp-1B vs LAION-2B: same ViT-L/14 model, half the data, but 2.3 pp higher IN-1K accuracy — curation via CLIP-score filtering plus image-text alignment scoring outweighs raw scale.
Batch size: OpenCLIP ablations show diminishing returns beyond B = 32,768 for ViT-L. The benefit of larger batches (more hard negatives) plateaus when the model saturates easy pairs.
Model size: ViT-bigG (1.84B image params) improves over ViT-H (632M) by ~2 pp — significant but expensive. EVA MIM pre-training closes most of the gap at lower cost.

Compute-optimal frontier

Unlike Chinchilla (equal tokens & params), CLIP's optimal frontier is harder to characterise because the effective dataset size after curation is not fixed. The DataComp paper recommends: filter aggressively (CLIP-score > 0.3 + English-text heuristics), use ViT-L or larger, batch ≥ 32k, and train for 13B image-text pair-steps. Beyond that, returns diminish.

07

Zero-Shot Classification & Retrieval

CLIP's canonical downstream use is zero-shot classification: construct one text prompt per class label, encode all prompts, then rank the image against the text embeddings by cosine similarity. No fine-tuning, no labelled data.

Zero-shot ImageNet with open_clip (Python)

import open_clip, torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='datacomp_xl_s13b_b90k'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')

# Build class embeddings once (cache for large label sets)
classnames = ["a dog", "a cat", "a car"]
templates  = ["a photo of {}", "a {} in the wild"]
texts = [t.format(c) for c in classnames for t in templates]
with torch.no_grad():
    txt_feats = model.encode_text(tokenizer(texts)).float()
    txt_feats /= txt_feats.norm(dim=-1, keepdim=True)
    # Average over templates for each class
    txt_feats = txt_feats.view(len(classnames), len(templates), -1).mean(1)
    txt_feats /= txt_feats.norm(dim=-1, keepdim=True)

img = preprocess(Image.open("photo.jpg")).unsqueeze(0)
with torch.no_grad():
    img_feat = model.encode_image(img).float()
    img_feat /= img_feat.norm(dim=-1, keepdim=True)

probs = (img_feat @ txt_feats.T).softmax(dim=-1)
pred  = classnames[probs.argmax()]

Prompt engineering matters

OpenAI found that ensembling 80 hand-crafted templates (“a photo of a {}”, “a blurry photo of the {}”, …) improved IN-1K from 71.3% to 76.2% for ViT-L/14. The averaged embedding is more robust than any single prompt. This gave rise to the field of CLIP prompt tuning (CoOp, CoCoOp, CLIP-Adapter) which learns the prompt prefix in continuous embedding space rather than by hand.

Cross-modal retrieval

The same embedding space enables text → image and image → text retrieval. On MS-COCO Recall@1: CLIP ViT-L/14 achieves 58.4% image→text, 37.8% text→image. SigLIP-SO400M pushes these to ~70% / ~50%. This is the backbone of every DALL-E 2 / Stable Diffusion retrieval augmentation pipeline.

08

As a Building Block (Text-to-Image, VLMs)

CLIP is rarely deployed standalone. Its embedding space and image encoder are load-bearing components in three major downstream architectures:

DALL-E 2

CLIP image embed → prior (diffusion) → CLIP image embed → decoder (unCLIP)

Stable Diff. 1.x

CLIP ViT-L/14 text encoder → cross-attention conditioning in UNet (image encoder unused)

SD 2.x / XL

OpenCLIP ViT-H text encoder + second CLIP-L encoder (SDXL) → concatenated 2048-d conditioning

LLaVA / VLMs

CLIP ViT-L/14 image encoder → MLP projection → prepended to LLM token sequence

The frozen-vs-tuned decision

Most text-to-image systems keep the CLIP text encoder frozen during diffusion training, treating it as a fixed feature extractor. LLaVA-1.5 keeps the CLIP image encoder frozen too (only the MLP projection and LLM are trained). This is why a fine-tuned diffusion model that breaks the CLIP conditioning — e.g. by further training the text encoder on NSFW data — loses compositionality on common prompts.

CLIP score as an evaluation metric

CLIP score (Hessel et al., 2021) measures image-text alignment by computing the cosine similarity of a generated image and its prompt using a fixed CLIP model (usually ViT-L/14). It has become the standard automated metric for text-to-image models: DALL-E 3 reports 0.754 CLIP-score-H on COCO-30K, vs Stable Diffusion XL at 0.743. The metric is imperfect — it rewards style over semantic precision — but it is reproducible and correlates with human preference at scale.

09

What to Take Away

CLIP collapses two modalities into one cosine-comparable space using noisy web-scale image-text pairs and InfoNCE loss. No manual labels required.
InfoNCE is symmetric cross-entropy on an N×N batch similarity matrix. Temperature is learnable but must be clipped to prevent gradient explosion.
OpenCLIP gave us reproducibility (open data + code + weights). EVA-CLIP unlocked huge ViTs via MIM warm-start. SigLIP replaced the batch-softmax with per-pair sigmoid BCE, removing the false-negative problem and enabling smaller batch training.
SigLIP-SO400M/14 is the current best-practice image encoder for downstream VLMs: 83.1% IN-1K, strong OCR & chart transfer, open weights on HuggingFace.
Data quality beats data quantity. DataComp-1B (1B curated pairs) outperforms LAION-2B (2B raw pairs) with the same model and compute.
CLIP is plumbing. It powers DALL-E 2 priors, Stable Diffusion conditioning, LLaVA projection layers, retrieval augmentation, and CLIP-score evaluation metrics.

Where to next

Deck 02 drills into the image encoder side: ViT, DeiT, Swin, DINOv2, and SAM. Once you understand those architectures the CLIP image encoder column becomes much clearer, and you’ll be able to reason about why patch size and resolution are the primary knobs when adapting a CLIP model to a new task.