CLIP, OpenCLIP, EVA-CLIP, SigLIP, InfoNCE loss, contrastive scale curves, zero-shot classification & retrieval, and how CLIP became the backbone of modern generative models.
Before CLIP, vision models were trained on fixed label sets. To add a new category you re-trained the classification head. CLIP (Contrastive Language-Image Pre-training, Radford et al., OpenAI 2021) learns a joint embedding space from 400 million noisy image-text pairs scraped from the web. Both modalities collapse to a common unit-normed vector; cosine similarity becomes the universal comparator.
ResNet-50 trained on ImageNet-1K: 1,000 output logits, fixed. Adding “radiograph” as a class requires labelled data and full fine-tuning. Zero-shot transfer is essentially impossible.
ViT-L/14 and a Transformer text encoder share a 768-d embedding space. Any string is a valid “class”. Comparison is a dot product. New concepts cost nothing at inference time.
The key insight is that the web already contains supervisory signal: alt-text, captions, surrounding context. The model does not need human-curated labels — it just has to learn to assign high similarity to the correct (image, text) pair and low similarity to the 4,095 other texts in a batch of 32,768.
Once trained, the space is compositional. “a dog wearing a hat” lands between the “dog” region and the “hat” region without any hat-dog training images. This compositionality is what makes CLIP embeddings so widely reused in downstream systems.
CLIP uses two independent encoders that share no weights. After encoding, both outputs are L2-normalised and projected to the same dimensionality. Training maximises the cosine similarity of correct pairs and minimises it for all other pairs within the batch.
| Variant | Image encoder | Embed dim | Params (total) | WIT-400M zero-shot IN-1K |
|---|---|---|---|---|
| CLIP RN50 | ResNet-50 | 1024 | 102M | 59.6% |
| CLIP ViT-B/32 | ViT-B patch=32 | 512 | 151M | 63.3% |
| CLIP ViT-B/16 | ViT-B patch=16 | 512 | 150M | 68.3% |
| CLIP ViT-L/14 | ViT-L patch=14 | 768 | 428M | 75.3% |
| CLIP ViT-L/14@336 | ViT-L patch=14, 336px | 768 | 428M | 76.2% |
OpenAI's WebImageText (WIT) dataset was 400M image-text pairs collected from public internet sources. It was never released. This opacity drove the OpenCLIP project to re-create CLIP on public datasets (LAION-400M, LAION-2B, DataComp-1B). WIT's composition — in particular, its balance of creative-commons art vs photographic vs diagram data — still isn't fully characterised.
The training objective is the InfoNCE loss (van den Oord et al., 2018), also called NT-Xent in SimCLR. CLIP applies it symmetrically to both modalities. For a batch of N pairs, the model sees an N × N similarity matrix; diagonal entries are the positives.
# I = image embeddings [N, d], L2-normalised
# T = text embeddings [N, d], L2-normalised
# logit_scale = exp(log_temperature) (learnable, init ~14 = 1/0.07)
import torch, torch.nn.functional as F
logits = logit_scale * I @ T.T # [N, N]
labels = torch.arange(N, device=logits.device)
loss_i = F.cross_entropy(logits, labels) # image→text
loss_t = F.cross_entropy(logits.T, labels) # text→image
loss = (loss_i + loss_t) / 2
The temperature τ controls peakiness of the softmax distribution. CLIP initialises log(τ) = log(1/0.07) ≈ 2.66 and learns it jointly. A lower τ sharpens the distribution and amplifies gradients from hard negatives; too low and the loss saturates on easy pairs and ignores hard ones.
Cross-entropy over a row of the logit matrix is equivalent to maximising the log-probability of the positive pair relative to all negatives — which is exactly the InfoNCE bound on mutual information I(I;T). With batch size N, the bound is at most log(N) bits. Larger batches give a tighter bound and more hard negatives per update.
CLIP clips logit_scale to log(100) ≈ 4.6 during training. Without this, the effective temperature can collapse below 0.01, gradients become enormous, and training destabilises. OpenCLIP and SigLIP both inherit this constraint.
CLIP used batch N = 32,768 across 592 V100s. At N = 256 you get only 255 negatives per positive. LAION-5B training used N = 86,016 (48 A100s × 1,792 per GPU). More negatives → harder task → richer features.
With N = 32,768 images of generic subjects, many texts describe multiple images. These “false negatives” contaminate the loss. SigLIP's sigmoid formulation handles this more gracefully by treating each pair independently rather than as a single-correct softmax problem.
CLIP's closed training data created a reproducibility gap. Three major follow-ups each targeted a different limitation:
What it fixed: reproducibility. Open weights, open training code (open_clip library), open data (LAION-400M, LAION-2B, DataComp-1B).
pip install open_clip_torchWhat it fixed: capacity. Uses masked image modelling pre-training (EVA) to initialise the image encoder before contrastive fine-tuning. This unlocks huge ViTs that otherwise diverge from random init.
What it fixed: loss formulation. Replaces softmax over the batch with a per-pair sigmoid binary cross-entropy. This removes the false-negative problem and decouples batch size from loss validity.
| Model | Data | Loss | Max IN-1K 0-shot | Open weights |
|---|---|---|---|---|
| CLIP ViT-L/14 | WIT-400M (closed) | InfoNCE softmax | 76.2% | Yes (weights only) |
| OpenCLIP ViT-bigG/14 | LAION-2B (open) | InfoNCE softmax | 80.1% | Yes (weights + code + data) |
| EVA-CLIP-18B | LAION-400M + CC12M | InfoNCE softmax | 83.0% | Yes |
| SigLIP-SO400M | WebLI-10B (closed) | Sigmoid BCE | 83.1% | Yes (weights) |
The original CLIP uses softmax across the entire batch for each positive, making the problem a single-correct multi-class classification. SigLIP treats each pair in the N × N matrix as an independent binary classification using sigmoid binary cross-entropy. This has deep consequences.
# logits: [N, N] = image @ text.T * scale
# labels: +1 on diagonal, -1 off-diagonal
import torch
labels = 2 * torch.eye(N, device=logits.device) - 1 # {-1, +1}
# Per-pair sigmoid binary cross-entropy:
loss = -torch.nn.functional.logsigmoid(labels * logits).sum() / N
# Equivalently (numerically stable):
# loss = F.binary_cross_entropy_with_logits(logits, (labels+1)/2)
-10 to offset the overwhelming fraction of negative pairsPaLI-3, PaLM-E, Gemini, and several open VLMs (PaliGemma, InternVL2) use SigLIP-SO400M as the image encoder because it transfers better to fine-grained tasks like OCR, charts, and dense captioning — precisely because the sigmoid loss forces richer per-image discrimination rather than relative ranking within a batch.
The DataComp benchmark (Gadre et al., 2023) and a series of OpenCLIP ablations produced the clearest picture yet of what actually drives CLIP zero-shot performance. The answer is mostly data quality, then model size, then compute budget — a different ranking from language-model scaling laws.
Unlike Chinchilla (equal tokens & params), CLIP's optimal frontier is harder to characterise because the effective dataset size after curation is not fixed. The DataComp paper recommends: filter aggressively (CLIP-score > 0.3 + English-text heuristics), use ViT-L or larger, batch ≥ 32k, and train for 13B image-text pair-steps. Beyond that, returns diminish.
CLIP's canonical downstream use is zero-shot classification: construct one text prompt per class label, encode all prompts, then rank the image against the text embeddings by cosine similarity. No fine-tuning, no labelled data.
import open_clip, torch
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14', pretrained='datacomp_xl_s13b_b90k'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')
# Build class embeddings once (cache for large label sets)
classnames = ["a dog", "a cat", "a car"]
templates = ["a photo of {}", "a {} in the wild"]
texts = [t.format(c) for c in classnames for t in templates]
with torch.no_grad():
txt_feats = model.encode_text(tokenizer(texts)).float()
txt_feats /= txt_feats.norm(dim=-1, keepdim=True)
# Average over templates for each class
txt_feats = txt_feats.view(len(classnames), len(templates), -1).mean(1)
txt_feats /= txt_feats.norm(dim=-1, keepdim=True)
img = preprocess(Image.open("photo.jpg")).unsqueeze(0)
with torch.no_grad():
img_feat = model.encode_image(img).float()
img_feat /= img_feat.norm(dim=-1, keepdim=True)
probs = (img_feat @ txt_feats.T).softmax(dim=-1)
pred = classnames[probs.argmax()]
OpenAI found that ensembling 80 hand-crafted templates (“a photo of a {}”, “a blurry photo of the {}”, …) improved IN-1K from 71.3% to 76.2% for ViT-L/14. The averaged embedding is more robust than any single prompt. This gave rise to the field of CLIP prompt tuning (CoOp, CoCoOp, CLIP-Adapter) which learns the prompt prefix in continuous embedding space rather than by hand.
The same embedding space enables text → image and image → text retrieval. On MS-COCO Recall@1: CLIP ViT-L/14 achieves 58.4% image→text, 37.8% text→image. SigLIP-SO400M pushes these to ~70% / ~50%. This is the backbone of every DALL-E 2 / Stable Diffusion retrieval augmentation pipeline.
CLIP is rarely deployed standalone. Its embedding space and image encoder are load-bearing components in three major downstream architectures:
Most text-to-image systems keep the CLIP text encoder frozen during diffusion training, treating it as a fixed feature extractor. LLaVA-1.5 keeps the CLIP image encoder frozen too (only the MLP projection and LLM are trained). This is why a fine-tuned diffusion model that breaks the CLIP conditioning — e.g. by further training the text encoder on NSFW data — loses compositionality on common prompts.
CLIP score (Hessel et al., 2021) measures image-text alignment by computing the cosine similarity of a generated image and its prompt using a fixed CLIP model (usually ViT-L/14). It has become the standard automated metric for text-to-image models: DALL-E 3 reports 0.754 CLIP-score-H on COCO-30K, vs Stable Diffusion XL at 0.743. The metric is imperfect — it rewards style over semantic precision — but it is reproducible and correlates with human preference at scale.
Deck 02 drills into the image encoder side: ViT, DeiT, Swin, DINOv2, and SAM. Once you understand those architectures the CLIP image encoder column becomes much clearer, and you’ll be able to reason about why patch size and resolution are the primary knobs when adapting a CLIP model to a new task.