CLIP: Connecting text and images

OpenAI Blog Models

Summary

CLIP is OpenAI's vision-language model that learns from text-image pairs from the internet, enabling zero-shot visual classification without task-specific training data. It addresses major limitations in traditional computer vision by reducing dependence on expensive labeled datasets and improving real-world generalization.

We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:46 PM

# CLIP: Connecting text and images Source: [https://openai.com/index/clip/](https://openai.com/index/clip/) CLIP was designed to mitigate a number of major problems in the standard deep learning approach to computer vision: **Costly datasets**: Deep learning needs a lot of data, and vision models have traditionally been trained on manually labeled datasets that are expensive to construct and only provide supervision for a limited number of predetermined visual concepts\. The ImageNet dataset, one of the largest efforts in this space, required over 25,000 workers to annotate 14 million images for 22,000 object categories\. In contrast, CLIP learns from text–image pairs that are already publicly available on the internet\. Reducing the need for expensive large labeled datasets has been extensively studied by prior work, notably self\-supervised learning,[14](https://openai.com/index/clip/#citation-bottom-14),[15](https://openai.com/index/clip/#citation-bottom-15),[16](https://openai.com/index/clip/#citation-bottom-16)contrastive methods,[17](https://openai.com/index/clip/#citation-bottom-17),[18](https://openai.com/index/clip/#citation-bottom-18),[19](https://openai.com/index/clip/#citation-bottom-19),[20](https://openai.com/index/clip/#citation-bottom-20),[21](https://openai.com/index/clip/#citation-bottom-21)self\-training approaches,[22](https://openai.com/index/clip/#citation-bottom-22),[23](https://openai.com/index/clip/#citation-bottom-23)and generative modeling\.[24](https://openai.com/index/clip/#citation-bottom-24),[25](https://openai.com/index/clip/#citation-bottom-25),[26](https://openai.com/index/clip/#citation-bottom-26),[27](https://openai.com/index/clip/#citation-bottom-27) **Narrow**: An ImageNet model is good at predicting the 1000 ImageNet categories, but that’s all it can do “out of the box\.” If we wish to perform any other task, an ML practitioner needs to build a new dataset, add an output head, and fine\-tune the model\. In contrast, CLIP can be adapted to perform a wide variety of visual classification tasks without needing additional training examples\. To apply CLIP to a new task, all we need to do is “tell” CLIP’s text\-encoder the names of the task’s visual concepts, and it will output a linear classifier of CLIP’s visual representations\. The accuracy of this classifier is often competitive with fully supervised models\. We show random, non\-cherry picked, predictions of zero\-shot CLIP classifiers on examples from various datasets below\. **Poor real\-world performance**: Deep learning systems are often reported to achieve human or even superhuman performance[28](https://openai.com/index/clip/#citation-bottom-28),[A](https://openai.com/index/clip/#citation-bottom-A)on vision benchmarks, yet when deployed in the wild, their performance can be far below the expectation set by the benchmark\. In other words, there is a gap between “benchmark performance” and “real performance\.” We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams\. In contrast, the CLIP model can be evaluated on benchmarks without having to train on their data, so it can’t “cheat” in this manner\. This results in its benchmark performance being much more representative of its performance in the wild\. To verify the “cheating hypothesis”, we also measure how CLIP’s performance changes when it is able to “study” for ImageNet\. When a linear classifier is fitted on top of CLIP’s features, it improves CLIP’s accuracy on the ImageNet test set by almost 10%\. However, this classifier does*no better*on average across an evaluation suite of 7 other datasets measuring “robust” performance\.[30](https://openai.com/index/clip/#citation-bottom-30)

Similar Articles

Hierarchical text-conditional image generation with CLIP latents

OpenAI Blog

OpenAI proposes a hierarchical two-stage model for text-conditional image generation using CLIP latents: a prior that generates CLIP image embeddings from text captions, and a diffusion-based decoder that generates images from embeddings. The approach improves image diversity and enables zero-shot language-guided image manipulations.

krthr/clip-embeddings

Replicate Explore

A CLIP-based embedding model hosted on Replicate that generates 768-dimensional embeddings for both images and text using the clip-vit-large-patch14 architecture, costing ~$0.00022 per run.

Alien Dreams: An Emerging Art Scene

ML at Berkeley

The article highlights the emerging scene of AI-generated art using OpenAI's CLIP model as a steering mechanism for generative models, showcasing various examples of text-to-image outputs.

andreasjansson/clip-features

Replicate Explore

A model on Replicate that outputs CLIP ViT-L/14 features for text and images, allowing similarity computation between inputs.