CLIP: Connecting text and images

OpenAI Blog 01/05/21, 08:00 AM Models

vision-language zero-shot contrastive-learning multimodal openai foundation-model

Summary

CLIP is OpenAI's vision-language model that learns from text-image pairs from the internet, enabling zero-shot visual classification without task-specific training data. It addresses major limitations in traditional computer vision by reducing dependence on expensive labeled datasets and improving real-world generalization.

We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 02:46 PM

# CLIP: Connecting text and images Source: [https://openai.com/index/clip/](https://openai.com/index/clip/) CLIP was designed to mitigate a number of major problems in the standard deep learning approach to computer vision: **Costly datasets**: Deep learning needs a lot of data, and vision models have traditionally been trained on manually labeled datasets that are expensive to construct and only provide supervision for a limited number of predetermined visual concepts\. The ImageNet dataset, one of the largest efforts in this space, required over 25,000 workers to annotate 14 million images for 22,000 object categories\. In contrast, CLIP learns from text–image pairs that are already publicly available on the internet\. Reducing the need for expensive large labeled datasets has been extensively studied by prior work, notably self\-supervised learning,[14](https://openai.com/index/clip/#citation-bottom-14),[15](https://openai.com/index/clip/#citation-bottom-15),[16](https://openai.com/index/clip/#citation-bottom-16)contrastive methods,[17](https://openai.com/index/clip/#citation-bottom-17),[18](https://openai.com/index/clip/#citation-bottom-18),[19](https://openai.com/index/clip/#citation-bottom-19),[20](https://openai.com/index/clip/#citation-bottom-20),[21](https://openai.com/index/clip/#citation-bottom-21)self\-training approaches,[22](https://openai.com/index/clip/#citation-bottom-22),[23](https://openai.com/index/clip/#citation-bottom-23)and generative modeling\.[24](https://openai.com/index/clip/#citation-bottom-24),[25](https://openai.com/index/clip/#citation-bottom-25),[26](https://openai.com/index/clip/#citation-bottom-26),[27](https://openai.com/index/clip/#citation-bottom-27) **Narrow**: An ImageNet model is good at predicting the 1000 ImageNet categories, but that’s all it can do “out of the box\.” If we wish to perform any other task, an ML practitioner needs to build a new dataset, add an output head, and fine\-tune the model\. In contrast, CLIP can be adapted to perform a wide variety of visual classification tasks without needing additional training examples\. To apply CLIP to a new task, all we need to do is “tell” CLIP’s text\-encoder the names of the task’s visual concepts, and it will output a linear classifier of CLIP’s visual representations\. The accuracy of this classifier is often competitive with fully supervised models\. We show random, non\-cherry picked, predictions of zero\-shot CLIP classifiers on examples from various datasets below\. **Poor real\-world performance**: Deep learning systems are often reported to achieve human or even superhuman performance[28](https://openai.com/index/clip/#citation-bottom-28),[A](https://openai.com/index/clip/#citation-bottom-A)on vision benchmarks, yet when deployed in the wild, their performance can be far below the expectation set by the benchmark\. In other words, there is a gap between “benchmark performance” and “real performance\.” We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams\. In contrast, the CLIP model can be evaluated on benchmarks without having to train on their data, so it can’t “cheat” in this manner\. This results in its benchmark performance being much more representative of its performance in the wild\. To verify the “cheating hypothesis”, we also measure how CLIP’s performance changes when it is able to “study” for ImageNet\. When a linear classifier is fitted on top of CLIP’s features, it improves CLIP’s accuracy on the ImageNet test set by almost 10%\. However, this classifier does*no better*on average across an evaluation suite of 7 other datasets measuring “robust” performance\.[30](https://openai.com/index/clip/#citation-bottom-30)

CLIP: Connecting text and images

Similar Articles

Hierarchical text-conditional image generation with CLIP latents

krthr/clip-embeddings

Alien Dreams: An Emerging Art Scene

PixelClaw: an LLM agent for image manipulation

andreasjansson/clip-features

Submit Feedback

Similar Articles

Hierarchical text-conditional image generation with CLIP latents

Alien Dreams: An Emerging Art Scene

PixelClaw: an LLM agent for image manipulation