krthr/clip-embeddings

Replicate Explore Tools

clip embeddings image-embeddings text-embeddings replicate open-source vision-language

Summary

A CLIP-based embedding model hosted on Replicate that generates 768-dimensional embeddings for both images and text using the clip-vit-large-patch14 architecture, costing ~$0.00022 per run.

krthr / clip-embeddings

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/09/26, 06:25 AM

# CLIP Embeddings for Images and Text on Replicate Source: [https://replicate.com/krthr/clip-embeddings](https://replicate.com/krthr/clip-embeddings) ## Run time and cost This model costs approximately $0\.00022 to run on Replicate, or 4545 runs per $1, but this varies depending on your inputs\. It is also open source and you can[run it on your own computer with Docker](https://replicate.com/krthr/clip-embeddings/api)\. This model runs on[Nvidia T4 GPU hardware](https://replicate.com/docs/billing)\. Predictions typically complete within 1 seconds\. ## Readme Get text & image embeddings using CLIP\. ### Details - Model used:`clip\-vit\-large\-patch14` - Length of the embeddings:`768` ### Response ``` { "embedding": [0.1, 0.2, ..., 0.5] } ``` Model createdover 1 year ago

Similar Articles

andreasjansson/clip-features

Replicate Explore

A model on Replicate that outputs CLIP ViT-L/14 features for text and images, allowing similarity computation between inputs.

beautyyuyanli/multilingual-e5-large

Replicate Explore

Multilingual E5-large embedding model is now available on Replicate, costing ~$0.00098 per run and completing in ~1 second on Nvidia L40S.

CLIP: Connecting text and images

OpenAI Blog

CLIP is OpenAI's vision-language model that learns from text-image pairs from the internet, enabling zero-shot visual classification without task-specific training data. It addresses major limitations in traditional computer vision by reducing dependence on expensive labeled datasets and improving real-world generalization.

Hierarchical text-conditional image generation with CLIP latents

OpenAI Blog

OpenAI proposes a hierarchical two-stage model for text-conditional image generation using CLIP latents: a prior that generates CLIP image embeddings from text captions, and a diffusion-based decoder that generates images from embeddings. The approach improves image diversity and enables zero-shot language-guided image manipulations.

New embedding models and API updates

OpenAI Blog

OpenAI released two new embedding models: text-embedding-3-small (5x cheaper than ada-002 with 40%+ MIRACL improvement) and text-embedding-3-large (best performance with up to 3072 dimensions). Both models show significant performance gains on standard benchmarks while reducing costs.

Similar Articles

andreasjansson/clip-features

beautyyuyanli/multilingual-e5-large

CLIP: Connecting text and images

Hierarchical text-conditional image generation with CLIP latents

New embedding models and API updates

Submit Feedback