krthr/clip-embeddings
Summary
A CLIP-based embedding model hosted on Replicate that generates 768-dimensional embeddings for both images and text using the clip-vit-large-patch14 architecture, costing ~$0.00022 per run.
View Cached Full Text
Cached at: 05/09/26, 06:25 AM
Similar Articles
andreasjansson/clip-features
A model on Replicate that outputs CLIP ViT-L/14 features for text and images, allowing similarity computation between inputs.
beautyyuyanli/multilingual-e5-large
Multilingual E5-large embedding model is now available on Replicate, costing ~$0.00098 per run and completing in ~1 second on Nvidia L40S.
CLIP: Connecting text and images
CLIP is OpenAI's vision-language model that learns from text-image pairs from the internet, enabling zero-shot visual classification without task-specific training data. It addresses major limitations in traditional computer vision by reducing dependence on expensive labeled datasets and improving real-world generalization.
Hierarchical text-conditional image generation with CLIP latents
OpenAI proposes a hierarchical two-stage model for text-conditional image generation using CLIP latents: a prior that generates CLIP image embeddings from text captions, and a diffusion-based decoder that generates images from embeddings. The approach improves image diversity and enables zero-shot language-guided image manipulations.
New embedding models and API updates
OpenAI released two new embedding models: text-embedding-3-small (5x cheaper than ada-002 with 40%+ MIRACL improvement) and text-embedding-3-large (best performance with up to 3072 dimensions). Both models show significant performance gains on standard benchmarks while reducing costs.