I shipped a multimodal semantic search over 68k artworks. Modal + L40S + Qwen3-VL-Embedding

Reddit r/AI_Agents 06/03/26, 10:55 PM News

multimodal-semantic-search qwen3-vl-embedding faiss modal l40s cloudflare-r2 artwork-search

Summary

A developer built a multimodal semantic search over 68k artworks from the National Gallery of Art using Qwen3-VL-Embedding, FAISS, Modal, and Cloudflare R2. The system achieves warm response times of ~1.3s and cold starts of ~44s, supporting both text-to-image and image-to-image queries.

What I built: text → artworks search over the National Gallery of Art's open-access collection. Query like "a Vermeer with afternoon light" or "lonely winter scene" and you get 64 visually-similar results, ranked by multimodal similarity. Image-to-image works too: upload an artwork, get similar ones. **The data:** * 68,816 artworks from the NGA open-access dataset (CC0, public domain) * Each record: title, artist, year, IIIF thumbnail URL * Stored in Cloudflare R2 bucket **The stack:** * **Embedding**: Qwen3-VL-Embedding-2B. Multimodal (text or image) → single vector. I'm indexing at 1024-dim from a 2048-dim base. * **Index**: FAISS, flat inner product on L2-normalized vectors. \~250MB on disk. * **Serving**: Modal, running FastAPI on the backend. L40S GPU, scale-to-zero. Warm request \~3s end-to-end, cold \~12s (embedder load). * **Frontend**: Next.js on Vercel, with search and pagination endpoints that proxy to Modal. * **Storage**: Cloudflare R2 for the FAISS index, the metadata, and the thumbnail cache. R2's zero-egress pricing is the only reason this is viable as a side project. **Things I learned the hard way:** 1. **Cold start is the real enemy.** My container on Modal has an 8-minute scaledown window — after 8 min idle, it's gone. Fresh startup is \~31s (image pull + the embedder loaded onto the L40S in bfloat16, \~4GB GPU working set); the first search request then adds another \~1.5s for the embedder's first inference. I keep a warm container alive with a Vercel cron ping every 7 minutes. Measured today on the production build: first-cold is 43.8s end-to-end; warm is 0.4-2.0s end-to-end (mean 1.3s, median 1.45s, 10/10 success). On a warm container, each search request is 35-41ms server-side — 34-39ms is the embedder, 0.9-2.4ms is FAISS over 68k vectors. Most of the 1-2s client delta is the Vercel↔Modal round trip, not the model. 2. **IIIF saves a ton of work.** NGA's IIIF endpoint serves arbitrary-size thumbnails on demand. I normalize all candidate images to a fixed thumbnail size so I'm not shipping 6MB TIFFs to users on slow connections. Search never loads the full image — only the embedding. 3. **Vector dim is a real tradeoff.** I'm indexing at 1024-dim from a 2048-dim base; FAISS handles 68k vectors in 0.9-2.4ms. Doubling the dim would double the index size and noticeably slow search. 4. **Pagination on a flat index is awkward.** I return the top 64 from a top-80 nearest-neighbor search. "Search more" pages through the broader top-256 with an offset pinned to multiples of 16, range 64-240. Three more pages feels like the right amount before the results start to feel stale. Project is solo, UK-based, master's student (finance background — I am very much not a CV engineer), slow replies likely.

Original Article

I shipped a multimodal semantic search over 68k artworks. Modal + L40S + Qwen3-VL-Embedding

Similar Articles

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

@UnslothAI: 4-bit Qwen3.6 MTP GGUF managed to search 70+ sites from a single prompt. Try this locally on 20GB RAM via Unsloth Studi…

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Qwen3.7 Preview lands on Arena (1 minute read)

Submit Feedback

Similar Articles

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

@UnslothAI: 4-bit Qwen3.6 MTP GGUF managed to search 70+ sites from a single prompt. Try this locally on 20GB RAM via Unsloth Studi…

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Qwen3.7 Preview lands on Arena (1 minute read)