A developer built a multimodal semantic search over 68k artworks from the National Gallery of Art using Qwen3-VL-Embedding, FAISS, Modal, and Cloudflare R2. The system achieves warm response times of ~1.3s and cold starts of ~44s, supporting both text-to-image and image-to-image queries.
What I built: text → artworks search over the National Gallery of Art's open-access collection. Query like "a Vermeer with afternoon light" or "lonely winter scene" and you get 64 visually-similar results, ranked by multimodal similarity. Image-to-image works too: upload an artwork, get similar ones. **The data:** * 68,816 artworks from the NGA open-access dataset (CC0, public domain) * Each record: title, artist, year, IIIF thumbnail URL * Stored in Cloudflare R2 bucket **The stack:** * **Embedding**: Qwen3-VL-Embedding-2B. Multimodal (text or image) → single vector. I'm indexing at 1024-dim from a 2048-dim base. * **Index**: FAISS, flat inner product on L2-normalized vectors. \~250MB on disk. * **Serving**: Modal, running FastAPI on the backend. L40S GPU, scale-to-zero. Warm request \~3s end-to-end, cold \~12s (embedder load). * **Frontend**: Next.js on Vercel, with search and pagination endpoints that proxy to Modal. * **Storage**: Cloudflare R2 for the FAISS index, the metadata, and the thumbnail cache. R2's zero-egress pricing is the only reason this is viable as a side project. **Things I learned the hard way:** 1. **Cold start is the real enemy.** My container on Modal has an 8-minute scaledown window — after 8 min idle, it's gone. Fresh startup is \~31s (image pull + the embedder loaded onto the L40S in bfloat16, \~4GB GPU working set); the first search request then adds another \~1.5s for the embedder's first inference. I keep a warm container alive with a Vercel cron ping every 7 minutes. Measured today on the production build: first-cold is 43.8s end-to-end; warm is 0.4-2.0s end-to-end (mean 1.3s, median 1.45s, 10/10 success). On a warm container, each search request is 35-41ms server-side — 34-39ms is the embedder, 0.9-2.4ms is FAISS over 68k vectors. Most of the 1-2s client delta is the Vercel↔Modal round trip, not the model. 2. **IIIF saves a ton of work.** NGA's IIIF endpoint serves arbitrary-size thumbnails on demand. I normalize all candidate images to a fixed thumbnail size so I'm not shipping 6MB TIFFs to users on slow connections. Search never loads the full image — only the embedding. 3. **Vector dim is a real tradeoff.** I'm indexing at 1024-dim from a 2048-dim base; FAISS handles 68k vectors in 0.9-2.4ms. Doubling the dim would double the index size and noticeably slow search. 4. **Pagination on a flat index is awkward.** I return the top 64 from a top-80 nearest-neighbor search. "Search more" pages through the broader top-256 with an offset pinned to multiples of 16, range 64-240. Three more pages feels like the right amount before the results start to feel stale. Project is solo, UK-based, master's student (finance background — I am very much not a CV engineer), slow replies likely.
This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.
UnslothAI announces that its 4-bit Qwen3.6 MTP GGUF model can search over 70 websites from a single prompt, running locally on 20GB RAM via Unsloth Studio. The update adds automatic MTP and speculative decoding support.
This article provides a technical guide on training and fine-tuning multimodal embedding and reranker models using the Sentence Transformers library, demonstrating performance improvements on Visual Document Retrieval tasks with Qwen3-VL.
This paper introduces On-Policy Data Evolution (ODE) and a visual-native agent harness to improve multimodal deep search agents. By enabling reusable visual evidence and closed-loop data generation, ODE significantly boosts the performance of Qwen3-VL agents across multiple benchmarks, surpassing Gemini 2.5 Pro.
Alibaba Qwen announces two major model releases: Qwen3-Omni, the first natively end-to-end omni-modal AI unifying text, image, audio and video, and Qwen3-Next-80B-A3B, an ultra-efficient MoE model with 3B activated parameters per token, achieving SOTA performance and 10x faster inference than Qwen3-32B.