Building an Open Source Edge Semantic Cache for LLMs in Rust/WASM – Sanity check on the architecture? [D]

Reddit r/MachineLearning Tools

Summary

Proposes building an open-source, lightweight semantic cache for LLMs using Rust/WASM at the CDN edge to reduce latency and API costs, seeking community feedback on architecture and use-case validity.

Hey everyone, I am planning out a new open-source infrastructure project and want to get some brutal feedback on the architecture and use-case validity from people running high volume LLM workloads in production. **The Problem:** Python-based proxies/gateways introduce too much latency overhead for real-time streaming agent steps or fast UI completions. Additionally, centralized semantic caching still suffers from cross-region network latency (e.g., London to us-east-1), and enterprise API costs remain a massive bottleneck for repetitive/predictable user queries (like customer support or structured data extraction). **The Proposed Architecture:** Instead of a heavy centralized gateway, the goal is to build a lightweight, zero-dependency semantic cache running directly at the CDN Edge using WebAssembly (WASM) compiled from Rust. The flow looks like this: 1. **Inbound Prompt:** Hits the edge node closest to the user (e.g., Cloudflare Workers / Fastly Compute). 2. **Edge Embedding:** The Rust/WASM module intercepts the raw text prompt and instantly generates a vector using an edge-native lightweight model (e.g., `bge-small-en-v1.5`). 3. **Similarity Index Check:** It performs a fast cosine similarity check against an edge vector database (like Cloudflare Vectorize) to find the nearest semantic neighbor. 4. **Cache Hit:** If similarity >= threshold (e.g., 0.88), it pulls the full generated response text from an edge KV store and returns it in \~5ms. The main LLM provider is never billed or touched. 5. **Cache Miss:** It proxies the streaming request to OpenAI/Anthropic/vLLM, streams it back to the client, and asynchronously updates the edge vector index and KV store. **Why Rust/WASM?** To achieve sub-millisecond execution overhead on the proxy itself, avoid garbage collection pauses, and maintain a tiny memory footprint suitable for edge runtime constraints where traditional databases or Python scripts cannot run. **My Questions for the Community:** 1. For those running LLMs in production (especially customer support, internal RAG, or autonomous agents), what is your realistic semantic cache hit rate? Is the power law of repetitive queries high enough in your domains to justify this? 2. What are the biggest footguns with semantic caching at the edge? (e.g., Cache invalidation strategies, handling system prompt updates, or drift in embedding models). 3. Would you actually use a drop-in open-source template/CLI that lets you spin this up on your own edge account, or do you prefer centralized API gateways?
Original Article

Similar Articles

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

Reddit r/LocalLLaMA

The author released a pure Rust, CPU-only inference implementation of the LFM2.5-8B-A1B model (4-bit Q4KM quantization), achieving a decode speed of approximately 37 tokens/s and memory usage around 7GB. The goal is to make LLMs runnable on cheap VPS or older machines. The implementation is open source and published as a cargo crate.

LMCache/LMCache

GitHub Trending (daily)

LMCache is an open-source KV cache management layer for LLM inference that reduces time-to-first-token and improves throughput by enabling persistent storage and reuse of KV cache across serving engines.