Building an Open Source Edge Semantic Cache for LLMs in Rust/WASM – Sanity check on the architecture? [D]

Reddit r/MachineLearning 06/12/26, 09:53 AM Tools

open-source semantic-cache edge-computing wasm rust llm infrastructure

Summary

Proposes building an open-source, lightweight semantic cache for LLMs using Rust/WASM at the CDN edge to reduce latency and API costs, seeking community feedback on architecture and use-case validity.

Hey everyone, I am planning out a new open-source infrastructure project and want to get some brutal feedback on the architecture and use-case validity from people running high volume LLM workloads in production. **The Problem:** Python-based proxies/gateways introduce too much latency overhead for real-time streaming agent steps or fast UI completions. Additionally, centralized semantic caching still suffers from cross-region network latency (e.g., London to us-east-1), and enterprise API costs remain a massive bottleneck for repetitive/predictable user queries (like customer support or structured data extraction). **The Proposed Architecture:** Instead of a heavy centralized gateway, the goal is to build a lightweight, zero-dependency semantic cache running directly at the CDN Edge using WebAssembly (WASM) compiled from Rust. The flow looks like this: 1. **Inbound Prompt:** Hits the edge node closest to the user (e.g., Cloudflare Workers / Fastly Compute). 2. **Edge Embedding:** The Rust/WASM module intercepts the raw text prompt and instantly generates a vector using an edge-native lightweight model (e.g., `bge-small-en-v1.5`). 3. **Similarity Index Check:** It performs a fast cosine similarity check against an edge vector database (like Cloudflare Vectorize) to find the nearest semantic neighbor. 4. **Cache Hit:** If similarity >= threshold (e.g., 0.88), it pulls the full generated response text from an edge KV store and returns it in \~5ms. The main LLM provider is never billed or touched. 5. **Cache Miss:** It proxies the streaming request to OpenAI/Anthropic/vLLM, streams it back to the client, and asynchronously updates the edge vector index and KV store. **Why Rust/WASM?** To achieve sub-millisecond execution overhead on the proxy itself, avoid garbage collection pauses, and maintain a tiny memory footprint suitable for edge runtime constraints where traditional databases or Python scripts cannot run. **My Questions for the Community:** 1. For those running LLMs in production (especially customer support, internal RAG, or autonomous agents), what is your realistic semantic cache hit rate? Is the power law of repetitive queries high enough in your domains to justify this? 2. What are the biggest footguns with semantic caching at the edge? (e.g., Cache invalidation strategies, handling system prompt updates, or drift in embedding models). 3. Would you actually use a drop-in open-source template/CLI that lets you spin this up on your own edge account, or do you prefer centralized API gateways?

Original Article

Building an Open Source Edge Semantic Cache for LLMs in Rust/WASM – Sanity check on the architecture? [D]

Similar Articles

@Alacritic_Super: Building an AI app? Cut down your API costs and speed up response times with an LLM Cache built in Rust! Every time a u…

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

@Mayhem4Markets: https://x.com/Mayhem4Markets/status/2069090022117019928

@Alacritic_Super: If you are building production LLM applications, learn LLM Caching. Caching can reduce latency, GPU utilization, and AP…

LMCache/LMCache

Submit Feedback

Similar Articles

@Alacritic_Super: Building an AI app? Cut down your API costs and speed up response times with an LLM Cache built in Rust! Every time a u…

I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B

@Mayhem4Markets: https://x.com/Mayhem4Markets/status/2069090022117019928

@Alacritic_Super: If you are building production LLM applications, learn LLM Caching. Caching can reduce latency, GPU utilization, and AP…