Maybe dumb question, but how do you serve multiple users with the full context length?
Summary
A user asks how llama.cpp can serve multiple users each with full context length, noting that it seems to only share the context pool rather than providing dedicated context per user.
Similar Articles
Seeking resources to read about llama.cpp server and how offloading works
A user shares their experience with llama.cpp server's model offloading, noting performance trade-offs and quiet operation, and asks for resources to understand how the tool manages memory across VRAM and system RAM.
How do you keep long sessions from eating the whole context window?
A user shares a custom Plugin SDK hook that gradually compresses older turns while keeping recent ones raw to prevent context window exhaustion in long OpenClaw sessions, reducing re-sent context by 80%.
Local compression helps
A user shares a tip to use Ollama's local llama3.1:8b model for compressing conversation context in agent workflows, reducing latency and token usage compared to sending context to providers.
@ickma2311: Efficient AI Lecture 15: Long-Context LLM Long context is not just a bigger prompt window. The key question is: which p…
This post summarizes Efficient AI Lecture 15 on long-context LLMs, covering RoPE position interpolation for context extension, the needle-in-haystack evaluation, and StreamingLLM's attention sink phenomenon and KV cache eviction strategy.
@MaximeRivest: current llm architecture is stupid (if not stupid its, at least, wasteful). take these 3 prompts of 4 context chunks: […
A tweet criticizes current LLM architecture for wasteful recomputation due to order-dependent context, and proposes encoding context units separately to enable order-invariant, efficient caching and generation.