Maybe dumb question, but how do you serve multiple users with the full context length?

Reddit r/LocalLLaMA 06/15/26, 07:59 PM Tools

Summary

A user asks how llama.cpp can serve multiple users each with full context length, noting that it seems to only share the context pool rather than providing dedicated context per user.

After experimenting with llama.cpp, I'm wondering a thing. Let's say we have an LLM with a context size of 128k. Now let's say we want have up to 8 parallel users, and we want to provide **each** client with the full context capabilities. With llama.cpp, how does that work? AFAIK it only allows *sharing* the 128k between users, but not actually providing 128k _per_ user. Is there something I'm missing? Thanks

Original Article

Similar Articles

Seeking resources to read about llama.cpp server and how offloading works

Reddit r/LocalLLaMA

A user shares their experience with llama.cpp server's model offloading, noting performance trade-offs and quiet operation, and asks for resources to understand how the tool manages memory across VRAM and system RAM.

How do you keep long sessions from eating the whole context window?

Reddit r/openclaw

A user shares a custom Plugin SDK hook that gradually compresses older turns while keeping recent ones raw to prevent context window exhaustion in long OpenClaw sessions, reducing re-sent context by 80%.

Local compression helps

Reddit r/AI_Agents

A user shares a tip to use Ollama's local llama3.1:8b model for compressing conversation context in agent workflows, reducing latency and token usage compared to sending context to providers.

@ickma2311: Efficient AI Lecture 15: Long-Context LLM Long context is not just a bigger prompt window. The key question is: which p…

X AI KOLs Timeline

This post summarizes Efficient AI Lecture 15 on long-context LLMs, covering RoPE position interpolation for context extension, the needle-in-haystack evaluation, and StreamingLLM's attention sink phenomenon and KV cache eviction strategy.

@MaximeRivest: current llm architecture is stupid (if not stupid its, at least, wasteful). take these 3 prompts of 4 context chunks: […

X AI KOLs Following

A tweet criticizes current LLM architecture for wasteful recomputation due to order-dependent context, and proposes encoding context units separately to enable order-invariant, efficient caching and generation.