Maybe dumb question, but how do you serve multiple users with the full context length?

Reddit r/LocalLLaMA Tools

Summary

A user asks how llama.cpp can serve multiple users each with full context length, noting that it seems to only share the context pool rather than providing dedicated context per user.

After experimenting with llama.cpp, I'm wondering a thing. Let's say we have an LLM with a context size of 128k. Now let's say we want have up to 8 parallel users, and we want to provide **each** client with the full context capabilities. With llama.cpp, how does that work? AFAIK it only allows *sharing* the 128k between users, but not actually providing 128k _per_ user. Is there something I'm missing? Thanks
Original Article

Similar Articles

Local compression helps

Reddit r/AI_Agents

A user shares a tip to use Ollama's local llama3.1:8b model for compressing conversation context in agent workflows, reducing latency and token usage compared to sending context to providers.