Llama.cpp server running ~2 weeks straight. Loses its mind?
Summary
User reports that Qwen3.6 models running on llama.cpp server become significantly less capable after ~2 weeks of continuous operation, and restarting sessions does not resolve the issue.
Similar Articles
qwen3.6 just stops
A user reports an issue where the Qwen 3.6 model stops mid-task when served via vLLM with specific Docker and speculative decoding configurations.
LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more
LlamaStation v0.9 is a Windows GUI for llama.cpp that offers a clean interface with full parameter control, multiple backends (official, TurboQuant, AtomicChat, BeeLlama), real-time VRAM monitoring, per-model profiles, voice mode, and headless mode, all without intermediate layers like Ollama.
Seeking resources to read about llama.cpp server and how offloading works
A user shares their experience with llama.cpp server's model offloading, noting performance trade-offs and quiet operation, and asks for resources to understand how the tool manages memory across VRAM and system RAM.
How do i prevent llama.cpp from offloading on Swap?
User seeks advice on preventing llama.cpp from offloading KV cache to swap before RAM is fully exhausted, sharing their configuration on an M2 Max with 96GB RAM and a large Qwen model.
Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into
Author shares a working llama-server config to run the 35B-MoE Qwen3.6 model on an 8GB RTX 4060, highlighting a max_tokens trap caused by unconstrained internal reasoning and the fix using per-request thinking_budget_tokens.