Tag
A guide for running Pi AI agent securely inside a Docker Sandbox while running llama-server on the host machine for local GPU inference.
The author upgraded their Hermes agents with TencentDB Agent Memory, using a local Qwen 3.5-4B model via llama-server for structured JSON extraction and multi-step tool use, implementing a resilient layered memory pipeline with cursor-based checkpointing.
Llampart 1.0.0 is a standalone local web UI for llama-server with translations, extended settings, and a polished conversation sidebar, released under MIT license.
Discussion about how the Pi coding agent controls thinking verbosity of Qwen 35B A3B model on llama-server, while other clients fail to do so.
Llama-Studio is a WebUI for managing llama-server sessions, allowing configuration, monitoring, and control of multiple instances for local development and experimentation.
Georgi Gerganov shared a one-liner to launch the quantized 27B Qwen3.6 model with llama-server using default speculative-decoding settings.
Author shares a working llama-server config to run the 35B-MoE Qwen3.6 model on an 8GB RTX 4060, highlighting a max_tokens trap caused by unconstrained internal reasoning and the fix using per-request thinking_budget_tokens.