llama-server

#llama-server

Pi + Docker Sandbox + llama-server setup guide

Reddit r/AI_Agents ↗ · 3d ago

A guide for running Pi AI agent securely inside a Docker Sandbox while running llama-server on the host machine for local GPU inference.

0 favorites 0 likes

#llama-server

@Michaelzsguo: Today I upgraded my Hermes agents with TencentDB Agent Memory. I did not connect it to a cloud LLM. Instead, I wired it…

X AI KOLs Timeline ↗ · 2026-05-24 Cached

The author upgraded their Hermes agents with TencentDB Agent Memory, using a local Qwen 3.5-4B model via llama-server for structured JSON extraction and multi-step tool use, implementing a resilient layered memory pipeline with cursor-based checkpointing.

0 favorites 0 likes

#llama-server

llampart 1.0.0 - I released a standalone local web UI for llama-server with translations, extended settings and a polished conversation sidebar

Reddit r/LocalLLaMA ↗ · 2026-05-24

Llampart 1.0.0 is a standalone local web UI for llama-server with translations, extended settings, and a polished conversation sidebar, released under MIT license.

0 favorites 0 likes

#llama-server

How does Pi coding agent control Qwen's thinking verbosity? (Qwen 35B A3B, llama-server)

Reddit r/LocalLLaMA ↗ · 2026-05-17

Discussion about how the Pi coding agent controls thinking verbosity of Qwen 35B A3B model on llama-server, while other clients fail to do so.

0 favorites 0 likes

#llama-server

Llama-Studio, WebUI for llama-server Management

Reddit r/LocalLLaMA ↗ · 2026-05-14

Llama-Studio is a WebUI for managing llama-server sessions, allowing configuration, monitoring, and control of multiple instances for local development and experimentation.

0 favorites 0 likes

#llama-server

@ggerganov: llama-server -hf ggml-org/Qwen3.6-27B-GGUF --spec-default

X AI KOLs Following ↗ · 2026-04-22 Cached

Georgi Gerganov shared a one-liner to launch the quantized 27B Qwen3.6 model with llama-server using default speculative-decoding settings.

0 favorites 0 likes

#llama-server

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

Reddit r/LocalLLaMA ↗ · 2026-04-21

Author shares a working llama-server config to run the 35B-MoE Qwen3.6 model on an 8GB RTX 4060, highlighting a max_tokens trap caused by unconstrained internal reasoning and the fix using per-request thinking_budget_tokens.

0 favorites 0 likes

llama-server

Pi + Docker Sandbox + llama-server setup guide

@Michaelzsguo: Today I upgraded my Hermes agents with TencentDB Agent Memory. I did not connect it to a cloud LLM. Instead, I wired it…

llampart 1.0.0 - I released a standalone local web UI for llama-server with translations, extended settings and a polished conversation sidebar

How does Pi coding agent control Qwen's thinking verbosity? (Qwen 35B A3B, llama-server)

Llama-Studio, WebUI for llama-server Management

@ggerganov: llama-server -hf ggml-org/Qwen3.6-27B-GGUF --spec-default

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

Submit Feedback