@no_stp_on_snek: vllm-swift 0.6.3 + longctx 0.3.2 are out. highlights: - triattentionv3 + longctx rescue path hits 256k niah on apple si…

X AI KOLs Following Tools

Summary

vllm-swift 0.6.3 and longctx 0.3.2 releases bring triattentionv3 with 256k context on Apple Silicon, Gemma 4 MTP drafter support, Hermes tool calling with auto-recovery, and a longctx-svc daemon for scaling to 12M-token corpora.

vllm-swift 0.6.3 + longctx 0.3.2 are out. highlights: - triattentionv3 + longctx rescue path hits 256k niah on apple silicon (yes triattention now mildly viable) - gemma 4 mtp drafter, swift-native, 1.5x at 4-bit k=2 - hermes tool calling + auto-recovery of leaked json tool_calls - enginecore pgroup kill (the "memory leak" on shutdown, fixed) - longctx-svc daemon: pulls relevant code per turn, testing showed scaling up to 12m-token corpora, exposes tools over mcp longctx is still experimental but i've been using it in agentic workflows. And so much more! release notes: http://github.com/TheTom/vllm-swift/blob/main/CHANGELOG.md… http://github.com/TheTom/longctx/blob/main/CHANGELOG.md… Repos https://github.com/TheTom/vllm-swift… https://github.com/TheTom/longctx
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/14/26, 04:29 AM

vllm-swift 0.6.3 + longctx 0.3.2 are out. highlights: - triattentionv3 + longctx rescue path hits 256k niah on apple silicon (yes triattention now mildly viable) - gemma 4 mtp drafter, swift-native, 1.5x at 4-bit k=2 - hermes tool calling + auto-recovery of leaked json tool_calls - enginecore pgroup kill (the “memory leak” on shutdown, fixed) - longctx-svc daemon: pulls relevant code per turn, testing showed scaling up to 12m-token corpora, exposes tools over mcp longctx is still experimental but i’ve been using it in agentic workflows. And so much more! release notes: http://github.com/TheTom/vllm-swift/blob/main/CHANGELOG.md… http://github.com/TheTom/longctx/blob/main/CHANGELOG.md… Repos https://github.com/TheTom/vllm-swift… https://github.com/TheTom/longctx


TheTom/vllm-swift

Source: https://github.com/TheTom/vllm-swift

vllm-swift

A native Swift/Metal backend for vLLM on Apple Silicon.
No Python in the inference hot path.

Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.
OpenAI-compatible API. Up to 2.6× faster short-context decode.

Quick Start

1. Install

Homebrew (recommended for Mac power users):

brew tap TheTom/tap && brew install vllm-swift

pip (everyone else, including dev containers and non-brew Macs):

pip install vllm-swift

The pip wheel bundles the prebuilt Swift bridge dylib + Metal kernel library, so no compile or brew step is required. Apple Silicon, Python 3.10+, macOS 11+.

From source:

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh       # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)

2. Run

vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096  # increase as needed, max 40960

Homebrew users don’t need activate.shvllm-swift serve handles everything.

Server running at http://localhost:8000 (OpenAI-compatible API).

Drop-in replacement for vLLM on Apple Silicon. All vllm serve flags work unchanged.

Performance (M5 Max 128GB)

Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM’s offline LLM API.

Qwen3-0.6B

Single8 concurrent32 concurrent64 concurrent
vllm-swift3641,5272,8593,425
vllm-metal (Python/MLX)1116522,0472,620

Qwen3-4B

Single8 concurrent32 concurrent64 concurrent
vllm-swift1474771,1941,518
vllm-metal (Python/MLX)1043961,0651,375

Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.

TurboQuant+ KV Cache Compression

TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.

Qwen3.5 2B (4-bit weights)

KV CacheCompressionPrefill @1KDecode @1KPrefill @4KDecode @4K
FP161.0×1,252 tok/s259 tok/s1,215 tok/s249 tok/s
turbo4v23.0×1,331 tok/s245 tok/s1,245 tok/s240 tok/s
turbo34.6×1,346 tok/s174 tok/s1,276 tok/s241 tok/s

Architecture

The entire forward pass runs in Swift/Metal. Python is used only for orchestration.

Python (vLLM API, tokenization, scheduling)  ← github.com/vllm-project/vllm
  ↓ ctypes FFI
C bridge (bridge.h)
  ↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
  ↓
Metal GPU

Features

  • OpenAI-compatible API (/v1/completions, /v1/chat/completions)
  • Streaming (SSE) responses
  • Chat templates (applied by vLLM, model-specific)
  • Batched concurrent decode with BatchedKVCache (fully batched projections + attention)
  • Per-request temperature sampling in batched path
  • Auto model download from HuggingFace Hub
  • TurboQuant+ KV cache compression (turbo3, turbo4v2) via mlx-swift-lm
  • longctx code-aware retrieval companion (--enable-longctx, experimental)
  • TriAttention V3 query-aware KV-cache eviction (env-gated, experimental — pair with longctx, see Effectively-unbounded context)
  • Decode and prompt logprobs
  • Greedy and temperature sampling
  • EOS / stop token detection (vLLM scheduler)
  • VLM (vision-language model) support (experimental)
  • Works with Hermes, OpenCode, and any OpenAI-compatible client

Use with AI tools

# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
  --served-model-name qwen3-4b \
  --enable-auto-tool-choice --tool-call-parser hermes

Then point your tool at it:

# Hermes — set in ~/.hermes/config.yaml:
#   base_url: http://localhost:8000/v1
#   model: qwen3-4b

# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode

# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'

Configuration

vllm-swift serve is a thin wrapper around vllm serve — all standard vLLM flags work. Here are the common setups:

Basic serving

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960

Agent / tool calling (Hermes, OpenCode, etc.)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes

Chain-of-thought models (strip <think> tags)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-reasoning --reasoning-parser deepseek_r1

Long context with TurboQuant+

Compress KV cache 3-5× to fit longer context with modest throughput cost:

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'
SchemeCompressionBest for
turbo4v2~3×Recommended — best quality/compression balance
turbo3~4.6×Maximum compression, higher PPL trade-off

Long context with rescue — TriAttention V3 + longctx (experimental)

Both pieces are off by default. Wiring works on Qwen3.5-2B-4bit (M5 Max + M2 Mac mini) up to 256K. Other model families load fine but are less tested. APIs and defaults may change.

Two independent features that compose:

  • longctx — retrieval companion that adds a ## Retrieved code context block to chat completions, sourced from the user’s repo. See longctx.
  • TriAttention V3 — query-aware KV-cache eviction policy. Independent Swift port and hybrid extension of Mao et al., “TriAttention: Efficient Long Reasoning with Trigonometric KV Compression” (arXiv:2604.04921). Drops low-salience cells once the cache passes a budget. Design doc.

When to enable what

WorkloadUse this
Coding assistant on a local repolongctx alone
Long single-shot prompt that fits the model’s context windowTurboQuant+ KV codec (turbo4v2)
Long multi-turn chat that would otherwise outgrow GPU memoryV3 + longctx
Retrieval-heavy workloads (NIAH-style) at 32K+V3 + longctx

V3 alone is not recommended. Eviction is one-way; without a recovery layer, evicted facts are gone. On Qwen3.5-2B-4bit at 32K → 256K, V3-only misses recall at every rung; V3+longctx passes at every rung (table at the bottom of this section).

Installing longctx-svc

--enable-longctx and the V3 rescue path both need the longctx-svc Python service. Install once into the bundled vllm-swift venv:

~/.vllm-swift/venv/bin/pip install longctx-svc

Or install globally if you’d rather:

pip install longctx-svc

longctx alone (code-aware retrieval)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --enable-longctx

The sidecar boots automatically when --enable-longctx is set. Each chat completion’s prompt is scanned for absolute file paths; the first path’s project root is detected (.git, package.json, etc.), the repo is indexed, and top-K relevant chunks are spliced in as a system message. Works alongside --enable-auto-tool-choice. See longctx for scope tuning, watch-mode, and tester notes.

V3 + longctx (long multi-turn chat)

# Start longctx-svc separately (the auto-spawn from --enable-longctx
# is enough for the code-aware path; the V3 rescue path benefits
# from a long-running shared instance)
~/.vllm-swift/venv/bin/longctx-svc serve --host 127.0.0.1 --port 5054 &

# Serve with V3 + longctx wired together
VLLM_TRIATT_ENABLED=1 \
VLLM_TRIATT_BUDGET=230400 \
VLLM_TRIATT_WINDOW=128 \
VLLM_TRIATT_PREFIX=32 \
VLLM_TRIATT_WARMUP=256 \
VLLM_TRIATT_HYBRID=2 \
LONGCTX_ENDPOINT=http://127.0.0.1:5054 \
vllm-swift serve ~/models/Qwen3.5-2B-4bit \
  --served-model-name qwen35-2b \
  --enable-longctx \
  --max-model-len 262144

Set VLLM_TRIATT_BUDGET to ~90% of --max-model-len for a 10% eviction headroom. The auto Tier-3 rehydrate hook fires before each turn’s prefill, queries longctx with the user’s question, and prepends recovered chunks as a system message. The model sees a normal multi-turn chat with the recovered context up top.

V3 environment variables

VarDefaultPurpose
VLLM_TRIATT_ENABLEDunset (off)master switch
VLLM_TRIATT_BUDGETrequiredKV cells to keep (set to ~90% of --max-model-len for 10% eviction headroom)
VLLM_TRIATT_WINDOW128always-keep recent window
VLLM_TRIATT_PREFIX32always-keep prompt prefix
VLLM_TRIATT_WARMUP256tokens before first eviction round
VLLM_TRIATT_HYBRID2eviction policy mode
LONGCTX_ENDPOINTunsetURL of longctx-svc — required for the rescue path

Limitations / current state

  • V3 cache (TriAttentionKVCache) is FP16 only. Stacking V3 with TurboQuant codecs (turbo4v2, turbo8v4) is not yet supported. Track progress at mlx-swift-lm task #187.
  • V3 hooks are wired on Qwen3 / Qwen3.5 / Qwen3-MoE / Llama / Mistral3 / Phi / Phi3 / Gemma3 / GLM4. Other model families fall back to non-V3 caches.
  • Tier-3 rehydrate auto-fires only through the chat-completions multi-turn path (ChatSession in mlx-swift-lm). Single-shot completions skip it; document or NIAH-style workloads need to structure as a 2+ turn chat.
  • longctx-svc is an alpha companion service with its own caveats; see its README.

Receipts

12-cell ramp on Qwen3.5-2B-4bit (M5 Max), 32K → 256K planted-fact NIAH:

ctxbaseline turbo8v4V3 onlyV3 + longctx
32K✓HIT✗miss✓HIT
64K✓HIT✗miss✓HIT
128K✓HIT✗miss✓HIT
256K✓HIT✗miss✓HIT

Source paper: longctx and TriAttention.

Full setup (agent + reasoning + TurboQuant+)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --enable-reasoning --reasoning-parser deepseek_r1 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'

All flags

vllm-swift serve <model> [options]

  --served-model-name NAME   Clean model name for API clients (recommended)
  --max-model-len N          Max sequence length (default: model config)
  --port PORT                API server port (default: 8000)
  --gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
  --dtype float16            Model dtype (default: float16)
  --enable-auto-tool-choice  Enable tool/function calling
  --tool-call-parser NAME    Tool call format (hermes, llama3, mistral, etc.)
  --enable-reasoning         Enable chain-of-thought parsing
  --reasoning-parser NAME    Reasoning format (deepseek_r1, etc.)
  --additional-config JSON   Extra config (kv_scheme, kv_bits)

All standard vLLM flags work — these are just the most common ones.

Documentation

DocWhat’s in it
docs/PERFORMANCE.mdFull perf matrix vs vllm-metal, methodology, long-context cells
docs/MODEL_COMPATIBILITY.mdEmpirical pass / soft-fail / hard-fail across local MLX models with root-cause classification (model intrinsic, vLLM upstream, env-missing)
docs/TROUBLESHOOTING.mdSymptom → diagnostic → fix for known failure patterns (parser mismatch, reasoning consuming the turn, Gemma-4 boot failure, etc.)
CHANGELOG.mdRelease history

Changelog

See CHANGELOG.md for release history.

Known Limitations (early development)

  • LoRA not supported (Swift engine limitation)
  • Chunked prefill disabled (Swift engine handles full sequences)
  • top_p sampling not supported in batched decode path (temperature works)
  • Only Qwen3 models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
  • Requires macOS on Apple Silicon (no Linux/CUDA)

Install

Homebrew

brew tap TheTom/tap && brew install vllm-swift

Prebuilt bottle — no Swift toolchain needed. First run of vllm-swift serve sets up a managed Python environment automatically.

To update to the latest version:

vllm-swift update

# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swift

From source

git clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh       # builds Swift, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Manual (full control)

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
  vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Troubleshooting

Homebrew checksum error on reinstall:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift

“No module named vllm” or plugin not loading after brew install:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setup

vLLM build error (Apple Clang parentheses): Our install script and brew wrapper handle this automatically. If you’re on an older bottle or installing vLLM manually:

# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup

# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllm

activate.sh not found: Make sure you run ./install.sh (or ./scripts/install.sh) first — it generates activate.sh in the project root.

Metal kernel not found (GDN/TurboFlash models): The mlx.metallib file must be in the same directory as libVLLMBridge.dylib. For manual installs, copy it:

cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
   $(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/

Download a model

vllm-swift download mlx-community/Qwen3-4B-4bit

# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit

# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latest

Project Structure

vllm_swift/           Python plugin (vLLM WorkerBase)
swift/
  Sources/VLLMBridge/       C bridge (@_cdecl exports)
  bridge.h                  C API (prefill, decode, batched decode)
scripts/
  install.sh                One-step build + install
  build_bottle.sh           Build + upload Homebrew bottle
  integration_test.sh       End-to-end smoke test
homebrew/
  vllm-swift.rb             Homebrew formula
tests/                      84 tests, 97% coverage

Requirements

  • macOS 14+ on Apple Silicon
  • Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
  • Python 3.10+
  • vLLM 0.19+
  • mlx-swift-lm (pulled automatically by Swift Package Manager)

License

Apache-2.0

Similar Articles