@no_stp_on_snek: vllm-swift 0.6.3 + longctx 0.3.2 are out. highlights: - triattentionv3 + longctx rescue path hits 256k niah on apple si…

X AI KOLs Following 05/14/26, 02:41 AM Tools

apple-silicon swift metal inference vllm long-context tool-calling

Summary

vllm-swift 0.6.3 and longctx 0.3.2 releases bring triattentionv3 with 256k context on Apple Silicon, Gemma 4 MTP drafter support, Hermes tool calling with auto-recovery, and a longctx-svc daemon for scaling to 12M-token corpora.

vllm-swift 0.6.3 + longctx 0.3.2 are out. highlights: - triattentionv3 + longctx rescue path hits 256k niah on apple silicon (yes triattention now mildly viable) - gemma 4 mtp drafter, swift-native, 1.5x at 4-bit k=2 - hermes tool calling + auto-recovery of leaked json tool_calls - enginecore pgroup kill (the "memory leak" on shutdown, fixed) - longctx-svc daemon: pulls relevant code per turn, testing showed scaling up to 12m-token corpora, exposes tools over mcp longctx is still experimental but i've been using it in agentic workflows. And so much more! release notes: http://github.com/TheTom/vllm-swift/blob/main/CHANGELOG.md… http://github.com/TheTom/longctx/blob/main/CHANGELOG.md… Repos https://github.com/TheTom/vllm-swift… https://github.com/TheTom/longctx

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/14/26, 04:29 AM

vllm-swift 0.6.3 + longctx 0.3.2 are out. highlights: - triattentionv3 + longctx rescue path hits 256k niah on apple silicon (yes triattention now mildly viable) - gemma 4 mtp drafter, swift-native, 1.5x at 4-bit k=2 - hermes tool calling + auto-recovery of leaked json tool_calls - enginecore pgroup kill (the “memory leak” on shutdown, fixed) - longctx-svc daemon: pulls relevant code per turn, testing showed scaling up to 12m-token corpora, exposes tools over mcp longctx is still experimental but i’ve been using it in agentic workflows. And so much more! release notes: http://github.com/TheTom/vllm-swift/blob/main/CHANGELOG.md… http://github.com/TheTom/longctx/blob/main/CHANGELOG.md… Repos https://github.com/TheTom/vllm-swift… https://github.com/TheTom/longctx

TheTom/vllm-swift

Source: https://github.com/TheTom/vllm-swift

A native Swift/Metal backend for vLLM on Apple Silicon.
No Python in the inference hot path.

Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.
OpenAI-compatible API. Up to 2.6× faster short-context decode.

Quick Start

1. Install

Homebrew (recommended for Mac power users):

brew tap TheTom/tap && brew install vllm-swift

pip (everyone else, including dev containers and non-brew Macs):

pip install vllm-swift

The pip wheel bundles the prebuilt Swift bridge dylib + Metal kernel library, so no compile or brew step is required. Apple Silicon, Python 3.10+, macOS 11+.

From source:

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh       # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)

2. Run

vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096  # increase as needed, max 40960

Homebrew users don’t need activate.sh — vllm-swift serve handles everything.

Server running at http://localhost:8000 (OpenAI-compatible API).

Drop-in replacement for vLLM on Apple Silicon. All vllm serve flags work unchanged.

Performance (M5 Max 128GB)

Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM’s offline LLM API.

Qwen3-0.6B

	Single	8 concurrent	32 concurrent	64 concurrent
vllm-swift	364	1,527	2,859	3,425
vllm-metal (Python/MLX)	111	652	2,047	2,620

Qwen3-4B

	Single	8 concurrent	32 concurrent	64 concurrent
vllm-swift	147	477	1,194	1,518
vllm-metal (Python/MLX)	104	396	1,065	1,375

Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.

TurboQuant+ KV Cache Compression

TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.

Qwen3.5 2B (4-bit weights)

KV Cache	Compression	Prefill @1K	Decode @1K	Prefill @4K	Decode @4K
FP16	1.0×	1,252 tok/s	259 tok/s	1,215 tok/s	249 tok/s
turbo4v2	3.0×	1,331 tok/s	245 tok/s	1,245 tok/s	240 tok/s
turbo3	4.6×	1,346 tok/s	174 tok/s	1,276 tok/s	241 tok/s

Architecture

The entire forward pass runs in Swift/Metal. Python is used only for orchestration.

Python (vLLM API, tokenization, scheduling)  ← github.com/vllm-project/vllm
  ↓ ctypes FFI
C bridge (bridge.h)
  ↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
  ↓
Metal GPU

Features

OpenAI-compatible API (/v1/completions, /v1/chat/completions)
Streaming (SSE) responses
Chat templates (applied by vLLM, model-specific)
Batched concurrent decode with BatchedKVCache (fully batched projections + attention)
Per-request temperature sampling in batched path
Auto model download from HuggingFace Hub
TurboQuant+ KV cache compression (turbo3, turbo4v2) via mlx-swift-lm
longctx code-aware retrieval companion (--enable-longctx, experimental)
TriAttention V3 query-aware KV-cache eviction (env-gated, experimental — pair with longctx, see Effectively-unbounded context)
Decode and prompt logprobs
Greedy and temperature sampling
EOS / stop token detection (vLLM scheduler)
VLM (vision-language model) support (experimental)
Works with Hermes, OpenCode, and any OpenAI-compatible client

Use with AI tools

# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
  --served-model-name qwen3-4b \
  --enable-auto-tool-choice --tool-call-parser hermes

Then point your tool at it:

# Hermes — set in ~/.hermes/config.yaml:
#   base_url: http://localhost:8000/v1
#   model: qwen3-4b

# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode

# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'

Configuration

vllm-swift serve is a thin wrapper around vllm serve — all standard vLLM flags work. Here are the common setups:

Basic serving

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960

Agent / tool calling (Hermes, OpenCode, etc.)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes

Chain-of-thought models (strip `<think>` tags)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-reasoning --reasoning-parser deepseek_r1

Long context with TurboQuant+

Compress KV cache 3-5× to fit longer context with modest throughput cost:

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'

Scheme	Compression	Best for
`turbo4v2`	~3×	Recommended — best quality/compression balance
`turbo3`	~4.6×	Maximum compression, higher PPL trade-off

Long context with rescue — TriAttention V3 + longctx (experimental)

Both pieces are off by default. Wiring works on Qwen3.5-2B-4bit (M5 Max + M2 Mac mini) up to 256K. Other model families load fine but are less tested. APIs and defaults may change.

Two independent features that compose:

longctx — retrieval companion that adds a ## Retrieved code context block to chat completions, sourced from the user’s repo. See longctx.
TriAttention V3 — query-aware KV-cache eviction policy. Independent Swift port and hybrid extension of Mao et al., “TriAttention: Efficient Long Reasoning with Trigonometric KV Compression” (arXiv:2604.04921). Drops low-salience cells once the cache passes a budget. Design doc.

When to enable what

Workload	Use this
Coding assistant on a local repo	longctx alone
Long single-shot prompt that fits the model’s context window	TurboQuant+ KV codec (`turbo4v2`)
Long multi-turn chat that would otherwise outgrow GPU memory	V3 + longctx
Retrieval-heavy workloads (NIAH-style) at 32K+	V3 + longctx

V3 alone is not recommended. Eviction is one-way; without a recovery layer, evicted facts are gone. On Qwen3.5-2B-4bit at 32K → 256K, V3-only misses recall at every rung; V3+longctx passes at every rung (table at the bottom of this section).

Installing longctx-svc

--enable-longctx and the V3 rescue path both need the longctx-svc Python service. Install once into the bundled vllm-swift venv:

~/.vllm-swift/venv/bin/pip install longctx-svc

Or install globally if you’d rather:

pip install longctx-svc

longctx alone (code-aware retrieval)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --enable-longctx

The sidecar boots automatically when --enable-longctx is set. Each chat completion’s prompt is scanned for absolute file paths; the first path’s project root is detected (.git, package.json, etc.), the repo is indexed, and top-K relevant chunks are spliced in as a system message. Works alongside --enable-auto-tool-choice. See longctx for scope tuning, watch-mode, and tester notes.

V3 + longctx (long multi-turn chat)

# Start longctx-svc separately (the auto-spawn from --enable-longctx
# is enough for the code-aware path; the V3 rescue path benefits
# from a long-running shared instance)
~/.vllm-swift/venv/bin/longctx-svc serve --host 127.0.0.1 --port 5054 &

# Serve with V3 + longctx wired together
VLLM_TRIATT_ENABLED=1 \
VLLM_TRIATT_BUDGET=230400 \
VLLM_TRIATT_WINDOW=128 \
VLLM_TRIATT_PREFIX=32 \
VLLM_TRIATT_WARMUP=256 \
VLLM_TRIATT_HYBRID=2 \
LONGCTX_ENDPOINT=http://127.0.0.1:5054 \
vllm-swift serve ~/models/Qwen3.5-2B-4bit \
  --served-model-name qwen35-2b \
  --enable-longctx \
  --max-model-len 262144

Set VLLM_TRIATT_BUDGET to ~90% of --max-model-len for a 10% eviction headroom. The auto Tier-3 rehydrate hook fires before each turn’s prefill, queries longctx with the user’s question, and prepends recovered chunks as a system message. The model sees a normal multi-turn chat with the recovered context up top.

V3 environment variables

Var	Default	Purpose
`VLLM_TRIATT_ENABLED`	unset (off)	master switch
`VLLM_TRIATT_BUDGET`	required	KV cells to keep (set to ~90% of `--max-model-len` for 10% eviction headroom)
`VLLM_TRIATT_WINDOW`	128	always-keep recent window
`VLLM_TRIATT_PREFIX`	32	always-keep prompt prefix
`VLLM_TRIATT_WARMUP`	256	tokens before first eviction round
`VLLM_TRIATT_HYBRID`	2	eviction policy mode
`LONGCTX_ENDPOINT`	unset	URL of longctx-svc — required for the rescue path

Limitations / current state

V3 cache (TriAttentionKVCache) is FP16 only. Stacking V3 with TurboQuant codecs (turbo4v2, turbo8v4) is not yet supported. Track progress at mlx-swift-lm task #187.
V3 hooks are wired on Qwen3 / Qwen3.5 / Qwen3-MoE / Llama / Mistral3 / Phi / Phi3 / Gemma3 / GLM4. Other model families fall back to non-V3 caches.
Tier-3 rehydrate auto-fires only through the chat-completions multi-turn path (ChatSession in mlx-swift-lm). Single-shot completions skip it; document or NIAH-style workloads need to structure as a 2+ turn chat.
longctx-svc is an alpha companion service with its own caveats; see its README.

Receipts

12-cell ramp on Qwen3.5-2B-4bit (M5 Max), 32K → 256K planted-fact NIAH:

ctx	baseline turbo8v4	V3 only	V3 + longctx
32K	✓HIT	✗miss	✓HIT
64K	✓HIT	✗miss	✓HIT
128K	✓HIT	✗miss	✓HIT
256K	✓HIT	✗miss	✓HIT

Source paper: longctx and TriAttention.

Full setup (agent + reasoning + TurboQuant+)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --enable-reasoning --reasoning-parser deepseek_r1 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'

All flags

vllm-swift serve <model> [options]

  --served-model-name NAME   Clean model name for API clients (recommended)
  --max-model-len N          Max sequence length (default: model config)
  --port PORT                API server port (default: 8000)
  --gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
  --dtype float16            Model dtype (default: float16)
  --enable-auto-tool-choice  Enable tool/function calling
  --tool-call-parser NAME    Tool call format (hermes, llama3, mistral, etc.)
  --enable-reasoning         Enable chain-of-thought parsing
  --reasoning-parser NAME    Reasoning format (deepseek_r1, etc.)
  --additional-config JSON   Extra config (kv_scheme, kv_bits)

All standard vLLM flags work — these are just the most common ones.

Documentation

Doc	What’s in it
docs/PERFORMANCE.md	Full perf matrix vs vllm-metal, methodology, long-context cells
docs/MODEL_COMPATIBILITY.md	Empirical pass / soft-fail / hard-fail across local MLX models with root-cause classification (model intrinsic, vLLM upstream, env-missing)
docs/TROUBLESHOOTING.md	Symptom → diagnostic → fix for known failure patterns (parser mismatch, reasoning consuming the turn, Gemma-4 boot failure, etc.)
CHANGELOG.md	Release history

Changelog

See CHANGELOG.md for release history.

Known Limitations (early development)

LoRA not supported (Swift engine limitation)
Chunked prefill disabled (Swift engine handles full sequences)
top_p sampling not supported in batched decode path (temperature works)
Only Qwen3 models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
Requires macOS on Apple Silicon (no Linux/CUDA)

Install

Homebrew

brew tap TheTom/tap && brew install vllm-swift

Prebuilt bottle — no Swift toolchain needed. First run of vllm-swift serve sets up a managed Python environment automatically.

To update to the latest version:

vllm-swift update

# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swift

From source

git clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh       # builds Swift, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Manual (full control)

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
  vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Troubleshooting

Homebrew checksum error on reinstall:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift

“No module named vllm” or plugin not loading after brew install:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setup

vLLM build error (Apple Clang parentheses): Our install script and brew wrapper handle this automatically. If you’re on an older bottle or installing vLLM manually:

# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup

# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllm

activate.sh not found: Make sure you run ./install.sh (or ./scripts/install.sh) first — it generates activate.sh in the project root.

Metal kernel not found (GDN/TurboFlash models): The mlx.metallib file must be in the same directory as libVLLMBridge.dylib. For manual installs, copy it:

cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
   $(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/

Download a model

vllm-swift download mlx-community/Qwen3-4B-4bit

# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit

# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latest

Project Structure

vllm_swift/           Python plugin (vLLM WorkerBase)
swift/
  Sources/VLLMBridge/       C bridge (@_cdecl exports)
  bridge.h                  C API (prefill, decode, batched decode)
scripts/
  install.sh                One-step build + install
  build_bottle.sh           Build + upload Homebrew bottle
  integration_test.sh       End-to-end smoke test
homebrew/
  vllm-swift.rb             Homebrew formula
tests/                      84 tests, 97% coverage

Requirements

macOS 14+ on Apple Silicon
Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
Python 3.10+
vLLM 0.19+
mlx-swift-lm (pulled automatically by Swift Package Manager)

License

Apache-2.0