@no_stp_on_snek: vllm-swift 0.6.3 + longctx 0.3.2 are out. highlights: - triattentionv3 + longctx rescue path hits 256k niah on apple si…
Summary
vllm-swift 0.6.3 and longctx 0.3.2 releases bring triattentionv3 with 256k context on Apple Silicon, Gemma 4 MTP drafter support, Hermes tool calling with auto-recovery, and a longctx-svc daemon for scaling to 12M-token corpora.
View Cached Full Text
Cached at: 05/14/26, 04:29 AM
vllm-swift 0.6.3 + longctx 0.3.2 are out. highlights: - triattentionv3 + longctx rescue path hits 256k niah on apple silicon (yes triattention now mildly viable) - gemma 4 mtp drafter, swift-native, 1.5x at 4-bit k=2 - hermes tool calling + auto-recovery of leaked json tool_calls - enginecore pgroup kill (the “memory leak” on shutdown, fixed) - longctx-svc daemon: pulls relevant code per turn, testing showed scaling up to 12m-token corpora, exposes tools over mcp longctx is still experimental but i’ve been using it in agentic workflows. And so much more! release notes: http://github.com/TheTom/vllm-swift/blob/main/CHANGELOG.md… http://github.com/TheTom/longctx/blob/main/CHANGELOG.md… Repos https://github.com/TheTom/vllm-swift… https://github.com/TheTom/longctx
TheTom/vllm-swift
Source: https://github.com/TheTom/vllm-swift
A native Swift/Metal backend for vLLM on Apple Silicon.
No Python in the inference hot path.
Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.
OpenAI-compatible API. Up to 2.6× faster short-context decode.
Quick Start
1. Install
Homebrew (recommended for Mac power users):
brew tap TheTom/tap && brew install vllm-swift
pip (everyone else, including dev containers and non-brew Macs):
pip install vllm-swift
The pip wheel bundles the prebuilt Swift bridge dylib + Metal kernel library, so no compile or brew step is required. Apple Silicon, Python 3.10+, macOS 11+.
From source:
git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh # sets DYLD_LIBRARY_PATH (generated by install.sh)
2. Run
vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096 # increase as needed, max 40960
Homebrew users don’t need
activate.sh—vllm-swift servehandles everything.
Server running at http://localhost:8000 (OpenAI-compatible API).
Drop-in replacement for vLLM on Apple Silicon. All
vllm serveflags work unchanged.
Performance (M5 Max 128GB)
Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM’s offline LLM API.
Qwen3-0.6B
| Single | 8 concurrent | 32 concurrent | 64 concurrent | |
|---|---|---|---|---|
| vllm-swift | 364 | 1,527 | 2,859 | 3,425 |
| vllm-metal (Python/MLX) | 111 | 652 | 2,047 | 2,620 |
Qwen3-4B
| Single | 8 concurrent | 32 concurrent | 64 concurrent | |
|---|---|---|---|---|
| vllm-swift | 147 | 477 | 1,194 | 1,518 |
| vllm-metal (Python/MLX) | 104 | 396 | 1,065 | 1,375 |
Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.
TurboQuant+ KV Cache Compression
TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.
Qwen3.5 2B (4-bit weights)
| KV Cache | Compression | Prefill @1K | Decode @1K | Prefill @4K | Decode @4K |
|---|---|---|---|---|---|
| FP16 | 1.0× | 1,252 tok/s | 259 tok/s | 1,215 tok/s | 249 tok/s |
| turbo4v2 | 3.0× | 1,331 tok/s | 245 tok/s | 1,245 tok/s | 240 tok/s |
| turbo3 | 4.6× | 1,346 tok/s | 174 tok/s | 1,276 tok/s | 241 tok/s |
Architecture
The entire forward pass runs in Swift/Metal. Python is used only for orchestration.
Python (vLLM API, tokenization, scheduling) ← github.com/vllm-project/vllm
↓ ctypes FFI
C bridge (bridge.h)
↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
↓
Metal GPU
Features
- OpenAI-compatible API (
/v1/completions,/v1/chat/completions) - Streaming (SSE) responses
- Chat templates (applied by vLLM, model-specific)
- Batched concurrent decode with
BatchedKVCache(fully batched projections + attention) - Per-request temperature sampling in batched path
- Auto model download from HuggingFace Hub
- TurboQuant+ KV cache compression (
turbo3,turbo4v2) via mlx-swift-lm - longctx code-aware retrieval companion (
--enable-longctx, experimental) - TriAttention V3 query-aware KV-cache eviction (env-gated, experimental — pair with longctx, see Effectively-unbounded context)
- Decode and prompt logprobs
- Greedy and temperature sampling
- EOS / stop token detection (vLLM scheduler)
- VLM (vision-language model) support (experimental)
- Works with Hermes, OpenCode, and any OpenAI-compatible client
Use with AI tools
# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
--served-model-name qwen3-4b \
--enable-auto-tool-choice --tool-call-parser hermes
Then point your tool at it:
# Hermes — set in ~/.hermes/config.yaml:
# base_url: http://localhost:8000/v1
# model: qwen3-4b
# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode
# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'
Configuration
vllm-swift serve is a thin wrapper around vllm serve — all standard vLLM flags work. Here are the common setups:
Basic serving
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960
Agent / tool calling (Hermes, OpenCode, etc.)
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-auto-tool-choice --tool-call-parser hermes
Chain-of-thought models (strip <think> tags)
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-reasoning --reasoning-parser deepseek_r1
Long context with TurboQuant+
Compress KV cache 3-5× to fit longer context with modest throughput cost:
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'
| Scheme | Compression | Best for |
|---|---|---|
turbo4v2 | ~3× | Recommended — best quality/compression balance |
turbo3 | ~4.6× | Maximum compression, higher PPL trade-off |
Long context with rescue — TriAttention V3 + longctx (experimental)
Both pieces are off by default. Wiring works on Qwen3.5-2B-4bit (M5 Max + M2 Mac mini) up to 256K. Other model families load fine but are less tested. APIs and defaults may change.
Two independent features that compose:
- longctx — retrieval companion that adds a
## Retrieved code contextblock to chat completions, sourced from the user’s repo. See longctx. - TriAttention V3 — query-aware KV-cache eviction policy. Independent Swift port and hybrid extension of Mao et al., “TriAttention: Efficient Long Reasoning with Trigonometric KV Compression” (arXiv:2604.04921). Drops low-salience cells once the cache passes a budget. Design doc.
When to enable what
| Workload | Use this |
|---|---|
| Coding assistant on a local repo | longctx alone |
| Long single-shot prompt that fits the model’s context window | TurboQuant+ KV codec (turbo4v2) |
| Long multi-turn chat that would otherwise outgrow GPU memory | V3 + longctx |
| Retrieval-heavy workloads (NIAH-style) at 32K+ | V3 + longctx |
V3 alone is not recommended. Eviction is one-way; without a recovery layer, evicted facts are gone. On Qwen3.5-2B-4bit at 32K → 256K, V3-only misses recall at every rung; V3+longctx passes at every rung (table at the bottom of this section).
Installing longctx-svc
--enable-longctx and the V3 rescue path both need the longctx-svc Python service. Install once into the bundled vllm-swift venv:
~/.vllm-swift/venv/bin/pip install longctx-svc
Or install globally if you’d rather:
pip install longctx-svc
longctx alone (code-aware retrieval)
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--enable-longctx
The sidecar boots automatically when --enable-longctx is set. Each chat completion’s prompt is scanned for absolute file paths; the first path’s project root is detected (.git, package.json, etc.), the repo is indexed, and top-K relevant chunks are spliced in as a system message. Works alongside --enable-auto-tool-choice. See longctx for scope tuning, watch-mode, and tester notes.
V3 + longctx (long multi-turn chat)
# Start longctx-svc separately (the auto-spawn from --enable-longctx
# is enough for the code-aware path; the V3 rescue path benefits
# from a long-running shared instance)
~/.vllm-swift/venv/bin/longctx-svc serve --host 127.0.0.1 --port 5054 &
# Serve with V3 + longctx wired together
VLLM_TRIATT_ENABLED=1 \
VLLM_TRIATT_BUDGET=230400 \
VLLM_TRIATT_WINDOW=128 \
VLLM_TRIATT_PREFIX=32 \
VLLM_TRIATT_WARMUP=256 \
VLLM_TRIATT_HYBRID=2 \
LONGCTX_ENDPOINT=http://127.0.0.1:5054 \
vllm-swift serve ~/models/Qwen3.5-2B-4bit \
--served-model-name qwen35-2b \
--enable-longctx \
--max-model-len 262144
Set VLLM_TRIATT_BUDGET to ~90% of --max-model-len for a 10% eviction headroom. The auto Tier-3 rehydrate hook fires before each turn’s prefill, queries longctx with the user’s question, and prepends recovered chunks as a system message. The model sees a normal multi-turn chat with the recovered context up top.
V3 environment variables
| Var | Default | Purpose |
|---|---|---|
VLLM_TRIATT_ENABLED | unset (off) | master switch |
VLLM_TRIATT_BUDGET | required | KV cells to keep (set to ~90% of --max-model-len for 10% eviction headroom) |
VLLM_TRIATT_WINDOW | 128 | always-keep recent window |
VLLM_TRIATT_PREFIX | 32 | always-keep prompt prefix |
VLLM_TRIATT_WARMUP | 256 | tokens before first eviction round |
VLLM_TRIATT_HYBRID | 2 | eviction policy mode |
LONGCTX_ENDPOINT | unset | URL of longctx-svc — required for the rescue path |
Limitations / current state
- V3 cache (
TriAttentionKVCache) is FP16 only. Stacking V3 with TurboQuant codecs (turbo4v2, turbo8v4) is not yet supported. Track progress at mlx-swift-lm task #187. - V3 hooks are wired on Qwen3 / Qwen3.5 / Qwen3-MoE / Llama / Mistral3 / Phi / Phi3 / Gemma3 / GLM4. Other model families fall back to non-V3 caches.
- Tier-3 rehydrate auto-fires only through the chat-completions multi-turn path (
ChatSessionin mlx-swift-lm). Single-shot completions skip it; document or NIAH-style workloads need to structure as a 2+ turn chat. - longctx-svc is an alpha companion service with its own caveats; see its README.
Receipts
12-cell ramp on Qwen3.5-2B-4bit (M5 Max), 32K → 256K planted-fact NIAH:
| ctx | baseline turbo8v4 | V3 only | V3 + longctx |
|---|---|---|---|
| 32K | ✓HIT | ✗miss | ✓HIT |
| 64K | ✓HIT | ✗miss | ✓HIT |
| 128K | ✓HIT | ✗miss | ✓HIT |
| 256K | ✓HIT | ✗miss | ✓HIT |
Source paper: longctx and TriAttention.
Full setup (agent + reasoning + TurboQuant+)
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-auto-tool-choice --tool-call-parser hermes \
--enable-reasoning --reasoning-parser deepseek_r1 \
--additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'
All flags
vllm-swift serve <model> [options]
--served-model-name NAME Clean model name for API clients (recommended)
--max-model-len N Max sequence length (default: model config)
--port PORT API server port (default: 8000)
--gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
--dtype float16 Model dtype (default: float16)
--enable-auto-tool-choice Enable tool/function calling
--tool-call-parser NAME Tool call format (hermes, llama3, mistral, etc.)
--enable-reasoning Enable chain-of-thought parsing
--reasoning-parser NAME Reasoning format (deepseek_r1, etc.)
--additional-config JSON Extra config (kv_scheme, kv_bits)
All standard vLLM flags work — these are just the most common ones.
Documentation
| Doc | What’s in it |
|---|---|
| docs/PERFORMANCE.md | Full perf matrix vs vllm-metal, methodology, long-context cells |
| docs/MODEL_COMPATIBILITY.md | Empirical pass / soft-fail / hard-fail across local MLX models with root-cause classification (model intrinsic, vLLM upstream, env-missing) |
| docs/TROUBLESHOOTING.md | Symptom → diagnostic → fix for known failure patterns (parser mismatch, reasoning consuming the turn, Gemma-4 boot failure, etc.) |
| CHANGELOG.md | Release history |
Changelog
See CHANGELOG.md for release history.
Known Limitations (early development)
- LoRA not supported (Swift engine limitation)
- Chunked prefill disabled (Swift engine handles full sequences)
- top_p sampling not supported in batched decode path (temperature works)
- Only Qwen3 models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
- Requires macOS on Apple Silicon (no Linux/CUDA)
Install
Homebrew
brew tap TheTom/tap && brew install vllm-swift
Prebuilt bottle — no Swift toolchain needed. First run of vllm-swift serve sets up a managed Python environment automatically.
To update to the latest version:
vllm-swift update
# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swift
From source
git clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh # builds Swift, installs plugin, creates activate.sh
source activate.sh # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096
Manual (full control)
git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096
Troubleshooting
Homebrew checksum error on reinstall:
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift
“No module named vllm” or plugin not loading after brew install:
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setup
vLLM build error (Apple Clang parentheses): Our install script and brew wrapper handle this automatically. If you’re on an older bottle or installing vLLM manually:
# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup
# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllm
activate.sh not found: Make sure you run ./install.sh (or ./scripts/install.sh) first — it generates activate.sh in the project root.
Metal kernel not found (GDN/TurboFlash models): The mlx.metallib file must be in the same directory as libVLLMBridge.dylib. For manual installs, copy it:
cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
$(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/
Download a model
vllm-swift download mlx-community/Qwen3-4B-4bit
# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit
# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latest
Project Structure
vllm_swift/ Python plugin (vLLM WorkerBase)
swift/
Sources/VLLMBridge/ C bridge (@_cdecl exports)
bridge.h C API (prefill, decode, batched decode)
scripts/
install.sh One-step build + install
build_bottle.sh Build + upload Homebrew bottle
integration_test.sh End-to-end smoke test
homebrew/
vllm-swift.rb Homebrew formula
tests/ 84 tests, 97% coverage
Requirements
- macOS 14+ on Apple Silicon
- Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
- Python 3.10+
- vLLM 0.19+
- mlx-swift-lm (pulled automatically by Swift Package Manager)
License
Apache-2.0
Similar Articles
@no_stp_on_snek: first receipts: triattention v3 evicts safely with longctx. ✓HIT every rung 32k → 256k on qwen3.5-2b-4bit (hybrid mamba…
Introduces triattention v3, a new attention mechanism that enables safe eviction without recall loss for long-context inference, demonstrated on a hybrid mamba+attention model up to 256k tokens.
@jundotkim: oMLX 0.3.9.dev2 released. Highlights: - Gemma 4 MTP on the vision path (thanks to @Prince_Canuma's mlx-vlm). Image+text…
oMLX 0.3.9.dev2 is released with improved Gemma 4 support, DFlash engine integration, and ParoQuant capabilities for local LLM inference on Apple Silicon.
@Prince_Canuma: mlx-audio v0.4.3 is here A massive release across models, server, and DX → 6 new TTS models: Higgs Audio v2 (voice clon…
mlx-audio v0.4.3 releases with 6 new TTS models including Higgs Audio v2 and OmniVoice (646+ languages), plus server improvements like concurrent requests and continuous batching, ~3x faster Voxtral Realtime on 4-bit, and slimmer dependencies for Apple Silicon.
@Youssofal_: MTPLX V0.3 Is Out!: - I realised M1 & M2 macs do not support BF16 and were emulating it leading to significantly decrea…
MTPLX v0.3 is released, a native runtime for Apple Silicon that uses Multi-Token Prediction (MTP) to double decode speed while maintaining distributional accuracy via Leviathan-Chen acceptance.
@rohanpaul_ai: atomic[.]chat just made Gemma 4 26B faster inside LLaMA.cpp. making token generation about 40% faster in its MacBook Pr…
atomic.chat has optimized Gemma 4 26B inference in LLaMA.cpp, achieving ~40% faster token generation on MacBook Pro M5 Max using Multi-Token Prediction (MTP) speculative decoding. This is a notable win for local AI users running desktop apps, coding agents, and private on-device assistants.