@Prince_Canuma：今天我们发布了最大规模的 MLX-VLM 更新：v0.6.0 ……并且我们正在提升。这次更新旨在将你的 Apple 设备……

X AI KOLs Following 2026/06/01 20:14 工具

mlx-vlm apple-ml vision-language-model local-ai speculative-decoding agent-tools open-source

摘要

MLX-VLM v0.6.0 已发布，新增推测解码、兼容 Anthropic API 的智能体服务器、新模型（DeepSeek V4、ZAYA1-VL 等）、图像生成/编辑以及音频输入支持，使 Apple 设备上能运行本地 AI 智能体。

今天我们发布了最大规模的 MLX-VLM 更新：v0.6.0 ……并且我们正在提升。这次更新旨在将你的 Apple 设备变成真正的本地智能体机器。从你的桌面到你的口袋。新特性： * 全面的推测解码 — Gemma 4 EAGLE3 + DFlash、Qwen MTP、DeepSeek V4 MTP。更快的 token，更少的等待。 * 智能体就绪服务器 — 原生 Anthropic /v1/messages API、有状态的 /v1/responses、工具调用、Codex 上下文预算。直接将 Claude Code 和 Codex 接入本地模型。 * 海量新模型 — DeepSeek V4、ZAYA1-VL、MiniCPM-V 4.6、LFM2 MoE、Step-3.7 Flash、Laguna 等。 * 图像生成与编辑 — FLUX.2（基础版 + klein）、PrismML Bonsai。 * 音频输入 — Qwen3 Omni、Gemma 4 音频、base64 聊天音频。 * TurboQuant KV 缓存 — RHT 校正的快速路径，内存占用更少。 * 模块化服务器、更好的指标、更干净的流式传输。在你手中的硬件上运行真正的智能体。 GitHub：http://github.com/Blaizzy/mlx-vlm

查看原文

查看缓存全文

缓存时间: 2026/06/02 03:41

今天我们发布了迄今为止最大规模的 MLX-VLM 版本：v0.6.0……并且我们正在提升——这次更新旨在将你的 Apple 设备变成真正的本地智能体机器。从桌面到口袋，一应俱全。

新特性：

随处投机解码 — Gemma 4 EAGLE3 + DFlash、Qwen MTP、DeepSeek V4 MTP。更快的 token，更少的等待。
智能体就绪服务器 — 原生 Anthropic /v1/messages API、有状态 /v1/responses、工具调用、Codex 上下文预算。直接将 Claude Code 和 Codex 接入本地模型。
海量新模型 — DeepSeek V4、ZAYA1-VL、MiniCPM-V 4.6、LFM2 MoE、Step-3.7 Flash、Laguna 等等。
图像生成与编辑 — FLUX.2（base + klein）、PrismML Bonsai。
音频输入 — Qwen3 Omni、Gemma 4 音频、base64 聊天音频。
TurboQuant KV 缓存 — RHT 校正的快速路径，减少内存占用。
模块化服务器，更好的指标，更清晰的流式传输。

在您手中的硬件上运行真正的智能体。

Github: http://github.com/Blaizzy/mlx-vlm

Blaizzy/mlx-vlm

来源：https://github.com/Blaizzy/mlx-vlm 上传 Python 包 (https://github.com/Blaizzy/mlx-vlm/actions/workflows/python-publish.yml)

MLX-VLM

MLX-VLM 是一个用于在 Mac 上使用 MLX 进行视觉语言模型（VLM）和全能模型（支持音频和视频的 VLM）的推理与微调的包。

模型特定文档

一些模型提供了详细的文档，包含提示格式、示例和最佳实践：

模型	文档
DeepSeek-OCR	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/deepseekocr/README.md)
DeepSeek-OCR-2	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/deepseekocr_2/README.md)
DOTS-OCR	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/dots_ocr/README.md)
DOTS-MOCR	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/dots_ocr/README.md)
GLM-OCR	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/glm_ocr/README.md)
Phi-4 Reasoning Vision	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/phi4_siglip/README.md)
MiniCPM-o	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/minicpmo/README.md)
Phi-4 Multimodal	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/phi4mm/README.md)
MolmoPoint	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/molmo_point/README.md)
Moondream3	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/moondream3/README.md)
Gemma 4	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/gemma4/README.md)
Falcon-OCR	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/falcon_ocr/README.md)
Granite Vision 3.2	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/granite_vision/README.md)
Granite 4.0 Vision	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/granite4_vision/README.md)
MiniCPM-V 4.6	文档 (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/minicpmv4_6/README.md)

安装

开始使用最简单的方法是通过 pip 安装 mlx-vlm 包：

pip install -U mlx-vlm

使用

命令行界面 (CLI)

使用 CLI 从模型生成输出：

# 文本生成
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Hello, how are you?"

# 图像生成
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg

# 音频生成（新）
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you hear" --audio /path/to/audio.wav

# 多模态生成（图像 + 音频）
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you see and hear" --image /path/to/image.jpg --audio /path/to/audio.wav

思考预算

对于思考模型（如 Qwen3.5），你可以限制在思考块中花费的 token 数量：

mlx_vlm.generate --model mlx-community/Qwen3.5-2B-4bit \
  --thinking-budget 50 \
  --thinking-start-token "<think>" \
  --thinking-end-token "</think>" \
  --enable-thinking \
  --prompt "Solve 2+2"

标志	描述
`--enable-thinking`	在聊天模板中激活思考模式
`--thinking-budget`	思考块内允许的最大 token 数量
`--thinking-start-token`	开启思考块的 token（默认：``）
`--thinking-end-token`	关闭思考块的 token（默认：``）

当预算超支时，模型会被强制输出 \n 并切换到回答。如果传入了 --enable-thinking 但模型的聊天模板不支持，则预算仅在模型自行生成起始 token 时生效。

在服务器上，思考模式默认禁用。使用 --enable-thinking 启动服务器，使其成为未指定该选项的请求的默认模式：

mlx_vlm.server --model Qwen/Qwen3.5-4B --enable-thinking

请求可以通过 enable_thinking: true 或 enable_thinking: false 覆盖服务器默认值。

投机解码

通过使用较小的“起草者”模型起草几个候选 token，并在单个目标前向传播中验证它们，从而加速生成。支持三种起草者系列。

标志	描述
`--draft-model`	起草者的 HuggingFace 仓库或本地路径
`--draft-kind`	起草者系列 — `dflash`（默认）、`eagle3` 或 `mtp`（Gemma 4）
`--draft-block-size`	覆盖起草者配置的块大小

有关包括批量生成在内的 Python API 示例，请参阅 docs/usage.md。

DFlash (Qwen3.5)

一种轻量级块扩散起草者，每轮预测多个 token，通常快 2-3 倍。

# 使用投机解码进行文本生成
mlx_vlm.generate --model Qwen/Qwen3.5-4B \
  --draft-model z-lab/Qwen3.5-4B-DFlash \
  --prompt "Write a quicksort in Python." \
  --max-tokens 512 --temperature 0 --enable-thinking

# 也可与图像一起使用
mlx_vlm.generate --model Qwen/Qwen3.5-4B \
  --draft-model z-lab/Qwen3.5-4B-DFlash \
  --image examples/images/cats.jpg \
  --prompt "Describe this image." \
  --max-tokens 256 --temperature 0 --enable-thinking

# 使用投机解码的服务器
mlx_vlm.server --model Qwen/Qwen3.5-4B \
  --draft-model z-lab/Qwen3.5-4B-DFlash

DFlash 起草缓存窗口化功能可通过 Python API 使用。在投机解码期间，目标模型仍会用其完整的 KV 缓存验证每个提议的 token；此旋钮仅更改 DFlash 起草者的缓存。当设置了 draft_window_size 时，起草者在其自己的 KV 缓存中最多保留那么多的最近已提交 token，而不是关注整个生成的 prefix。这减少了起草端的缓存长度和内存，但由于起草者的上下文比目标验证器少，接受率可能降低。

在 MLX 上，对于 Qwen3.5 DFlash，完整的起草缓存通常更快，因此窗口化默认为 None；仅当你想尝试这种紧凑的最近 token 缓存权衡时才进行设置：

from mlx_vlm import load
from mlx_vlm.generate import generate
from mlx_vlm.speculative.drafters import load_drafter

model, processor = load("Qwen/Qwen3.5-4B")
draft_model, draft_kind = load_drafter("z-lab/Qwen3.5-4B-DFlash")
draft_model.config.draft_window_size = 256  # None 禁用窗口化

result = generate(
    model, processor,
    "Write a quicksort in Python.",
    max_tokens=512,
    temperature=0,
    draft_model=draft_model,
    draft_kind=draft_kind,
)

Gemma 4 MTP

多 Token 预测 (https://ai.google.dev/gemma/docs/mtp/mtp)：谷歌的 4 层“助手”起草者，与目标共享 K/V，并从固定位置自回归地起草多个 token。传递 --draft-kind mtp 以调度 MTP 循环。

mlx_vlm.generate --model mlx-community/gemma-4-31B-it-bf16 \
  --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
  --draft-kind mtp --draft-block-size 4 \
  --prompt "Explain speculative decoding in 3 sentences." \
  --max-tokens 256 --temperature 0

# 服务器
mlx_vlm.server --model mlx-community/gemma-4-31B-it-bf16 \
  --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
  --draft-kind mtp --draft-block-size 4

支持的配对（目标 ↔ 起草者）：

目标	起草者
`mlx-community/gemma-4-E2B-it-bf16`	`mlx-community/gemma-4-E2B-it-assistant-bf16`
`mlx-community/gemma-4-E4B-it-bf16`	`mlx-community/gemma-4-E4B-it-assistant-bf16`
`mlx-community/gemma-4-26B-A4B-it-bf16`	`mlx-community/gemma-4-26B-A4B-it-assistant-bf16`
`mlx-community/gemma-4-31B-it-bf16`	`mlx-community/gemma-4-31B-it-assistant-bf16`

测量到的加速（贪婪，逐字节相同输出）：在 B=4 时，26B-A4B 上最高 3.94×，31B 上最高 2.29×。有关完整扫描和架构说明，请参阅 mlx_vlm/speculative/drafters/gemma4_assistant/README.md。

Gemma 4 EAGLE-3

EAGLE-3 (https://sgl-project.github.io/SpecForge/concepts/EAGLE3.html) 使用轻量级单层推测器从三个目标隐藏状态捕获中起草。Red Hat Speculators 检查点自动检测为 --draft-kind eagle3。

mlx_vlm.generate --model mlx-community/gemma-4-31B-it-bf16 \
  --draft-model RedHatAI/gemma-4-31B-it-speculator.eagle3 \
  --prompt "Explain speculative decoding in 3 sentences." \
  --max-tokens 256 --temperature 0

# 服务器
mlx_vlm.server --model mlx-community/gemma-4-31B-it-bf16 \
  --draft-model RedHatAI/gemma-4-31B-it-speculator.eagle3

Gradio 聊天界面

使用 Gradio 启动聊天界面：

mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit

Python 脚本

以下是如何在 Python 脚本中使用 MLX-VLM 的示例：

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# 加载模型
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# 准备输入
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] 也可以使用 PIL.Image.Image 对象
prompt = "Describe this image."

# 应用聊天模板
formatted_prompt = apply_chat_template(
    processor,
    config,
    prompt,
    num_images=len(image)
)

# 生成输出
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

音频示例

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# 加载支持音频的模型
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# 准备音频输入
audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."

# 应用包含音频的聊天模板
formatted_prompt = apply_chat_template(
    processor,
    config,
    prompt,
    num_audios=len(audio)
)

# 生成包含音频的输出
output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output)

多模态示例（图像 + 音频）

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# 加载多模态模型
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# 准备输入
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = ""  # 这里填写你的提示词

# 应用聊天模板
formatted_prompt = apply_chat_template(
    processor,
    config,
    prompt,
    num_images=len(image),
    num_audios=len(audio)
)

# 生成输出
output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output)

服务器 (FastAPI)

启动服务器：

mlx_vlm.server --port 8080
# 启动时预加载模型（Hugging Face 仓库或本地路径）
mlx_vlm.server --model <model-name>
# 使用适配器预加载模型
mlx_vlm.server --model <model-name> --adapter-path <path>
# 启用远程代码信任（某些模型需要）
mlx_vlm.server --trust-remote-code
# 默认启用思考模式（请求可覆盖）
mlx_vlm.server --model Qwen/Qwen3.5-4B --enable-thinking

服务器选项

--model：服务器启动时预加载模型，接受 Hugging Face 仓库 ID 或本地路径（可选，如果省略则在第一个请求时延迟加载）
--adapter-path：与预加载模型一起使用的适配器权重路径
--draft-model：投机解码起草者路径或 HF id（例如 z-lab/Qwen3.5-4B-DFlash、RedHatAI/gemma-4-31B-it-speculator.eagle3、google/gemma-4-31B-it-assistant）——启用投机解码以获得约 2 倍或更高的吞吐量
--draft-kind：起草者系列 — dflash（默认）、eagle3 或 mtp（Gemma 4）
--draft-block-size：覆盖起草者配置的块大小
--host：主机地址（默认：0.0.0.0）
--port：端口号（默认：8080）
--trust-remote-code：从 Hugging Face Hub 加载模型时信任远程代码
--enable-thinking：默认启用思考模式，用于未设置 enable_thinking 的请求
--kv-bits：KV 缓存量化的位数（例如 8 表示均匀，3.5 表示 TurboQuant）
--kv-quant-scheme：KV 缓存量化后端（uniform 或 turboquant）
--kv-group-size：均匀 KV 缓存量化的组大小（默认：64）
--max-kv-size：KV 缓存的最大 token 数
--vision-cache-size：缓存的最大视觉特征数（默认：20）
--log-level：日志级别 — DEBUG、INFO、WARNING、ERROR、CRITICAL（默认：INFO）

你也可以通过环境变量设置远程代码信任：

MLX_TRUST_REMOTE_CODE=true mlx_vlm.server

服务器提供多个端点以满足不同用例，并支持动态模型加载/卸载和缓存（一次一个模型）。

连续批处理

服务器支持连续批处理，以在处理多个并发请求时获得更高的吞吐量。新请求会立即加入活动批次，而无需等待现有请求完成，并且支持图像和纯文本请求的混合批次。

当服务器加载模型时，连续批处理会自动启用。你可以在启动时预加载模型，以便它准备好立即服务：

mlx_vlm.server --port 8080 --model mlx-community/Qwen2.5-VL-3B-Instruct-4bit

通过健康端点验证：

curl http://localhost:8080/health
# {"status":"healthy","loaded_model":"...","apc_enabled":false}

如果省略了 --model，模型将在第一个请求时加载。

自动前缀缓存 (APC)

自动前缀缓存在共享相同前缀的请求之间重用块级 K/V 缓存状态。这对于重复的长文档、长聊天历史或检索上下文非常有用，其中每个请求追加