@Prince_Canuma：祝贺 @OpenBMB 发布 MiniCPM-V 4.6！MLX-VLM 已为其提供 Day-0 支持，感谢 Magic Yang。运行…

X AI KOLs Timeline 2026/05/11 19:47 模型

vision-language-model open-source mac-inference mlx-framework minicpm multimodal

摘要

OpenBMB 已发布 MiniCPM-V 4.6 视觉语言模型。该模型在发布首日即获得 MLX-VLM 软件包的支持，可在搭载 Apple Silicon 的 Mac 上实现高速推理。

祝贺 @OpenBMB 发布 MiniCPM-V 4.6！ MLX-VLM 已为其提供 Day-0 支持 🚀 感谢 Magic Yang 在 M3 Max 上以全精度运行，速度可达 125 tok/s。 https://t.co/qb0isGJOyJ https://t.co/hL9OS3rrKx

查看原文

查看缓存全文

缓存时间: 2026/05/11 20:45

祝贺 @OpenBMB 发布 MiniCPM-V 4.6！我们已在 MLX-VLM 上提供 Day-0 集成支持 🚀 感谢 Magic Yang。在 M3 Max 上以全精度运行速度达到 125 tok/s。
https://t.co/qb0isGJOyJ https://t.co/hL9OS3rrKx

Blaizzy/mlx-vlm

Source: https://github.com/Blaizzy/mlx-vlm
Upload Python Package (https://github.com/Blaizzy/mlx-vlm/actions/workflows/python-publish.yml)

MLX-VLM

MLX-VLM 是一个用于在 Mac 上使用 MLX 对视觉语言模型（VLM）和全模态模型（支持音频与视频的 VLM）进行推理与微调的 Python 包。

模型专属文档

部分模型提供了包含提示格式、示例和最佳实践的详细文档：

模型	文档
DeepSeek-OCR	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/deepseekocr/README.md)
DeepSeek-OCR-2	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/deepseekocr_2/README.md)
DOTS-OCR	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/dots_ocr/README.md)
DOTS-MOCR	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/dots_ocr/README.md)
GLM-OCR	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/glm_ocr/README.md)
Phi-4 Reasoning Vision	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/phi4_siglip/README.md)
MiniCPM-o	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/minicpmo/README.md)
Phi-4 Multimodal	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/phi4mm/README.md)
MolmoPoint	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/molmo_point/README.md)
Moondream3	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/moondream3/README.md)
Gemma 4	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/gemma4/README.md)
Falcon-OCR	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/falcon_ocr/README.md)
Granite Vision 3.2	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/granite_vision/README.md)
Granite 4.0 Vision	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/granite4_vision/README.md)
MiniCPM-V 4.6	Docs (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/minicpmv4_6/README.md)

安装

最简单的入门方式是通过 pip 安装 mlx-vlm 包：

pip install -U mlx-vlm

用法

命令行界面（CLI）

通过 CLI 生成模型输出：

# 文本生成
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Hello, how are you?"

# 图像生成
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg

# 音频生成（新）
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you hear" --audio /path/to/audio.wav

# 多模态生成（图像 + 音频）
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you see and hear" --image /path/to/image.jpg --audio /path/to/audio.wav

思考预算

对于思考类模型（如 Qwen3.5），可以限制在思考块中使用的 token 数量：

mlx_vlm.generate --model mlx-community/Qwen3.5-2B-4bit \
  --thinking-budget 50 \
  --thinking-start-token "" \
  --thinking-end-token "" \
  --enable-thinking \
  --prompt "Solve 2+2"

参数	说明
`--enable-thinking`	在聊天模板中激活思考模式
`--thinking-budget`	思考块内允许的最大 token 数
`--thinking-start-token`	打开思考块的 token（默认：`<think>`）
`--thinking-end-token`	关闭思考块的 token（默认：`</think>`）

当预算超限时，模型会被强制输出 \n 并转入回答阶段。如果只指定 --enable-thinking 而模型聊天模板不支持，则预算仅在模型自行生成开始 token 时生效。

在服务器端，思考模式默认关闭。启动服务器时可添加 --enable-thinking 使其成为未明确指定时的默认行为：

mlx_vlm.server --model Qwen/Qwen3.5-4B --enable-thinking

客户端请求可以通过 enable_thinking: true 或 enable_thinking: false 覆盖服务器的默认设置。

推测解码

通过使用一个小型“起草”模型生成多个候选 token，并在单次目标前向传播中验证它们，从而加速生成。支持两种起草模型类型。

参数	说明
`--draft-model`	起草模型的 HuggingFace 仓库名或本地路径
`--draft-kind`	起草模型类型 — `dflash`（默认）或 `mtp`（Gemma 4）
`--draft-block-size`	覆盖起草模型配置的块大小

批处理等 Python API 示例请参阅 docs/usage.md。

DFlash（Qwen3.5）

轻量级块扩散起草模型，每轮可预测多个 token，通常可带来 2–3 倍加速。

# 使用推测解码的文本生成
mlx_vlm.generate --model Qwen/Qwen3.5-4B \
  --draft-model z-lab/Qwen3.5-4B-DFlash \
  --prompt "Write a quicksort in Python." \
  --max-tokens 512 --temperature 0 --enable-thinking

# 也支持图像输入
mlx_vlm.generate --model Qwen/Qwen3.5-4B \
  --draft-model z-lab/Qwen3.5-4B-DFlash \
  --image examples/images/cats.jpg \
  --prompt "Describe this image." \
  --max-tokens 256 --temperature 0 --enable-thinking

# 启用推测解码的服务器
mlx_vlm.server --model Qwen/Qwen3.5-4B \
  --draft-model z-lab/Qwen3.5-4B-DFlash

Gemma 4 MTP

多 token 预测（https://ai.google.dev/gemma/docs/mtp/mtp）：Google 的 4 层“辅助”起草模型，与目标模型共享 K/V，并从固定位置自回归生成多个 token。通过传递 --draft-kind mtp 触发 MTP 循环。

mlx_vlm.generate --model mlx-community/gemma-4-31B-it-bf16 \
  --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
  --draft-kind mtp --draft-block-size 4 \
  --prompt "Explain speculative decoding in 3 sentences." \
  --max-tokens 256 --temperature 0

# 服务器
mlx_vlm.server --model mlx-community/gemma-4-31B-it-bf16 \
  --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
  --draft-kind mtp --draft-block-size 4

支持的配对（目标模型 ↔ 起草模型）：

目标模型	起草模型
`mlx-community/gemma-4-E2B-it-bf16`	`mlx-community/gemma-4-E2B-it-assistant-bf16`
`mlx-community/gemma-4-E4B-it-bf16`	`mlx-community/gemma-4-E4B-it-assistant-bf16`
`mlx-community/gemma-4-26B-A4B-it-bf16`	`mlx-community/gemma-4-26B-A4B-it-assistant-bf16`
`mlx-community/gemma-4-31B-it-bf16`	`mlx-community/gemma-4-31B-it-assistant-bf16`

实测加速比（贪心采样，逐字节相同输出）：26B-A4B 上最高 3.94×，31B 上最高 2.29×（B=4）。完整性能数据和架构说明见 mlx_vlm/speculative/drafters/gemma4_assistant/README.md。

Gradio 聊天界面

启动基于 Gradio 的聊天界面：

mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit

Python 脚本

以下是在 Python 脚本中使用 MLX-VLM 的示例：

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# 加载模型
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# 准备输入
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] 也可以使用 PIL.Image.Image 对象
prompt = "Describe this image."

# 应用聊天模板
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# 生成输出
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

音频示例

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# 加载支持音频的模型
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# 准备音频输入
audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."

# 应用带音频的聊天模板
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_audios=len(audio)
)

# 生成输出（包含音频）
output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output)

多模态示例（图像 + 音频）

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# 加载多模态模型
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# 准备输入
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = ""

# 应用聊天模板
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image), num_audios=len(audio)
)

# 生成输出
output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output)

服务器（FastAPI）

启动服务器：

mlx_vlm.server --port 8080

# 启动时预加载模型（Hugging Face 仓库或本地路径）
mlx_vlm.server --model

# 预加载带适配器的模型
mlx_vlm.server --model --adapter-path

# 启用 trust_remote_code（部分模型需要）
mlx_vlm.server --trust-remote-code

# 为未明确指定的请求默认开启思考模式
mlx_vlm.server --model Qwen/Qwen3.5-4B --enable-thinking

服务器选项

--model：在服务器启动时预加载模型，接受 Hugging Face 仓库 ID 或本地路径（可选；若省略则在首次请求时延迟加载）
--adapter-path：与预加载模型一起使用的适配器权重路径
--draft-model：推测解码的起草模型路径或 HF ID（如 z-lab/Qwen3.5-4B-DFlash、google/gemma-4-31B-it-assistant），可带来约 2× 以上的吞吐量提升
--draft-kind：起草模型类型 — dflash（默认）或 mtp（Gemma 4）
--draft-block-size：覆盖起草模型配置的块大小
--host：主机地址（默认：0.0.0.0）
--port：端口号（默认：8080）
--trust-remote-code：从 Hugging Face Hub 加载模型时信任远程代码
--enable-thinking：为未设置 enable_thinking 的请求默认开启思考模式
--kv-bits：KV 缓存量化的位数（例如 8 为均匀量化，3.5 为 TurboQuant）
--kv-quant-scheme：KV 缓存量化后端（uniform 或 turboquant）
--kv-group-size：均匀 KV 缓存量化时的组大小（默认：64）
--max-kv-size：最大 KV 缓存大小（以 token 计）
--vision-cache-size：最多缓存的视觉特征数量（默认：20）
--log-level：日志级别 — DEBUG、INFO、WARNING、ERROR、CRITICAL（默认：INFO）

也可以通过环境变量设置信任远程代码：

MLX_TRUST_REMOTE_CODE=true mlx_vlm.server

服务器提供多个端点以适应不同使用场景，并支持带缓存的动态模型加载/卸载（一次只加载一个模型）。

连续批处理

服务器支持连续批处理，在处理多个并发请求时可获得更高吞吐量。新请求会立即加入活动批处理，无需等待已有请求完成，并且支持图片与纯文本请求的混合批次。

服务器加载模型后连续批处理会自动启用。您可以在启动时预加载模型以便立即服务：

mlx_vlm.server --port 8080 --model mlx-community/Qwen2.5-VL-3B-Instruct-4bit

通过健康检查端点确认：

curl http://localhost:8080/health
# {"status":"healthy","loaded_model":"...","apc_enabled":false}

若省略 --model，模型将在第一次请求时加载。

自动前缀缓存（APC）

自动前缀缓存在不同请求共享相同前缀时，在块级别重用 KV 缓存状态。适用于重复的长文档、长对话历史或检索增强场景（每个请求仅附加短后缀）。

APC 分为两层：

热内存：将可重用的 APCBlock 张量保留在进程内存中。这是最快的路径，但会同时占用可重用块池和运行时 KVCache。
热磁盘：将缓存的前缀持久化为 safetensors 分片，以便进程重启后继续使用。热磁盘读取直接构建逐层 prompt cache，而不会将恢复的块提升到 APCBlock 池中；写入操作可以同时填充内存和磁盘两层。

Python 脚本

直接在 stream_generate 中使用 APCManager：

from pathlib import Path
from mlx_vlm import load, stream_generate
from mlx_vlm.apc import APCManager, DiskBlockStore
from mlx_vlm.prompt_utils import apply_chat_template

model_id = "Qwen/Qwen3-VL-4B-Instruct"
model, processor = load(model_id)

disk = DiskBlockStore(
    Path("~/.cache/mlx-vlm/caching").expanduser(),
    namespace=model_id,
    max_bytes=3 * (1 << 30),  # 3 GB 磁盘上限；设为 None 表示无限制
)

apc = APCManager(num_blocks=4096, block_size=16, disk=disk)
document = Path("long_document.txt").read_text()

try:
    # 第一个请求计算完整前缀并存储可重用 KV 块。
    prompt1 = apply_chat_template(
        processor, model.config,
        prompt=f"{document}\n\nSummarize the key decisions.",
        num_images=0,
    )
    for _ in stream_generate(
        model, processor, prompt1,
        max_tokens=128, temperature=0.0, apc_manager=apc
    ):
        pass

    # 第二个请求共享同一文档前缀，仅预填充后缀部分。
    prompt2 = apply_chat_template(
        processor, model.config,
        prompt=f"{document}\n\nList the open engineering risks.",
        num_images=0,
    )
    for chunk in stream_generate(
        model, processor, prompt2,
        max_tokens=128, temperature=0.0, apc_manager=apc
    ):
        print(chunk.text, end="", flush=True)
    print(apc.stats_snapshot())
finally:
    apc.close()

若需对比冷启动、热内存、热磁盘及磁盘驱逐行为，可使用相同的直接 API 方式：

import os
import tempfile
import time
from pathlib import Path
from mlx_vlm import load, stream_generate
from mlx_vlm.apc import APCManager, DiskBlockStore
from mlx_vlm.prompt_utils import apply_chat_template

model_id = "Qwen/Qwen3-VL-4B-Instruct"
contexts = [8000, 20000, 50000, 100000]
disk_cap_gb = 0   # 0 表示无上限
shard_max_blocks = 256
context_sweep_max_tokens = 1   # 一个 token 就足够测量预填充复用
test_prompt_tokens = 8000
fill_prompts = 80
eviction_disk_cap_gb = 3.0

os.environ["APC_DISK_SHARD_MAX_BLOCKS"] = str(shard_max_blocks)

model, processor = load(model_id)
tokenizer = processor

相似文章

@AdinaYakup: MiniCPM V4.6 一个真正能在手机上运行的 1B 多模态大语言模型，由 @OpenBMB 刚刚发布 1B - Apache2.0 支持 iOS、Android,…

X AI KOLs Following

OpenBMB 发布了 MiniCPM V4.6，这是一个专为移动设备优化的 1B 参数多模态大语言模型，采用 Apache 2.0 许可证。它具备混合视觉 token 压缩功能，声称在 iOS、Android 和 HarmonyOS 上原生运行时，吞吐量比 Qwen3.5 0.8B 快约 1.5 倍。

MiniCPM-V 4.6

Product Hunt

MiniCPM-V 4.6 是一款专为移动设备优化的极致高效 13 亿参数视觉语言模型。

@FeitengLi: OpenBMB 开源 MiniCPM-V 4.6 了，1.3B 参数（SigLIP2-400M + Qwen3.5-0.8B），262k 上下文，视觉编码 FLOPs 比上一代少 50%+。同任务 token 成本比 Qwen3.5-0…

X AI KOLs Timeline

OpenBMB releases MiniCPM-V 4.6, a 1.3B-parameter multimodal LLM with 262k context and significantly reduced visual encoding FLOPs, achieving strong benchmark performance and broad inference framework support.

@Prince_Canuma：今天我们发布了最大规模的 MLX-VLM 更新：v0.6.0 ……并且我们正在提升。这次更新旨在将你的 Apple 设备……

X AI KOLs Following

MLX-VLM v0.6.0 已发布，新增推测解码、兼容 Anthropic API 的智能体服务器、新模型（DeepSeek V4、ZAYA1-VL 等）、图像生成/编辑以及音频输入支持，使 Apple 设备上能运行本地 AI 智能体。

MiniCPM-V 4.5：通过架构、数据与训练配方打造高效多模态大语言模型

Papers with Code Trending

MiniCPM-V 4.5 是一款 8B 参数规模的多模态大语言模型，凭借统一的 3D-Resampler 架构、创新的数据策略以及混合强化学习方法，实现了高效率与卓越性能。据悉，该模型在显著降低 GPU 显存占用与推理耗时的同时，综合表现已超越更大规模的闭源及开源标杆模型。

@Prince_Canuma：祝贺 @OpenBMB 发布 MiniCPM-V 4.6！MLX-VLM 已为其提供 Day-0 支持，感谢 Magic Yang。运行…

祝贺 @OpenBMB 发布 MiniCPM-V 4.6！我们已在 MLX-VLM 上提供 Day-0 集成支持 🚀 感谢 Magic Yang。在 M3 Max 上以全精度运行速度达到 125 tok/s。
https://t.co/qb0isGJOyJ https://t.co/hL9OS3rrKx

Blaizzy/mlx-vlm

MLX-VLM

目录

模型专属文档

安装

用法

命令行界面（CLI）

思考预算

推测解码

DFlash（Qwen3.5）

Gemma 4 MTP

Gradio 聊天界面

Python 脚本

音频示例

多模态示例（图像 + 音频）

服务器（FastAPI）

服务器选项

连续批处理

自动前缀缓存（APC）

Python 脚本

相似文章

@AdinaYakup: MiniCPM V4.6 一个真正能在手机上运行的 1B 多模态大语言模型，由 @OpenBMB 刚刚发布 1B - Apache2.0 支持 iOS、Android,…

MiniCPM-V 4.6

@FeitengLi: OpenBMB 开源 MiniCPM-V 4.6 了，1.3B 参数（SigLIP2-400M + Qwen3.5-0.8B），262k 上下文，视觉编码 FLOPs 比上一代少 50%+。同任务 token 成本比 Qwen3.5-0…

@Prince_Canuma：今天我们发布了最大规模的 MLX-VLM 更新：v0.6.0 ……并且我们正在提升。这次更新旨在将你的 Apple 设备……

MiniCPM-V 4.5：通过架构、数据与训练配方打造高效多模态大语言模型

提交意见反馈

祝贺 @OpenBMB 发布 MiniCPM-V 4.6！我们已在 MLX-VLM 上提供 Day-0 集成支持 🚀 感谢 Magic Yang。在 M3 Max 上以全精度运行速度达到 125 tok/s。 https://t.co/qb0isGJOyJ https://t.co/hL9OS3rrKx

Blaizzy/mlx-vlm

MLX-VLM

目录

模型专属文档

安装

用法

命令行界面（CLI）

思考预算

推测解码

DFlash（Qwen3.5）

Gemma 4 MTP

Gradio 聊天界面

Python 脚本

音频示例

多模态示例（图像 + 音频）

服务器（FastAPI）

服务器选项

连续批处理

自动前缀缓存（APC）

Python 脚本

相似文章

@AdinaYakup: MiniCPM V4.6 一个真正能在手机上运行的 1B 多模态大语言模型，由 @OpenBMB 刚刚发布 1B - Apache2.0 支持 iOS、Android,…

MiniCPM-V 4.6

@FeitengLi: OpenBMB 开源 MiniCPM-V 4.6 了，1.3B 参数（SigLIP2-400M + Qwen3.5-0.8B），262k 上下文，视觉编码 FLOPs 比上一代少 50%+。 同任务 token 成本比 Qwen3.5-0…

@Prince_Canuma：今天我们发布了最大规模的 MLX-VLM 更新：v0.6.0 ……并且我们正在提升。这次更新旨在将你的 Apple 设备……

MiniCPM-V 4.5：通过架构、数据与训练配方打造高效多模态大语言模型

提交意见反馈

祝贺 @OpenBMB 发布 MiniCPM-V 4.6！我们已在 MLX-VLM 上提供 Day-0 集成支持 🚀 感谢 Magic Yang。在 M3 Max 上以全精度运行速度达到 125 tok/s。
https://t.co/qb0isGJOyJ https://t.co/hL9OS3rrKx

@FeitengLi: OpenBMB 开源 MiniCPM-V 4.6 了，1.3B 参数（SigLIP2-400M + Qwen3.5-0.8B），262k 上下文，视觉编码 FLOPs 比上一代少 50%+。同任务 token 成本比 Qwen3.5-0…