ServiceNow-AI/SuperApriel-15B-Instruct · Hugging Face

Reddit r/LocalLLaMA 2026/04/22 14:10 模型

摘要

ServiceNow 推出 SuperApriel-15B-Instruct：仅 15B 参数的单一 checkpoint，内置 8 组 mixer 预设，可在 1× 到 10.7× 解码吞吐之间切换，在 32K 上下文下仍保持高达 96% 的质量。

一个 15B 参数的 **token-mixer 超网络**，在 32K 序列长度下提供 **8 组优化部署预设**，解码吞吐从 1.0× 到 10.7× 全覆盖，全部来自同一 checkpoint。通过随机蒸馏与定向监督微调，从 [Apriel-1.6](https://huggingface.co/ServiceNow-AI/Apriel-1.6-15b-Thinker) 衍生而来。 * **模型规模：** 15B 参数 * **层数：** 48 层解码器，每层含 4 种 mixer 变体 * **上下文长度：** 262K 位置（运行时决定） * **语言：** 英文（最佳） # [](https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#highlights)亮点 * **单一 checkpoint 灵活部署**：多组预设，在吞吐与质量之间自由权衡 * **每层四种 mixer**：Full Attention (FA)、Sliding Window Attention (SWA)、Gated DeltaNet (GDN)、Kimi Delta Attention (KDA) * **指令微调：** 针对多组 Pareto 最优排布进行定向 SFT * **支持投机解码：** 以全注意力做目标模型，同 checkpoint 内的高效排布做 draft 模型

查看原文

查看缓存全文

缓存时间: 2026/04/22 17:35

ServiceNow-AI/SuperApriel-15B-Instruct · Hugging Face

来源：https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#superapriel-15b-instructSuperApriel-15b-Instruct

thumbnail/ˈɑː\.pri\.əl/

150 亿参数的token-mixer 超网，仅凭一份 checkpoint 即可在 32K 序列长度下提供 8 种部署预设，解码吞吐从 1.0× 到 10.7× 全覆盖。通过随机蒸馏与针对性监督微调，从 Apriel-1.6（https://huggingface.co/ServiceNow-AI/Apriel-1.6-15b-Thinker）衍生而来。

**模型规模：**150 亿参数
**层数：**48 个解码层，每层 4 种 mixer 变体
**上下文长度：**262K 位置（取决于运行时）
**语言：**英语（最佳）

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#highlights亮点

一份 checkpoint 灵活部署：多种预设，在速度和质量之间自由权衡
每层四种 mixer：Full Attention（FA）、Sliding Window Attention（SWA）、Gated DeltaNet（GDN）、Kimi Delta Attention（KDA）
指令微调：针对 8 个帕累托最优排布做定向 SFT
支持投机解码：同一份 checkpoint，用全注意力做目标、高效排布做草稿

详见报告获取完整基准、质量保留曲线与幕后故事。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#performance-overview性能概览

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#pareto-frontier-speedup-vs-quality帕累托前沿：速度 vs 质量

帕累托前沿图（https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct/blob/main/assets/pareto_3panel.png）

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#throughput-comparison吞吐对比

吞吐图（https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct/blob/main/assets/throughput_new_model.png）

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#deployment-presets部署预设

每个预设为各层分配一种 mixer，所有预设共用一份 checkpoint，仅推理时选择不同。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#benchmark-results基准结果

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#full-results-s2-targeted-sft-with-8-placements完整结果（S2：8 个帕累托排布的定向 SFT）

Config@32kAIME’24AIME’25MATH-500GSM8KFDASWDERULERTau2MMLU-ProAIME(NV)GPQAHLELCBIFBenchAllall-attention1.0×93.386.791.892.378.389.579.456.776.882.772.08.268.663.174.2Reg|Lklhd‐262.85×86.783.392.091.179.988.274.430.776.280.070.710.069.263.171.1Idealized|All‐181.99×90.086.792.092.178.086.667.152.676.382.268.26.967.358.871.8Reg|Lklhd‐184.76×86.776.792.491.781.789.860.546.276.374.468.76.664.859.369.7Idealized|Lklhd‐66.2×83.376.792.492.376.988.966.140.473.662.265.06.157.054.966.8Idealized|All‐66.13×83.380.092.291.775.687.461.934.273.356.761.05.955.955.365.3Reg|Lklhd‐136.9×76.773.390.491.268.685.157.028.669.326.761.25.552.657.160.2Reg|Lklhd‐1010.69×76.766.790.690.865.282.948.623.468.224.452.54.550.256.257.2

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#comparison-with-other-hybrid-models与其他混合模型对比

ModelSpeedup @32k数学 (平均)全部任务Super Aprielall-attention1.0×91.074.2Super AprielReg|Lklhd‐262.85×88.371.1Super AprielReg|Lklhd‐184.76×86.869.7Super AprielIdealized|Lklhd‐66.2×86.266.8Apriel-H1 15B1.97×80.458.4Nemotron-Nano 12B v25.85×74.562.4Falcon-H1R 7B4.61×78.664.9Nemotron-3-Nano 30B4.09×89.072.6

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#model-overview模型概览

SuperApriel-15b-Instruct 训练分两段：

阶段 1 — 随机蒸馏：在 266B token 上，以冻结的 Apriel-1.6 为教师，同时训练四种 mixer。见 SuperApriel-15b-Base（https://huggingface.co/ServiceNow-AI/SuperApriel-15b-Base）。
阶段 2 — 定向 SFT：在 60B token 上，用贝叶斯排布优化找出 8 个帕累托最优配置做监督微调。共享参数（FFN、embeddings、norms）冻结，仅训练 mixer 权重。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#architecture-details架构细节

组件详情参数总量150 亿解码层数48Q/KV 头数32 / 8（分组查询注意力），d_h = 128隐藏维度5,120FFN 宽度14,336（SiLU 门控）词表131,072 个 token视觉编码器Pixtral（16×16 patch）

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#mixer-typesMixer 类型

Mixer时间复杂度内存复杂度说明Full Attention (FA)O(n²)O(n) KV 缓存标准分组查询注意力Sliding Window (SWA)O(w·n)O(w)局部窗口 4,096 个 tokenGated DeltaNet (GDN)O(n)O(1) 固定状态矩阵值循环状态，delta 规则Kimi Delta Attention (KDA)O(n)O(1) 固定状态线性注意力，通道门控

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#how-to-use如何使用

推荐 serving 后端为 vLLM + Fast-LLM 插件，支持预设选择与运行时切换。简单场景也可直接用 Transformers（见下）。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#use-with-vllm配合 vLLM

预设选择与吞吐优化需 Fast-LLM 的 vLLM 插件（https://github.com/ServiceNow/Fast-LLM）。两种 serving 模式：

单预设模式：仅加载所选 mixer 权重（约 27 GB bf16），未用权重不加载，单卡即可宽裕运行。
超网模式：每层四种 mixer 权重全加载（约 46 GB bf16），运行时通过 collective_rpc 即时切换（3–20 ms，无需重载引擎）。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#installation安装

安装 vLLM 与 Apriel2 插件（无需完整训练框架）：

python -m venv .venv
source .venv/bin/activate

pip install vllm==0.14.1

git clone [email protected]:ServiceNow/Fast-LLM.git
cd Fast-LLM && git checkout feature/vllm-apriel2-models

# 仅安装 vLLM 插件
pip install -e ./april2-vllm-plugin/

仅依赖 torch、transformers、einops，无训练依赖。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#single-preset-mode单预设模式

单卡加载一个预设，将对应 config 复制为模型主 config：

import json, shutil
from vllm import LLM, SamplingParams

model_dir = "ServiceNow-AI/SuperApriel-15b-Instruct"  # 或本地路径

preset = "Reg_Lklhd-26"  # 预设目录见 preset_configs/
shutil.copy(f"{model_dir}/preset_configs/{preset}/config.json",
            f"{model_dir}/config.json")

llm = LLM(
    model=model_dir,
    trust_remote_code=True,
    gpu_memory_utilization=0.90,
    max_model_len=4096,
)

output = llm.generate(
    ["What is 2 + 3?"],
    SamplingParams(max_tokens=200, temperature=0.6),
)
print(output[0].outputs[0].text)

可用预设：all-attention、Reg_Lklhd-26、Reg_Lklhd-18、Reg_Lklhd-13、Reg_Lklhd-10、Idealized_All-18、Idealized_All-6、Idealized_Lklhd-6、extra_bayesian-mix-7。

**注意：**自定义排布须至少包含一层 attention 类（FA 或 SWA）。仅用循环 mixer（GDN/KDA）的配置暂不支持（vLLM KV 缓存限制）。官方预设均满足。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#supernet-mode-runtime-switching超网模式（运行时切换）

加载完整超网，运行时即时切换预设，无需重启引擎。单卡需加 enforce_eager=True（跳过 CUDA 图，省显存）：

import json
from vllm import LLM, SamplingParams

model_dir = "ServiceNow-AI/SuperApriel-15b-Instruct"
# 确保默认超网 config.json 生效

llm = LLM(
    model=model_dir,
    trust_remote_code=True,
    gpu_memory_utilization=0.90,
    max_model_len=4096,
    enforce_eager=True,   # 单卡必需；双卡可注释
    # tensor_parallel_size=2,  # 双卡并行，可开 CUDA 图
)

# 查看可用 mixer
mixer_names = llm.collective_rpc("get_mixer_names")
print(mixer_names[0])  # ['attention', 'gdn', 'kda', 'sliding_window']

# 加载并切换预设
preset = "Reg_Lklhd-26"
with open(f"{model_dir}/preset_configs/{preset}/config.json") as f:
    pattern = json.load(f)["decoder"]["pattern"]

llm.collective_rpc("set_layer_placements", args=(pattern,))  # 约 3–20 ms

output = llm.generate(
    ["What is 2 + 3?"],
    SamplingParams(max_tokens=200, temperature=0.6),
)
print(output[0].outputs[0].text)

# 再切到另一预设
with open(f"{model_dir}/preset_configs/all-attention/config.json") as f:
    pattern = json.load(f)["decoder"]["pattern"]
llm.collective_rpc("set_layer_placements", args=(pattern,))

配置模型大小KV 缓存可用备注单预设，1 卡27 GB39 GB固定部署最佳超网，1 卡（enforce_eager）46 GB20 GB运行时切换，KV 容量小超网，2 卡（TP=2）每卡 23 GB每卡 50 GB完整编译 + CUDA 图

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#serving-openai-compatible-api启动 OpenAI 兼容 API

用 vllm serve 把单预设部署为 OpenAI 风格接口，启动前选好预设：

# 选预设
MODEL_DIR="ServiceNow-AI/SuperApriel-15b-Instruct"
cp $MODEL_DIR/preset_configs/Reg_Lklhd-26/config.json $MODEL_DIR/config.json

# 启动服务
vllm serve $MODEL_DIR \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --served-model-name SuperApriel-15b-Instruct

curl 调用：

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "SuperApriel-15b-Instruct",
    "messages": [{"role": "user", "content": "What is 2 + 3?"}],
    "max_tokens": 200,
    "temperature": 0.6
  }'

或用 OpenAI Python 客户端：

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="SuperApriel-15b-Instruct",
    messages=[{"role": "user", "content": "What is 2 + 3?"}],
    max_tokens=200,
    temperature=0.6,
)
print(response.choices[0].message.content)

切换预设需停服、换 config、重启。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#per-request-preset-selection按请求选择预设

上述超网模式全局切换。生产环境可按请求路由到不同预设，例如延迟敏感走高速排布，复杂推理走全注意力。

🔴 TODO：待支持通过 vLLM serving API 的 placement_id 字段实现按请求选择预设，与全局 collective_rpc 切换分开管理。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#use-with-transformers使用 Transformers

模型兼容 AutoModelForCausalLM，每个预设 config 指定每层 mixer，自动从超网加载对应权重。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#installation-1安装

pip install transformers torch einops accelerate

含 GDN/KDA 的预设额外需要：

pip install causal-conv1d mamba-ssm

纯 attention 预设（all-attention 等）无需上述包。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#usage用法

仓库默认超网 config 面向 vLLM。用 Transformers 前需复制所选预设的 config.json：

import re
import shutil
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "ServiceNow-AI/SuperApriel-15b-Instruct"  # 或本地路径

preset = "all-attention"  # 或其他预设
shutil.copy(f"{model_dir}/preset_configs/{preset}/config.json",
            f"{model_dir}/config.json")

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "正实数 $x$ 和 $y$ 满足 $y^3=x^2$ 且 $(y-x)^2=4y^2$，求 $x+y$？\n把答案放在 \\boxed{} 中。"
messages = [{"role": "user", "content": prompt}]

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, tools=[]
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

response = re.findall(
    r"\[BEGIN FINAL RESPONSE\](.*?)\[END FINAL RESPONSE\]", output, re.DOTALL
)[0].strip()
print("response:", response)

可用预设同上。推理参数建议参考 Apriel-1.6-15b-Thinker（https://huggingface.co/ServiceNow-AI/Apriel-1.6-15b-Thinker）。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#chat-template对话模板

<|begin_system|>
You are a thoughtful, systematic AI assistant from ServiceNow Language Models (SLAM) lab. Analyze each question carefully, present your reasoning step-by-step, then provide the final response after the marker [BEGIN FINAL RESPONSE].
<|begin_user|>
# user message here

<|begin_assistant|>
Here are my reasoning steps:
# thoughts here
[BEGIN FINAL RESPONSE]
# assistant response here
[END FINAL RESPONSE]
<|end|>

模型先输出思考过程，再在 [BEGIN FINAL RESPONSE] 与 [END FINAL RESPONSE] 之间给出最终回答。

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#usage-guidelines使用指南

使用默认对话模板，已含系统提示；可在用户消息内追加指令。
生成时模型以 Here are my reasoning steps:\n 开头（imp

ServiceNow-AI/SuperApriel-15B-Instruct · Hugging Face

ServiceNow-AI/SuperApriel-15B-Instruct · Hugging Face

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#superapriel-15b-instructSuperApriel-15b-Instruct

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#highlights亮点

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#performance-overview性能概览

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#pareto-frontier-speedup-vs-quality帕累托前沿：速度 vs 质量

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#throughput-comparison吞吐对比

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#deployment-presets部署预设

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#benchmark-results基准结果

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#full-results-s2-targeted-sft-with-8-placements完整结果（S2：8 个帕累托排布的定向 SFT）

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#comparison-with-other-hybrid-models与其他混合模型对比

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#model-overview模型概览

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#architecture-details架构细节

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#mixer-typesMixer 类型

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#how-to-use如何使用

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#use-with-vllm配合 vLLM

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#installation安装

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#single-preset-mode单预设模式

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#supernet-mode-runtime-switching超网模式（运行时切换）

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#serving-openai-compatible-api启动 OpenAI 兼容 API

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#per-request-preset-selection按请求选择预设

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#use-with-transformers使用 Transformers

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#installation-1安装

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#usage用法

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#chat-template对话模板

https://huggingface.co/ServiceNow-AI/SuperApriel-15B-Instruct#usage-guidelines使用指南

相似文章

swiss-ai/Apertus-v1.5 70B/8B

@AdinaYakup: Mega-ASR https://huggingface.co/zhifeixie/Mega-ASR… 1.7B 参数，Apache 2.0，专为噪声/混响/截断/重叠说话人场景设计…

nvidia/Nemotron-Labs-Audex-30B-A3B · Hugging Face

CohereLabs/command-a-plus-05-2026-bf16 · Hugging Face

ai-sage/GigaChat3.1-Audio-10B-A1.8B · Hugging Face

提交意见反馈