unsloth/Kimi-K2.6-GGUF

Hugging Face Models Trending 2026/04/20 15:27 模型

摘要

Unsloth 推出开源 1T 参数 Kimi K2.6 MoE 模型的量化 GGUF 版本，专为长程编码、自主智能体集群及生产级设计任务优化。

任务：image-text-to-text 标签：transformers, gguf, compressed-tensors, unsloth, image-text-to-text, arxiv:2602.02276, base_model:moonshotai/Kimi-K2.6, base_model:quantized:moonshotai/Kimi-K2.6, license:other, endpoints_compatible, region:us, conversational

查看原文

查看缓存全文

缓存时间: 2026/04/21 19:47

unsloth/Kimi-K2.6-GGUF · Hugging Face

来源：https://huggingface.co/unsloth/Kimi-K2.6-GGUF

阅读我们的 Kimi K2.6 运行指南！

查看 Unsloth Dynamic 2.0 GGUF 了解我们的量化基准。

若想以全精度运行 Kimi K2.6，请使用 Q8（UD-Q8_K_XL），仅比 Q4（UD-Q4_K_XL）大 10 GB。
更小的 GGUF 仍在陆续上传中。

UD-Q8_K_XL 与 Kimi-K2.6 完全无损。Kimi 对 MoE 权重采用 int4，其余用 BF16；Q8_K_XL 亦遵循此方案。
UD-Q4_K_XL 与 UD-Q8_K_XL 相同，仅其余张量为 Q8_0。
BF16 仅作为制品提供——UD-Q8_K_XL 已完全反量化，请优先使用。

kimi k2.6

Kimi K2.6 是一款开源原生多模态智能体模型，在长周期编码、以编码驱动的设计、主动自主执行及群体任务编排等实用能力上实现突破。

核心亮点

长周期编码：在复杂端到端编码任务上显著提升，可稳健泛化至 Rust、Go、Python 等多种语言，覆盖前端、DevOps、性能优化等多领域。
编码驱动设计：能将简单提示与视觉输入直接转化为生产级界面与轻量全栈工作流，生成结构化布局、交互元素与精美动画，审美精准。
升级智能体集群：横向扩展至 300 子智能体、4,000 协同步骤，动态拆解任务为并行领域子任务，单次自主运行即可输出文档、网站、表格等完整成果。
主动开放编排：在无人值守场景下，支持 7×24 持续后台智能体，主动管理日程、执行代码、跨平台协同，表现优异。

2. 模型概览

架构
混合专家（MoE）
总参数量
1 T
激活参数量
32 B
层数（含稠密层）
61
稠密层数
1
注意力隐维
7168
专家隐维（每专家）
2048
注意力头数
64
专家总数
384
每 Token 激活专家
8
共享专家数
1
词表大小
160 K
上下文长度
256 K
注意力机制
MLA
激活函数
SwiGLU
视觉编码器
MoonViT
视觉编码器参数量
400 M

3. 评测结果

基准	Kimi K2.6	GPT-5.4 (xhigh)	Claude Opus 4.6 (max effort)	Gemini 3.1 Pro (thinking high)	Kimi K2.5
Agentic
HLE-Full (w/ tools)	54.0	52.1	53.0	51.4	50.2
BrowseComp	83.2	82.7	83.7	85.9	74.9
BrowseComp (Agent Swarm)	86.3	78.4	—	—	—
DeepSearchQA (f1-score)	92.5	78.6	91.3	81.9	89.0
DeepSearchQA (accuracy)	83.0	63.7	80.6	60.2	77.1
WideSearch (item-f1)	80.8	—	—	—	72.7
Toolathlon	50.0	54.6	47.2	48.8	27.8
MCPMark	55.9	62.5*	56.7*	55.9*	29.5
Claw Eval (pass³)	62.3	60.3	70.4	57.8	52.3
Claw Eval (pass@3)	80.9	78.4	82.4	82.9	75.4
APEX-Agents	27.9	33.3	33.0	32.0	11.5
OSWorld-Verified	73.1	75.0	72.7	—	63.3
Coding
Terminal-Bench 2.0 (Terminus-2)	66.7	65.4*	65.4	68.5	50.8
SWE-Bench Pro	58.6	57.7	53.4	54.2	50.7
SWE-Bench Multilingual	76.7	—	77.8	76.9*	73.0
SWE-Bench Verified	80.2	—	80.8	80.6	76.8
SciCode	52.2	56.6	51.9	58.9	48.7
OJBench (python)	60.6	—	60.3	70.7	54.7
LiveCodeBench (v6)	89.6	—	88.8	91.7	85.0
推理与知识
HLE-Full	34.7	39.8	40.0	44.4	30.1
AIME 2026	96.4	99.2	96.7	98.3	95.8
HMMT 2026 (Feb)	92.7	97.7	96.2	94.7	87.1
IMO-AnswerBench	86.0	91.4	75.3	91.0*	81.8
GPQA-Diamond	90.5	92.8	91.3	94.3	87.6
视觉
MMMU-Pro	79.4	81.2	73.9	83.0*	78.5
MMMU-Pro (w/ python)	80.1	82.1	77.3	85.3*	77.7
CharXiv (RQ)	80.4	82.8*	69.1	80.2*	77.5
CharXiv (RQ w/ python)	86.7	90.0*	84.7	89.9*	78.7
MathVision	87.4	92.0*	71.2*	89.8*	84.2
MathVision (w/ python)	93.2	96.1*	84.6*	95.7*	85.0
BabyVision	39.8	49.7	14.8	51.6	36.5
BabyVision (w/ python)	68.5	80.2*	38.4*	68.3*	40.5
V* (w/ python)	96.9	98.4*	86.4*	96.9*	86.9

脚注

通用测试说明
- Kimi K2.6、K2.5 均开启思考模式；Claude Opus 4.6 为 max effort；GPT-5.4 为 xhigh；Gemini 3.1 Pro 为 thinking high。
- 除非特别说明，Kimi K2.6 实验温度=1.0，top-p=1.0，上下文 262,144 tokens。
- 无公开分数的基准按相同条件复测，标“*”；其余引用官方报告。
推理基准
- IMO-AnswerBench 的 GPT-5.4、Claude 4.6 分数来自 z.ai/blog/glm-5.1。
- Humanity’s Last Exam 等推理任务最大生成长度 98,304 tokens；默认报告 HLE-full，文本子集 K2.6 无工具 36.4%，有工具 55.5%。
工具增强 / 智能体任务
- K2.6 配备搜索、代码解释器、网页浏览工具。
- HLE-Full w/ tools 最大生成长度 262,144 tokens，单步 49,152 tokens；超窗后仅保留最近一轮工具消息。
- BrowseComp 采用与 K2.5、DeepSeek-V3.2 相同的 discard-all 上下文管理。
- DeepSearchQA 无上下文管理，超长即判失败；对手分数引自 Claude Opus 4.7 System Card。
- WideSearch 采用“hide tool result”管理策略。
- 系统提示与 K2.5 技术报告一致。
- Claw Eval 用 v1.1，max-tokens-per-step=16384。
- APEX-Agents 评测 452 公开任务（排除有外部依赖的两项）。
编码任务
- Terminal-Bench 2.0 用默认 Terminus-2 框架及 JSON 解析，保留思考。
- SWE-Bench 系列用自研框架，工具集最小化。
- 所有编码分数取 10 次独立运行平均。
视觉基准
- max-tokens=98,304，avg@3。
- 含 Python 工具时 max-tokens-per-step=65,536，max-steps=50。
- MMMU-Pro 遵循官方协议，保持输入顺序并前置图片。

4. 原生 INT4 量化

Kimi-K2.6 采用与 Kimi-K2-Thinking 相同的原生 int4 量化方案。

5. 部署

官方 API 已上线 platform.moonshot.ai，兼容 OpenAI/Anthropic 格式。
可用 Kimi Vendor Verifier 验证部署。
推荐推理引擎：

vLLM

SGLang

KTransformers

Kimi-K2.6 架构与 K2.5 一致，部署方式可直接复用。
transformers 版本要求：>=4.57.1, <5.0.0
部署示例见 Model Deployment Guide。

6. 模型使用

以下示例演示如何调用官方 API。
若使用 vLLM/SGLang 等第三方 API，请注意：

视频对话为实验功能，目前仅官方 API 支持。

建议 temperature：Thinking 模式 1.0，Instant 模式 0.6。

建议 top_p=0.95。

Instant 模式需在 extra_body 传入 {'chat_template_kwargs': {"thinking": False}}。

对话补全

import openai
import base64
import requests

def simple_chat(client: openai.OpenAI, model_name: str):
    messages = [
        {'role': 'system', 'content': 'You are Kimi, an AI assistant created by Moonshot AI.'},
        {
            'role': 'user',
            'content': [
                {'type': 'text', 'text': 'which one is bigger, 9.11 or 9.9? think carefully.'}
            ],
        },
    ]
    response = client.chat.completions.create(
        model=model_name, messages=messages, stream=False, max_tokens=4096
    )
    print('====== Below is reasoning content in Thinking Mode ======')
    print(f'reasoning content: {response.choices[0].message.reasoning}')
    print('====== Below is response in Thinking Mode ======')
    print(f'response: {response.choices[0].message.content}')

    # To use instant mode, pass {"thinking" = {"type":"disabled"}}
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        stream=False,
        max_tokens=4096,
        extra_body={'thinking': {'type': 'disabled'}},  # this is for official API
        # extra_body= {'chat_template_kwargs': {"thinking": False}}  # this is for vLLM/SGLang
    )
    print('====== Below is response in Instant Mode ======')
    print(f'response: {response.choices[0].message.content}')

带视觉内容的对话

K2.6 支持图片与视频输入。

图片示例：

import openai
import base64
import requests

def chat_with_image(client: openai.OpenAI, model_name: str):
    url = 'https://huggingface.co/moonshotai/Kimi-K2.6/resolve/main/figures/kimi-logo.png'
    image_base64 = base64.b64encode(requests.get(url).content).decode()
    messages = [
        {
            'role': 'user',
            'content': [
                {'type': 'text', 'text': 'Describe this image in detail.'},
                {
                    'type': 'image_url',
                    'image_url': {'url': f'data:image/png;base64, {image_base64}'},
                },
            ],
        }
    ]

    response = client.chat.completions.create(
        model=model_name, messages=messages, stream=False, max_tokens=8192
    )
    print('====== Below is reasoning content in Thinking Mode ======')
    print(f'reasoning content: {response.choices[0].message.reasoning}')
    print('====== Below is response in Thinking Mode ======')
    print(f'response: {response.choices[0].message.content}')

    # Also support instant mode if you pass {"thinking" = {"type":"disabled"}}
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        stream=False,
        max_tokens=4096,
        extra_body={'thinking': {'type': 'disabled'}},  # this is for official API
        # extra_body= {'chat_template_kwargs': {"thinking": False}}  # this is for vLLM/SGLang
    )
    print('====== Below is response in Instant Mode ======')
    print(f'response: {response.choices[0].message.content}')
    return response.choices[0].message.content

视频示例：

import openai
import base64
import requests

def chat_with_video(client: openai.OpenAI, model_name:str):
    url = 'https://huggingface.co/moonshotai/Kimi-K2.6/resolve/main/figures/demo_video.mp4'
    video_base64 = base64.b64encode(requests.get(url).content).decode()
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text","text": "Describe the video in detail."},
                {
                    "type": "video_url",
                    "video_url": {"url": f"data:video/mp4;base64,{video_base64}"},
                },
            ],
        }
    ]

    response = client.chat.completions.create(model=model_name, messages=messages)
    print('====== Below is reasoning content in Thinking Mode ======')
    print(f'reasoning content: {response.choices[0].message.reasoning}')
    print('====== Below is response in Thinking Mode ======')
    print(f'response: {response.choices[0].message.content}')

    # Also support instant mode if pass {"thinking" = {"type":"disabled"}}
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        stream=False,
        max_tokens=4096,
        extra_body={'thinking': {'type': 'disabled'}},  # this is for official API
        # extra_body= {'chat_template_kwargs': {"thinking": False}}  # this is for vLLM/SGLang
    )
    print('====== Below is response in Instant Mode ======')
    print(f'response: {response.choices[0].message.content}')
    return response.choices[0].message.content

保留思考（Preserve Thinking）

Kimi K2.6 支持 preserve_thinking 模式，可在多轮对话中保留完整推理内容，提升编码智能体场景表现。
该功能默认关闭，示例：

def chat_with_preserve_thinking(client: openai.OpenAI, model_name: str):
    messages = [
        {
            "role": "user",
            "content": "Tell me three random numbers."
        },
        {
            "role": "assistant",
            "reasoning_content": "I'll start by listing five numbers: 473, 921, 235, 215, 222, and I'll tell you the first three.",
            "content": "473, 921, 235"
        },
        {
            "role": "user",
            "content": "What are the other two number

unsloth/Kimi-K2.6-GGUF

unsloth/Kimi-K2.6-GGUF · Hugging Face

阅读我们的 Kimi K2.6 运行指南！

核心亮点

2. 模型概览

3. 评测结果

4. 原生 INT4 量化

5. 部署

6. 模型使用

对话补全

带视觉内容的对话

保留思考（Preserve Thinking）

相似文章

unsloth/Kimi-K2.7-Code-GGUF

Kimi K2.6 Unsloth GGUF 已发布

unsloth/Qwen3.6-27B-GGUF

unsloth/Qwen3.6-27B-MTP-GGUF

Kimi K2.6

提交意见反馈