REAP 剪枝版 Nemotron-3-Super（512→256 experts）+ GRPO 微调 + FP8/AWQ，AIME 2026 90%+，附 Benchmark

Reddit r/LocalLLaMA 2026/04/22 14:25 模型

摘要

社区发布：用 REAP 把 NVIDIA Nemotron-3-Super-120B 剪到 64B，再用 GRPO 做数学强化微调，最后 AWQ/FP8 量化，单卡 H100/RTX PRO 6000 即可跑到 AIME 2026 90%+。

r/LocalLLaMA 的朋友们，放出我在 AIMO3（Kaggle 赛）期间折腾的模型：拿 NVIDIA 的 Nemotron-3-Super-120B-A12B（潜 MoE + Mamba2 混合结构），用 REAP 把 512→256 专家剪枝（顺带干掉 MTP 层），再用 LoRA-RL + GRPO 在约 270 道 AIMO3 + AstralMath 题上微调，最后 AWQ/FP8 量化。结果：120B→64B，单卡 H100 或 RTX PRO 6000 Blackwell 就能跑，AIME 2026 直接 90%+。 # 模型 * BF16（完整权重，约 129 GB 显存）：[Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16](https://huggingface.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16) * FP8 动态量化（W8A8，约 72 GB）：[Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-FP8](https://huggingface.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-FP8) * AWQ（W4A16，约 43 GB）：[Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-AWQ](https://huggingface.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-AWQ) # AIME 2026（30 题，4 次尝试取平均，system-role prompt） | 版本 | avg@4 | pass@4 | 工具调用 | |:-|:-|:-|:-| | 120B 基座 ([MathArena 排行榜](https://matharena.ai/?view=problem&comp=aime--aime_2026)) | 0.9000 | n/a | 无 | | 本帖 AWQ | 0.9083 | 0.9333 | 无 | | 本帖 FP8 | 0.9167 | 0.9667 | 无 | 虽然测试时没开工具，模型本身很擅长 Python 工具集成推理！ # AWQ vs FP8 权衡 FP8 吞吐比 AWQ 低约 40%，但质量更高（pass@4 多对 1 题，最难那题数值也更稳）。FP8 收敛到答案更快，部分抵消了吞吐劣势。 # vLLM 补丁 vLLM 的 fused `grouped_topk` CUDA kernel 在 experts_per_group > 128 时会非法访存（剪枝后模型 256 专家，n_group=1）。仓库附了小补丁，遇到该情况直接跳过 fused kernel。 # 链接 * 评测仓库：[https://github.com/madmax0404/nemotron-3-super-reap-pruned-awq-and-fp8-aime-2026-benchmarks](https://github.com/madmax0404/nemotron-3-super-reap-pruned-awq-and-fp8-aime-2026-benchmarks) * HF 组织页：[https://huggingface.co/Max-and-Omnis](https://huggingface.co/Max-and-Omnis) 硬件：1× RTX PRO 6000 Blackwell，vLLM 0.19.1。欢迎提问 REAP→GRPO→AWQ/FP8 全流程细节！

查看原文

REAP 剪枝版 Nemotron-3-Super（512→256 experts）+ GRPO 微调 + FP8/AWQ，AIME 2026 90%+，附 Benchmark

相似文章

@ctnzr: 我们更进一步：Nemotron 3 Super 拥有120B参数，在NVFP4精度下基于25T tokens进行了预训练。Nemotron 3 Ultra 大约为500B参数，……

48GB 显存实现 500k 上下文！！- 21 tok/s (编码)

@ProTekkFZS：在 3090 上用 Q4_K_M 3.6 35B、768k 上下文加 YaRN，爽到飞起

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

Gemma4 26b a4b Apex 量化版本表现相当不错

提交意见反馈