我发布了一个在GPT-2中等规模（约3.54亿参数，115亿token）的无softmax注意力模型：结构稀疏性+瓦片跳过内核实现长上下文显存节省。开放权重+自定义Triton内核[R]

Reddit r/MachineLearning 2026/06/21 10:46 模型

摘要

发布了RRT-355M，一个GPT-2中等规模的无softmax注意力模型，拥有3.54亿参数，从零开始在115亿token上训练，利用结构稀疏性和瓦片跳过内核实现长上下文效率，在22个任务基准测试中达到与GPT-2中等规模相当的性能。

暂无内容

查看原文

查看缓存全文

缓存时间: 2026/06/22 01:35

Tripstoph/RRT-Foundation · Hugging Face

来源：https://huggingface.co/Tripstoph/RRT-Foundation

https://huggingface.co/Tripstoph/RRT-Foundation#rrt-355m–softmax-free-attention-at-gpt-2-medium-scaleRRT-355M — 在 GPT-2 Medium 规模下的无 Softmax 注意力机制

核心结果：一个 GPT-2 Medium 大小的检查点（约 354 M 参数）从头开始训练，不使用 softmax，在标准化的 22 项任务上下文学习基准上进行评估，并附带开源内核，该检查点上的稀疏推理与密集推理比特一致。

本 Hugging Face 仓库仅提供权重、配置和基板常数。推理需要使用 GitHub 上的 RRT 引擎（RRT-LLM-FOUNDATION (https://github.com/tripstoph/RRT-LLM-FOUNDATION)，AGPL-3.0）。使用标准transformers库的 GPT-2 将产生错误输出。

训练已完成。本仓库不再计划发布其他检查点。

https://huggingface.co/Tripstoph/RRT-Foundation#capability-evaluation-22-task-core能力评估（22 项任务 CORE 基准）

模型	CORE	备注
GPT-2 124M	0.1211	下限参考，同一测试框架
GPT-2 medium	0.1770	密集 softmax 对照，规模匹配
RRT-355M	0.1558	无 softmax，本检查点
Pythia 410M	0.1895	现代基线，同一测试框架

CORE = 22 项上下文学习任务的中心化准确率均值（DCLM 协议 (https://arxiv.org/abs/2406.11794)，Karpathy nanochat eval_bundle）。RRT-355M 比 GPT-2 medium 对照 低 0.021，比 GPT-2 124M 下限 高 0.035 —— 这是一个可衡量的权衡，并非能力崩溃。

**任务不对称性（RRT − GPT-2 medium，中心化分数）：**在多项选择推理上有所提升（arc_easy +0.12，agi_eval_lsat_ar +0.09，openbook_qa +0.07）；在续写任务上下降最大（lambada_openai −0.16，coqa −0.13，squad −0.07）。

**未评估：**MMLU、GSM8K、HumanEval、聊天/指令基准，或微调后的下游任务。详情请见本仓库的 eval/eval_summary.json (https://huggingface.co/Tripstoph/RRT-Foundation/blob/main/eval/eval_summary.json) 和 GitHub 上的完整文档 docs/EVALUATION.md (https://github.com/tripstoph/RRT-LLM-FOUNDATION/blob/main/docs/EVALUATION.md)。

任务不对称性——与 GPT-2 medium 的中心化分数差值（选定 CORE 任务） (https://huggingface.co/Tripstoph/RRT-Foundation/blob/main/figures/task_asymmetry.png)

https://huggingface.co/Tripstoph/RRT-Foundation#mechanism-and-training机制与训练

指标	值	备注
结构边稀疏度	99.66 %	保真门控；训练测量值
训练数据	FineWeb-Edu	1534 亿 token，4× H100，22k 次迭代
最佳验证损失（检查点）	2.8001	迭代 21 000
权重文件	~1011 MB bf16	`model.safetensors`

三个指标——请勿混淆：(1) 训练期间的结构稀疏度，(2) 推理时的粗粒度块跳过（长上下文 34–55%），(3) 上述 CORE 行为分数。

每个注意力边应用摩擦 ln(max(i−j, 1)) 和门控 μ = η / (1 + η^n)^(1/n)，其中 n = 1.25。INT8 预检查跳过无活动边的块；在本检查点上与密集方式比特一致。v2 内核：21/22 项 CORE 任务与 v1 一致（Δ CORE −0.0016）。

训练结束时的逐层结构稀疏度 (https://huggingface.co/Tripstoph/RRT-Foundation/blob/main/figures/sparsity_per_layer.png)

https://huggingface.co/Tripstoph/RRT-Foundation#systems-notes-secondary系统说明（次要）

指标	值	说明
INT8 块跳过 @ T=2048 / 8192	34% / 55%	第12层微基准，H100
内核 vs SDPA @ T=2048	11.5×	非端到端生成
峰值注意力显存 @ T=16384	5.5 GB	GPT-2 XL 参考前向，RTX 3070

https://huggingface.co/Tripstoph/RRT-Foundation#files-in-this-repo本仓库文件

文件	用途
`model.safetensors`	bf16 权重
`config.json`	架构元数据
`rrt_substrate_constants.json`	推理需要，仅包含 `n_backbone`、`C_max`
`eval/`	CORE 摘要 JSON、比较 CSV、一致性说明
`figures/`	基准报告中的关键图表
`tokenizer_pointer.txt`	`openai-community/gpt2` BPE 分词器

https://huggingface.co/Tripstoph/RRT-Foundation#reproduce复现

git clone https://github.com/tripstoph/RRT-LLM-FOUNDATION.git
cd RRT-LLM-FOUNDATION
pip install -e .
python eval/run_core_eval.py --model rrt:_state/ckpt.pt --snapshot-dir engine --seed 1337
# 快速验证（~几分钟）：python eval/smoke_core.py --model rrt:_state/ckpt.pt --snapshot-dir engine

预期完整 CORE：0.1558。声明与证据：GitHub docs/CLAIMS.md (https://github.com/tripstoph/RRT-LLM-FOUNDATION/blob/main/docs/CLAIMS.md)。

https://huggingface.co/Tripstoph/RRT-Foundation#scope范围

RRT-355M 单独验证了注意力机制。更广泛的流水线工作将在关系自创生基板（RAS） 下独立探索；本仓库不承诺任何时间线或额外的模型发布。

https://huggingface.co/Tripstoph/RRT-Foundation#limitations局限性

自定义 Triton 引擎（Hopper sm_90）；不支持 AutoModelForCausalLM
CORE 分数低于同样规模的密集 GPT-2 medium
单一检查点；本仓库未进行规模扩展
速度/内存数据为指定上下文下的内核基准测试

https://huggingface.co/Tripstoph/RRT-Foundation#citation引用

@misc{rrt-355m-2026,
  author       = {Tripstoph},
  title        = {RRT-355M: Softmax-free attention at GPT-2 Medium scale},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Tripstoph/RRT-Foundation}},
  note         = {机制验证权重；引擎位于 GitHub，采用 AGPL-3.0 许可。},
}

最后更新：2026-06-21

我发布了一个在GPT-2中等规模（约3.54亿参数，115亿token）的无softmax注意力模型：结构稀疏性+瓦片跳过内核实现长上下文显存节省。开放权重+自定义Triton内核[R]

Tripstoph/RRT-Foundation · Hugging Face

https://huggingface.co/Tripstoph/RRT-Foundation#rrt-355m–softmax-free-attention-at-gpt-2-medium-scaleRRT-355M — 在 GPT-2 Medium 规模下的无 Softmax 注意力机制

https://huggingface.co/Tripstoph/RRT-Foundation#capability-evaluation-22-task-core能力评估（22 项任务 CORE 基准）

https://huggingface.co/Tripstoph/RRT-Foundation#mechanism-and-training机制与训练

https://huggingface.co/Tripstoph/RRT-Foundation#systems-notes-secondary系统说明（次要）

https://huggingface.co/Tripstoph/RRT-Foundation#files-in-this-repo本仓库文件

https://huggingface.co/Tripstoph/RRT-Foundation#reproduce复现

https://huggingface.co/Tripstoph/RRT-Foundation#scope范围

https://huggingface.co/Tripstoph/RRT-Foundation#limitations局限性

https://huggingface.co/Tripstoph/RRT-Foundation#citation引用

相似文章

@tilderesearch: https://x.com/tilderesearch/status/2061771450168889432

MiniMax 稀疏注意力

@eliebakouch：@OpenAI 这次发布太棒了！一个总参数量 1.5 B、仅激活 50 M 的 gpt-oss 架构 MoE，能从万亿级数据中廉价滤除隐私信息…

@rohanpaul_ai: 相当惊人，MiniMax Sparse Attention 在100万token时将注意力计算量减少28.4倍，预填充速度提升14.2倍，以及…

使用稀疏Transformer进行生成建模

提交意见反馈