@tom_doerr: 使用强化学习构建自定义AI代理 https://github.com/agentica-project/rllm…

X AI KOLs Timeline 工具

摘要

rLLM 是一个开源框架,通过强化学习对语言代理进行后训练,其发布的重要模型如 DeepSWE-Preview 和 DeepCoder-14B-Preview 取得了最先进的结果。

使用强化学习构建自定义AI代理 https://t.co/5AgTfa8wk2 https://t.co/06WfFXFhiQ
查看原文
查看缓存全文

缓存时间: 2026/05/14 18:41

使用强化学习构建自定义AI代理 https://t.co/5AgTfa8wk2 https://t.co/06WfFXFhiQ


agentica-project/rllm 来源:https://github.com/agentica-project/rllm

rLLM

rLLM已迁移至新的组织 rllm-org/rllm (https://github.com/rllm-org/rllm)。此仓库现为rLLM的归档版本。

🚀 语言代理的强化学习

🌟 文档 (https://rllm-project.readthedocs.io/en/latest)
Discord (https://discord.gg/BDH46HT9en)
网站 (https://www.agentica-project.com)
Twitter/X (https://x.com/Agentica_)
Github (https://github.com/agentica-project/rllm)
Hugging Face集合 (https://huggingface.co/agentica-org)

rLLM是一个开源框架,用于通过强化学习对语言代理进行后训练。借助rLLM,您可以轻松构建自定义代理和环境,使用强化学习进行训练,并将其部署到实际工作负载中。

版本发布 📰

[2025/07/01] 我们发布了DeepSWE-Preview,一个32B的软件工程代理(SWE),通过纯RL训练,在SWEBench-Verified上达到59%(Pass@1 42.2%),位列开源权重模型SWEBench排行榜第一。

  • 🍽️ 关于SWE代理和RL训练配方的深入博文 (https://pretty-radio-b75.notion.site/DeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art[…]-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33?pvs=73)
  • 🤗 HF模型 DeepSWE-Preview (https://huggingface.co/agentica-org/DeepSWE-Preview)
  • 🤗 HF数据集 R2E-Gym-Subset (https://huggingface.co/datasets/R2E-Gym/R2E-Gym-Subset)
  • 📄 训练脚本 (https://github.com/agentica-project/rllm/tree/main/examples/swe)
  • 📈 Wandb训练日志 (https://wandb.ai/mluo/deepswe)——所有训练运行和消融实验。
  • 🔎 评估日志 (https://drive.google.com/file/d/10LIwpJeaFuiX6Y-qEG2a4a335PEuQJeS/view?usp=sharing)——在SWE-Bench-Verified上的16次运行。

[2025/04/08] 我们发布了DeepCoder-14B-Preview (https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51),一个14B的编程模型,在LiveCodeBench上实现了**60.6%**的Pass@1准确率(提升8%),与o3-mini-2025-01-031 (Low)o1-2024-12-17性能持平。

  • ⬆️ 关于训练配方和见解的深入博文 (https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51)
  • 🤗 HF模型 DeepCoder-14B-Preview (https://huggingface.co/agentica-org/DeepCoder-14B-Preview),DeepCoder-1.5B-Preview (https://huggingface.co/agentica-org/DeepCoder-1.5B-Preview)
  • 🤗 HF数据集 DeepCoder-Preview-Dataset (https://huggingface.co/datasets/agentica-org/DeepCoder-Preview-Dataset)
  • 📄 训练脚本 (https://github.com/agentica-project/rllm/tree/main/scripts/deepcoder/train)——达到o3-mini性能所用的精确超参数。
  • 📈 Wandb训练日志 (https://wandb.ai/mluo/deepcoder)——所有训练运行和消融实验。
  • 🔎 评估日志 (https://drive.google.com/file/d/1tr_xXvCJnjU0tLO7DNtFL85GIr3aGYln/view?usp=sharing)——DeepCoder在LiveCodeBench和Codeforces上的日志。

[2025/02/10] 我们发布了DeepScaleR-1.5B-Preview (https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2),一个1.5B的模型,在AIME上达到43.1%的Pass@1,超越了O1-Preview。我们通过迭代扩展Deepseek的GRPO算法,将思考过程的上下文长度从8K提升到16K再到24K来实现这一目标。

  • 🍗 关于训练配方和见解的深入博文 (https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2)
  • 🤗 HF模型 DeepScaleR-1.5B-Preview (https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)
  • 🤗 HF数据集 DeepScaleR-Preview-Dataset (https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) / 🗂️ JSON数据集 (https://github.com/agentica-project/deepscaler/tree/main/deepscaler/data)
  • 📄 训练脚本 (https://github.com/agentica-project/deepscaler/tree/main/scripts/train)——在AIME上达到43.1%所使用的精确超参数。
  • 📈 Wandb训练日志 (https://wandb.ai/mluo/deepscaler-1.5b)——所有训练运行和消融实验。
    • 由于Wandb迁移bug,8k训练运行被压缩到400-500步。数据相同,但原始运行是1600步。
  • 🔎 评估日志 (https://drive.google.com/file/d/1V_rYKoL35WmubbmWN6PeFg4zo5QOug8X/view?pli=1)——DeepScaleR、Deepseek Distill和Still 1.5B在1000多个数学问题上的生成结果。

快速开始 🎯

安装

# 克隆仓库
git clone --recurse-submodules https://github.com/rllm-org/rllm.git
cd rllm

# 创建conda环境
conda create -n rllm python=3.10
conda activate rllm

# 安装所有依赖
pip install -e ./verl
pip install -e .

致谢

  • 我们的训练实验基于我们重度修改的verl (https://github.com/volcengine/verl) 分支,verl是一个开源的RLHF库。
  • 我们的模型训练基于 DeepSeek-R1-Distill-Qwen-1.5B (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)、DeepSeek-R1-Distill-Qwen-14B (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) 和 Qwen3-32B (https://huggingface.co/Qwen/Qwen3-32b)。
  • 本工作是伯克利Sky计算实验室 (https://skycomputing.berkeley.edu/)、伯克利AI研究 (https://bair.berkeley.edu/) 以及与Together AI成功合作的成果。

引用

引用rLLM:

@misc{rllm2025,
  title={rLLM: A Framework for Post-Training Language Agents},
  author={Sijun Tan and Michael Luo and Colin Cai and Tarun Venkat and Kyle Montgomery and Aaron Hao and Tianhao Wu and Arnav Balyan and Manan Roongta and Chenguang Wang and Li Erran Li and Raluca Ada Popa and Ion Stoica},
  year={2025},
  howpublished={\url{https://pretty-radio-b75.notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31}},
  note={Notion Blog}
}

引用DeepSWE:

@misc{deepswe2025,
  title={DeepSWE: Training a State-of-the-Art Coding Agent from Scratch by Scaling RL},
  author={Michael Luo and Naman Jain and Jaskirat Singh and Sijun Tan and Ameen Patel and Qingyang Wu and Alpay Ariyak and Colin Cai and Tarun Venkat and Shang Zhu and Ben Athiwaratkun and Manan Roongta and Ce Zhang and Li Erran Li and Raluca Ada Popa and Koushik Sen and Ion Stoica},
  howpublished={\url{https://pretty-radio-b75.notion.site/DeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art-Coding-Agent-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33}},
  note={Notion Blog},
  year={2025}
}

引用DeepCoder:

@misc{deepcoder2025,
  title={DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level},
  author={Michael Luo and Sijun Tan and Roy Huang and Ameen Patel and Alpay Ariyak and Qingyang Wu and Xiaoxiang Shi and Rachel Xin and Colin Cai and Maurice Weber and Ce Zhang and Li Erran Li and Raluca Ada Popa and Ion Stoica},
  howpublished={\url{https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51}},
  note={Notion Blog},
  year={2025}
}

引用DeepScaleR:

@misc{deepscaler2025,
  title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},
  author={Michael Luo and Sijun Tan and Justin Wong and Xiaoxiang Shi and William Y. Tang and Manan Roongta and Colin Cai and Jeffrey Luo and Li Erran Li and Raluca Ada Popa and Ion Stoica},
  year={2025},
  howpublished={\url{https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2}},
  note={Notion Blog}
}

相似文章