Tag
A blog post summarizing ten recent agentic RL frameworks and best practices, covering modular interfaces, trajectory structure, action masks, process rewards, advantage normalization, scalable rollouts, stability/exploration, and task curriculum.
Discussion of recent agentic RL papers, highlighting action masking as a common technique and its evolution with world modeling papers like ECHO and PaW.
APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.
TRACE is a unified rollout budget allocation framework that enhances reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness. It improves efficiency and accuracy on agentic benchmarks like Multi-Hop QA.
OpenEnv, a library for creating agentic execution environments to train open source agents with reinforcement learning, is becoming more open with a new governance committee including Meta-PyTorch, Hugging Face, Nvidia, and others, aiming to provide a protocol layer that works across models and harnesses.
Sergio Paniego highlights that frontier agents' performance is due to models being trained inside their deployment harness. The new work 'Polar: Agentic RL on Any Harness at Scale' by NVIDIA AI enables turning harnesses like Codex, Claude Code, Qwen Code, or Pi into RL training environments without modifying their internals.
StepPO introduces a step-centric paradigm for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming token-centric methods in multi-turn interaction tasks.
This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.
Sanbu 散步 released a modern RL tutorial Hands-On Modern RL, covering from CartPole+PPO basics to LLM post-training (RLHF, DPO, GRPO) and Agentic RL, code-first, English version coming soon.
Skill0.5 is a novel agentic reinforcement learning framework that combines general skill internalization with task-specific skill utilization via a dynamic difficulty-aware router, improving out-of-distribution generalization in complex task environments as demonstrated on ALFWorld and WebShop.
NVIDIA releases Polar, an open-source infrastructure for black-box agentic reinforcement learning, enabling training of coding agents like Claude Code or Codex with any agent harness or framework.
This paper proposes AKBE, an on-policy method for LLM agent reinforcement learning that dynamically identifies when tool use is needed versus when internal knowledge suffices, improving accuracy by +1.85 on average and reducing tool calls by 18% over standard agentic RL.
This paper proposes using Masked Diffusion Language Models (MDLMs) as text-based world models for agentic reinforcement learning, showing that their any-order denoising objective avoids prefix mode collapse and leads to stronger performance than autoregressive baselines.
EnvFactory automates the creation of executable tool environments and natural multi-turn trajectories for training LLMs with agentic reinforcement learning, achieving superior performance on benchmarks like BFCLv3 and MCP-Atlas with fewer environments than prior work.
DR-Venus-4B is a 4B-parameter deep-research agent trained on only 10K open samples via agentic SFT+RL with turn-level rewards, outrunning prior sub-9B agents and rivaling 30B models on research benchmarks while staying deployable on edge devices.