@cwolferesearch: I've been reading a ton of agentic RL papers recently. Out of all the work, one of the only commonly-used tricks is act…
Summary
Discussion of recent agentic RL papers, highlighting action masking as a common technique and its evolution with world modeling papers like ECHO and PaW.
View Cached Full Text
Cached at: 06/20/26, 10:25 PM
I’ve been reading a ton of agentic RL papers recently. Out of all the work, one of the only commonly-used tricks is action masking, but this approach is evolving with RL + world modeling papers like ECHO / PaW.
[1/N]
The idea of action masking is to remove the contribution of non-LLM-generated tokens (e.g., environment feedback / tool outputs) to the policy gradient. This is basically the agentic RL version of masking prompt tokens when you run SFT. The benefits of action masking have been widely replicated across different papers. As a result, this trick is almost universally adopted in recent agent papers.
[2/N]
Despite action masking being so common, recent papers have shown that completely removing non-action tokens from the objective is not optimal. We want the LLM to not only take action, but also form a world model (i.e., be able to predict environment observations / feedback). To do this, we want to train on both action and environmental tokens, as proposed in papers like ECHO / PaW.
[3/N]
Concretely, this can be implemented by:
- Using RL on action tokens.
- Using SFT on tool response tokens.
In this case, the SFT objective is formulated as RL with a constant positive advantage, allowing the SFT objective to be implemented in the normal RL policy update flow with no additional cost.
[4/N]
This basic trick has a large performance impact; see an example from ECHO in the image below. I find this approach especially interesting because it goes against a commonly-accepted norm (action masking). I love simple and effective tricks like this, and it makes you wonder what other performance improvements are possible if we question default settings!
[5/N]
Here are all of the links for further reading:
- Prime Intellect blog on ECHO / PaW: https://primeintellect.ai/blog/true-agents-model-the-world…
- Echo: https://arxiv.org/abs/2605.24517
- PaW: https://arxiv.org/abs/2606.02388
Similar Articles
@cwolferesearch: I just published a blog on agentic RL that covers 10+ recent frameworks in the space. Here are the key takeaways… Link …
A blog post summarizing ten recent agentic RL frameworks and best practices, covering modular interfaces, trajectory structure, action masks, process rewards, advantage normalization, scalable rollouts, stability/exploration, and task curriculum.
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.
@samsja19: Very exciting work to bridge the gap between RL and mid/pretraining You can learn from your environment beyond the rewa…
A new method called ECHO bridges RL and pre-training by using next token prediction on tool call outputs to learn from the environment beyond reward signals, combining world modeling and agentic actions.
@HuggingPapers: Top AI Papers of The Week (May 25-31): - Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players - SkillO…
A curated roundup of top AI papers from May 25-31, covering multi-agent world modeling, vision-language-action models, agent skill optimization, and alignment frameworks.
@lftherios: 1/ Autoresearch from @karpathy has been one of the most interesting agentic patterns to emerge this year. The challenge…
Andrej Karpathy's autoresearch pattern highlights how current AI agents run experiments in isolation, wasting compute by duplicating work and rediscovering dead ends.