Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
Summary
Introduces SR²AM, a framework for efficient agentic reasoning via self-regulated simulative planning, achieving competitive performance with models 20-30x larger while using 26-95% fewer reasoning tokens.
View Cached Full Text
Cached at: 05/22/26, 06:21 PM
Paper page - Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
Source: https://huggingface.co/papers/2605.22138 Efficient reasoning isnotabout shorter chain-of-thought, but about betterallocationof simulation (i.e., knowing when to imagine possible futures and when to act directly).
Current adaptive-reasoning approaches (effort knobs, token budgets in Opus 4.7 and GPT-5.5) control how much the model thinks. SR²AM asks a more structural question: whatkind of thinkingshould the model do at each step?
We decompose agentic deliberation into three systems:
- System I (reactive execution): fast, pattern-based reasoning and action for familiar situations
- System II (simulative reasoning): predicting future states through the a world model, evaluating consequences before committing. This is what separates planning from longer chain-of-thought
- System III (self-regulation): a learned configurator that autonomously decides when to simulate, how far ahead, and when to skip planning entirely
Last year, in our companion paperSiRA, we showed that simulative reasoning yields up to 124% improvement over reactive baselines — and that strong reasoning models (o1, o3-mini) fail as planners without this structure.
SR²AMadds the self-regulation layer. The result is RL enables the model to plan further ahead (+22.8% horizon) rather than more often (+2% frequency). In terms of performance, our 30B model is competitive with DeepSeek-V3.2 (685B) and Kimi-K2.5 (1T) at 26–95% fewer reasoning tokens.
This is a prototype using language-based world models. Stay tuned for our next steps on multimodal and physical world models.
The concept of a configurator, which decides when and how deeply to engage a reasoning process, is not specific to planning, but extensible to learning and adaptation going forward.
📄 SR²AM:https://arxiv.org/abs/2605.22138 📄 SiRA:https://arxiv.org/abs/2507.23773 🌐 Project:https://sailing-lab.github.io/sr2am-self-regulated-planning 💻 Code:https://github.com/sailing-lab/sr2am
🤗 SR²AM-v0.1-8B:https://huggingface.co/sailing-lab/SR2AM-v0.1-8B 🤗 SR²AM-v1.0-30B:https://huggingface.co/sailing-lab/SR2AM-v1.0-30B
Similar Articles
Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
STRATAGEM is a new framework for improving reasoning transferability in language models by using game self-play with a Reasoning Transferability Coefficient and Reasoning Evolution Reward to reinforce abstract, domain-agnostic reasoning patterns over game-specific heuristics. Experiments show strong improvements on mathematical reasoning, general reasoning, and code generation benchmarks.
@mdeng34: Frontier LLMs are converging on efficient, adaptive reasoning. Opus 4.7 lets the model decide how deeply to reason. GPT…
New research introduces SR²AM, a configurator that self-regulates when to use simulative reasoning, improving efficiency and performance in LLMs.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
The paper introduces BRIGHT-Pro, a new benchmark for reasoning-intensive retrieval, and RTriever-Synth, a synthetic corpus used to fine-tune RTriever-4B for improved performance in agentic search systems.
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning
Study reveals that answer tokens in thinking LLMs follow a structured self-reading pattern—forward drift plus focus on key anchors—during quantitative reasoning, and proposes a training-free SRQ steering method to exploit this for accuracy gains.
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent
This paper proposes SAM, a state-adaptive memory framework that dynamically manages interaction histories for long-horizon agentic reasoning, enabling intent-driven recall without retraining the backbone model. It outperforms strong baselines across multiple benchmarks like BrowseComp and HLE.