Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
Summary
STRATAGEM is a new framework for improving reasoning transferability in language models by using game self-play with a Reasoning Transferability Coefficient and Reasoning Evolution Reward to reinforce abstract, domain-agnostic reasoning patterns over game-specific heuristics. Experiments show strong improvements on mathematical reasoning, general reasoning, and code generation benchmarks.
View Cached Full Text
Cached at: 04/21/26, 07:21 AM
Paper page - Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
Source: https://huggingface.co/papers/2604.17696 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
STRATAGEM addresses limitations in reasoning transfer for language models by using a reasoning transferability coefficient and evolution reward to promote abstract, domain-agnostic patterns over game-specific heuristics.
Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demandstrategic planning,probabilistic inference, andadaptive decision-making. However, existingself-playapproaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer:domain specificity, where learned patterns remain anchored in game semantics, andcontextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through aReasoning Transferability Coefficient, while incentivizing adaptive reasoning development via aReasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics wheremulti-step reasoningis critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2604\.17696
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.17696 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.17696 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.17696 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
STRIDE-ED is a strategy-grounded reasoning framework for empathetic dialogue systems that uses structured multi-stage reasoning combined with a data refinement pipeline and two-stage training (supervised fine-tuning + multi-objective RL) to improve emotional understanding and response generation. The framework demonstrates consistent improvements across open-source LLMs on both automatic metrics and human evaluations.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA proposes strategic trajectory abstraction for long-horizon LLM agents, using hierarchical GRPO-style rollout with diverse strategy sampling and critical self-judgment to improve sample efficiency and final performance over frontier models and prior RL baselines.
TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs
TriEx introduces a tri-view game-based framework that aligns self-reasoning, opponent belief states, and oracle audits to make multi-agent LLM decisions auditable and reveal mismatches between stated rationales and actual behavior.
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
This paper introduces TESSY, a teacher-student cooperative framework for fine-tuning reasoning models that generates on-policy SFT data by decoupling generation into capability tokens (from teacher) and style tokens (from student), addressing catastrophic forgetting issues when using off-policy teacher data.
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning
Study reveals that answer tokens in thinking LLMs follow a structured self-reading pattern—forward drift plus focus on key anchors—during quantitative reasoning, and proposes a training-free SRQ steering method to exploit this for accuracy gains.