Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

arXiv cs.AI Papers

Summary

This paper introduces SWARR, a two-stage recipe using supervised fine-tuning and reinforcement learning to adapt sliding-window attention models for mathematical reasoning, showing that RL can narrow the performance gap with self-attention while maintaining efficiency.

arXiv:2606.11634v1 Announce Type: new Abstract: The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:48 PM

# Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
Source: [https://arxiv.org/abs/2606.11634](https://arxiv.org/abs/2606.11634)
[View PDF](https://arxiv.org/pdf/2606.11634)

> Abstract:The rapid progress of reasoning and agentic large language models \(LLMs\) has increased the demand for long\-context inference, but self\-attention \(SA\) scales quadratically with context length\. To address this, we study SWARR \(Sliding\-Window Attention with Reinforced Adaptation for Math Reasoning\), a practical recipe for adapting SWA models to mathematical reasoning\. SWARR has two stages: \(1\) efficient conversion from a pretrained SA model to SWA with supervised fine\-tuning \(SFT\), which avoids pretraining a new base model, and \(2\) policy adaptation with reinforcement learning \(RL\)\. We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data\-architecture mismatch: most SFT data are prepared for SA models and may contain long\-range dependencies that are difficult for SWA to model\. Because on\-policy RL optimizes self\-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA\. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear\-complexity attention\. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning\.

## Submission history

From: Kai Liu \[[view email](https://arxiv.org/show-email/70029cef/2606.11634)\] **\[v1\]**Wed, 10 Jun 2026 03:56:03 UTC \(1,103 KB\)

Similar Articles

Improving mathematical reasoning with process supervision

OpenAI Blog

OpenAI demonstrates that process supervision—rewarding intermediate reasoning steps rather than just final answers—improves mathematical reasoning while reducing alignment costs. This approach produces more interpretable, human-aligned reasoning without sacrificing model performance.

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

arXiv cs.CL

AtManRL is a method that uses differentiable attention manipulation and reinforcement learning to train LLMs to generate more faithful chain-of-thought reasoning by ensuring reasoning tokens causally influence final predictions. Experiments on GSM8K and MMLU with Llama-3.2-3B demonstrate the approach can identify influential reasoning tokens and improve reasoning transparency.

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

arXiv cs.AI

This paper investigates jailbreak attacks on Large Reasoning Models (LRMs), revealing that attack success correlates with attention patterns. The authors propose a reinforcement learning-based jailbreak method that incorporates attention signals into the reward function and uses diverse persuasion strategies, achieving significantly higher attack success rates across multiple benchmarks.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

arXiv cs.CL

This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

arXiv cs.LG

RASFT is a novel supervised fine-tuning framework for large language models that adapts expert supervision based on the model's own reasoning capabilities, achieving better performance on mathematical and code reasoning benchmarks compared to standard SFT and reinforcement learning methods.