A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL
Summary
This paper proposes a local perturbation theory to explain cross-domain interference in multi-domain RL for LLMs, showing that interference is driven by a second-order damage term in a low-dimensional conflict subspace, and demonstrates that brief domain refresh or training-free rollback can selectively recover lost capabilities.
View Cached Full Text
Cached at: 06/03/26, 07:36 AM
Paper page - A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL
Source: https://huggingface.co/papers/2606.02398
Abstract
Multi-domain reinforcement learning in language models causes performance degradation through shared computational pathways, but targeted refresh and rollback techniques can selectively recover lost capabilities with minimal side effects.
Reinforcement learning(RL) post-training improveslarge language models(LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based oncatastrophic forgettingor globalgradient conflictare incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitudeparameter editswith weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under alocal perturbation modelof multi-domain RL that later-domain training harms an earlier domain mainly through asecond-order damage term, which under the observed sparse route structure concentrates in a low-dimensional sharedconflict subspace. Moreover, a shortdomain refreshcontracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code rightarrow Math rightarrow QA rightarrow CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-freerollbackon a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.02398
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.02398 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.02398 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.02398 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Rethinking the Divergence Regularization in LLM RL
This paper introduces DRPO, which replaces the hard mask in DPPO with a smooth advantage-weighted quadratic regularizer to improve stability and efficiency in LLM reinforcement learning by providing continuous gradient corrections beyond trust-region boundaries.
Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
This paper addresses the 'Lost in Conversation' problem where LLMs struggle with information revealed across multiple turns. It proposes a scalable sharding pipeline to create multi-turn training data from single-turn QA datasets and uses reinforcement learning with verifiable rewards to train a memory-augmented policy that maintains a compact rolling memory, improving multi-turn reasoning accuracy and generalizing zero-shot to harder tasks.
LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training
LaRA is a layer-wise representation analysis framework that detects data contamination in RL post-trained LLMs by measuring geometric deviations across model layers, outperforming output-level baselines.
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
This paper presents CERO, a cross-epoch adaptive rollout optimization method for RL post-training of LLMs, which allocates a fixed rollout budget across prompts and epochs using Bayesian posterior variance to maximize sample efficiency, achieving theoretical regret bounds and outperforming GRPO on mathematical reasoning tasks.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.