Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO
Summary
This paper identifies surrogate hacking and temporal uncertainty as failure modes in multi-timescale RL, and proposes a Target Decoupling architecture that removes routing from the actor, using the critic for auxiliary representation learning. The method eliminates policy collapse on the LunarLander-v2 benchmark and stably surpasses the 'Environment Solved' threshold without hyperparameter hacking.
View Cached Full Text
Cached at: 05/26/26, 02:44 PM
Paper page - Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO
Source: https://huggingface.co/papers/2604.13517 Motivation: Fusing multi-timescale signals in RL can trigger optimization pathologies. We identify two specific failure modes in Actor-Critic architectures:Surrogate Objective Hacking(policy gradients exploiting dynamic routing weights) and theParadox of Temporal Uncertainty(myopic degeneration under gradient-free routing).
Method: We propose aTarget Decouplingarchitecture (“Representation over Routing”). We remove routing aggregation from the Actor. Instead, the Critic fits multiple temporal horizons as an auxiliary representation learning task, while the Actor updates solely on the long-term advantage.
Results: On the LunarLander-v2 delayed-reward benchmark, our decoupled agent avoids the “hovering for survival” local optimum. It eliminates policy collapse and stably surpasses the “Environment Solved” threshold without hyperparameter hacking.
Code and reproducible scripts are open-sourced in the repo.
Similar Articles
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
This paper introduces a critique-and-routing controller for multi-agent LLM systems that formulates coordination as a sequential decision problem. It uses policy gradients to optimize the controller for iterative refinement, outperforming baselines while reducing reliance on top-tier models.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
This paper studies reward hacking in reinforcement learning for language models through the geometry of updates, identifying optimization drift as a key factor. It proposes trusted-direction projection to constrain gradients within a clean reference subspace, delaying shortcut exploitation and preserving task performance.
TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
Proposes TEMPO, a policy optimization method that trains LLMs to reason exclusively from pre-cutoff information by using a two-mode reward and GRPO-based training, reducing knowledge leakage by 2–13% while improving task performance by 6–13%.
more ai slop to slop around~
This post extends E8 lattice geometric activation injection to supervised LLM safety routing, using STE-snapped E8 policy heads. While achieving near-perfect routing on clean data, the approach catastrophically fails under adversarial stress, requiring a hybrid symbolic-geometric architecture with audited deterministic rules.