Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Hugging Face Daily Papers 05/21/26, 12:00 AM Papers

Summary

This paper identifies surrogate hacking and temporal uncertainty as failure modes in multi-timescale RL, and proposes a Target Decoupling architecture that removes routing from the actor, using the critic for auxiliary representation learning. The method eliminates policy collapse on the LunarLander-v2 benchmark and stably surpasses the 'Environment Solved' threshold without hyperparameter hacking.

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines. The source code to reproduce our experiments is publicly available at https://github.com/ben-dlwlrma/Representation-Over-Routing.

Original Article

View Cached Full Text

Cached at: 05/26/26, 02:44 PM

Paper page - Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Source: https://huggingface.co/papers/2604.13517 Motivation: Fusing multi-timescale signals in RL can trigger optimization pathologies. We identify two specific failure modes in Actor-Critic architectures:Surrogate Objective Hacking(policy gradients exploiting dynamic routing weights) and theParadox of Temporal Uncertainty(myopic degeneration under gradient-free routing).

Method: We propose aTarget Decouplingarchitecture (“Representation over Routing”). We remove routing aggregation from the Actor. Instead, the Critic fits multiple temporal horizons as an auxiliary representation learning task, while the Actor updates solely on the long-term advantage.

Results: On the LunarLander-v2 delayed-reward benchmark, our decoupled agent avoids the “hovering for survival” local optimum. It eliminates policy collapse and stably surpasses the “Environment Solved” threshold without hyperparameter hacking.

Code and reproducible scripts are open-sourced in the repo.

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Paper page - Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Similar Articles

Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

more ai slop to slop around~

Submit Feedback

Similar Articles

Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting