Tag
This paper proposes a performance-driven state abstraction method for reinforcement learning that directly optimizes decision quality, using a multi-timescale framework to jointly adapt the policy and a tree-structured abstraction. The algorithm refines or aggregates state space based on Q-value discrepancies, achieving better sample efficiency and faster replanning than baselines.
This paper identifies surrogate hacking and temporal uncertainty as failure modes in multi-timescale RL, and proposes a Target Decoupling architecture that removes routing from the actor, using the critic for auxiliary representation learning. The method eliminates policy collapse on the LunarLander-v2 benchmark and stably surpasses the 'Environment Solved' threshold without hyperparameter hacking.