Tag
Introduces LatentGym, a controllable testbed for studying cross-task experiential learning in LLM agents, enabling measurement of exploration vs exploitation and revealing how frontier models fail to adapt across related tasks.
This paper proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation in LLM reinforcement learning by addressing policy entropy collapse through difficulty-aware coefficient allocation and initial-anchored target entropy. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements in both accuracy and exploration capability.
DiPO introduces a novel reinforcement learning approach for LLMs that uses perplexity-based sample partitioning to disentangle exploration and exploitation subspaces, combined with a bidirectional reward allocation mechanism for more stable policy optimization. The method demonstrates superior performance on mathematical reasoning and function calling tasks.