Tag
This viewpoint paper proposes interpreting national AI development as a learning system using Human-Centered Learning Mechanics, arguing that AI sovereignty depends on a country's ability to regulate its information dynamics. It provides a mathematical model and policy implications for France, reframing AI policy as governance of a non-equilibrium learning system.
This paper investigates how training alignment objectives reshape linguistic features in large language models, finding that instruction-tuned systems collapse language entropy significantly more than scale would suggest, and that entropy regularization can mitigate this collapse.
This paper provides a refined theoretical analysis of actor-critic methods with entropy regularization, showing that an exact critic acts as a strong variance reducer and enables sample complexity comparable to deterministic policy gradient, and that with a sufficiently accurate learned critic the benefits are preserved.
This paper proposes Human-Centered Learning Mechanics (HCLM), a dynamical and information-theoretic framework for studying open and controlled learning systems. It formalizes entropy regularization through effective information force, derives convergence and generalization results, and provides a conditional interpretation of scaling-law behavior.
This paper proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation in LLM reinforcement learning by addressing policy entropy collapse through difficulty-aware coefficient allocation and initial-anchored target entropy. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements in both accuracy and exploration capability.
OpenAI researchers demonstrate a precise mathematical equivalence between soft (entropy-regularized) Q-learning and policy gradient methods in reinforcement learning, providing theoretical insight into why Q-learning works despite inaccurate value estimates. They validate this equivalence empirically on the Atari benchmark and show a Q-learning method can closely match A3C's learning dynamics.