Tag
This paper introduces Path-Coupled Bellman Flows (PCBF), a continuous-time distributional reinforcement learning method that uses flow matching to model return distributions without heuristic projections. It addresses boundary mismatch and high-variance issues in previous flow-based approaches by coupling current and successor return flows through shared base noise.
This paper introduces POISE, a method for stable policy optimization in large reasoning models by estimating baselines using the model's own internal states, reducing computational overhead compared to PPO and GRPO.
OpenAI researchers derive a bias-free action-dependent baseline for variance reduction in policy gradient methods, demonstrating improved learning efficiency on high-dimensional control tasks, multi-agent, and partially observed environments.