Tag
This paper proposes HT-GRPO, a hierarchical reinforcement learning method for diffusion multi-modal large language models that uses a sketch-then-paint training scheme and hierarchical credit assignment to improve image generation quality and reward alignment.
OpenAI research proposes hierarchical reinforcement learning where agents break down complex tasks into sequences of high-level actions rather than low-level ones, significantly improving efficiency for long-horizon tasks by reducing search complexity from thousands of steps to dozens.