Tag
This paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC) for fine-tuning Vision-Language-Action (VLA) policies using online reinforcement learning with sparse binary episode outcomes. HABC separates viability and efficiency objectives via adaptive critic heads and intervention-aware credit assignment, significantly improving success rates on contact-rich bimanual manipulation tasks.
This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.
The paper introduces Reflection-Enhanced Self-Distillation (Resd), a framework that transforms failure feedback into corrective supervision for LLMs, enabling efficient learning from rare successes. It outperforms standard self-distillation baselines and achieves faster early improvement than GRPO with fewer samples.
The post explains why Reinforcement Learning struggles with long-horizon tasks due to sparse rewards and highlights GEPA, a method that uses trajectory-level textual reflection to preserve richer feedback signals for optimization.
OpenAI presents Hindsight Experience Replay (HER), a reinforcement learning technique that enables robots to learn from failed attempts by retroactively treating achieved alternative outcomes as successful goals, allowing learning even with sparse reward signals.
OpenAI presents Hindsight Experience Replay (HER), a technique enabling sample-efficient reinforcement learning from sparse binary rewards without complex reward engineering. It is demonstrated on robotic arm manipulation tasks including pushing, sliding, and pick-and-place, and validated on physical robots.
OpenAI researchers propose a framework using stochastic neural networks for hierarchical reinforcement learning that pre-trains useful skills guided by a proxy reward, then leverages these skills for faster learning in downstream tasks with sparse rewards or long horizons.