Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes
Summary
This paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC) for fine-tuning Vision-Language-Action (VLA) policies using online reinforcement learning with sparse binary episode outcomes. HABC separates viability and efficiency objectives via adaptive critic heads and intervention-aware credit assignment, significantly improving success rates on contact-rich bimanual manipulation tasks.
View Cached Full Text
Cached at: 06/16/26, 11:31 AM
Paper page - Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes
Source: https://huggingface.co/papers/2606.17043 Published on Jun 15
·
Submitted byhttps://huggingface.co/SiyuanH
Siyuanon Jun 16
Abstract
Hierarchical Advantage-Weighted Behavior Cloning (HABC) addresses sparse reward challenges in robot learning by separately optimizing viability and efficiency objectives through adaptive critic heads and intervention-aware credit assignment, significantly improving success rates in contact-rich manipulation tasks.
When pretrained VLA policies are fine-tuned throughonline RL, each rollout episode produces only a single binary outcome (success or failure), yet theactor updaterequiresper-transition supervision. Existing approaches commonly reduce thissparse outcometo a singlescalar rewardoradvantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives ofviabilityandefficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separatecritic headsfor these two objectives on different data subsets and combines their outputs with astate-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizingviabilitywhen success is uncertain and shifting toefficiencyonly whenviabilityis high, and converts the result into per-transition weights on the actor loss.Intervention-aware credit assignmentfurther restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on threecontact-rich bimanual tasks, HABC raises success fromsupervised fine-tuning(SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.17043
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.17043 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.17043 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.17043 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.
EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies
EventVLA introduces a sparse visual evidence memory framework for long-horizon robotic manipulation, achieving an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA proposes a high-concurrency distributed asynchronous reinforcement learning framework for Vision-Language-Action models, using plane decoupling and a swimlane pipeline to improve throughput and efficiency in large-scale embodied AI training.
InSight: Self-Guided Skill Acquisition via Steerable VLAs
InSight presents a framework for autonomous skill acquisition in vision-language-action (VLA) models by enabling steerability at the primitive-action level and using a VLM-guided data flywheel to generate demonstrations, achieving manipulation tasks like block flipping and pouring without human demonstrations.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.