Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Hugging Face Daily Papers 06/15/26, 12:00 AM Papers

Summary

This paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC) for fine-tuning Vision-Language-Action (VLA) policies using online reinforcement learning with sparse binary episode outcomes. HABC separates viability and efficiency objectives via adaptive critic heads and intervention-aware credit assignment, significantly improving success rates on contact-rich bimanual manipulation tasks.

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:31 AM

Paper page - Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Source: https://huggingface.co/papers/2606.17043 Published on Jun 15

Submitted byhttps://huggingface.co/SiyuanH

Siyuanon Jun 16

Abstract

Hierarchical Advantage-Weighted Behavior Cloning (HABC) addresses sparse reward challenges in robot learning by separately optimizing viability and efficiency objectives through adaptive critic heads and intervention-aware credit assignment, significantly improving success rates in contact-rich manipulation tasks.

When pretrained VLA policies are fine-tuned throughonline RL, each rollout episode produces only a single binary outcome (success or failure), yet theactor updaterequiresper-transition supervision. Existing approaches commonly reduce thissparse outcometo a singlescalar rewardoradvantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives ofviabilityandefficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separatecritic headsfor these two objectives on different data subsets and combines their outputs with astate-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizingviabilitywhen success is uncertain and shifting toefficiencyonly whenviabilityis high, and converts the result into per-transition weights on the actor loss.Intervention-aware credit assignmentfurther restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on threecontact-rich bimanual tasks, HABC raises success fromsupervised fine-tuning(SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.17043

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.17043 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.17043 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.17043 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Paper page - Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

InSight: Self-Guided Skill Acquisition via Steerable VLAs

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Submit Feedback

Similar Articles

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

InSight: Self-Guided Skill Acquisition via Steerable VLAs

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation