Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

arXiv cs.AI 05/14/26, 04:00 AM Papers

multi-agent reinforcement-learning instruction-following macro-action value-correction marl compliance

Summary

Proposes MAVIC, a method for multi-agent reinforcement learning that corrects value estimates at instruction boundaries to enable compliance with external natural language instructions while preserving base task performance.

arXiv:2605.12655v1 Announce Type: new Abstract: Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.

Original Article

View Cached Full Text

Cached at: 05/14/26, 06:13 AM

# Macro-Action Based Multi-Agent Instruction Following through Value Cancellation
Source: [https://arxiv.org/abs/2605.12655](https://arxiv.org/abs/2605.12655)
[View PDF](https://arxiv.org/pdf/2605.12655)

> Abstract:Multi\-agent reinforcement learning \(MARL\) in real\-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long\-horizon objectives\. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro\-actions\. We propose Macro\-Action Value Correction for Instruction Compliance \(MAVIC\), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective\. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy\. We provide theoretical analysis and an actor\-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi\-agent environments\.

## Submission history

From: Wo Wei Lin \[[view email](https://arxiv.org/show-email/e97b481f/2605.12655)\] **\[v1\]**Tue, 12 May 2026 19:01:16 UTC \(356 KB\)

Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

Similar Articles

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Recursive Multi-Agent Systems

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

Submit Feedback

Similar Articles

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents