Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

arXiv cs.AI Papers

Summary

Proposes MAVIC, a method for multi-agent reinforcement learning that corrects value estimates at instruction boundaries to enable compliance with external natural language instructions while preserving base task performance.

arXiv:2605.12655v1 Announce Type: new Abstract: Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:13 AM

# Macro-Action Based Multi-Agent Instruction Following through Value Cancellation
Source: [https://arxiv.org/abs/2605.12655](https://arxiv.org/abs/2605.12655)
[View PDF](https://arxiv.org/pdf/2605.12655)

> Abstract:Multi\-agent reinforcement learning \(MARL\) in real\-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long\-horizon objectives\. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro\-actions\. We propose Macro\-Action Value Correction for Instruction Compliance \(MAVIC\), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective\. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy\. We provide theoretical analysis and an actor\-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi\-agent environments\.

## Submission history

From: Wo Wei Lin \[[view email](https://arxiv.org/show-email/e97b481f/2605.12655)\] **\[v1\]**Tue, 12 May 2026 19:01:16 UTC \(356 KB\)

Similar Articles

Recursive Multi-Agent Systems

Papers with Code Trending

This paper introduces RecursiveMAS, a framework that extends recursive scaling principles to multi-agent systems for improved collaborative reasoning efficiency and accuracy. It demonstrates significant speedups and token reduction across various benchmarks compared to standard baselines.

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

arXiv cs.CL

AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.