Macro-Action Based Multi-Agent Instruction Following through Value Cancellation
Summary
Proposes MAVIC, a method for multi-agent reinforcement learning that corrects value estimates at instruction boundaries to enable compliance with external natural language instructions while preserving base task performance.
View Cached Full Text
Cached at: 05/14/26, 06:13 AM
# Macro-Action Based Multi-Agent Instruction Following through Value Cancellation Source: [https://arxiv.org/abs/2605.12655](https://arxiv.org/abs/2605.12655) [View PDF](https://arxiv.org/pdf/2605.12655) > Abstract:Multi\-agent reinforcement learning \(MARL\) in real\-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long\-horizon objectives\. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro\-actions\. We propose Macro\-Action Value Correction for Instruction Compliance \(MAVIC\), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective\. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy\. We provide theoretical analysis and an actor\-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi\-agent environments\. ## Submission history From: Wo Wei Lin \[[view email](https://arxiv.org/show-email/e97b481f/2605.12655)\] **\[v1\]**Tue, 12 May 2026 19:01:16 UTC \(356 KB\)
Similar Articles
MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning
MetaAgent-X introduces an end-to-end reinforcement learning framework that jointly optimizes the design and execution of automatic multi-agent systems, overcoming the frozen-executor ceiling and achieving up to 21.7% gains over existing baselines.
Recursive Multi-Agent Systems
This paper introduces RecursiveMAS, a framework that extends recursive scaling principles to multi-agent systems for improved collaborative reasoning efficiency and accuracy. It demonstrates significant speedups and token reduction across various benchmarks compared to standard baselines.
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
Proposes VeGAS, a test-time framework for MLLM-based embodied agents that samples multiple candidate actions and uses a generative verifier to select the most reliable, achieving up to 36% relative improvement over CoT baselines on challenging tasks.
When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
This paper identifies a failure mode in LLM-based multi-agent systems where plans fail due to agents misjudging their knowledge (epistemic miscalibration) and proposes EPC-AW, a workflow that uses information-consistency and epistemic state refinement to improve system-level success by 9.75%.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.