Multi-scale Mixture of World Models for Embodied Agents in Evolving Environments
Summary
This paper introduces MuSix, a framework for embodied agents that uses scale-aware world model mixture and evolution to handle multi-scale reasoning and dynamic adaptation in evolving environments, achieving improvements over baselines on EmbodiedBench and HAZARD.
View Cached Full Text
Cached at: 07/02/26, 05:40 AM
# Multi-scale Mixture of World Models for Embodied Agents in Evolving Environments Source: [https://arxiv.org/abs/2607.00457](https://arxiv.org/abs/2607.00457) [View PDF](https://arxiv.org/pdf/2607.00457) > Abstract:Embodied agents operating in the real world require multi\-scale reasoning and knowledge adaptation as conditions change\. We identify two challenges in applying Mixture of Experts \(MoE\) to this setting: routing lacks an explicit notion of scale, preventing targeted updates at specific scales, and a uniform update policy cannot accommodate the different rates at which knowledge at each scale becomes outdated\. We present MuSix, a framework that addresses both challenges through scale\-aware world model mixture and evolution\. A two\-stage routing mechanism grounds scale selection in experiential distance, a measure of situational novelty inspired by Construal Level Theory: a meta\-router first maps this quantity to a weight over continuous scale space, then per\-scale base routers select world models within the identified scale\. For adaptation, scale\-dependent forgetting rates allow low\-scale knowledge to refresh rapidly while high\-scale abstractions persist, and gated inter\-scale transfer maintains coherence across the hierarchy\. Experiments on EmbodiedBench and HAZARD show that MuSix improves over state\-of\-the\-art baselines on multi\-scale reasoning and dynamic adaptation\. ## Submission history From: Jinwoo Jang \[[view email](https://arxiv.org/show-email/3b4d85fb/2607.00457)\] **\[v1\]**Wed, 1 Jul 2026 05:23:56 UTC \(9,191 KB\)
Similar Articles
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a unified framework for multi-agent multi-view video world modeling that achieves accurate control of multiple agents while maintaining multi-view consistency through a Multi-Agent Condition Module and Global State Encoder.
Multi-Agent World Models (3 minute read)
γ-World is a generative multi-agent world model that supports independently controllable, permutation-symmetric agents using Simplex Rotary Agent Encoding and Sparse Hub Attention, achieving real-time 24 FPS rollouts and zero-shot generalization from two to four players.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World introduces a self-evolving training framework for general agent intelligence that autonomously discovers real-world environments and tasks via the Model Context Protocol, enabling continuous learning. Agent-World-8B and 14B models outperform strong proprietary models across 23 challenging agent benchmarks.
tencent/HY-Embodied-0.5
Tencent releases HY-Embodied-0.5, a suite of foundation models designed for embodied AI agents featuring a Mixture-of-Transformers (MoT) architecture with efficient 2B and powerful 32B variants for real-world robot control and spatial-temporal reasoning.
ABot-M0.5: Unified Mobility-and-Manipulation World Action Model
ABot-M0.5 is a new World Action Model for mobile manipulation that improves performance through temporal granularity alignment, action space disentanglement, and train-test consistency, achieving state-of-the-art results on long-horizon and fine-grained manipulation benchmarks.