Multi-scale Mixture of World Models for Embodied Agents in Evolving Environments

arXiv cs.AI Papers

Summary

This paper introduces MuSix, a framework for embodied agents that uses scale-aware world model mixture and evolution to handle multi-scale reasoning and dynamic adaptation in evolving environments, achieving improvements over baselines on EmbodiedBench and HAZARD.

arXiv:2607.00457v1 Announce Type: new Abstract: Embodied agents operating in the real world require multi-scale reasoning and knowledge adaptation as conditions change. We identify two challenges in applying Mixture of Experts (MoE) to this setting: routing lacks an explicit notion of scale, preventing targeted updates at specific scales, and a uniform update policy cannot accommodate the different rates at which knowledge at each scale becomes outdated. We present MuSix, a framework that addresses both challenges through scale-aware world model mixture and evolution. A two-stage routing mechanism grounds scale selection in experiential distance, a measure of situational novelty inspired by Construal Level Theory: a meta-router first maps this quantity to a weight over continuous scale space, then per-scale base routers select world models within the identified scale. For adaptation, scale-dependent forgetting rates allow low-scale knowledge to refresh rapidly while high-scale abstractions persist, and gated inter-scale transfer maintains coherence across the hierarchy. Experiments on EmbodiedBench and HAZARD show that MuSix improves over state-of-the-art baselines on multi-scale reasoning and dynamic adaptation.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:40 AM

# Multi-scale Mixture of World Models for Embodied Agents in Evolving Environments
Source: [https://arxiv.org/abs/2607.00457](https://arxiv.org/abs/2607.00457)
[View PDF](https://arxiv.org/pdf/2607.00457)

> Abstract:Embodied agents operating in the real world require multi\-scale reasoning and knowledge adaptation as conditions change\. We identify two challenges in applying Mixture of Experts \(MoE\) to this setting: routing lacks an explicit notion of scale, preventing targeted updates at specific scales, and a uniform update policy cannot accommodate the different rates at which knowledge at each scale becomes outdated\. We present MuSix, a framework that addresses both challenges through scale\-aware world model mixture and evolution\. A two\-stage routing mechanism grounds scale selection in experiential distance, a measure of situational novelty inspired by Construal Level Theory: a meta\-router first maps this quantity to a weight over continuous scale space, then per\-scale base routers select world models within the identified scale\. For adaptation, scale\-dependent forgetting rates allow low\-scale knowledge to refresh rapidly while high\-scale abstractions persist, and gated inter\-scale transfer maintains coherence across the hierarchy\. Experiments on EmbodiedBench and HAZARD show that MuSix improves over state\-of\-the\-art baselines on multi\-scale reasoning and dynamic adaptation\.

## Submission history

From: Jinwoo Jang \[[view email](https://arxiv.org/show-email/3b4d85fb/2607.00457)\] **\[v1\]**Wed, 1 Jul 2026 05:23:56 UTC \(9,191 KB\)

Similar Articles

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Hugging Face Daily Papers

MultiWorld is a unified framework for multi-agent multi-view video world modeling that achieves accurate control of multiple agents while maintaining multi-view consistency through a Multi-Agent Condition Module and Global State Encoder.

Multi-Agent World Models (3 minute read)

TLDR AI

γ-World is a generative multi-agent world model that supports independently controllable, permutation-symmetric agents using Simplex Rotary Agent Encoding and Sparse Hub Attention, achieving real-time 24 FPS rollouts and zero-shot generalization from two to four players.

tencent/HY-Embodied-0.5

Hugging Face Models Trending

Tencent releases HY-Embodied-0.5, a suite of foundation models designed for embodied AI agents featuring a Mixture-of-Transformers (MoT) architecture with efficient 2B and powerful 32B variants for real-world robot control and spatial-temporal reasoning.

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

Hugging Face Daily Papers

ABot-M0.5 is a new World Action Model for mobile manipulation that improves performance through temporal granularity alignment, action space disentanglement, and train-test consistency, achieving state-of-the-art results on long-horizon and fine-grained manipulation benchmarks.