Tag
This paper introduces MuSix, a framework for embodied agents that uses scale-aware world model mixture and evolution to handle multi-scale reasoning and dynamic adaptation in evolving environments, achieving improvements over baselines on EmbodiedBench and HAZARD.
LabGuard introduces a framework that translates natural-language laboratory safety rules into executable runtime monitors for embodied agents, achieving a reduction in unsafe events from 39.5% to 23.8% while maintaining task success.
WorldLines introduces a benchmark for long-horizon embodied household assistance, featuring memory QA and embodied task planning with partial observability, and proposes ObsMem, a visibility-aware memory framework.
Introduces AgentSpec, a modular specification framework for systematically composing and analyzing embodied LLM agent scaffolds, revealing that performance depends on scaffold compatibility and interaction effects rather than isolated module strength.
This paper presents RECENT, a framework that enables efficient skill grounding in embodied agents using small language models (sLMs) by refactoring code-based skills rather than regenerating them from scratch, achieving performance comparable to LLM-based methods.
Cosmos 3 is a family of omnimodal world models from NVIDIA that jointly processes language, image, video, audio, and action sequences using a unified mixture-of-transformers architecture, achieving state-of-the-art performance in understanding and generation tasks for Physical AI.
This paper proposes Polar, a multimodal memory-augmented framework for personalizing embodied MLLM agents over long-term user interactions, using a knowledge graph and episodic memory to ground user-intended instances from accumulated context.
DexHoldem is a real-world benchmark for evaluating embodied agents in dexterous manipulation tasks, using Texas Hold'em with a ShadowHand to test primitive execution, perception, and decision-making in a closed-loop setting.
Ego2World converts egocentric cooking videos (HD-EPIC) into executable symbolic worlds with graph-transition rules, enabling evaluation of belief-state planning under partial observation. Experiments show that belief memory improves task completion, suggesting it should be a first-class target in embodied agent evaluation.
Proposes VeGAS, a test-time framework for MLLM-based embodied agents that samples multiple candidate actions and uses a generative verifier to select the most reliable, achieving up to 36% relative improvement over CoT baselines on challenging tasks.
The paper introduces 'Continual Harness,' a framework enabling embodied AI agents to self-improve online without environment resets. It demonstrates significant progress in playing Pokémon games, achieving human-level performance through automated prompt and skill refinement.