MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft
Summary
The MineExplorer benchmark evaluates multimodal large language model agents' open-world exploration abilities in Minecraft using atomic and multi-hop tasks designed through multi-agent synthesis. Experiments show that open-world exploration remains challenging, with strong models degrading sharply over longer trajectories.
View Cached Full Text
Cached at: 06/02/26, 03:35 PM
Paper page - MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft
Source: https://huggingface.co/papers/2605.30931
Abstract
MineExplorer benchmark evaluates multimodal large language models’ open-world exploration capabilities in Minecraft through atomic and multi-hop tasks designed via multi-agent synthesis.
Multimodal large language models(MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluatingopen-world explorationcapabilities of MLLM agents inMinecraft. We first filteratomic taskswhose solutions rely heavily onMinecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around aReAct-style capability formulationand composeatomic tasksinto implicitmulti-hop tasks. To further construct reliable instances, MineExplorer uses amulti-agent synthesisworkflow that jointly designstask graphs,sandbox scenes, and rule-basedmilestone evaluators. Human evaluation shows that themulti-agent synthesisworkflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show thatopen-world explorationremains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.30931
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30931 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30931 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30931 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Look Before You Leap: Autonomous Exploration for LLM Agents
This paper identifies autonomous exploration as a critical capability for LLM agents and proposes the Explore-then-Act paradigm, which decouples information gathering from task execution to improve adaptability and real-world performance. It also introduces Exploration Checkpoint Coverage as a verifiable metric for evaluating exploration breadth.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents
GROW proposes a novel reinforcement learning framework that adapts GRPO to multi-turn VLM agent tasks by decomposing trajectories into state-action pairs and computing advantages between them, achieving state-of-the-art performance on over 800 Minecraft tasks.
Some considerations on learning to explore via meta-reinforcement learning
OpenAI researchers introduce E-MAML and E-RL², two meta-reinforcement learning algorithms designed to improve exploration in tasks where discovering optimal policies requires significant exploration. The work demonstrates these algorithms' effectiveness on novel environments including Krazy World and maze tasks.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
This paper proposes a method to train LLM agents with intrinsic meta-evolution capabilities, enabling spontaneous self-improvement without external rewards at inference time. Applied to Qwen3-30B and Seed-OSS-36B, the approach yields a 20% performance boost on web navigation benchmarks, with a 14B model outperforming Gemini-2.5-Flash.