MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

Hugging Face Daily Papers 05/29/26, 12:00 AM Papers

minecraft mllm-agents open-world-exploration benchmark multi-agent-synthesis evaluation

Summary

The MineExplorer benchmark evaluates multimodal large language model agents' open-world exploration abilities in Minecraft using atomic and multi-hop tasks designed through multi-agent synthesis. Experiments show that open-world exploration remains challenging, with strong models degrading sharply over longer trajectories.

Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:35 PM

Paper page - MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

Source: https://huggingface.co/papers/2605.30931

Abstract

MineExplorer benchmark evaluates multimodal large language models’ open-world exploration capabilities in Minecraft through atomic and multi-hop tasks designed via multi-agent synthesis.

Multimodal large language models(MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluatingopen-world explorationcapabilities of MLLM agents inMinecraft. We first filteratomic taskswhose solutions rely heavily onMinecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around aReAct-style capability formulationand composeatomic tasksinto implicitmulti-hop tasks. To further construct reliable instances, MineExplorer uses amulti-agent synthesisworkflow that jointly designstask graphs,sandbox scenes, and rule-basedmilestone evaluators. Human evaluation shows that themulti-agent synthesisworkflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show thatopen-world explorationremains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.30931

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30931 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30931 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30931 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

Paper page - MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Look Before You Leap: Autonomous Exploration for LLM Agents

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

Some considerations on learning to explore via meta-reinforcement learning

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

Submit Feedback

Similar Articles

Look Before You Leap: Autonomous Exploration for LLM Agents

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

Some considerations on learning to explore via meta-reinforcement learning

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration