Tag
The MineExplorer benchmark evaluates multimodal large language model agents' open-world exploration abilities in Minecraft using atomic and multi-hop tasks designed through multi-agent synthesis. Experiments show that open-world exploration remains challenging, with strong models degrading sharply over longer trajectories.