LLM能否遵守严格的二维空间约束？（使用推箱子游戏进行测试）

Reddit r/LocalLLaMA 2026/06/03 08:58 新闻

llm-evaluation spatial-reasoning sokoban benchmark zero-shot model-comparison formatting-constraints

摘要

一项基准测试评估了LLMs在带有格式约束的严格推箱子谜题上的表现，发现只有ChatGPT、Qwen3.7-max和Gemini 3.5-thinking成功，而其他模型因非法移动或格式错误而失败。

我最近在零样本条件下测试了现代大型语言模型（LLM）处理空间几何与逻辑推理的能力。为了杜绝作弊式猜测，我使用了一张自定义的**推箱子（Sokoban）**地图，并设置了极其严格的格式约束（不允许使用思维链，仅输出原始方向指令）。结果显示，顶级闭源模型与其他模型之间存在巨大鸿沟。 --- ### 📊 测试结果以下是各模型在严格遵守布局约束的前提下尝试解谜的表现： #### ✅ 通过（成功解题 + 完美格式） * **ChatGPT** * **Qwen3.7-max** * **Gemini 3.5-thinking** #### 🔴 失败（非法移动、死局或格式崩溃） * **Gemini 3.5-flash** * **Gemini 3.1 Pro** * **Qwen3.7-plus**（快速/思考模式） * **Qwen3.6-plus** * **Qwen3.6-35B-A3B** * **GLM-5** * **Gemma4-26B-A4B** *（注：由于账户访问限制，Claude 模型未包含在此次测试中。）* --- ### 📝 使用的测试提示你可以复制下方提示，在其他模型上测试它们处理空间追踪的能力： ```text You are a perfect Sokoban automatic solver. Based on the standard XSB format character map provided below, calculate the sequence of moves required to push all boxes ($) to their respective goals (. or +). 1. Symbol Definitions: # : Wall (Space) : Floor @ : Player $ : Box (not on goal) . : Goal (empty) * : Box on Goal + : Player on Goal 2. Core Movement Rules: - The player moves one step at a time to an adjacent floor: UP, DOWN, LEFT, or RIGHT. - The player can only push a single box; the player cannot pull boxes, nor can they push two consecutive boxes at once. - Avoid pushing boxes into corners/deadlocks that make the level unsolvable. 3. [Extremely Strict] Output Format Requirements: Perform all path deductions within your internal state machine or mental simulation. - The final result [MUST ONLY] consist of a sequence of these four uppercase words: UP, DOWN, LEFT, RIGHT. - All steps must be output on a single line, strictly separated by English commas (,). [DO NOT] include spaces and [DO NOT] include newlines. - The entire response [IS STRICTLY FORBIDDEN] from containing any introductory text, concluding remarks, Chain of Thought (CoT), extra punctuation (except the commas between steps), or any characters other than these four words. Correct Output Example Format: UP,UP,LEFT,DOWN,RIGHT,RIGHT,DOWN 4. Level Map Data to be Solved: [ " ###", " ## # ####", " ## ### #", "## $ #", "# @$ # #", "### $### #", " # #.. #", " ## ##.# ##", " # ##", " # ##", " #######" ] ```

查看原文

LLM能否遵守严格的二维空间约束？（使用推箱子游戏进行测试）

相似文章

LinAlg-Bench：揭示大语言模型数学推理中结构性失败模式的诊断性基准

评估开源大语言模型在自主代号游戏模拟中的表现

@LM_Braswell：已确认，LLMs现在比满屋子的狂热Anagram玩家还要厉害——你能找出最后一个I应该放在哪里吗？

LLMEval-Logic：一个经过求解器验证的、带有对抗性加固的大语言模型逻辑推理中文基准

从零阶选择到二阶判断：组合硬化暴露前沿大语言模型的组合性缺陷

提交意见反馈