LLM能否遵守严格的二维空间约束?(使用推箱子游戏进行测试)

Reddit r/LocalLLaMA 新闻

摘要

一项基准测试评估了LLMs在带有格式约束的严格推箱子谜题上的表现,发现只有ChatGPT、Qwen3.7-max和Gemini 3.5-thinking成功,而其他模型因非法移动或格式错误而失败。

我最近在零样本条件下测试了现代大型语言模型(LLM)处理空间几何与逻辑推理的能力。为了杜绝作弊式猜测,我使用了一张自定义的**推箱子(Sokoban)**地图,并设置了极其严格的格式约束(不允许使用思维链,仅输出原始方向指令)。结果显示,顶级闭源模型与其他模型之间存在巨大鸿沟。 --- ### 📊 测试结果 以下是各模型在严格遵守布局约束的前提下尝试解谜的表现: #### ✅ 通过(成功解题 + 完美格式) * **ChatGPT** * **Qwen3.7-max** * **Gemini 3.5-thinking** #### 🔴 失败(非法移动、死局或格式崩溃) * **Gemini 3.5-flash** * **Gemini 3.1 Pro** * **Qwen3.7-plus**(快速/思考模式) * **Qwen3.6-plus** * **Qwen3.6-35B-A3B** * **GLM-5** * **Gemma4-26B-A4B** *(注:由于账户访问限制,Claude 模型未包含在此次测试中。)* --- ### 📝 使用的测试提示 你可以复制下方提示,在其他模型上测试它们处理空间追踪的能力: ```text You are a perfect Sokoban automatic solver. Based on the standard XSB format character map provided below, calculate the sequence of moves required to push all boxes ($) to their respective goals (. or +). 1. Symbol Definitions: # : Wall (Space) : Floor @ : Player $ : Box (not on goal) . : Goal (empty) * : Box on Goal + : Player on Goal 2. Core Movement Rules: - The player moves one step at a time to an adjacent floor: UP, DOWN, LEFT, or RIGHT. - The player can only push a single box; the player cannot pull boxes, nor can they push two consecutive boxes at once. - Avoid pushing boxes into corners/deadlocks that make the level unsolvable. 3. [Extremely Strict] Output Format Requirements: Perform all path deductions within your internal state machine or mental simulation. - The final result [MUST ONLY] consist of a sequence of these four uppercase words: UP, DOWN, LEFT, RIGHT. - All steps must be output on a single line, strictly separated by English commas (,). [DO NOT] include spaces and [DO NOT] include newlines. - The entire response [IS STRICTLY FORBIDDEN] from containing any introductory text, concluding remarks, Chain of Thought (CoT), extra punctuation (except the commas between steps), or any characters other than these four words. Correct Output Example Format: UP,UP,LEFT,DOWN,RIGHT,RIGHT,DOWN 4. Level Map Data to be Solved: [ " ###", " ## # ####", " ## ### #", "## $ #", "# @$ # #", "### $### #", " # #.. #", " ## ##.# ##", " # ##", " # ##", " #######" ] ```
查看原文

相似文章

评估开源大语言模型在自主代号游戏模拟中的表现

Reddit r/AI_Agents

一位开发者构建了一个代号游戏模拟平台,用于评估开源大语言模型在长程协作中的表现。结果显示,DeepSeek v4 Flash 在游戏逻辑对齐方面表现优异,胜出其他模型;而 Qwen 3 Next 和 GPT 5.4 Nano 则在规则约束和视角转换方面存在困难。