在 SWEBench Pro 上，GPT 5.5 的失败中有 68.5% 是由损坏或错误的测试用例引起的，占整个基准测试的 28.9%

Reddit r/ArtificialInteligence 2026/05/26 23:15 新闻

benchmark evaluation gpt swebench test-cases ai-reliability

摘要

分析显示，GPT 5.5 在 SWEBench Pro 上的失败中有 28.9% 是由于损坏或错误的测试用例所致，类似问题也影响了其他主要 AI 基准测试，引发了对当前评估方法准确性的担忧。

[https://deepswe.datacurve.ai/blog](https://deepswe.datacurve.ai/blog) 其实际得分应为 86.7%。其他基准测试中也存在类似错误，包括： * MMLU [https://arxiv.org/abs/2406.04127](https://arxiv.org/abs/2406.04127) * ARC AGI [https://www.reddit.com/r/singularity/comments/1hjjj5c/comment/m37bw8p/](https://www.reddit.com/r/singularity/comments/1hjjj5c/comment/m37bw8p/) * SpatialBench [https://x.com/YafahEdelman/status/2031178437243916509?s=20](https://x.com/YafahEdelman/status/2031178437243916509?s=20) * HLE [https://www.futurehouse.org/research-announcements/hle-exam](https://www.futurehouse.org/research-announcements/hle-exam) * SWEBench Verified [https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/) * GPQA [https://epochai.substack.com/p/gpqa-diamond-whats-left](https://epochai.substack.com/p/gpqa-diamond-whats-left) * FrontierMath: Tiers 1-4（由LLMs发现）：[https://epoch.ai/frontiermath/tiers-1-4?view=graph&tab=release-date&tier=Core+%28Tiers+1-3%](https://epoch.ai/frontiermath/tiers-1-4?view=graph&tab=release-date&tier=Core+%28Tiers+1-3%29) 看起来即使是人类专家基准测试的创建者也会产生幻觉。我想这意味着人类不具备推理或意识能力 😔 我不知道还要多久LLMs才能变得如此优秀，以至于我们不知道如何准确衡量它们？

查看原文

在 SWEBench Pro 上，GPT 5.5 的失败中有 68.5% 是由损坏或错误的测试用例引起的，占整个基准测试的 28.9%

相似文章

新DeepSWE基准测试发现Claude Opus作弊

GPT-5.5 被用于标记 FrontierMath 问题中的致命错误

介绍 BenchBench（5分钟阅读）

@omarsar0: 效率前沿！你认为 GPT-5.6 会落在哪里？

在一个困难的新SWE基准测试ProgramBench上，GPT5.5 high/xhigh首次解决了任务，显著优于Opus 4.7

提交意见反馈