SWE-rebench 排行榜更新:GLM-5.2、Qwen3.6-27B、Qwen3.6-35B-A3B、Gemma 4 31B 等新模型 + 改进的 UI
摘要
SWE-rebench 排行榜已更新,新增了 GLM-5.2、Qwen3.6、Gemma 4 31B 等模型,并改进了 UI,展示了软件工程任务上的性能排名。
暂无内容
查看缓存全文
缓存时间: 2026/07/01 16:17
# SWE-rebench 排行榜
来源:https://swe-rebench.com/
1 62.7%± 0.91%
70.0%$2.252,120,66090.0% cached
2 61.6%± 0.64%
72.7%$1.841,866,49791.6% cached
3 60.4%± 1.37%
71.8%$1.751,898,13192.5% cached
4 59.6%± 1.98%
72.7%$1.741,878,24893.6% cached
5 OpenAI gpt-5.5-2026-04-23-medium
58.9%± 0.78%
70.0%$0.98708,41883.5% cached
6 56.5%± 1.20%
67.3%$2.022,479,38795.3% cached
7 OpenAI gpt-5.4-2026-03-05-medium
54.9%± 1.02%
70.9%$0.60834,45283.5% cached
8 53.1%± 1.45%
66.4%$1.321,526,13594.2% cached
9 53.0%± 0.53%
64.5%$0.231,031,65398.7% cached
10 51.3%± 0.55%
63.6%$1.292,644,57795.6% cached
11 51.1%± 1.20%
66.4%$0.751,545,44580.1% cached
12 51.1%± 1.13%
71.8%$0.752,623,45687.0% cached
13 50.7%± 0.93%
65.5%$0.942,664,00191.8% cached
14 49.5%± 0.98%
61.8%$0.771,848,59375.7% cached
15 47.8%± 1.37%
60.9%$1.531,828,64993.6% cached
16 46.5%± 1.27%
64.5%$0.612,466,97790.4% cached
17 45.6%± 1.27%
67.3%$1.066,885,81893.5% cached
18 42.7%± 1.29%
61.8%$0.222,247,89176.9% cached
19 42.4%± 0.84%
61.8%$0.122,586,99888.6% cached
20 38.4%± 0.97%
57.3%$0.072,996,07795.5% cached
21 38.2%± 0.86%
59.1%$0.392,256,18286.4% cached
22 36.5%± 0.45%
50.9%$0.561,875,62414.2% cached
23 33.8%± 0.93%
54.5%$0.182,229,92578.4% cached
24 16.5%± 1.13%
37.3%$0.322,238,42069.6% cached
25 N/AN/AN/AN/A26 N/AN/AN/AN/A27 N/AN/AN/AN/A28 N/AN/AN/AN/A29 N/AN/AN/AN/A30 N/AN/AN/AN/A31 N/AN/AN/AN/A32 N/AN/AN/AN/A33 N/AN/AN/AN/A34 N/AN/AN/AN/A35 N/AN/AN/AN/A36 Mistral Devstral-2-123B-Instruct-2512
N/AN/AN/AN/A37 Mistral Devstral-Small-2-24B-Instruct-2512
N/AN/AN/AN/A38 N/AN/AN/AN/A39 N/AN/AN/AN/A40 N/AN/AN/AN/A41 N/AN/AN/AN/A42 N/AN/AN/AN/A43 N/AN/AN/AN/A44 Gemini gemini-2.5-flash-preview-05-20 no-thinking
N/AN/AN/AN/A45 Gemini gemini-2.5-flash-preview-05-20 no-thinking
N/AN/AN/AN/A46 N/AN/AN/AN/A47 N/AN/AN/AN/A48 N/AN/AN/AN/A49 N/AN/AN/AN/A50 N/AN/AN/AN/A51 N/AN/AN/AN/A52 N/AN/AN/AN/A53 N/AN/AN/AN/A54 N/AN/AN/AN/A55 N/AN/AN/AN/A56 N/AN/AN/AN/A57 N/AN/AN/AN/A58 N/AN/AN/AN/A59 N/AN/AN/AN/A60 N/AN/AN/AN/A61 N/AN/AN/AN/A62 N/AN/AN/AN/A63 OpenAI gpt-5-mini-2025-08-07-high
N/AN/AN/AN/A64 OpenAI gpt-5-mini-2025-08-07-medium
N/AN/AN/AN/A65 N/AN/AN/AN/A66 N/AN/AN/AN/A67 OpenAI gpt-5.2-2025-12-11-medium
N/AN/AN/AN/A68 N/AN/AN/AN/A69 N/AN/AN/AN/A70 N/AN/AN/AN/A71 N/AN/AN/AN/A72 N/AN/AN/AN/A73 N/AN/AN/AN/A74 N/AN/AN/AN/A75 N/AN/AN/AN/A76 N/AN/AN/AN/A77 N/AN/AN/AN/A78 N/AN/AN/AN/A79 N/AN/AN/AN/A80 N/AN/AN/AN/A81 N/AN/AN/AN/A82 N/AN/AN/AN/A83 N/AN/AN/AN/A84 Meta Llama-4-Maverick-17B-128E-Instruct
N/AN/AN/AN/A85 Meta Llama-4-Scout-17B-16E-Instruct
N/AN/AN/AN/A86 N/AN/AN/AN/A87 N/AN/AN/AN/A88 N/AN/AN/AN/A89 N/AN/AN/AN/A90 N/AN/AN/AN/A91 N/AN/AN/AN/A92 N/AN/AN/AN/A93 Qwen Qwen2.5-Coder-32B-Instruct
N/AN/AN/AN/A94 N/AN/AN/AN/A95 Qwen Qwen3-235B-A22B no-thinking
N/AN/AN/AN/A96 N/AN/AN/AN/A97 Qwen Qwen3-235B-A22B-Instruct-2507
N/AN/AN/AN/A98 Qwen Qwen3-235B-A22B-Thinking-2507
N/AN/AN/AN/A99 Qwen Qwen3-30B-A3B-Instruct-2507
N/AN/AN/AN/A100 Qwen Qwen3-30B-A3B-Thinking-2507
N/AN/AN/AN/A101 N/AN/AN/AN/A102 N/AN/AN/AN/A103 N/AN/AN/AN/A104 Qwen Qwen3-Coder-30B-A3B-Instruct
N/AN/AN/AN/A105 Qwen Qwen3-Coder-480B-A35B-Instruct
N/AN/AN/AN/A106 N/AN/AN/AN/A107 Qwen Qwen3-Next-80B-A3B-Instruct
N/AN/AN/AN/A108 N/AN/AN/AN/A109 N/AN/AN/AN/A110 N/AN/AN/AN/A111 N/AN/AN/AN/A
相似文章
我在 RTX 5090 上用同一真实架构写作任务实测 Qwen3.6-27B、Qwen3.6-35B-A3B、Qwen3.5-27B 与 Gemma 4
在 RTX 5090 上,让四款本地大模型——Qwen3.6-27B、Qwen3.6-35B、Qwen3.5-27B 与 Gemma 4——完成 2 万 token 架构写作任务,结果显示 Qwen3.6-27B 在清晰度、完整性与实用性上取得最佳综合平衡。
Qwen3.6-35B-A3B 和 9B 已正式登上公开的 Terminal-Bench 2.0 排行榜!
Qwen3.6-35B-A3B 和 Qwen3.5-9B 模型已正式登上 Terminal-Bench 2.0 排行榜,其中 little-coder 在 35B 变体上取得 24.6% 的成绩,超越了 Gemini 2.5 Pro 和 Qwen3-Coder-480B;而 9B 模型则表明,10B 以下的本地模型能够与高难度代理基准竞争。
gemma-4-12b-it vs Qwen3.5-9B 在共同基准测试中的对比:Qwen 在 5/8 项基准测试中击败 gemma,虽体积更小但总体胜出
Qwen3.5-9B 在 8 项基准测试中的 5 项中优于 gemma-4-12b-it,尽管模型体积更小。gemma 仅在编程能力上略胜一筹。
Gemma 4 31B 的能力让我惊讶
一位用户分享了轶事发现:Gemma 4 31B 在理解和重构杂乱的学术代码方面优于 Qwen 3.6 模型,并与 Opus 4.7 能力相当,还突出了一个 Gemma 擅长的基准测试(SciCode)。
Qwen 3.7 Max 在 SWE-Bench Pro 上取得了 60.6% 的得分
Qwen 3.7 Max 在 SWE-Bench Pro 上取得了 60.6% 的得分,展现了在软件工程任务上的竞争力。