@RookieRicardoR: Domestic models break through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just tested Qwen3.7-Max, sharing some real thoughts. Last night I topped up as soon as the API went live and chose three tasks (see video) to test Qwen3.7-Max's frontend capabilities…
Summary
The user tested Qwen3.7-Max and believes it matches top models like Claude 4.6 and Gemini 3.1 Pro in frontend, computing power, and Agent capabilities. Its reasoning ability has significantly improved, and with monthly iteration speed, it has become a first-tier domestic model.
View Cached Full Text
Cached at: 05/25/26, 04:55 PM
Another breakthrough for a domestic model, now on par with top models like Claude 4.6 and Gemini 3.1 Pro.
Just finished testing Qwen3.7-Max — here are a few honest impressions.
As soon as the API launched last night, I topped up my account and picked three test tasks (see video) to evaluate Qwen3.7-Max’s frontend capability, computational ability, and agent capability. In my opinion, it can truly be called the best domestic model.
When I previously tested DeepSeek-v4 Pro and Kimi 2.6 with the same tasks, their single-run completion rates were both lower than Qwen3.7-Max. My subjective ranking is Qwen3.7-Max > Kimi 2.6 > DeepSeek-v4 Pro. This time, Qwen has indeed overtaken Claude Opus 4.6 on the Terminal-Bench leaderboard, which matches my hands-on experience.
For reasoning ability, I threw a few Olympiad-style math problems and some HMMT questions at it. Its accuracy isn’t the absolute highest, but it’s clearly a tier above what I tested with version 3.6 last month. One interesting detail: when it doesn’t know an answer, it honestly admits uncertainty rather than fabricating a plausible but wrong response — very much like Claude.
Another thing: Qwen’s iteration speed is staggering. Its voice on Twitter may not be as loud as Kimi or DeepSeek, but Qwen released 3.5 in March, 3.6 in April, and now 3.7 in May — it’s already on a monthly cadence. And each iteration brings noticeable improvements. It’s now firmly in the top tier.
On OpenRouter overseas, Qwen3.6-Plus just broke the platform’s record with 1.4 trillion tokens in daily API calls. Developers are voting with their wallets — real money speaks.
This generation of Qwen clearly leans toward agentic capabilities. Under extreme stress testing, long-horizon tasks ran for 35 hours without crashing, and cross-agent framework compatibility is significantly better than the previous generation.
See the specific test video.
Similar Articles
@intheworldofai: Qwen 3.7-Max is genuinely one of the most impressive agentic coding models I’ve tested in a while. I had it generate a …
阿里巴巴发布了通义千问 3.7 Max,一款专为智能体时代设计的旗舰编码模型。该模型在长周期自主执行、前端生成和3D场景构建上表现突出,多项基准测试中与顶尖闭源模型持平甚至超越,是接近前沿的中国模型。
@zhixianio: After receiving the new machine, I began an 'ascetic' practice of forcing myself to use local models for common tasks. I thought it would be painful, but both speed and quality greatly exceeded my expectations: Model: Qwen3.6-35B-A3B-oQ6-fp16-mtp, Running: oMLX, with N…
The author uses the Qwen3.6-35B-A3B model and oMLX tool on the new local machine for daily tasks, finding that both speed and quality far exceed expectations, even outperforming remote LLMs in PA and coding scenarios, demonstrating a significant improvement in on-device AI capabilities.
@WEB3_furture: COOL! Someone took the newly released Qwen 3.7-Max, Claude Opus 4.7, and GPT-5.5 for an Agent loop comparison: letting the model write its own Tetris bot, test it, and directly PK after 10 consecutive iterations. Results: Qwen 3.7-Max: +$…
Someone conducted an Agent loop comparison test on Qwen 3.7-Max, Claude Opus 4.7, and GPT-5.5, letting the models write their own Tetris bots and iterate 10 rounds before competing. The results show that Qwen 3.7-Max leads in both performance and cost.
@sitinme: A 26M parameter model can do Function Call, and is even stronger than Qwen-0.6B? This team's out-of-the-box approach is too wild! Nowadays, large models have ever-growing parameter counts, but one question has never been seriously considered: does calling a tool really need hundreds of billions of parameters? Think about it, when you say 'Check today's...'
The Cactus team distilled Gemini 3.1 into a specialized model called Needle with only 26M parameters, specifically for Function Call. Its performance surpasses Qwen-0.6B, demonstrating the potential of small models in tool calling scenarios.
@jakevin7: An interesting thing. The DeepSeek V4 technical report conducted a comprehensive evaluation of all major LLMs, concluding that Gemini 3.1 Pro has the strongest world knowledge among all models. Not GPT, not Claude, but Gemini. But when people use Gemini...
According to the DeepSeek V4 technical report's evaluation of mainstream LLMs, Gemini 3.1 Pro is considered to have the strongest world knowledge, but users generally find it hard to use because the model does not proactively use search tools.