@RookieRicardoR: Domestic models break through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just tested Qwen3.7-Max, sharing some real thoughts. Last night I topped up as soon as the API went live and chose three tasks (see video) to test Qwen3.7-Max's frontend capabilities…

X AI KOLs Timeline Models

Summary

The user tested Qwen3.7-Max and believes it matches top models like Claude 4.6 and Gemini 3.1 Pro in frontend, computing power, and Agent capabilities. Its reasoning ability has significantly improved, and with monthly iteration speed, it has become a first-tier domestic model.

Domestic models have broken through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just finished testing Qwen3.7-Max, sharing a few real impressions. Last night, I topped up as soon as the API went live and selected three tasks (see video) to test Qwen3.7-Max's frontend capabilities, computing power, and Agent capabilities. It can truly be called the best domestic model now. Previously, when testing with DeepSeek-v4 Pro and Kimi 2.6, the completion rate of a single execution was not as good as Qwen3.7-Max. Subjectively, it felt like Qwen3.7-Max > Kimi 2.6 > DeepSeek-v4 Pro. This time, Qwen indeed surpassed Claude Opus 4.6 on the Terminal-Bench leaderboard, consistent with that feeling. For reasoning ability, I used some Olympiad math problems and a few HMMT questions. The accuracy wasn't the best, but it clearly felt a step higher than the 3.6 version I tested last month. One detail: when it encounters a question it doesn't know, it honestly says it's unsure rather than fabricating a seemingly plausible wrong answer—this is very similar to Claude. Another thing: Qwen's iteration speed is incredibly fast now. Although it doesn't have as much presence on Twitter as Kimi and DeepSeek, Qwen released 3.5 in March, 3.6 in April, and directly 3.7 in May, already on a monthly update cadence. And each iteration brings significant improvement, so it is now a genuine first-tier model. On the overseas platform OpenRouter, Qwen3.6-Plus's call volume just broke the platform record with 1.4 trillion tokens per day—developers are voting with their real money. This generation of Qwen is clearly moving toward Agent capabilities. Under extreme stress testing, long-duration tasks can run for 35 hours without crashing, and cross-Agent framework compatibility is much better than the previous generation. See the specific test video
Original Article
View Cached Full Text

Cached at: 05/25/26, 04:55 PM

Another breakthrough for a domestic model, now on par with top models like Claude 4.6 and Gemini 3.1 Pro.

Just finished testing Qwen3.7-Max — here are a few honest impressions.

As soon as the API launched last night, I topped up my account and picked three test tasks (see video) to evaluate Qwen3.7-Max’s frontend capability, computational ability, and agent capability. In my opinion, it can truly be called the best domestic model.

When I previously tested DeepSeek-v4 Pro and Kimi 2.6 with the same tasks, their single-run completion rates were both lower than Qwen3.7-Max. My subjective ranking is Qwen3.7-Max > Kimi 2.6 > DeepSeek-v4 Pro. This time, Qwen has indeed overtaken Claude Opus 4.6 on the Terminal-Bench leaderboard, which matches my hands-on experience.

For reasoning ability, I threw a few Olympiad-style math problems and some HMMT questions at it. Its accuracy isn’t the absolute highest, but it’s clearly a tier above what I tested with version 3.6 last month. One interesting detail: when it doesn’t know an answer, it honestly admits uncertainty rather than fabricating a plausible but wrong response — very much like Claude.

Another thing: Qwen’s iteration speed is staggering. Its voice on Twitter may not be as loud as Kimi or DeepSeek, but Qwen released 3.5 in March, 3.6 in April, and now 3.7 in May — it’s already on a monthly cadence. And each iteration brings noticeable improvements. It’s now firmly in the top tier.

On OpenRouter overseas, Qwen3.6-Plus just broke the platform’s record with 1.4 trillion tokens in daily API calls. Developers are voting with their wallets — real money speaks.

This generation of Qwen clearly leans toward agentic capabilities. Under extreme stress testing, long-horizon tasks ran for 35 hours without crashing, and cross-agent framework compatibility is significantly better than the previous generation.

See the specific test video.

Similar Articles

@zhixianio: After receiving the new machine, I began an 'ascetic' practice of forcing myself to use local models for common tasks. I thought it would be painful, but both speed and quality greatly exceeded my expectations: Model: Qwen3.6-35B-A3B-oQ6-fp16-mtp, Running: oMLX, with N…

X AI KOLs Timeline

The author uses the Qwen3.6-35B-A3B model and oMLX tool on the new local machine for daily tasks, finding that both speed and quality far exceed expectations, even outperforming remote LLMs in PA and coding scenarios, demonstrating a significant improvement in on-device AI capabilities.

@WEB3_furture: COOL! Someone took the newly released Qwen 3.7-Max, Claude Opus 4.7, and GPT-5.5 for an Agent loop comparison: letting the model write its own Tetris bot, test it, and directly PK after 10 consecutive iterations. Results: Qwen 3.7-Max: +$…

X AI KOLs Timeline

Someone conducted an Agent loop comparison test on Qwen 3.7-Max, Claude Opus 4.7, and GPT-5.5, letting the models write their own Tetris bots and iterate 10 rounds before competing. The results show that Qwen 3.7-Max leads in both performance and cost.

@sitinme: A 26M parameter model can do Function Call, and is even stronger than Qwen-0.6B? This team's out-of-the-box approach is too wild! Nowadays, large models have ever-growing parameter counts, but one question has never been seriously considered: does calling a tool really need hundreds of billions of parameters? Think about it, when you say 'Check today's...'

X AI KOLs Timeline

The Cactus team distilled Gemini 3.1 into a specialized model called Needle with only 26M parameters, specifically for Function Call. Its performance surpasses Qwen-0.6B, demonstrating the potential of small models in tool calling scenarios.

@jakevin7: An interesting thing. The DeepSeek V4 technical report conducted a comprehensive evaluation of all major LLMs, concluding that Gemini 3.1 Pro has the strongest world knowledge among all models. Not GPT, not Claude, but Gemini. But when people use Gemini...

X AI KOLs Following

According to the DeepSeek V4 technical report's evaluation of mainstream LLMs, Gemini 3.1 Pro is considered to have the strongest world knowledge, but users generally find it hard to use because the model does not proactively use search tools.