model-benchmarks

#model-benchmarks

DeepSeek-V4-Flash-0731 now far surpassing the DeepSeek-V4-Pro-Preview in benchmarks

Reddit r/LocalLLaMA ↗ · 14h ago

DeepSeek's new V4-Flash-0731 model is now far outperforming the V4-Pro-Preview in benchmarks, marking a significant improvement in the model family.

0 favorites 0 likes

#model-benchmarks

@enginenerdx: same model, three medals: sonnet 5 scored 14/42 on the IMO in a web ui, 21 in claude code, 35 in a structured multi-age…

X AI KOLs Timeline ↗ · 6d ago Cached

A comparison shows that LLM performance on the 2026 IMO varies dramatically depending on the evaluation harness, with structured multi-agent setups achieving far higher scores than simple web UI, indicating that current gains are absorbed at the frontier by better orchestration.

0 favorites 0 likes

#model-benchmarks

@omarsar0: Just had a great discussion on dynamic workflows. Rough notes: - applies to a very small set of use cases - think of it…

X AI KOLs Following ↗ · 2026-06-25 Cached

Discussion of dynamic workflows for test-time compute, including their limited use cases, benefits for research experiments, and the need for better benchmarks. Mentions models like Mythos and Opus 4.8 for agent orchestration.

0 favorites 0 likes

#model-benchmarks

I stopped trusting model benchmarks and started running my own eval set, here is what changed[D]

Reddit r/MachineLearning ↗ · 2026-06-25

The author describes losing faith in public AI model benchmarks due to vendor-created metrics, self-reported parameters, and lack of independent verification, and advocates for building custom evaluation sets from real production traffic to make more relevant model comparisons.

0 favorites 0 likes

#model-benchmarks

Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room

Reddit r/LocalLLaMA ↗ · 2026-05-20

Qwen3.7 Max ranks 5th on Artificial Analysis benchmarks, matching GPT-5.4 and outperforming Gemini 3.5 Flash, while Qwen3.6 27B trails significantly.

0 favorites 1 likes

#model-benchmarks

@0xLogicrw: Google DeepMind researcher Lun Wang announces departure, and in a long post completely dismisses the current AI evaluation approach. The current evaluation systems are all 'fighting the last war' — they can only passively test capabilities the model already possesses, and have no way to predict what new abilities the next generation of models will suddenly evolve. Compared to data, …

X AI KOLs Timeline ↗ · 2026-05-18 Cached

Google DeepMind researcher Lun Wang leaves the company and writes a post criticizing the current AI evaluation system, arguing that it lags behind model evolution and cannot predict new capabilities, leaving the industry in a state of 'flying blind'.

0 favorites 0 likes

model-benchmarks

DeepSeek-V4-Flash-0731 now far surpassing the DeepSeek-V4-Pro-Preview in benchmarks

@enginenerdx: same model, three medals: sonnet 5 scored 14/42 on the IMO in a web ui, 21 in claude code, 35 in a structured multi-age…

@omarsar0: Just had a great discussion on dynamic workflows. Rough notes: - applies to a very small set of use cases - think of it…

I stopped trusting model benchmarks and started running my own eval set, here is what changed[D]

Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room

Submit Feedback