A user compares open-source and frontier AI models on a single-file HTML canvas car driving animation task, presenting results and a gallery.
Hey yall, I was inspired by this post : [https://www.reddit.com/r/LocalLLaMA/comments/1tf3p6c/local\_qwen\_36\_vs\_frontier\_models\_on\_a\_coding/](https://www.reddit.com/r/LocalLLaMA/comments/1tf3p6c/local_qwen_36_vs_frontier_models_on_a_coding/) And I know this isn't exactly local, but I wanted to share what I tested out and what results each model delivered so I decided to share this. I ran the same single-file Canvas prompt across multiple models using my harness ( [https://github.com/AidenGeunGeun/OpenCodeOrchestra](https://github.com/AidenGeunGeun/OpenCodeOrchestra) ). The models were able to use whatever tools they had access to - some used auditor models and some did not. We have some clear winners and some ambiguous. https://preview.redd.it/2ehkh47vfo1h1.png?width=2972&format=png&auto=webp&s=d1e643f7b8bd0c3bab241838731848109359f1e1 **The results are here:** [**https://aidengeungeun.github.io/oco-canvas-car-scene-compare/**](https://aidengeungeun.github.io/oco-canvas-car-scene-compare/) Setup: * Same prompt for every run * One isolated Orchestrator per model * Highest available thinking/effort setting for each model * Output target: one standalone HTML file, no libraries, no external assets * Task: realistic side-view car driving scene with parallax scenery, spinning wheels, subtle body motion, cinematic lighting, and seamless looping Models included: * GPT-5.5 xhigh * GPT-5.4 xhigh * Claude Opus 4.7 (max effort) * Claude Opus 4.6 (max effort) * Claude Sonnet 4.6 (high effort, max doesnt exist on sonnet) * Kimi K2.6 * DeepSeek V4 Pro * DeepSeek V4 Flash * GLM-5.1 * MiniMax M2.7 * Qwen 3.6 Plus * Grok 4.3 I used whatever highest thinking possible for each model. tok/s and time for generation were not measured. The results are here: **Gallery:** [**https://aidengeungeun.github.io/oco-canvas-car-scene-compare/**](https://aidengeungeun.github.io/oco-canvas-car-scene-compare/) Source: [https://github.com/AidenGeunGeun/oco-canvas-car-scene-compare](https://github.com/AidenGeunGeun/oco-canvas-car-scene-compare) We know that models are capable of doing these kind of work, but I was wondering how a wide variety of Open weights models compare to frontier models, especially the ones that are used often. I tried to use MiMo-V2.5-pro too, but since that model had billing issues with the OpenCode Go subscription, I couldn't use it. Take a look!
A user compares local quantized Qwen 3.6 models against frontier models on a single-file HTML canvas driving animation task, finding that the local 27B Qwen quant delivers competitive results with better parallax and motion than some frontier outputs.
This paper argues that traditional benchmarks both overestimate and underestimate frontier AI capabilities, and proposes 'open-world evaluations'—long-horizon, real-world tasks assessed qualitatively—as a complementary approach. The CRUX project is introduced, with a demonstration where an AI agent successfully published an iOS app to the App Store with minimal intervention.
Discussion on how routing and post-training open-source models can outperform frontier models in accuracy, speed, and cost, with Harvey's partnership with Fireworks AI demonstrating hybrid legal agents beating frontier models on quality and cost.
Developer releases an open-source medieval-themed dashboard that visualizes multiple Claude Code agents as 2D village characters to simplify real-time tracking of parallel coding sessions.