Open Source vs frontier models on a single-file HTML canvas driving animation - results

Reddit r/LocalLLaMA News

Summary

A user compares open-source and frontier AI models on a single-file HTML canvas car driving animation task, presenting results and a gallery.

Hey yall, I was inspired by this post : [https://www.reddit.com/r/LocalLLaMA/comments/1tf3p6c/local\_qwen\_36\_vs\_frontier\_models\_on\_a\_coding/](https://www.reddit.com/r/LocalLLaMA/comments/1tf3p6c/local_qwen_36_vs_frontier_models_on_a_coding/) And I know this isn't exactly local, but I wanted to share what I tested out and what results each model delivered so I decided to share this. I ran the same single-file Canvas prompt across multiple models using my harness ( [https://github.com/AidenGeunGeun/OpenCodeOrchestra](https://github.com/AidenGeunGeun/OpenCodeOrchestra) ). The models were able to use whatever tools they had access to - some used auditor models and some did not. We have some clear winners and some ambiguous. https://preview.redd.it/2ehkh47vfo1h1.png?width=2972&format=png&auto=webp&s=d1e643f7b8bd0c3bab241838731848109359f1e1 **The results are here:** [**https://aidengeungeun.github.io/oco-canvas-car-scene-compare/**](https://aidengeungeun.github.io/oco-canvas-car-scene-compare/) Setup: * Same prompt for every run * One isolated Orchestrator per model * Highest available thinking/effort setting for each model * Output target: one standalone HTML file, no libraries, no external assets * Task: realistic side-view car driving scene with parallax scenery, spinning wheels, subtle body motion, cinematic lighting, and seamless looping Models included: * GPT-5.5 xhigh * GPT-5.4 xhigh * Claude Opus 4.7 (max effort) * Claude Opus 4.6 (max effort) * Claude Sonnet 4.6 (high effort, max doesnt exist on sonnet) * Kimi K2.6 * DeepSeek V4 Pro * DeepSeek V4 Flash * GLM-5.1 * MiniMax M2.7 * Qwen 3.6 Plus * Grok 4.3 I used whatever highest thinking possible for each model. tok/s and time for generation were not measured. The results are here: **Gallery:** [**https://aidengeungeun.github.io/oco-canvas-car-scene-compare/**](https://aidengeungeun.github.io/oco-canvas-car-scene-compare/) Source: [https://github.com/AidenGeunGeun/oco-canvas-car-scene-compare](https://github.com/AidenGeunGeun/oco-canvas-car-scene-compare) We know that models are capable of doing these kind of work, but I was wondering how a wide variety of Open weights models compare to frontier models, especially the ones that are used often. I tried to use MiMo-V2.5-pro too, but since that model had billing issues with the OpenCode Go subscription, I couldn't use it. Take a look!
Original Article

Similar Articles

Open-World Evaluations for Measuring Frontier AI Capabilities

arXiv cs.AI

This paper argues that traditional benchmarks both overestimate and underestimate frontier AI capabilities, and proposes 'open-world evaluations'—long-horizon, real-world tasks assessed qualitatively—as a complementary approach. The CRUX project is introduced, with a demonstration where an AI agent successfully published an iOS app to the App Store with minimal intervention.