(Interactive)OpenCode Racing Game Comparison Qwen3.6 35B vs Qwen3.5 122B vs Qwen3.5 27B vs Qwen3.5 4B vs Gemma 4 31B vs Gemma 4 26B vs Qwen3 Coder Next vs GLM 4.7 Flash

Reddit r/LocalLLaMA 04/21/26, 04:30 AM Models

benchmark opencode coding-agent multi-model racing-game interactive comparison

Summary

An informal benchmark comparing 8 AI models (Qwen3.6 35B, Qwen3.5 series, Gemma 4 series, GLM 4.7 Flash) in creating racing games via OpenCode/Playwright MCP, testing their coding agent capabilities and documenting various implementation quirks.

You can play them here: [https://fatheredpuma81.github.io/LLM\_Racing\_Games/](https://fatheredpuma81.github.io/LLM_Racing_Games/) This started out as a simple test for Qwen3 Coder Next vs Qwen3.5 4B because they have similar benchmark numbers and then I just kept trying other models and decided I might as well share it even if I'm not that happy with how I did it. Read the "How this works" in the top right if you want to know how it was but the TLDR is: Disabled vision, sent same initial prompt in Plan mode, enabled Playwright MCP and sent the same start prompt, and then spent 3 turns testing the games and pointing out what issues I saw to the LLMs. There's a ton of things I'd do differently if I ever got around to redoing this. Keeping and showing all 4 versions of the HTML for 1, not disabling Vision which hindered Qwen 27B a ton (it was only disabled for an apples to apples comparison between 4B and Coder), and idk I had a bunch more thoughts on it but I'm too tired to remember them. Some interesting notes: * Qwen3 Coder Next's game does appear to have a track but it's made up of invisible walls. * Gemma 4 31B and Qwen3.5 27B both output the full code on every turn while the rest all primarily edited the code. * Gemma 4 31B's game actually had a road at one point. * Qwen3.5 27B Accidentally disabling Playwright MCP on the final turn is what gave us a car that actually moves and steers at a decent speed. The only thing that really changed between the 1st HTML and last was it added trees. * Qwen3.5 27B is the only one with tires that turn. Not that you can see it. * Gemma 4 26B was the only one to add sound. * Gemma 4 26B added a Team Rocket car blasting off again when you touched a wall but then OpenCode more or less crashed in the middle of it so I had to roll back which resulted in the less interesting Sound version. * GLM 4.7 Flash and Gemma 4 26B were the only ones to spawn a subagent. GLM used it for research during Planning and Gemma used it to implement sound on the final turn. * Found out GLM 4.7 Flash can't do Q8\_0 K Cache Quantization without breaking. * Qwen3.5 4B installed its own version of Playwright using NPX and then it started using both on bugfix turn 2/3. * GLM 4.7 Flash failed its final output to a white screen so I jumped back a turn and asked it to output the code full again. So it only got 2 turns I guess? * Qwen3.6 35B's game actually regressed in a lot of ways from the start. There was no screen jitter, the track was a lot more narrow, and the hit boxes were spot on with the walls. The minimap was a lot more broken though I think it got confused between Minimap Track and physical track.

Original Article

(Interactive)OpenCode Racing Game Comparison Qwen3.6 35B vs Qwen3.5 122B vs Qwen3.5 27B vs Qwen3.5 4B vs Gemma 4 31B vs Gemma 4 26B vs Qwen3 Coder Next vs GLM 4.7 Flash

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Gemma 4 beats Qwen 3.5 (UPDATE), and Qwen 3.6 27B + MiniMax M2.7 is the best OpenCode setup

I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

Youtuber tries Qwen 3.5 35B, Qwen 3.6 35B, and Gemma 4 27b to reverse engineer some large JS, with good results for Qwen 3.6

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint

Submit Feedback

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Gemma 4 beats Qwen 3.5 (UPDATE), and Qwen 3.6 27B + MiniMax M2.7 is the best OpenCode setup

I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

Youtuber tries Qwen 3.5 35B, Qwen 3.6 35B, and Gemma 4 27b to reverse engineer some large JS, with good results for Qwen 3.6

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint
Qwen3.5-9B outperforms gemma-4-12b-it on 5 of 8 benchmarks despite having a smaller footprint, with gemma only slightly better at coding.