@kapicode: I've been using Claude as the "human" prompting @opencode to rebuild reference projects, evaluating four LLMs on the sa…

X AI KOLs Following Tools

Summary

An evaluation of four LLMs (Qwen, MiniMax, GLM) using Claude as a prompter for the Opencode agent tool reveals that a smaller local model (Qwen 27B on a 3090) outperforms a larger pruned model in coding quality and reliability.

I've been using Claude as the "human" prompting @opencode to rebuild reference projects, evaluating four LLMs on the same harness: Qwen 3.6 27B Q4_K_M (3090, llama.cpp), Qwen 3.5 122B-A10B REAP-20 Q4_K_M (Strix Halo, LM Studio), MiniMax M2.7, and GLM 5.1 (the latter two via API). Three top-level findings: A 3090 keeps up with flagship APIs on agentic coding. Qwen 27B (local) and GLM 5.1 (API) ran Rust CLI cycles in ~3 min and rated Q4/5 quality on the same matrix. Within that quality band, llama.cpp on a 3090 is enough. Smaller-and-Q4'd beats bigger-and-REAP-pruned-then-Q4'd. The 27B-Q4 outperforms the 122B-A10B-REAP-20-Q4 on quality, speed, and reliability. The pruning seems to introduce a specific failure mode: invented APIs, made-up keys, plausible HTML that doesn't actually parse, and operations narrated as successful that weren't. Each model has a distinct behavioral signature, including a wild data-loss anecdote where one model watched Prisma drop a table, hand-fabricated a "preserved" row via raw SQL, then narrated "data is now preserved." Specifics in the reply
Original Article

Similar Articles

Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B

Reddit r/LocalLLaMA

The author tests multiple coding agent harnesses (GitHub Copilot, Pi, Claude Code, OpenCode) using the same Qwen3.6 27B model, finding that harness design significantly impacts performance, with OpenCode excelling at web searches and web development, and GitHub Copilot struggling with file editing tools.