I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Reddit r/LocalLLaMA 04/20/26, 09:01 PM News

benchmark apple-silicon code-generation local-llms qwen gemma moe

Summary

A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.

There are plenty of "bro trust me, this model is better for coding" discussions out there. I wanted to replace the vibes with actual data: which model writes correct code and how fast does it run on real hardware, tested under identical conditions so the results are directly comparable. No cherry-picked prompts, no subjective impressions, just pass@1 on 164 coding problems with an expanded test suite. # Full Results Table **Model** |**HumanEval+** |**Speed (tok/s)** |**VRAM** Qwen 3.6 35B-A3B (MoE) |89.6% |16.9 |20.1 GB Qwen 2.5 Coder 32B |87.2% |2.5 |18.6 GB Qwen 2.5 Coder 14B |86.6% |5.9 |8.5 GB Qwen 2.5 Coder 7B |84.2% |11.3 |4.5 GB Phi 4 14B |82.3% |5.3 |8.6 GB Devstral Small 24B |81.7% |3.5 |13.5 GB Gemma 3 27B |78.7% |3.0 |15.6 GB Mistral Small 3.1 24B |75.6% |3.6 |13.5 GB Gemma 3 12B |75.6% |5.7 |7.0 GB Phi 4 Mini 3.8B |70.7% |19.6 |2.5 GB Gemma 3 4B |64.6% |16.5 |2.5 GB Mistral Nemo 12B |64.6% |6.9 |7.1 GB Llama 3.1 8B |61.0% |10.8 |4.7 GB Llama 3.2 3B |60.4% |24.1 |2.0 GB Mistral 7B v0.3 |37.2% |11.5 |4.2 GB Gemma 3 1B |34.2% |46.6 |0.9 GB Llama 3.2 1B |32.9% |59.4 |0.9 GB Gemma 4 31B |31.1% |5.5 |18.6 GB Gemma 4 E4B |14.6% |36.7 |5.2 GB Gemma 4 26B-A4B MoE |12.2% |16.2 |16.1 GB Gemma 4 E2B |9.2% |29.2 |3.4 GB **Notable findings** **Qwen 3.6 35B-A3B is the clear winner** at 89.6%, and the MoE architecture means it runs at 16.9 tok/s despite being nominally a 35B model. Active parameter count is what matters for speed; total parameter count is what matters for quality. This model threads that needle well. **Best bang-for-RAM: Qwen 2.5 Coder 7B.** 84.2% at 11.3 tok/s in 4.5 GB. If you have 8 GB of RAM and want a daily coding assistant, this is probably your model. **The Gemma 4 results are surprising and worth discussing.** Gemma 4 31B scores 31.1%, which is lower than Llama 3.2 1B (32.9%) and well below Gemma 3 27B (78.7%). The Gemma 4 MoE variants (26B-A4B) come in at 12.2%. I ran these multiple times to confirm. The Q4\_K\_M quantization may be hitting the Gemma 4 architecture harder than others, or the HumanEval+ task distribution may not favor its strengths. Open to theories. (https://www.reddit.com/r/LocalLLaMA/s/2pgedDFBYt) **Phi 4 Mini 3.8B is a sleeper pick** at 70.7% and 19.6 tok/s in 2.5 GB. If you need something fast and small that still writes reasonable code, it outperforms several much larger models. # Methodology notes * EvalPlus HumanEval+ was chosen over standard HumanEval because it adds more test cases per problem, reducing the chance of models passing by luck * Each model evaluated in isolation (no concurrent processes) Full writeup: [https://medium.com/@enescingoz/i-benchmarked-21-coding-models-on-a-macbook-air-heres-which-ones-actually-write-good-code-1a59441dee14](https://medium.com/@enescingoz/i-benchmarked-21-coding-models-on-a-macbook-air-heres-which-ones-actually-write-good-code-1a59441dee14) GitHub repo (code + raw results): [https://github.com/enescingoz/mac-llm-bench](https://github.com/enescingoz/mac-llm-bench) HuggingFace dataset: [https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon](https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon) What model should I test next? I have a few slots open for the next run and want to prioritize based on what this community is actually using. Also, if you have a Mac and want to contribute your own results on different hardware (M3, M4 Pro, M4 Max, etc.), the framework is fully open source and contributions are welcome.

Original Article

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Similar Articles

I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

Tested how OpenCode Works with SelfHosted LLMS: Qwen 3.5, 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash - v2

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

The Qwen 3.6 35B A3B hype is real!!!

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Submit Feedback

Similar Articles

I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

Tested how OpenCode Works with SelfHosted LLMS: Qwen 3.5, 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash - v2

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

The Qwen 3.6 35B A3B hype is real!!!

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)