I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Reddit r/LocalLLaMA News

Summary

A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.

There are plenty of "bro trust me, this model is better for coding" discussions out there. I wanted to replace the vibes with actual data: which model writes correct code and how fast does it run on real hardware, tested under identical conditions so the results are directly comparable. No cherry-picked prompts, no subjective impressions, just pass@1 on 164 coding problems with an expanded test suite. # Full Results Table **Model** |**HumanEval+** |**Speed (tok/s)** |**VRAM** Qwen 3.6 35B-A3B (MoE) |89.6% |16.9 |20.1 GB Qwen 2.5 Coder 32B |87.2% |2.5 |18.6 GB Qwen 2.5 Coder 14B |86.6% |5.9 |8.5 GB Qwen 2.5 Coder 7B |84.2% |11.3 |4.5 GB Phi 4 14B |82.3% |5.3 |8.6 GB Devstral Small 24B |81.7% |3.5 |13.5 GB Gemma 3 27B |78.7% |3.0 |15.6 GB Mistral Small 3.1 24B |75.6% |3.6 |13.5 GB Gemma 3 12B |75.6% |5.7 |7.0 GB Phi 4 Mini 3.8B |70.7% |19.6 |2.5 GB Gemma 3 4B |64.6% |16.5 |2.5 GB Mistral Nemo 12B |64.6% |6.9 |7.1 GB Llama 3.1 8B |61.0% |10.8 |4.7 GB Llama 3.2 3B |60.4% |24.1 |2.0 GB Mistral 7B v0.3 |37.2% |11.5 |4.2 GB Gemma 3 1B |34.2% |46.6 |0.9 GB Llama 3.2 1B |32.9% |59.4 |0.9 GB Gemma 4 31B |31.1% |5.5 |18.6 GB Gemma 4 E4B |14.6% |36.7 |5.2 GB Gemma 4 26B-A4B MoE |12.2% |16.2 |16.1 GB Gemma 4 E2B |9.2% |29.2 |3.4 GB **Notable findings** **Qwen 3.6 35B-A3B is the clear winner** at 89.6%, and the MoE architecture means it runs at 16.9 tok/s despite being nominally a 35B model. Active parameter count is what matters for speed; total parameter count is what matters for quality. This model threads that needle well. **Best bang-for-RAM: Qwen 2.5 Coder 7B.** 84.2% at 11.3 tok/s in 4.5 GB. If you have 8 GB of RAM and want a daily coding assistant, this is probably your model. **The Gemma 4 results are surprising and worth discussing.** Gemma 4 31B scores 31.1%, which is lower than Llama 3.2 1B (32.9%) and well below Gemma 3 27B (78.7%). The Gemma 4 MoE variants (26B-A4B) come in at 12.2%. I ran these multiple times to confirm. The Q4\_K\_M quantization may be hitting the Gemma 4 architecture harder than others, or the HumanEval+ task distribution may not favor its strengths. Open to theories. (https://www.reddit.com/r/LocalLLaMA/s/2pgedDFBYt) **Phi 4 Mini 3.8B is a sleeper pick** at 70.7% and 19.6 tok/s in 2.5 GB. If you need something fast and small that still writes reasonable code, it outperforms several much larger models. # Methodology notes * EvalPlus HumanEval+ was chosen over standard HumanEval because it adds more test cases per problem, reducing the chance of models passing by luck * Each model evaluated in isolation (no concurrent processes) Full writeup: [https://medium.com/@enescingoz/i-benchmarked-21-coding-models-on-a-macbook-air-heres-which-ones-actually-write-good-code-1a59441dee14](https://medium.com/@enescingoz/i-benchmarked-21-coding-models-on-a-macbook-air-heres-which-ones-actually-write-good-code-1a59441dee14) GitHub repo (code + raw results): [https://github.com/enescingoz/mac-llm-bench](https://github.com/enescingoz/mac-llm-bench) HuggingFace dataset: [https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon](https://huggingface.co/datasets/enescingoz/humaneval-apple-silicon) What model should I test next? I have a few slots open for the next run and want to prioritize based on what this community is actually using. Also, if you have a Mac and want to contribute your own results on different hardware (M3, M4 Pro, M4 Max, etc.), the framework is fully open source and contributions are welcome.
Original Article

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Reddit r/LocalLLaMA

A user compares Qwen3.6 35B-A3B and Gemma 4 26B-A4B-IT running locally on a 16GB VRAM GPU via LM Studio, finding Qwen3.6 produces more detailed outputs while both run at comparable speeds. The post is an informal community comparison using quantized models.

The Qwen 3.6 35B A3B hype is real!!!

Reddit r/LocalLLaMA

The author benchmarks small local LLMs, highlighting Qwen 3.6 35B A3B for its superior ability to map academic code to research papers compared to models like Gemma 4 and Nemotron 3 Nano.

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Reddit r/LocalLLaMA

A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.