Testing Local LLMs in Practice: Code Generation, Quality vs. Speed

Reddit r/LocalLLaMA 05/08/26, 05:33 PM Tools

local-llm code-generation benchmark golang llm-evaluation autonomous-coding open-source

Summary

The author built a benchmark harness to evaluate local LLMs for autonomous Go code generation, focusing on log parser generation for SIEM pipelines, and published results comparing quality vs. speed.

Hello, I spent the last few months building an AI agent that autonomously writes Go code using local LLMs. The primary use case is log parser generation for SIEM pipelines. A large part of the work ended up being evaluation itself: how do you objectively measure whether a model is actually useful for autonomous coding tasks? So I built a harness that (1) lets agents generate real Go parsers, (2) compiles the Go code, (3) validates extracted fields and types, (4) measures parsing quality against expected schemas, (5) and tracks throughput/speed over longer runs. Given the current release cadence of open-weight models, the results are interesting. I published the first public version of the benchmark and methodology here: [https://ndocs.teskalabs.com/logman.io/blog/2026/04/14/testing-local-llms-in-practice-code-generation-quality-vs-speed/](https://ndocs.teskalabs.com/logman.io/blog/2026/04/14/testing-local-llms-in-practice-code-generation-quality-vs-speed/) Feedback is very welcome. Also: which model should I test next?

Original Article

Similar Articles

Tested how OpenCode Works with SelfHosted LLMS: Qwen 3.5, 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash - v2

Reddit r/LocalLLaMA

A developer benchmarked multiple self-hosted LLMs (Qwen 3.5/3.6, Gemma 4, Nemotron 3, GLM-4.7) with OpenCode on two coding tasks, revealing speed and quality trade-offs on RTX 4080 hardware.

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

arXiv cs.CL

A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.

PlayCoder: Making LLM-Generated GUI Code Playable

Hugging Face Daily Papers

PlayCoder introduces PlayEval benchmark and a multi-agent framework that iteratively repairs LLM-generated GUI applications, achieving up to 20.3% end-to-end playable code.

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Reddit r/LocalLLaMA

A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance