Testing Local LLMs in Practice: Code Generation, Quality vs. Speed
Summary
The author built a benchmark harness to evaluate local LLMs for autonomous Go code generation, focusing on log parser generation for SIEM pipelines, and published results comparing quality vs. speed.
Similar Articles
Tested how OpenCode Works with SelfHosted LLMS: Qwen 3.5, 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash - v2
A developer benchmarked multiple self-hosted LLMs (Qwen 3.5/3.6, Gemma 4, Nemotron 3, GLM-4.7) with OpenCode on two coding tasks, revealing speed and quality trade-offs on RTX 4080 hardware.
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.
PlayCoder: Making LLM-Generated GUI Code Playable
PlayCoder introduces PlayEval benchmark and a multi-agent framework that iteratively repairs LLM-generated GUI applications, achieving up to 20.3% end-to-end playable code.
I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed
A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.
LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance
This paper analyzes how different fine-tuning strategies (FFT, LoRA, quantized LoRA) and model scales affect LLM interpretive behavior for automated code compliance tasks using perturbation-based attribution analysis. The findings show FFT produces more focused attribution patterns than parameter-efficient methods, and larger models develop specific interpretive strategies with diminishing performance returns beyond 7B parameters.