This post presents the second update of a benchmark for local vision language models, comparing 23 models across 30 images with revised settings, and provides performance recommendations for different VRAM tiers. Key findings include that thinking mode hurts vision performance and that MoE models underperform dense models for perception tasks.
I previously posted the first results of my VLM benchmark. There were a few useful comments and observations I took into account, to revise and expand my benchmark: I initially did not take into account the Gemma 4 vision budget which defaults to 280, essentially making it useless. I have increased it to maximum level, with the following optimal setttings which were posted here recently: --image-min-tokens 560 --image-max-tokens 2240 I used the -b 4096 -ub 4096 parameters to avoid splitting the image tokens into multiple blocks (default value is 512) Switched from ollama to llama.cpp I expanded my dataset from 20 to 30 images, to cover more use cases I expanded the benchmark to test the impact of thinking vs non-thinking The first benchmark only included Q4 quants; I expanded it to Q8 quants for small models The first benchmark only tested each image once; now 3x tests per image In total, 23 models x 30 images x 3 tests = 2,070 tests (not including failures, tunings, re-runs), 60 to 70 inference hours. I have three recommendations this time, one per hardware tier: VRAM tier Pick Size Score Speed 4–8 GB Qwen3.5 4B (nothink) @ Q4 3.2 GB 75.5/100 20 s/img 12–16 GB Qwen3-VL 8B @ Q8 (not Q4) 8.1 GB 74.4/100 26 s/img 24+ GB Qwen3.6 27B (nothink) @ Q4 16.9 GB 79.6/100 70 s/img I noticed a few interesting outcomes, which I did not expect: Thinking mode hurts vision. Every Qwen hybrid thinker scored higher with enable_thinking=false. This is because vision is perception, not reasoning. Thinking adds instability, timeouts, and empty outputs. MoE size is misleading for vision. MoE models tie with much smaller dense models, and perform worse than equivalent dense models. It makes sense in retrospect if when you see that a MoE is a collection of small models. Their big total parameter count buys knowledge breadth, not perception depth which scales with density. Q8 is not a guaranteed improvement. It improves Gemma 4 (more consistent, less hallucinations), cripples Qwen hybrid thinkers (they spend too long thinking, resulting in frequent timeouts). The only Q8 that's a strict win is Qwen3-VL 8B-Q8. Here are the full quality ranking, sorted by effective score (raw × completion rate). σ = stability across 3 runs. # Variant Quant Mode Score σ Successful Note 1 Qwen3.6 27B Q4 nothink 79.6 0.24 90/90 Champion 2 Qwen3.6 27B Q4 think 78.2 0.26 81/90 Same model, slower 3 Qwen3.6 35B-A3B Q4 nothink 76.4 0.55 90/90 MoE 4 Qwen3.5 4B Q4 nothink 75.5 0.48 90/90 Best pts/GB 5 GLM-4.6V-Flash 9B Q4 — 75.1 0.53 90/90 Best for chinese OCR 6 Qwen3.6 35B-A3B Q4 think 75.0 0.31 90/90 MoE 7 Gemma 4 31B Q4 — 74.6 0.45 90/90 Slow (93 s) 8 Qwen3-VL 8B Q8 — 74.4 0.33 90/90 Only perfect Q8 9 Qwen3-VL 8B Q4 — 73.1 0.52 90/90 10 Qwen3.5 9B Q4 nothink 73.1 0.58 90/90 11 Gemma 4 26B-A4B Q4 — 72.7 0.51 90/90 12 Qwen3.5 9B Q4 think 72.7 0.52 90/90 13 GLM-9B Q8 — 73.4 raw / 68.5 eff 0.51 84/90 Drop vs Q4 14 Qwen3.5 4B Q4 think 70.6 0.77 90/90 Unstable 15 Qwen3-VL 4B Q4 — 65.9 0.76 90/90 Degenerates 16 Qwen3.5 4B Q8 nothink 65.7 0.51 partial Drop vs Q4 17 Qwen3-VL 4B Q8 — 65.3 1.03 87/93 Worst σ 18 Gemma 4 12B Q8 — 76.6 raw / 59.7 eff 0.28 74/95 22% timeouts 19 Gemma 4 12B Q4 — 64.1 0.66 90/90 Hallucinations 20 Gemma 4 E4B Q8 — 63.9 0.46 78/90 21 Gemma 4 E4B Q4 — 58.8 0.60 90/90 Wrong counts 22 Qwen3.5 9B Q8 nothink partial — ~85% fail Unusable 23 Qwen3.5 9B Q8 think partial — ~60% fail Unusable Here is bit more info about some of those models, that the above numbers cannot express, based on reading their actual output: Qwen3.6-27B (Q4=16.9GB) : Best quality, best stability, no failures with thinking disabled. The no-thinking mode has a huge beneficial on speed, and avoids the timeouts due to reasoning too long. Gives very direct answers. Qwen3.6-35B-A3B (Q4=21.9GB) : Based on the numbers it might appear like a good speedy alternatives, but it rarely performs better than smaller models. Biggest problem, apart from its size, is the huge variance and unpredictability of its responses. Skip it, not worth using MoE for vision. Qwen3-VL-8B-Instruct (Q4=5.8GB Q8=8.1GB) : The only model with 100% reliability on Q8. Q8 brings big over Q4, for both quality and consistency. Qwen3.5-4B (Q4=3.2GB) : Use with thinking disabled; when enabled, on dense images, it can easily exhaust its token budget and error, or timeout. Q8 was a lot worse than Q4, with again timeouts on dense images. None of those problems with Q4 non-thinking. Test methodology specs: Apple M2 Max, 96GB RAM runtime: llama.cpp b9690 via llama-server models: 11 base models, Q4_K_M; Q8_0 added for 7 of the smaller ones hybrid thinking models (Qwen3.5/3.6) tested both with and without thinking enabled 30 images across screenshots, photos, posters, art, medical, scientific graphs, dense scenes, and multilingual content 3 runs per (model × image), median run scored hybrid scoring: 40% deterministic probes (OCR, counts, hallucination checks) + 60% LLM judge based on human created detailed ground truth description for each image timeout: 300s per call (fail fast on runaway thinking)
A technical overview of the state of local AI models in mid-2026, highlighting how open-weight models have narrowed the gap to frontier models through advances in mixture-of-experts and sparse attention, enabling efficient local inference.
MemLens is a new benchmark for evaluating memory capabilities in large vision-language models through multi-session conversations. It compares long-context and memory-augmented approaches, revealing limitations in both and motivating hybrid architectures.
Introduces ScreenLeak, a benchmark for measuring PII redaction in computer-use AI data, and presents two local models (v45_phase3 for text and rfdetr_v8 for images) achieving near-frontier performance at low latency.
This paper introduces LEVANTE-bench, a benchmark that systematically evaluates vision-language models on six cognitive tasks and compares their performance to children aged 5-12, finding that current VLMs align only partially with children's cognitive abilities.