A detailed benchmark of 20 small LLMs quantized for a 6GB GPU, measuring speed and VRAM usage at various context lengths, with qualitative probing for tool-use and instruction following. The report aims to help users with modest hardware choose models for local, private automation tasks.
I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small. So I went to the LM studio database and searched many variants from the same family, trying to select the newer models. Then asked claude to select known benchmarks and then run some qualitative tests. Now I'll try to test with real use cases and then select a "team". Most of the people runs local with more powerful machines. But the majority of the people barely has a 6gb gpu. So this review may help them. Below goes the report: **The problem.** I want local models doing repetitive overnight work (file organization, tagging, log triage) on a 6GB laptop GPU — zero cost, private, no rate limits. The real question isn't "which model is best" but "which of these specific quants actually fit in 6GB and behave correctly on *my* tasks." Leaderboard scores don't answer that: they're run on full-precision weights and generic benchmarks, not the Q4/Q6 GGUF you'll actually load. **Why qualitative probing instead of full benchmark suites.** Running BFCL-v3/v4 + IFEval + MMLU across 20 models on one 6GB GPU is on the order of days-to-a-week of compute, and most of that signal is *already published per model family*. What's not published is how a given quant behaves on the exact behaviors I need. So I built a fixed 6-probe set targeting those behaviors — (1) parseable tool-call, (2) multi-turn tool-call (does it chain with the real tool result or hallucinate a placeholder), (3) strict JSON, (4) instruction adherence (IFEval-style), (5) plan decomposition, (6) no path hallucination, plus a GSM8K-style arithmetic check — judged the outputs directly, and triangulated against published BFCL/IFEval to catch quant-level regressions. That turns a week into ~1 hour and tests the thing that actually matters. Then a separate performance pass measured prefill (prompt-processing) speed and generation tok/s at 1k/8k/32k context, N=5 each, on LM Studio's OpenAI-compatible API. **The 20 models.** Granite 4.1 3B (lmstudio-community, unsloth, nikolaykozloff Q6/Q8) · Granite-3B-function-calling-xLAM (Salesforce/unsloth) · Granite-3B-sft-claude-opus-reasoning · Granite 4.1 8B · Granite 4.1 8B base · LFM2.5-8B-A1B (liquidai official, unsloth, RemySkye-i1) · Gemma-4-e2b (google base, agentic, ×opus-4.7-turbo, ×deepseek-v4) · LFM2.5-1.2B-Instruct · LFM2.5-VL-1.6B (liquidai, unsloth) · Qwen3.5-4B (base, claude-4.6-opus-reasoning-distilled) · Nemotron-3-Nano-4B. **Results** (gen tok/s, N=5, σ<2.5 throughout; VRAM = full-GPU load): | Model | VRAM | @1k | @8k | @32k | Max ctx (GPU) | Note | |---|---|---|---|---|---|---| | lfm2.5-1.2b-instruct | 1.9G | 129 | 118 | 102 | 256k | clean, fast | | unsloth/lfm2.5-vl-1.6b | 3.0G | 207 | 182 | 142 | 128k | fastest overall (vision) | | liquidai/lfm2.5-vl-1.6b | 2.7G | 128 | 115 | 100 | 256k | vision | | liquidai/lfm2.5-8b-a1b | 5.4G | 99 | 97 | 90 | 64k | MoE, holds 32k well | | unsloth/lfm2.5-8b-a1b | 4.6G | 121 | 112 | 102 | 128k | fast but drops files | | lfm2.5-8b-a1b-i1 | 5.4G | 108 | 99 | 95 | 32k | reasoning variant | | gemma-4-agentic-e2b | 2.4G | 82 | 78 | 70 | 256k | lightest, holds 32k | | google/gemma-4-e2b (base) | 3.6G | 78 | 79 | 69 | 256k | base, noisy | | gemma-4-e2b×opus-turbo | 2.4G | 82 | 78 | 71 | 256k | broken chat template | | gemma4-e2b-deepseek | 2.4G | 83 | 78 | 71 | 256k | hallucinated paths | | unsloth/granite-4.1-3b | 4.8G | 70 | 60 | 40 | 32k | quality ≈ 8B | | granite-3b-xLAM (fc) | 4.8G | 71 | 61 | 41 | 32k | no edge vs base 3B | | lmstudio/granite-4.1-3b | 4.9G | 66 | 59 | 42 | 32k | solid baseline | | granite-3b-sft-reasoning | 4.8G | 68 | 58 | 39 | 32k | reasoning tax | | nikolaykozloff/granite-3b | 4.6G | 45 | 40 | — | 24k | hallucinated fn name | | nvidia/nemotron-3-nano-4b | 3.7G | 58 | 56 | 48 | 128k | least ctx-degradation | | qwen3.5-4b-distilled | 5.1G | 52 | 50 | 43 | 32k | reasoning, verbose | | qwen3.5-4b (base) | 5.6G | 52 | 49 | 43 | 32k | fine, unremarkable | | granite-4.1-8b base | 5.1G | 38 | 32 | — | ~10k | base, hallucinates | | granite-4.1-8b | 5.6G | 28 | 25 | — | ~10k | slow + ctx-capped | Three cross-cutting findings: (a) **reasoning-tuned models cost, they don't fail** — with a tight token cap they look broken (truncated mid-thought), but given room they answer correctly at 2–3× the tokens; that's a latency/cost signal for batch work, not a quality reason to cut them (though two still dropped a file in open-ended decomposition even with budget). (b) **Third-party fine-tunes are a landmine** — hallucinated function names, a broken jinja chat template (dead on arrival for multi-turn tool calls), hallucinated paths; the base/official-instruct builds were consistently safer. (c) **Context tax is real and uniform** — every model loses ~20–35% gen speed from 1k→32k, with no thermal throttling across N=5. **The picks.** **LFM2.5-1.2B-Instruct** — the cheap, always-on model. 1.9GB VRAM, 1.5s load, 129 tok/s and clean on JSON / instruction-adherence / no-hallucination probes. Its weak planning is irrelevant for a low-stakes always-resident role. Highest prefill in the whole set (~8.5k tok/s at 8k), so it ingests short inputs near-instantly. **Granite-4.1-3B (instruct)** — the quality-per-VRAM baseline. On my probes it matched Granite-8B on output quality while running 2–3× faster (60 tok/s at 8k vs 25), and it's the only dense 3B that cleanly holds 32k context. Notably the "function-calling-xLAM" fine-tune showed **no** advantage once tested multi-turn — the single-turn impression that it chained tools better collapsed under a proper multi-turn probe. Use the plain instruct. **Gemma-4-agentic-e2b** — the surprise. Just 2.4GB VRAM (lightest non-trivial model here), holds 256k context, and sustains 70 tok/s at 32k with high prefill (~3.8k tok/s). It gave clean, complete decomposition plans. It's the one model flexible enough to act as either a light orchestrator or a fast worker, which matters when you're juggling roles in 6GB. **Nemotron-3-Nano-4B** — the long-context worker. Slower at small context (58 tok/s at 1k) but it **degrades the least** — still 48 tok/s at 32k where the Granite-3Bs fall to ~41 — at only 3.7GB and a 128k ceiling. Best choice when the worker has to read a large input in one shot. **LFM2.5-8B-A1B (liquidai)** — the orchestrator, and the headline result. This 8B/1B-active MoE does **90 tok/s at 32k context** for ~5.4GB. The obvious dense alternative, Granite-8B, does 25–28 tok/s and caps out around 10k context for the same VRAM — so the MoE is 3–4× faster with 3× the usable context. I tested the unsloth build too (faster at 102 tok/s and 128k context) but it dropped a file in open-ended decomposition even with a generous token budget, so the official liquidai build wins on completeness; unsloth stays as a speed fallback. **Takeaways.** Benchmark on your own quants and your own tasks — published scores won't catch a broken chat template or a quant that hallucinates function names. On VRAM-constrained hardware an MoE punches far above its parameter count. And a tight, targeted probe set judged by hand gets you a defensible decision in an hour instead of a week of GPU time.
A GitHub repository providing practical configurations and benchmarks for running local LLMs (like Qwen3.6 27B) on dual RTX 5060 Ti 16GB cards using vLLM and llama.cpp.
AirLLM is an open-source tool that optimizes inference memory usage, enabling 70B LLMs to run on a single 4GB GPU without quantization, and supports 405B models on 8GB VRAM.
A user benchmarks 21 consumer GPUs on vast.ai running a small TTS model (OmniVoice) with peak VRAM of 5GB, comparing performance relative to real-time and to an RTX 3090.
A tool that estimates which LLMs fit on a user's GPU memory, ranking models by performance while considering memory constraints and quantization levels.
A user benchmarks the Nvidia 5090 RTX GPU for LLM inference using llama.cpp, measuring prompt processing and token generation at various power levels, finding that prompt processing is more sensitive to power limits than token generation, and noting differences from the 4090 RTX.