@TraffAlex: Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Eve…
Summary
A guide to the best local LLMs for consumer GPUs as of June 2026, using llama.cpp to run models like Gemma 4-12B, Qwen3.6-27B, and Nex-N2-Mini on 8-32GB VRAM, with setup and launch commands.
View Cached Full Text
Cached at: 06/15/26, 02:50 AM
Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026)
What I actually run on consumer hardware right now. Every model below runs via llama.cpp with a simple one-liner — no Docker, no Python env, no cloud.
━━━ 8-16GB VRAM ━━━
Gemma 4-12B (Google) • Smartest model in this size class — competes with stuff 2× bigger • Unsloth’s MTP GGUFs: 162 tok/s vs 52 tok/s normal (3× speedup) • Minimum 8GB VRAM recommended for Q4_K_M quant • GGUF → http://huggingface.co/unsloth/gemma-4-12b-it-GGUF…
LFM2.5-8B-A1B (LiquidAI) • Hybrid MoE, only 1B active params — absurdly fast for its size • Perfect for 8-12GB cards, MacBooks, or anyone on a tight budget • GGUF → http://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF…
━━━ 16-32GB VRAM ━━━
Qwen3.6-27B (Qwen) • Scored 1.00 on tool-efficiency benchmarks — best local agent available • 40 deterministic tasks, 32k/128k context needle tests — all passed • GGUF → http://huggingface.co/unsloth/Qwen3.6-27B-GGUF… • MTP version (faster) → http://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF…
Qwopus3.6-27B-v2 (Jackrong) • Best quantization of Qwen3.6-27B — topped 5 agent & coding benchmarks (1200 samples) • If you’re running Q4, this is the one to grab • GGUF → http://huggingface.co/Jackrong/Qwopus3.6-27B-v2-GGUF… • MTP version → http://huggingface.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF…
Gemma 4-31B QAT (Google/Unsloth) • QAT variant with MTP draft head: 76-125 tok/s (1.67× speedup) • Excellent for multi-agent / subagent workflows • GGUF → http://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF…
Nex-N2-Mini (Nex AGI) • Post-train of Qwen3.5-35B-A3B — MoE with only 3B active params • Fits on 16GB+ VRAM, overflow loads from system RAM • Adaptive thinking saves ~20% tokens with no quality loss • For deep multi-step reasoning, nothing in this size comes close • GGUF → http://huggingface.co/sjakek/Nex-N2-mini-GGUF…
━━━ Quick Picks ━━━
• 16GB all-rounder → Gemma 4-12B with MTP GGUFs • 32GB all-rounder → Qwen3.6-27B / Qwopus-v2 • Agents & tool use → Qwen3.6-27B or Qwopus Q4 • Deep reasoning → Nex-N2-Mini (MoE, fits 16GB+) • Tight budget → LFM2.5-8B-A1B • Cheapest full build: 1× used RTX 3090 (24GB) + rest of PC ≈ $1000-1500
━━━ Setup on Windows ━━━
- Download llama.cpp → http://github.com/ggml-org/llama.cpp/releases… (latest .zip)
- Extract to any folder (e.g. C:\llama.cpp)
- Download a .gguf from the links above (Q4_K_M or Q5_K_M for best quality/speed balance)
- Run one of the commands below depending on your hardware
━━━ Launch Commands ━━━
SINGLE GPU — Standard model (no MTP):
llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ –ctx-size 180000 ^ –flash-attn on ^ –cache-type-k q4_0 ^ –cache-type-v q4_0 ^ –batch-size 1024 –ubatch-size 512 ^ -ngl 100 ^ -np 1 ^ –port 8080 ^ –jinja
SINGLE GPU — MTP model (faster inference):
llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ –ctx-size 180000 ^ –flash-attn on ^ –cache-type-k q4_0 ^ –cache-type-v q4_0 ^ –batch-size 1024 –ubatch-size 512 ^ –spec-type draft-mtp ^ –spec-draft-n-max 3 ^ -ngl 100 ^ -np 1 ^ –port 8080 ^ –jinja
DUAL GPU — Split across two cards:
llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ –ctx-size 180000 ^ –flash-attn on ^ –cache-type-k q4_0 ^ –cache-type-v q4_0 ^ –batch-size 1024 –ubatch-size 512 ^ -ngl 100 ^ –tensor-split 0.55,0.45 ^ –main-gpu 0 ^ -np 1 ^ –port 8080 ^ –jinja
DUAL GPU + MTP + Vision (multimodal):
llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ –ctx-size 180000 ^ –flash-attn on ^ –cache-type-k q4_0 ^ –cache-type-v q4_0 ^ –batch-size 1024 –ubatch-size 512 ^ –spec-type draft-mtp ^ –spec-draft-n-max 3 ^ -ngl 100 ^ –tensor-split 0.60,0.40 ^ –main-gpu 0 ^ -np 1 ^ –port 8080 ^ –jinja ^ –mmproj C:\models\mmproj-F16.gguf
━━━ Parameter Breakdown ━━━
-m
–ctx-size 180000 Context window in tokens. 180k = huge context for long conversations or big codebases. Reduce to 32768 or 65536 if you don’t need long context — uses less VRAM.
–flash-attn on Flash Attention — dramatically speeds up inference and reduces VRAM usage. Works on RTX 30xx/40xx/50xx. Always enable this.
–cache-type-k q4_0 / –cache-type-v q4_0 Quantizes the KV cache (key/value attention cache) to 4-bit. This is what makes 180k context fit in VRAM. Without it, huge contexts eat all your memory. Quality impact is minimal — this is a free performance win.
–batch-size 1024 / –ubatch-size 512 batch-size = how many tokens are processed in one forward pass (throughput). ubatch-size = micro-batch actually sent to the GPU per step. Higher = faster prompt processing but needs more VRAM. If you run out of VRAM, lower these (e.g. 512/256).
-ngl 100 Number of layers to offload to GPU. 100 = all layers on GPU (full offload). This is what you want if the model fits in your VRAM. If it doesn’t fit, reduce this (e.g. -ngl 40) — remaining layers run on CPU/RAM.
–tensor-split 0.55,0.45 How to split model layers across multiple GPUs. Values are ratios. 0.55,0.45 = GPU 0 gets 55% of layers, GPU 1 gets 45%. Adjust based on your VRAM — give more to the card with more memory. Example: 0.70,0.30 for a 24GB + 12GB setup. Not needed for single GPU setups.
–main-gpu 0 Which GPU handles the batch computation (the “orchestrator”). Set to 0 (your primary GPU). The other GPU(s) handle their assigned layers. Minor performance impact — usually just leave it at 0.
-np 1 Number of parallel slots (concurrent requests). 1 = one user at a time. Increase to 2-4 if you want multiple clients connected simultaneously. Each extra slot uses additional VRAM for its own KV cache.
–port 8080 Which port the server listens on. Change if port 8080 is busy.
–jinja Enables Jinja2 template processing — required for proper chat formatting. Most modern models expect this. Always include it.
–spec-type draft-mtp Enables Multi-Token Prediction (MTP) speculative decoding. Only works with MTP GGUF models (downloaded separately). The model predicts multiple tokens at once and verifies them — big speed boost.
–spec-draft-n-max 3 How many tokens the MTP draft head proposes per step. 3 is a good default. Higher = potentially faster but more VRAM and may reduce quality.
–mmproj
━━━ Your Hardware → Your Command ━━━
Single GPU (8-24GB VRAM): Use the “Single GPU” command. Change -m to your model path. 8GB card → Gemma 4-12B Q4 or LFM2.5-8B 12GB card → Gemma 4-12B Q5/Q6 16GB card → Gemma 4-31B QAT Q4 or Nex-N2-Mini 24GB card → Qwen3.6-27B Q4/Q5, Qwopus-v2, Gemma 4-31B QAT Q5/Q6
Dual GPU: Use the “Dual GPU” command. Adjust –tensor-split based on your VRAM ratio. 24GB + 24GB → –tensor-split 0.50,0.50 24GB + 12GB → –tensor-split 0.70,0.30 24GB + 8GB → –tensor-split 0.75,0.25
Want speed? Use MTP versions of models with the “MTP” commands.
Want vision? Add –mmproj with the projector file from the model’s HuggingFace repo.
- Once running, you get: • Web chat UI → http://localhost:8080 • OpenAI-compatible API → http://localhost:8080/v1 • Playground → http://localhost:8080/playground
━━━ Why /v1 API Is the Killer Feature ━━━
One local endpoint replaces your entire cloud API bill. The /v1 endpoint is drop-in OpenAI-spec compatible — every tool that speaks OpenAI just works. No custom code, no glue layer.
Works out of the box with: • IDEs: Cursor, Continue, Windsurf, Cline, Roo Code • CLI tools: aider, Open Interpreter, OpenCode • Frameworks: LangChain, LlamaIndex, LiteLLM • Any OpenAI SDK (Python, Node, Go, Rust)
Why this beats cloud APIs: • 100% private — code never leaves your machine • $0 per token — no rate limits, no quotas, no surprise bills • Works fully offline • Zero telemetry, no training on your data • Swap models by dropping in a different .gguf — no app changes needed • Run 32k–128k context windows without burning money
Good combos: • Cursor + Qwopus-v2 → near-frontier quality, zero API cost • Continue + Qwen3.6-27B → best local coding agent • aider + Gemma 4-12B MTP → 162 tok/s, feels instant • OpenCode + Nex-N2-Mini → deep reasoning on 16GB
Set any OpenAI-compatible client to your local endpoint: set OPENAI_API_KEY=sk-dummy (any non-empty string works) set OPENAI_BASE_URL=http://localhost:8080/v1
every OpenAI-compatible tool now hits your local GPU
Shoutouts: @0xSero @rS_alonewolf @witcheer @UnslothAI @LottoLabs
unsloth/gemma-4-12b-it-GGUF · Hugging Face
Source: https://huggingface.co/unsloth/gemma-4-12b-it-GGUF
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#read-our-how-to-run-gemma-4-12b-guideRead our How toRun Gemma 4 12B Guide!


Hugging Face|GitHub|Launch Blog|Documentation License:Apache 2.0|Authors:Google DeepMind
This model card is for the Gemma 4 12B Unified model, which is part of the Gemma 4 family of open models. Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution.
Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.
Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in five distinct sizes:E2B,E4B,12B,26B A4B, and31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.
Gemma 4 introduces keycapability and architectural advancements:
- Reasoning– All models in the family are designed as highly capable reasoners, with configurable thinking modes.
- Extended Multimodalities– Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B, E4B, and 12B models).
- Diverse & Efficient Architectures– Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
- Optimized for On-Device– Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
- Increased Context Window– The small models feature a 128K context window, while the medium models support 256K.
- Enhanced Coding & Agentic Capabilities– Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
- Native System Prompt Support– Gemma 4 introduces native support for the
systemrole, enabling more structured and controllable conversations.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#models-overviewModels Overview
Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (12B, 26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.
The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#dense-modelsDense Models
PropertyE2BE4B12B Unified31B DenseTotal Parameters2.3B effective (5.1B with embeddings)4.5B effective (8B with embeddings)11.95B30.7BLayers35424860Sliding Window512 tokens512 tokens1024 tokens1024 tokensContext Length128K tokens128K tokens256K tokens256K tokensVocabulary Size262K262K262K262KSupported ModalitiesText, Image, AudioText, Image, AudioText, Image, AudioText, ImageVision Encoder Parameters*~150M**~150M*-~550MAudio Encoder Parameters*~300M**~300M*-No Audio The “E” in E2B and E4B stands for “effective” parameters. The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.
The “Unified” in Gemma 4 12B Unified refers to its encoder-free architecture. Other Gemma 4 models use dedicated encoders to process multimodal data before passing it to the LLM. Gemma 4 12B eliminates these encoders entirely, projecting raw image patches and audio waveforms directly into the LLM’s embedding space through lightweight linear layers. This unified approach means all modalities flow straight into a single decoder-only transformer, reducing multimodal latency and allowing the entire model to be fine-tuned in one pass.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#mixture-of-experts-moe-modelMixture-of-Experts (MoE) Model
Property26B A4B MoETotal Parameters25.2BActive Parameters3.8BLayers30Sliding Window1024 tokensContext Length256K tokensVocabulary Size262KExpert Count8 active / 128 total and 1 sharedSupported ModalitiesText, ImageVision Encoder Parameters*~550M* The “A” in 26B A4B stands for “active parameters” in contrast to the total number of parameters the model contains. By only activating a 4B subset of parameters during inference, the Mixture-of-Experts model runs much faster than its 26B total might suggest. This makes it an excellent choice for fast inference compared to the dense 31B model since it runs almost as fast as a 4B-parameter model.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#benchmark-resultsBenchmark Results
These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked in the table are for instruction-tuned models.
Gemma 4 31BGemma 4 26B A4BGemma 4 12B UnifiedGemma 4 E4BGemma 4 E2BGemma 3 27B (no think)MMLU Pro85.2%82.6%77.2%69.4%60.0%67.6%AIME 2026 no tools89.2%88.3%77.5%42.5%37.5%20.8%LiveCodeBench v680.0%77.1%72.0%52.0%44.0%29.1%Codeforces ELO215017181659940633110GPQA Diamond84.3%82.3%78.8%58.6%43.4%42.4%Tau2 (average over 3)76.9%68.2%69.0%42.2%24.5%16.2%HLE no tools19.5%8.7%5.2%---HLE with search26.5%17.2%----BigBench Extra Hard74.4%64.8%53.0%33.1%21.9%19.3%MMMLU88.4%86.3%83.4%76.6%67.4%70.7%VisionMMMU Pro76.9%73.8%69.1%52.6%44.2%49.7%OmniDocBench 1.5 (average edit distance, lower is better)0.1310.1490.1640.1810.2900.365MATH-Vision85.6%82.4%79.7%59.5%52.4%46.0%MedXPertQA MM61.3%58.1%48.7%28.7%23.5%-AudioCoVoST--38.5*35.5433.47-FLEURS (lower is better)--0.069*0.080.09-Long ContextMRCR v2 8 needle 128k (average)66.4%44.1%43.4%25.4%19.1%13.5% *Excluding Chinese language.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#core-capabilitiesCore Capabilities
Gemma 4 models handle a broad range of tasks across text, vision, and audio. Key capabilities include:
- Thinking– Built-in reasoning mode that lets the model think step-by-step before answering.
- Long Context– Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (12B, 26B A4B/31B).
- Image Understanding– Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions.
- Video Understanding– Analyze video by processing sequences of frames.
- Interleaved Multimodal Input– Freely mix text and images in any order within a single prompt.
- Function Calling– Native support for structured tool use, enabling agentic workflows.
- Coding– Code generation, completion, and correction.
- Multilingual– Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.
- Audio(E2B, E4B, and 12B only) – Automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#getting-startedGetting Started
You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:
pip install \-U transformers torch accelerate
Once you have everything installed, you can proceed to load the model with the code below:
from transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-12B-it"
# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
Once the model is loaded, you can start generating output:
# Prompt
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short joke about saving RAM."},
]
# Process input
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=False
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
# Generate output
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
# Parse output
processor.parse_response(response)
To enable reasoning, setenable\_thinking=Trueand theparse\_responsefunction will take care of parsing the thinking output.
Below, you will also find snippets for processing audio (E2B, E4B, 12B only), images, and video alongside text:
Code for processing AudioMake sure to install the following packages:
pip install \-U transformers torch torchvision librosa accelerate
You can then load the model with the code below:
from transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-12B-it"
# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
Once the model is loaded, you can start generating output by directly referencing the audio URL in the prompt:
# Prompt - add audio after text
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
{"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/journal1.wav"},
]
}
]
# Process input
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
# Parse output
processor.parse_response(response)
Code for processing ImagesMake sure to install the following packages:
pip install \-U transformers torch torchvision accelerate
You can then load the model with the code below:
from transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-12B-it"
# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
Once the model is loaded, you can start generating output by directly referencing the image URL in the prompt:
# Prompt - add image before text
messages = [
{
"role": "user", "content": [
{"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/GoldenGate.png"},
{"type": "text", "text": "What is shown in this image?"}
]
}
]
# Process input
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
# Parse output
processor.parse_response(response)
Code for processing VideosMake sure to install the following packages:
pip install \-U transformers torch torchvision librosa accelerate
You can then load the model with the code below:
from transformers import AutoProcessor, AutoModelForMultimodalLM
MODEL_ID = "google/gemma-4-12B-it"
# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
Once the model is loaded, you can start generating output by directly referencing the video URL in the prompt:
# Prompt - add video before text
messages = [
{
'role': 'user',
'content': [
{"type": "video", "video": "https://github.com/bebechien/gemma/raw/refs/heads/main/videos/ForBiggerBlazes.mp4"},
{'type': 'text', 'text': 'Describe this video.'}
]
}
]
# Process input
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
# Generate output
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
# Parse output
processor.parse_response(response)
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#best-practicesBest Practices
For the best performance, use these configurations and best practices:
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#1-sampling-parameters1. Sampling Parameters
Use the following standardized sampling configuration across all use cases:
temperature=1\.0top\_p=0\.95top\_k=64
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#2-thinking-mode-configuration2. Thinking Mode Configuration
Compared to Gemma 3, the models use standardsystem,assistant, anduserroles. To properly manage the thinking process, use the following control tokens:
- **Trigger Thinking:**Thinking is enabled by including the
<\|think\|\>token at the start of the system prompt. To disable thinking, remove the token. - Standard Generation:When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
<\|channel\>thought\\n[Internal reasoning]<channel\|\> - Disabled Thinking Behavior:For all models except for the E2B and E4B variants, if thinking is disabled, the model will still generate the tags but with an empty thought block:
<\|channel\>thought\\n<channel\|\>[Final answer]
Note that many libraries like Transformers and llama.cpp handle the complexities of the chat template for you.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#3-multi-turn-conversations3. Multi-Turn Conversations
- No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns mustnot be addedbefore the next user turn begins.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#4-modality-order4. Modality order
For optimal performance with multimodal inputs, place:
- Image contentbeforethe text in your prompt.
- Audio contentafterthe text in your prompt.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#5-variable-image-resolution5. Variable Image Resolution
Aside from variable aspect ratios, Gemma 4 supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don’t require fine-grained understanding.
- The supported token budgets are:70,140,280,560, and1120.- Uselower budgetsfor classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail. - Usehigher budgetsfor tasks like OCR, document parsing, or reading small text.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#6-audio6. Audio
Use the following prompt structures for audio processing:
- Audio Speech Recognition (ASR)
Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.
Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.
- Automatic Speech Translation (AST)
Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#7-audio-and-video-length7. Audio and Video Length
All models support image inputs and can process videos as frames whereas the E2B, E4B, and 12B models also support audio inputs. Audio supports a maximum length of 30 seconds. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#model-dataModel Data
Data used for model training and how the data was processed.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#training-datasetTraining Dataset
Our pre-training dataset is a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which includes web documents, code, images, audio, with a cutoff date of January 2025. Here are the key components:
- Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
- Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
- Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.
- Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks.
The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#data-preprocessingData Preprocessing
Here are the key data cleaning and filtering methods applied to the training data:
- CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
- Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
- Additional methods: Filtering based on content quality and safety in line withour policies.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#ethics-and-safetyEthics and Safety
As open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, Gemma 4 undergoes the same rigorous safety evaluations as our proprietary Gemini models.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#evaluation-approachEvaluation Approach
Gemma 4 models were developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align withGoogle’s AI principles, as well as safety policies, which aim to prevent our generative AI models from generating harmful content, including:
- Content related to child sexual abuse material and exploitation
- Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm)
- Sexually explicit content
- Hate speech (e.g., dehumanizing members of protected groups)
- Harassment (e.g., encouraging violence against people)
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#evaluation-resultsEvaluation Results
For all areas of safety testing, we saw major improvements in all categories of content safety relative to previous Gemma models. Overall, Gemma 4 models significantly outperform Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models’ performance.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#usage-and-limitationsUsage and Limitations
These models have certain limitations that users should be aware of.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#intended-usageIntended Usage
Multimodal models (capable of processing vision, language, and/or audio) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.
- Content Creation and Communication- Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Audio Processing and Interaction: The E2B, E4B, and 12B models can analyze and interpret audio inputs, enabling voice-driven interactions and transcriptions.
- Research and Education- Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#limitationsLimitations
- Training Data- The quality and diversity of the training data significantly influence the model’s capabilities. Biases or gaps in the training data can lead to limitations in the model’s responses. - The scope of the training dataset determines the subject areas the model can handle effectively.
- Context and Task Complexity- Models perform well on tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model’s performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
- Language Ambiguity and Nuance- Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language.
- Factual Accuracy- Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
- Common Sense- Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#ethical-considerations-and-risksEthical Considerations and Risks
The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:
- Bias and Fairness- VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. Gemma 4 models underwent careful scrutiny, input data pre-processing, and post-training evaluations as reported in this card to help mitigate the risk of these biases.
- Misinformation and Misuse- VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see theResponsible Generative AI Toolkit.
- Transparency and Accountability- This model card summarizes details on the models’ architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.
Risks identified and mitigations:
- Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
- Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided.
- Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
- Perpetuation of biases: It’s encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#benefitsBenefits
At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models.
https://huggingface.co/unsloth/gemma-4-12b-it-GGUF#running-with-llamacpp-text-vision-and-audioRunning with llama.cpp (text, vision, and audio)
This is an omni GGUF, so the same files handle text, images, and audio. Grab any recent stockllama.cppbuild and start the server. The multimodal projector (mmproj) is downloaded automatically when you use\-hf, so you do not need to pass it yourself:
llama-server -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4_K_XL --jinja -c 8192
# add -ngl 999 if you have a GPU build
Then query it through the OpenAI compatible API:
import json, base64, urllib.request
def ask(content, max_tokens=256):
body = {
"messages": [{"role": "user", "content": content}],
"max_tokens": max_tokens,
# Gemma 4 is a thinking model. Set this to False (or raise max_tokens),
# otherwise the reply lands in reasoning_content and "content" is empty.
"chat_template_kwargs": {"enable_thinking": False},
}
req = urllib.request.Request("http://127.0.0.1:8080/v1/chat/completions",
json.dumps(body).encode(),
{"Content-Type": "application/json"})
return json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"]
b64 = lambda p: base64.b64encode(open(p, "rb").read()).decode()
# Text
print(ask("What is 1+1?"))
# Vision (any image file)
print(ask([
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64," + b64("image.jpg")}},
]))
# Audio (16 kHz mono WAV works best)
print(ask([
{"type": "text", "text": "Transcribe this audio."},
{"type": "input_audio", "input_audio": {"data": b64("audio.wav"), "format": "wav"}},
]))
Tips:
- Pass
\-\-jinjaso the Gemma 4 chat template is applied. - For audio, feed a 16 kHz mono WAV (convert with
ffmpeg \-i in\.mp3 \-ar 16000 \-ac 1 out\.wav). Clean speech transcribes best. - To force a specific projector precision add
\-\-mmproj\-url \.\.\./mmproj\-F16\.gguf, or pass\-\-no\-mmprojto disable multimodal.
Similar Articles
Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM
A technical guide on setting up local LLM autocomplete (Qwen2.5-Coder-7B) and agentic coding (Qwen3.6-35B-A3B) on a single 16GB GPU with 64GB+ RAM using llama.cpp, including commands and performance benchmarks.
@analogalok: my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 …
User runs Gemma 4 31B dense model on 8GB VRAM gaming laptop at ~3 tokens/sec using llama.cpp with MTP speculative decoding, demonstrating feasibility of running a 31B dense model on consumer hardware and proposing agentic workflows where a fast MoE model routes to this slower dense model for hard tasks.
@iluciddreaming: Played with local LLMs for two months. Extensively tested various open-source models using Windows 11 + llama.cpp + llama-swap. Here is my final report card: Hardware: i7-13700 + 64GB RAM + RTX 4070. The best combination currently is gemm…
After two months of local LLM testing, the author finds that the combination of gemma-4-12B-it-QAT and MTP assistance performs best in speed and usability, with hardware i7-13700 + 64GB RAM + RTX 4070.
@leopardracer: https://x.com/leopardracer/status/2055341758523883631
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.
club-5060ti: practical RTX 5060 Ti local LLM notes and configs
A GitHub repository providing practical configurations and benchmarks for running local LLMs (like Qwen3.6 27B) on dual RTX 5060 Ti 16GB cards using vLLM and llama.cpp.