@mylifcc: I'm already running Gemma-4-12b on my Mac. Tech stack: llama.cpp + GGUF Q4_K_M + Metal 32K context, local OpenAI-compatible API. Measured about 36 tok/s, resident RSS about…
Summary
User shares their experience using llama.cpp with the GGUF Q4_K_M quantized version of Gemma-4-12b on a Mac, achieving local inference speed of about 36 tok/s and memory usage of about 10GB.
View Cached Full Text
Cached at: 06/03/26, 09:54 PM
I’ve already got Gemma-4-12b running on my Mac. The tech stack is:
llama.cpp + GGUF Q4_K_M + Metal
32K context, local OpenAI-compatible API
Measured ~36 tok/s, resident RSS ≈ 10GB
Hard to believe — only 10GB of RAM used!
If you also have a Mac with 16GB+ of RAM, check out my setup. You don’t have to use it forever, but can you resist trying it? https://t.co/F3yL6fyAoh
Similar Articles
Gemma4 26b MoE running in MLX with turboquant (and custom kernel)
A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.
Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup
A user shares a hands-on comparison of running Gemma 4 with LiteRT-LM on mobile devices versus their previous llama.cpp setup, noting significantly better memory usage (1.5-2 GB vs 4-5 GB) and faster inference (2-4 seconds vs 7-10 seconds) on smartphones like Samsung S25 Ultra and iPhone 13 Pro Max.
@leopardracer: GEMMA 4 26B ON AN RTX 4060 WITH A 248K TOKEN CONTEXT WINDOW 20 tokens per second and a context window so large you can …
Gemma 4 26B runs on an RTX 4060 with 248K token context at 20 tokens per second using llama.cpp and Q4_K_XL quantization, enabling local processing of entire codebases on consumer hardware.
@HuggingModels: Gemma 4 is here, and it's optimized for Apple Silicon. This 4-bit quantized model runs fast on your Mac, not just in th…
Gemma 4 is a 4-bit quantized model optimized for Apple Silicon, enabling fast local inference on Mac devices, reducing reliance on cloud computing.
@ai_xiaomu: Here comes a full-featured multimodal local model that runs on a MacBook with 16GB: 1. Download LM Studio; 2. Search for Gemma 4 12B and install it; 3. Ask Codex to configure the local API parameters for you; 4. Then enjoy the freedom of tokens.
Guides users on running the Gemma 4 12B multimodal local model on a MacBook with 16GB RAM using LM Studio and Codex, enabling free token usage.