@mylifcc: I'm already running Gemma-4-12b on my Mac. Tech stack: llama.cpp + GGUF Q4_K_M + Metal 32K context, local OpenAI-compatible API. Measured about 36 tok/s, resident RSS about…

X AI KOLs Timeline Tools

Summary

User shares their experience using llama.cpp with the GGUF Q4_K_M quantized version of Gemma-4-12b on a Mac, achieving local inference speed of about 36 tok/s and memory usage of about 10GB.

I'm already using Gemma-4-12b on my Mac. Tech stack: llama.cpp + GGUF Q4_K_M + Metal 32K context, local OpenAI-compatible API Measured about 36 tok/s, resident RSS about 10GB Unbelievable, only 10GB of memory usage! If you also have a Mac with 16GB or more, check out my setup. You don't have to use it all the time, but can you resist giving it a try? https://t.co/F3yL6fyAoh
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:54 PM

I’ve already got Gemma-4-12b running on my Mac. The tech stack is:

llama.cpp + GGUF Q4_K_M + Metal
32K context, local OpenAI-compatible API
Measured ~36 tok/s, resident RSS ≈ 10GB

Hard to believe — only 10GB of RAM used!
If you also have a Mac with 16GB+ of RAM, check out my setup. You don’t have to use it forever, but can you resist trying it? https://t.co/F3yL6fyAoh

Similar Articles

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Reddit r/LocalLLaMA

A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.

Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup

Reddit r/LocalLLaMA

A user shares a hands-on comparison of running Gemma 4 with LiteRT-LM on mobile devices versus their previous llama.cpp setup, noting significantly better memory usage (1.5-2 GB vs 4-5 GB) and faster inference (2-4 seconds vs 7-10 seconds) on smartphones like Samsung S25 Ultra and iPhone 13 Pro Max.