@Michaelzsguo: https://x.com/Michaelzsguo/status/2053217839729791221

X AI KOLs Timeline 05/09/26, 08:57 PM News

local-llm deployment hardware quantization ollama llama-cpp tutorial

Summary

This article is a guide for local large model deployment, covering hardware selection, memory calculations, Runtime tool comparisons, and model quantization options, helping users from getting started to optimizing their local inference experience.

https://t.co/yl8cj2zWtD

Original Article

View Cached Full Text

Cached at: 05/10/26, 02:20 AM

Local LLMs: From Getting Started to Running Well

0. TL;DR

Many people get stuck with local LLMs not because the model is too complex, but because they don’t know which layer they’re stuck on.

You think you’re running a model, but you’re actually making five decisions simultaneously: Hardware, Memory, Runtime, Model Selection, Quantization. Every layer has pitfalls, every layer can be optimized individually, but they constrain each other. Almost all subsequent issues can be traced back to these five layers.

Two ways to read: If you just want to get it running, read up to Section 1 and start. If you are already tuning speed, context, and quantization, Sections 3 to 6 will be more useful.

1. Get It Running First

Don’t start by researching 70B, Q4, GGUF. Just make a model talk on your machine first.

Apple Silicon Mac (Mac Studio or MacBook Pro) is currently the simplest starting platform. Three commands, done within 10 minutes:

brew install ollama   # If no Homebrew, go to ollama.com to download the installer directly
ollama pull qwen3:8b  # If tag doesn't exist, check available tags at ollama.com/library
ollama run qwen3:8b

Why choose qwen3:8b: Small enough (Q4 quantization ~5GB), smart enough, runs on 8GB RAM, supports both Chinese and English. No need to spend time on model selection for the first run.

After the model downloads and runs, you will see the >>> prompt. Ask it a question randomly, wait for it to start answering. After installation, check these two commands:

ollama list                 # Installed models: name, quantization version, file size
ollama show qwen3:8b        # Model details: quantization level, chat template, layers, context length

Understand these numbers:

tokens/s (TPS): Generation speed. Apple M series, 8B model usually 20-60 TPS, feels smooth. Below 5 TPS indicates memory or quantization issues.
prompt eval: First token latency. Related to model size and context length.
Memory usage: Be careful if usage exceeds 80% — system overhead will increase latency, or even trigger swap.

It’s running. Now let’s break down the layers.

2. Hardware: Memory Bandwidth is the Core Metric

Many people focus on GPU compute power (FLOPS) when selecting hardware. For local LLMs, this intuition is wrong.

Local inference is often not about unable to compute, but unable to read fast. Every generated token requires moving the model weights once. Bandwidth is the limit.

This explains why Apple Silicon often exceeds expectations on local LLMs.

Apple Silicon Path

M series chips use Unified Memory Architecture: CPU and GPU share the same memory pool, no dedicated VRAM, the path from memory to GPU cores is extremely short, bandwidth efficiency is far higher than discrete GPUs connected via PCIe.

MacBook Pro M4 Max: Two tiers of 410 / 546 GB/s, 128GB config usually corresponds to high-end 16-core CPU / 40-core GPU, bandwidth 546 GB/s
Mac Studio M3 Ultra: Memory bandwidth 819 GB/s, optional large memory configs, is the smoothest Apple desktop path for running large local models

NVIDIA Path

RTX 4090 (24GB VRAM): Memory bandwidth approx 1008 GB/s, ceiling for single consumer card
RTX 3090 (24GB VRAM): Bandwidth approx 936 GB/s, lower price, supports NVLink multi-card
Multi-card solution: 3090 can pool VRAM via NVLink; 4090 multi-card is usually model splitting/parallelism, cannot simply understand as 2×4090 = 48GB available VRAM, complexity is high
Datacenter cards (A100, H100): Significantly higher bandwidth, larger price gap, not an option for consumer scenarios

3. Calculate Memory Requirements

Hardware is just the ceiling. What really decides if it can run is the memory budget.

Memory requirements for running models locally can be split into two parts:

Total Memory Requirement ≈ Weights Memory + KV Cache + Runtime Overhead

Weights Memory is easy to calculate:

Weights Memory = Parameters × Bytes per Parameter

FP16 / BF16 ≈ 2 bytes/param → 8B FP16 ≈ 16 GB Q8 ≈ 1 byte/param → 8B Q8 ≈ 8 GB Q4 ≈ 0.5 byte/param → 8B Q4 ≈ 4 GB (+20% overhead approx 5 GB) Q3 ≈ 0.375 byte/param → 8B Q3 ≈ 3 GB

8B FP16 ≈ 16GB, Q8 ≈ 8GB, Q4 ≈ 4-5GB. This is why Q4 is the sweet spot.

KV Cache is the second variable, and the one many people ignore:

KV Cache ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × batch_size × bytes/element

Note: Modern models generally use GQA (Grouped Query Attention) or MQA (Multi-Query Attention), KV heads count is less than attention heads, so use num_kv_heads in the formula, not total num_heads. For single-user chat, batch_size is usually close to 1.

KV Cache grows linearly with ctx_size. 128K vs 4K is 32× KV Cache; but how much total memory grows depends on the proportion of weights vs KV Cache.

ctx_size is the memory assassin

This is the pitfall beginners step into most easily. Many think loading the model fixes it, but context window size is the real memory variable.

# Control context size via API parameters (Ollama parameter syntax may change across versions, core is to tune num_ctx to target value)
curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Explain KV Cache",
    "options": { "num_ctx": 32768 },
    "stream": false
  }'

qwen3:8b default ctx is 4096. Open to 32K, the memory usage difference comes mainly from the KV Cache item. This is not a bug, it’s by design — longer context requires more KV Cache to store already computed tokens.

Don’t look at absolute numbers in this chart, look at the ratio first: for the same model, the gap between 4K and 32K configs comes mainly from the KV Cache segment.

Quick Reference Table: By Model Size (Q4)

4. Runtime: Choose the Right Tool, Don’t Overthink

You think you’re using qwen3:8b, but what you touch first is Ollama. How the model loads, how the template is applied, how many layers offload to GPU, are all decided by the runtime first. Before you truly understand the model, the runtime has already made a bunch of choices.

Runtime	Who it’s for	Core Features
Ollama	Beginner’s choice	One-click install,自动 handles prompt template, built-in local API
llama.cpp	Advanced users	Ollama’s underlying engine, finest parameter control, can be used independently
MLX	Apple Silicon + willing to tune	Apple native framework, fastest speed in some scenarios
LM Studio	Don’t want to touch command line	GUI, model library integrated, drag-and-drop operation

Advanced users wanting fine control can skip Ollama and use llama.cpp directly. All those parameters Ollama handles for you are specified manually here:

./llama-cli \
  --model ./models/Qwen3-30B-A3B-Q4_K_M.gguf \
  --ctx-size 32768 \        # context size
  --n-gpu-layers 99 \       # layers offloaded to GPU, 99 = all
  --flash-attn on \         # Flash Attention
  --jinja                   # Use model's built-in chat template, reduces prompt format pitfalls

The --jinja flag deserves a separate mention: it lets llama.cpp read the model’s built-in chat template, rather than relying on default formats. Add this, and many mysterious quality issues disappear.

Local models aren’t just chat UIs. After running, it can connect to your agents, scripts, and toolchains via API:

# Ollama local native API (/api/generate)
curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Explain what KV Cache is in three sentences",
    "stream": false
  }'

# If connecting to OpenAI-compatible client, use /v1/chat/completions path
# Your code only needs to change one base_url to switch from cloud model to local

One last boundary to clarify: this article is about local single-machine inference, not production-grade multi-user serving. If you need to run a shared inference service for multiple people, that’s the domain of vLLM, SGLang, logic is completely different from this article.

5. Model Selection: Understand Model Names First

A model filename on HuggingFace might look like this:

Qwen3-30B-A3B-Q4_K_M.gguf

Every segment has meaning. Understand it, and you know what you are downloading.

Qwen3: Model family (Alibaba product, 3rd generation)
30B: Total parameters: 30 billion
A3B: Activated 3B: MoE architecture, only activates approx 3 billion parameters per inference
Q4_K_M: 4-bit K-means quantization, Medium variant
.gguf: File format: llama.cpp ecosystem, weights + metadata integrated

5a. Base / Instruct / Chat

Both called Qwen or Llama, Base and Instruct are not for the same usage scenario. Base will continue writing for you, Instruct will listen to you. Chat is optimized for dialogue format on top of Instruct. Local running almost always requires Instruct or Chat versions. Base is for continued training.

Prompt format pitfalls

Instruct models are highly sensitive to prompt format. Different model families have different chat templates, Qwen3 and Llama 3 formats are different:

Wrong approach (throwing question directly to model): “Please explain KV Cache”

Correct format (Qwen3 chat template,示意，priority to let runtime read model’s built-in template):

<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
Please explain KV Cache
<|im_end|>
<|im_start|>assistant

Ollama will automatically apply the correct template, llama.cpp with --jinja will too. Use wrong format, model quality will mysteriously drop — this is a real pitfall, often misdiagnosed as “this model sucks”.

5b. Quantization: Trade-off between Memory and Quality

Quantization is compressing 16GB 8B FP16 down to about 5GB Q4. You lose a bit of precision, in exchange for it actually fitting in the machine.

FP16/BF16 → Q8 → Q6 → Q4 → Q3 → Q2 Precision: High ←————————————————→ Low Size: Large ←————————————————→ Small

K-quants: The K in Q4_K_M is K-means clustering, _M/_S/_L is Medium/Small/Large, referring to quantization granularity. At same bit width, K-quants quality is usually better than standard uniform quantization.

Practical rules for choosing quantization:

Memory enough → Q8
Quality priority, limited memory → Q5_K_M or Q6_K_M
Daily driver → Q4_K_M (Sweet spot)
Memory forces you → Q3 (cautious), Q2 basically not recommended

A counter-intuitive conclusion: In many daily tasks, a Q4 8B is often more worth it than an FP16 3B. Not because quantization has no loss, but because the benefit from model scale usually outweighs the Q4 loss. Choosing a larger model at Q4 is usually more worth it than choosing a smaller model at FP16.

Quantization	Memory Save	Quality Risk	Best Scenario	Avoid When
Q8	Low (vs FP16 approx -50%)	Very Low	Memory enough, quality priority	Memory tight
Q6_K	Medium	Very Low	Quality sensitive but memory limited	—
Q5_K_M	Medium-Low	Low	Balanced choice	—
Q4_K_M	High	Acceptable	Daily driver (Sweet spot)	Extremely quality sensitive
Q3_K_M	Very High	Medium-High	Memory forces no choice	Code/Reasoning tasks
Q2_K	Extreme	High	Basically not recommended	Almost all serious tasks

5c. MoE: Total Parameters and Active Parameters are Not the Same

The most misunderstood part of MoE is here. The A3B suffix is telling you: it has 30B total, but only activates approx 3B per inference.

Memory requirement depends on total parameters: All weights must be loaded into memory
Compute load depends on active parameters: Only this part participates in current inference computation

So 30B-A3B: Memory must fit 30B weights (approx 18GB @ Q4), compute load closer to 3B — but actual inference speed is also affected by total weights read speed, runtime implementation and hardware bandwidth, not purely 3B speed, but significantly faster than Dense 30B.

MoE lets you try models with higher total parameters in the same memory, but quality still depends on training, data, routing and quantization, not that MoE is naturally smarter. 32GB can’t run Dense 70B, but can run MoE 30B-A3B, whether it’s worth trying depends on specific tasks.

Models aren’t bigger is better: For running local agents doing tool calls, latency and instruction following stability may be more critical than benchmark rankings. A fast 14B that stably follows instructions may be more usable than a sluggish 70B. Task match is more important than parameter count.

6. Optimization: From Running to Running Well

If the model is already running fast enough, skip this section. This talks about which knobs to turn when it’s not fast enough.

6a. Basic Parameters

# Ollama main runtime parameters (passing via API options is more stable)
curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3:8b",
    "prompt": "Your question",
    "options": {
      "num_ctx": 8192,       
      "num_gpu": 99,         
      "num_batch": 512       
    },
    "stream": false
  }'

num_ctx Context window size, directly affects KV Cache memory;
num_gpu Layers offloaded to GPU;
num_batch Affects prompt processing speed.
temperature, top-p, top-k control generation randomness, not performance tuning points, beginners stick to defaults.

6b. Attention Optimization

Flash Attention: In standard Attention computation, the attention matrix must be fully expanded in memory. Flash Attention blocks the computation, significantly reducing memory access count, faster speed and lower memory usage at same ctx_size. Some runtimes and configs enable by default, suggest checking your version’s docs to confirm. Explicitly enable with --flash-attn on in llama.cpp.

KV Cache Quantization: Quantize the KV Cache itself (usually to Q8 or Q4), further compressing long context memory usage. Effect is obvious when running 32K+ ctx.

6c. Smart Quantization: imatrix

Standard quantization reduces precision equally for all weights. imatrix (Importance Matrix) quantization first runs a batch of calibration data, measures importance of each weight group to output, then keeps higher precision for “important” weights, aggressively compresses “unimportant” ones.

Same Q4, imatrix Q4 actual quality is usually significantly higher than standard Q4. On HuggingFace, imat or imatrix appearing in filename is a good sign.

Turbo Quant is a variant of similar thinking, different toolchains have different names, essence is importance-guided non-uniform quantization.

6d. Emerging Inference Acceleration

Speculative Decoding: Use a small “draft model” to quickly generate candidate tokens first, main model verifies in batch. Use directly if verification passes, recalculate only if not. In scenarios where draft model and main model match well, TPS may have significant improvement. Usually best to use draft model from same model family, but not mandatory.

MTP (Multi-Token Prediction): Traditional inference predicts one token per step, MTP lets model predict multiple tokens in training or inference design. Some new models build this capability into training level, no extra draft model needed. Both routes are going in same direction: let single forward propagation produce more output.

6e. Don’t Just Look at TPS

TPS is speed, not quality. A model running fast but talking nonsense is meaningless.

Don’t just look at benchmarks. Test four things with tasks you will actually use: Can it output in JSON format, can it remember the last paragraph of long material, can it write a runnable function, can it avoid hallucinating on questions where you know the answer. 10 mins test is more direct than benchmark — because it’s running on your own tasks.

7. Glossary

8. Wrap-up: Five-Layer Decision, Explained Once

Next time you see a model not running, don’t blame the model first. Ask first: Is memory insufficient, ctx opened too large, quantization chosen wrong, or runtime not configured well?

Every layer can be optimized individually, but they constrain each other:

Runtime fast, but bandwidth insufficient, TPS has a ceiling
Quantization pressed low, memory saved, quality drops too
Model chosen well, memory can’t fit, won’t run

Understanding this stack is understanding local models.

If you run a good config on a certain device and model combination, welcome to share.

References

Ollama Documentation & API
llama.cpp GitHub
MLX
Apple M4 Max Tech Specs
Apple Mac Studio Tech Specs
DeepSeek / Huawei Ascend Memory Computing Slide (Public Version)
GGUF Format Specification