Local LLM Inference Optimization: The Complete Guide

Reddit r/LocalLLaMA Tools

Summary

A comprehensive guide to optimizing local LLM inference on consumer hardware, covering tools like llama.cpp, vLLM, and LM Studio, with practical advice on memory hierarchy, layer placement, and common failure modes.

No content available
Original Article
View Cached Full Text

Cached at: 06/22/26, 01:38 AM

# Local LLM Inference Optimization: The Complete Guide Source: [https://carteakey.dev/blog/local-inference/local-llm-optimization/](https://carteakey.dev/blog/local-inference/local-llm-optimization/) > **Note:**This post was drafted with significant AI assistance, synthesizing notes, bench results, and scripts from the[l3ms homelab toolkit](https://github.com/carteakey/l3ms)and the series of model\-running posts on this site\. The experiments, numbers, and failure modes documented here are real \- the synthesis and prose are AI\-assisted\. ## Preface Over the past year I've written posts on running[gpt\-oss\-120b](https://carteakey.dev/blog/local-inference/optimizing-gpt-oss-120b-local-inference/),[Qwen3\-Coder\-Next](https://carteakey.dev/blog/local-inference/optimizing-qwen3-coder-next-local-inference/),[Gemma 4 26B](https://carteakey.dev/blog/local-inference/running-gemma-4-26b-a4b-locally/),[Qwen3\.6\-35B\-A3B](https://carteakey.dev/blog/local-inference/running-qwen3-6-35b-a3b-locally/), and[Gemma 4 MTP](https://carteakey.dev/blog/running-gemma-4-mtp-locally/)locally on consumer hardware\. Each post has its own notes, failure modes, and tuning results \- but the same lessons keep appearing: enable XMP, pin to P\-cores, quantize your KV cache, don't trust the power profile\. This is my attempt at a master reference\. Instead of re\-discovering flags in every new model post, I want one doc to link back to\. If you're hitting a performance wall, starting from scratch, or just want to understand what each knob actually does \- start here\. The scope is intentionally wide\. We start from "should I even run locally?" and drill all the way down to CUDA environment variables and specific failure modes\. Skip to wherever you're stuck\. --- ## 1\. TL;DR: Start Here - **If you want maximum control and performance:**use`llama\.cpp`directly\. This guide assumes that path\. - **If you want desktop UX, model browsing, and a good local OpenAI\-compatible endpoint:**LM Studio is perfectly reasonable\. - **If you want multi\-user serving, batching, and production throughput:**evaluate`vLLM`\. - **If you are on Apple Silicon:**compare`llama\.cpp`Metal with`mlx`; unified memory changes the sizing math\. - **If TG is bad on MoE models:**check RAM speed before touching flags\. XMP/EXPO being off can cost 2\-3x\. - **If you hit VRAM limits:**reduce context, quantize KV cache, lower`\-\-parallel`, then tune layer placement\. - **If you use MTP speculative decoding:**benchmark draft acceptance and KV cache precision together; raw TPS is not enough\. - **If you are running a single\-user homelab:**prefer`\-\-parallel 1`, explicit context sizing, and static placement once you have a stable config\. ### 1\.1 Where to Jump In This is a reference, not a linear tutorial\. Start with the part that matches the problem: - **MoE generation is slow:**check[RAM speed](https://carteakey.dev/blog/local-inference/local-llm-optimization/#6-1-the-memory-hierarchy-the-most-important-mental-model), then[layer placement](https://carteakey.dev/blog/local-inference/local-llm-optimization/#10-layer-placement-the-core-optimization-for-moe)and[P\-core pinning](https://carteakey.dev/blog/local-inference/local-llm-optimization/#14-3-taskset-p-core-pinning-linux)\. - **The model does not fit or dies later in a session:**start with[`\-\-fit`](https://carteakey.dev/blog/local-inference/local-llm-optimization/#10-4-fit-on-recommended-starting-point),[context and KV cache](https://carteakey.dev/blog/local-inference/local-llm-optimization/#11-context-and-kv-cache), then the[known OOM causes](https://carteakey.dev/blog/local-inference/local-llm-optimization/#known-tg-variability-root-causes)\. - **Vision fails at load or on the first image:**go to[Vision / Multimodal](https://carteakey.dev/blog/local-inference/local-llm-optimization/#19-vision-multimodal)\. The projector and image batch need their own headroom\. - **MTP is no faster than normal decoding:**check[draft acceptance and KV precision](https://carteakey.dev/blog/local-inference/local-llm-optimization/#18-2-the-kv-cache-constraint-ctk-f16-ctv-f16), not just reported TG\. - **You use LM Studio or Ollama:**the[hardware](https://carteakey.dev/blog/local-inference/local-llm-optimization/#6-hardware),[OS](https://carteakey.dev/blog/local-inference/local-llm-optimization/#7-os-choice), and[security](https://carteakey.dev/blog/local-inference/local-llm-optimization/#21-security-notes)sections still apply\. Most llama\.cpp flags do not\. ### 1\.2 Safe Starting Profiles These are conservative baselines for a single\-user server\. They are starting points, not universal optimums; model architecture still changes the memory math\. Workload`\-\-fit\-target`ContextKV cache`\-\-parallel`BatchText, 12 GB VRAM512 MiB64k`q8\_0`/`q8\_0`11024Text, 24 GB VRAM512–768 MiB128k`q8\_0`/`q8\_0`1–21024Vision, 12 GB VRAM2048 MiB64k`q8\_0`/`q8\_0`1256MTP speculative decoding512\+ MiB64k`f16`/`f16`11024> **Avoid the exciting failure modes:**do not squeeze vision below`\-\-fit\-target 2048`on a 12 GB card; do not enable`GGML\_CUDA\_GRAPH\_OPT=1`with less than 512 MiB headroom; do not include E\-cores in a hybrid Intel CPU's thread range; and do not copy`q8\_0`KV settings into an MTP config without measuring draft acceptance\. All four can look fine in a short benchmark and fail in a real session\. ## 2\. Optimization Priority Checklist Ordered by typical impact\. Each item links to the section with the full explanation\. \#ActionImpactSection1**Enable XMP/EXPO in BIOS**2\-3x TG on MoE§6\.12**Use MTP speculative drafting**2\.0x\-2\.6x TG speedup§18\.13**Use QAT low\-bit models**\(e\.g\. Q4 QAT\)Recovers much of the lost low\-bit quality§9\.34**Run Linux**or tune Windows power plan~15\-20% TPS§75**Replace`power\-profiles\-daemon`with`tuned\-ppd`**Eliminates intermittent 20\-30% TG drop§7\.46**Build llama\.cpp from source; keep updated**MoE kernel improvements per release§8\.27**Use`\-\-fit on`**for VRAM\-optimal layer placementMajor TG; no manual tuning§10\.48**Use`\-ctk q8\_0 \-ctv q8\_0`**when not using MTPFrees KV VRAM for extra GPU layers§11\.29**Keep KV cache at`f16`for MTP unless tested otherwise**Preserves draft acceptance on tested Gemma 4 MTP configs§18\.210**Set`\-\-parallel 1`**for single\-user homelabReclaims KV VRAM for weights§11\.311**Pin to P\-cores**with`taskset \-c`\+20\-30% TG on Intel hybrid§14\.312**Enable`\-\-flash\-attn on`**Required for large\-context stability§11\.413**Enable`\-\-no\-mmap`**Eliminates TG jitter from page faults§15\.114**Enable`\-\-mlock`**Prevents mid\-session swap degradation§15\.215**Go headless**\(`systemctl isolate multi\-user\.target`\)Frees 200\-400 MB RAM \+ compositor VRAM§7\.416**iGPU for display**\(motherboard HDMI\)Frees 500\-1000 MB VRAM§6\.217**Set`LLAMA\_SET\_ROWS=1`**Cache locality for MoE expert access§17\.118**Set`GGML\_CUDA\_GRAPH\_OPT=1`**only with enough headroomReduces CUDA dispatch overhead§17\.119**Evaluate ik\_llama\.cpp**for generation\-heavy workloadsPossible TG win at cost of PP§20## 3\. What to Measure Before Tuning Optimization only makes sense if you know which phase is slow\. MetricWhat it tells youCommon bottleneck**TTFT**\(time to first token\)How long before output startsmodel load, prompt processing, cold cache**PP**\(prompt processing\)How fast the model reads input contextbatch size, GPU kernels, long prompts**TG**\(token generation\)How fast output streams after prefillVRAM/RAM bandwidth, layer placement, CPU pinning**VRAM at load**Whether weights, KV cache, and mmproj fitcontext size, parallel slots, quant level**VRAM after long sessions**Whether memory grows into OOM territoryCUDA graphs, VMM pool growth, fit headroom**RAM bandwidth / swap**Whether hybrid MoE weights are bottleneckedXMP/EXPO, channels,`\-\-mlock`, swappiness**Draft acceptance rate**Whether speculative decoding is helpingdraft quality, KV precision, spec lengthDo not optimize from a single short prompt\. Short prompts hide KV cache costs, long\-context VMM growth, and parallel\-slot allocation\. Benchmark at the context length you actually serve\. --- ## 4\. Glossary TermDefinition**GGUF**File format for quantized LLM weights used by llama\.cpp\. Stores weights, metadata, and tokenizer in a single binary\.**Quantization**Reducing weight numerical precision \(FP16 → Q4, etc\.\) to shrink model size and accelerate compute\. More bits = higher quality, larger file\.**QAT \(Quantization\-Aware Training\)**Training/fine\-tuning a model with quantization noise injected\. Allows near\-lossless 8\-bit intelligence at a 4\-bit memory size\.**MTP \(Multi\-Token Prediction\)**Speculative decoding method native to MTP\-trained models \(e\.g\. Gemma 4\)\. Uses a companion draft model to generate multiple candidate tokens in parallel, which the base model validates in one GPU step\.**PP / Prompt Processing**Tokens per second during the prefill phase \- how fast the model reads your input\. GPU\-bound\.**TG / Token Generation**Tokens per second during autoregressive decode \- how fast you see output stream\. Memory\-bandwidth\-bound\.**This is what the user feels\.****KV Cache**Buffer storing attention Key/Value tensors for all prior context tokens\. Grows linearly with context length\. Lives in VRAM\.**Context Window**Maximum total tokens \(input \+ output\) in one session\. Determines KV cache size\.**Dense Model**Standard transformer: all parameters active per token\. Must fit in VRAM for full\-speed inference\.**MoE / Mixture of Experts**Architecture where each token activates only a small subset of "expert" networks\. Enables very large total parameters with low per\-token compute\. Expert weights that don't fit in VRAM can spill to system RAM\.**Active Parameters**For MoE: the subset of params computed per token\. gpt\-oss\-120b: ~5B active of 120B total\. Qwen3\-Coder\-Next: ~3B active of 80B total\. TG speed tracks active count, not total\.**VRAM**Video RAM \- on\-die GPU memory\. Lowest latency, highest bandwidth storage for inference\.**Perplexity**Statistical measure of model surprise on a test corpus\. Lower = better\. Standard proxy for comparing quantization quality levels\.**llama\-bench**CLI tool for synthetic PP and TG benchmarking, included in llama\.cpp\.**llama\-fit\-params**CLI tool that probes free VRAM and outputs optimal`\-ngl`and`\-\-override\-tensor`flags without starting a server\.**`\-ngl`/ n\-gpu\-layers**Number of transformer blocks to load onto GPU VRAM\.**`\-ot`/ override\-tensor**Regex\-based per\-tensor placement override \- route specific weight tensors to CPU or GPU\.**XMP / EXPO**BIOS profiles \(Intel XMP / AMD EXPO\) that run system RAM at its rated speed\. "Auto" in BIOS often defaults to JEDEC base \- a fraction of rated speed\.**Fit**`\-\-fit on`flag: auto\-probes VRAM and computes optimal placement at startup, accounting for KV cache\. --- ## 5\. The Inference Landscape ### 5\.1 Why Run Locally? - **Privacy**: prompts never leave your machine\. - **Cost**: zero marginal cost after hardware\. Amortizes quickly under heavy use\. - **Control**: any model, any quant, any parameters\. No deprecations, no rate limits, no pricing changes\. - **Offline**: works without internet\. - **Experimentation**: swap models, tune parameters, run evals without API contracts\. ### 5\.2 Cloud vs Local \- Honest Tradeoffs Hosted APISelf\-hosted cloud GPULocal hardwareSetupMinutesHoursHours–daysModel quality ceilingFrontierYour choiceYour choicePer\-token cost$1–15/M$0\.10–0\.50/GPU\-hrZero \(marginal\)PrivacyProvider sees dataYou controlFullHardware investmentNoneNone$500–$3000\+Most serious users end up with both: cloud APIs for frontier tasks, local for everything privacy\-sensitive, experimental, or routine\. ### 5\.3 Local Inference Tools ToolBest forNotes**llama\.cpp**Performance tuning, full flag control, any hardwareBuild from source; CLI\-centric**Ollama**Zero\-config, model management, DockerUses llama\.cpp internally; limited tuning surface**LM Studio**Desktop GUI, Windows/macOS, model browsing, local OpenAI\-compatible APIGood UX, supports server/headless workflows, JIT loading, TTL, and auto\-evict**vLLM**Multi\-user production serving, continuous batchingDesigned for full\-VRAM serving; not suited for consumer hybrid setups**exllamav2**Maximum speed for dense modelsCUDA\-only; excellent for models that fully fit in VRAM**mlx**Apple SiliconmacOS only; leverages unified memory; no CUDA**This guide focuses on llama\.cpp\.**Most concepts \(KV cache, quantization, layer placement\) generalize across tools\. ### 5\.4 Backends Within llama\.cpp BackendBuild flagBest for**CUDA**`\-DGGML\_CUDA=ON`NVIDIA GPUs; highest performance and most tested**Vulkan**`\-DGGML\_VULKAN=ON`AMD and Intel GPUs; cross\-platform; works on NVIDIA too**Metal**\(auto on macOS\)Apple Silicon; unified memory**CPU\-only**\(no GPU flag\)Reference; small models or debugging**RPC**`\-DGGML\_RPC=ON`Experimental distributed inference> **This document assumes CUDA\.**Where behavior differs on Vulkan or CPU\-only, it's marked`\[Vulkan\]`or`\[CPU\]`\. If you're on AMD/Intel GPU, most flag logic is the same but CUDA\-specific env vars don't apply\. --- ## 6\. Hardware ### 6\.1 The Memory Hierarchy \- The Most Important Mental Model Token generation speed is limited by how fast the runtime can stream active weights through the memory hierarchy\. Rough bandwidth numbers: ``` VRAM (GPU on-die) ~600–1000 GB/s Unified memory (Apple Silicon) ~200–400 GB/s System RAM ~50–200 GB/s (varies hugely by speed and channel config) NVMe SSD ~5–7 GB/s SATA SSD / HDD ~0.5–3 GB/s ``` **Dense models**: every token reads all active weights\. Model must fit in VRAM for full\-speed inference\. Any spill to system RAM causes a large TG drop\. **MoE models**: only a fraction of expert weights is needed per token, but those weights must still be streamed \- often from system RAM\. TG ∝ RAM bandwidth\. This means a RAM kit running at rated speed vs JEDEC base can deliver 2–3× the bandwidth, translating almost linearly to TG throughput on MoE models\. **Check that your RAM is running at its rated speed:** ``` sudo dmidecode -t memory | grep -E "Speed|Configured" # "Configured Memory Speed" must match your XMP/EXPO profile speed. # If it doesn't, enable XMP/EXPO in BIOS. ``` This is consistently one of the largest single wins available for MoE inference\. "Auto" BIOS settings commonly default to JEDEC base speed regardless of the kit's rating\. ### 6\.2 GPU / VRAM VRAM is the primary inference resource\. More VRAM = more layers on GPU = faster inference\. VRAMDense examplesMoE hybrid examples8 GB7B–13B Q430B–70B, heavy RAM offload12 GB13B–20B Q4; 7B Q870B–120B\+ with CPU offload24 GB34B Q4; 13B Q8120B\+ moderate offload48 GB\+70B Q8; 34B FP16Most MoE partially/fully on GPU80 GB\+120B\+ FP16No offload needed**iGPU display trick \(desktop NVIDIA\):**route display through the motherboard video output instead of the GPU\. This frees 500–1000 MB VRAM the GPU was using for desktop composition\. ### 6\.3 CPU For**dense models**\(all in VRAM\): CPU is nearly idle during inference\. Core count has minimal impact\. For**MoE hybrid**: CPU executes expert forward passes for every token\. Expert compute is the TG bottleneck\.**Intel hybrid \(12th gen\+\):**P\-cores and E\-cores have significantly different throughput for matrix operations\. E\-cores drag TG down by 20–30%\. Always pin to P\-cores: ``` taskset -c 0-11 llama-server ... # i5-12600K: cores 0-11 are P-cores ``` Thread count: set`\-\-threads`to P\-core count, leaving 1–2 for the OS\. More threads than P\-cores is counterproductive\. ### 6\.4 What Is "Good Enough" Throughput? TG speedExperience< 5 t/sPainful; barely usable5–10 t/sFunctional; near human reading speed \(~7 t/s\)10–20 t/sComfortable for interactive chat20–40 t/sFast; coding agents feel snappy40\+ t/sNear\-instant for most tasksFor coding agents: TG dominates\. First\-token latency matters less once the context is warm\. Every t/s gained compounds across a full session\. --- ## 7\. OS Choice ### 7\.1 Linux Highest\-performance path for CUDA inference\. - **~15–20% TPS advantage**over Windows in practice: leaner CUDA driver overhead, better scheduling under sustained load, direct memory control\. - Full access to CPU governor, NUMA, huge pages, cgroups, headless mode\. - Recommended distributions: CachyOS \(real\-time kernel, best power management tuning surface\), Ubuntu/Debian \(easiest CUDA packages\), Arch \(rolling, latest drivers\)\. ### 7\.2 Windows Reasonable; CUDA support is solid\. - Some overhead from Windows scheduler and CUDA runtime\. - **Power plan**: set to "High Performance" or "Ultimate Performance"\. Default "Balanced" throttles CPU under sustained inference load\. - NVIDIA Control Panel → "Power management mode" → "Prefer maximum performance"\. - WSL2: close to native for CUDA; some VRAM overhead from virtualization\. Generally acceptable if dual\-boot isn't an option\. ### 7\.3 macOS Different runtime stack \- CUDA guidance does not apply\. - Metal backend activates automatically\. - **Apple Silicon unified memory**: CPU and GPU share the same physical pool\. A machine with 128 GB RAM effectively has 128 GB "VRAM"\. Transforms the dense model size ceiling\. - Memory bandwidth is excellent \(~400 GB/s on M3 Max\); competitive with mid\-range NVIDIA for the models it can run\. - `mlx`framework is worth evaluating alongside llama\.cpp for Metal workloads\. ### 7\.4 Linux OS Tuning These settings have measurable impact on TG throughput\. #### CPU Governor and Power Profile ``` # Must be "performance" on all cores cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # EPP must also be "performance" - governor alone is not enough on intel_pstate cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference # Verify actual P-core frequency near max boost grep "cpu MHz" /proc/cpuinfo | sort -rn | head -6 ``` **The`power\-profiles\-daemon`trap**: KDE and GNOME ship with`power\-profiles\-daemon`, which can set a non\-performance HWP \(hardware P\-state\) mode on some boots\. The insidious symptom: all sysfs checks \(`scaling\_governor`,`energy\_performance\_preference`,`cpu MHz`\) report "performance" \- but TG runs 20–30% below expected, and it varies between boots\. The degradation happens at the hardware MSR level where standard tooling doesn't look\. Fix \- replace with`tuned\-ppd`: ``` # Arch / CachyOS sudo pacman -S tuned-ppd # removes power-profiles-daemon automatically sudo systemctl enable --now tuned sudo tuned-adm profile throughput-performance # Reboot. TG will be stable and correct across all boots. # Ubuntu / Debian sudo apt install tuned sudo systemctl enable --now tuned sudo tuned-adm profile throughput-performance ``` #### Transparent Huge Pages ``` cat /sys/kernel/mm/transparent_hugepage/enabled # Recommended: [always] echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled ``` #### Headless Mode ``` # Stop desktop compositor - frees 200-400 MB RAM + compositor VRAM sudo systemctl isolate multi-user.target sudo sync && sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" # Restore when done sudo systemctl isolate graphical.target ``` Without a display server, use`zellij`in a TTY for split panes:`zellij`in any TTY gives full terminal multiplexing without X or Wayland\. --- ## 8\. Why llama\.cpp, and How to Build It ### 8\.1 Why Not Ollama? Ollama uses llama\.cpp internally but exposes a minimal, fixed\-default configuration surface\. Every flag in §10\-§17 of this guide \- layer placement, KV quantization, fit parameters, batch sizes, CUDA env vars \- is unavailable or unexposed in Ollama\. Ollama is the right choice for quick setup and model management\. If you're reading this guide, you've outgrown it\. ### 8\.2 Building from Source Always build from source\. Distro packages are outdated and not compiled for your GPU\. MoE inference performance improves significantly with each llama\.cpp release\. **CUDA build:** ``` git clone https://github.com/ggml-org/llama.cpp cd llama.cpp mkdir build && cd build cmake .. \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_CUDA=ON \ -DLLAMA_CURL=ON \ -DGGML_NATIVE=ON \ -DGGML_LTO=ON \ -DGGML_CUDA_GRAPHS=ON \ -DGGML_CUDA_FA=ON \ -DGGML_CUDA_FA_ALL_QUANTS=ON \ -DCMAKE_CUDA_ARCHITECTURES=89 # 89 = RTX 40-series | 86 = RTX 30-series | 75 = RTX 20-series | 61 = GTX 10-series cmake --build . --config Release \ --target llama-server llama-bench llama-fit-params llama-cli --parallel ``` **Vulkan build**\(AMD, Intel GPU\): ``` cmake .. \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_VULKAN=ON \ -DLLAMA_CURL=ON \ -DGGML_NATIVE=ON # CUDA-specific build flags do not apply here ``` Keep your build updated\. Pull and rebuild regularly \- especially before benchmarking a new model\. ### 8\.3 Key Binaries BinaryPurpose`llama\-server`OpenAI\-compatible HTTP inference server`llama\-bench`Synthetic PP and TG benchmarking`llama\-fit\-params`Dry\-run VRAM probe → outputs optimal`\-ngl`\+`\-ot`flags`llama\-cli`Interactive CLI prompt for quick tests`llama\-sweep\-bench`Sweep across parameters \(batch size, ngl, etc\.\) --- ## 9\. Model Selection and Quantization ### 9\.1 Dense vs MoE \- Choose Your Tuning Strategy DenseMoEActive params/tokenAllSmall fraction \(e\.g\. 3B of 80B\)Must fit in VRAMYes, for best performanceNo \- experts can live in RAMPrimary tuning leverQuant level \+ contextLayer placement \+ RAM bandwidthTG speed driverVRAM bandwidthRAM bandwidth \+ GPU layer countExample modelsGemma 4, Llama 3, MistralQwen3, DeepSeek, gpt\-oss### 9\.2 Quantization Reference QuantSize vs FP16QualityNotesQ2\_K~25%Noticeable degradationUse only under extreme size constraintsQ4\_K\_M~35%GoodBest size/quality balance for most situationsQ4\_K\_XL / UD\-Q4\_K\_XL~35%Better than Q4\_K\_MUnsloth Dynamic: layer\-importance\-aware allocationQ5\_K\_M / Q5\_K\_XL~40%Very close to FP16Strong default when VRAM allowsQ6\_K~50%Near\-losslessHigh\-VRAM setupsQ8\_0~65%Effectively losslessIf storage/RAM permitsFP16100%ReferenceMaximum quality; maximum sizeMXFP4 \(native\)~35%Better than Q4\_K\_Mgpt\-oss models: trained in MXFP4, not post\-quantized**UD \(Unsloth Dynamic\) quants**: allocate higher bits to attention\-sensitive layers and lower bits to robust layers\. Better perplexity than uniform quants at the same average bit width\. Generally the best choice when available\. **Rule of thumb**: use the highest quant that fits your VRAM \+ RAM budget\. Q5\_K\_XL or UD\-Q5\_K\_XL is a strong default\. Drop to Q4 only when necessary\. ### 9\.3 Quantization\-Aware Training \(QAT\) Standard Post\-Training Quantization \(PTQ\) quantizes weights*after*the model is fully trained\. When going down to 4\-bit, this rounding process can throw away critical precision, leading to regressions in reasoning, logic, and acrostic constraints\. Quantization\-Aware Training \(QAT\) bypasses this degradation by modeling low\-precision rounding noise*during*the training or fine\-tuning process\. This enables the model weights to adapt to the low\-bit limits\. - **Accuracy Recovery**: In the Gemma 4 QAT builds I tested, 4\-bit QAT behaved much closer to Q8 than ordinary post\-training 4\-bit quantization\. - **VRAM Savings**: A 26B MoE model in standard Q8\_0 or dynamic Q5 consumes ~18 GB, spilling heavily to system RAM on a 12GB card\. The QAT Q4 model size drops to ~14\.2 GB, allowing the vast majority of the model layers to load directly into VRAM for full GPU speed\. --- ## 10\. Layer Placement \- The Core Optimization for MoE > For dense models fully in VRAM: use`\-ngl 99`and skip to §11\. For MoE hybrid setups, layer placement is where most performance lives\. The goal: keep as many blocks as possible on GPU \(especially early layers and attention\), while offloading expert weights to RAM\. ### 10\.1`\-\-n\-gpu\-layers`\(`\-ngl`\) How many transformer blocks to load onto GPU\. Start at`99`\(all layers\)\. Drop if CUDA OOM\. ``` llama-server -m model.gguf --n-gpu-layers 99 # all on GPU llama-server -m model.gguf --n-gpu-layers 37 # 37 on GPU, rest on CPU ``` ### 10\.2`\-\-n\-cpu\-moe` Integer count: keep the named number of MoE layers' expert weights on CPU\. Quick coarse control\. ``` --n-cpu-moe 31 # first 31 MoE layer experts on CPU ``` > ⚠️**RAM ceiling**: for a ~60 GB model, putting all experts on CPU tries to load ~60 GB into RAM\. On a 64 GB system this will hard\-crash the machine\. Always use`llama\-fit\-params`\(§10\.4\) to find safe values before setting this high\. ### 10\.3`\-\-override\-tensor`\(`\-ot`\) \- Fine\-Grained Placement Per\-tensor, per\-layer placement via regex\. Most control, most complexity\. ``` # All expert projections in all layers → CPU: --override-tensor ".ffn_(up|down|gate)_(ch|)exps=CPU" # Layers 5+ experts → CPU; keep layers 0-4 fully on GPU: --override-tensor "blk\.(5|[6-9]|[0-9][0-9]+)\.ffn_(up|down|gate)_(ch|)exps=CPU" ``` **The shared expert gotcha**: some models \(Qwen3\.5\-122B, certain gpt\-oss variants\) have two expert tensor families: - Routed experts:`ffn\_\{up,down,gate\}\_exps` - Shared expert \(always active, 1 per layer\):`ffn\_\{up,down,gate\}\_shexp` A pattern matching only`\_exps`leaves`\_shexp`on GPU, silently consuming VRAM and causing CUDA OOM\. The safe pattern captures both: ``` # (ch|) matches both _exps (routed) and _shexp (shared): --override-tensor ".ffn_(up|down|gate)_(ch|)exps=CPU" ``` Safe to include`\(ch\|\)`even for models without shared experts \- it's harmless and future\-proofs the pattern\. ### 10\.4`\-\-fit on`\- Recommended Starting Point Auto\-probes free VRAM at startup, computes optimal`\-ngl`\+`\-ot`placement automatically\. Zero manual tuning\. ``` llama-server \ -m model.gguf \ --fit on \ --fit-ctx 65536 \ # minimum context to guarantee fits; KV cache for this ctx is accounted for --fit-target 512 \ # VRAM headroom in MiB to leave free ... ``` **`\-\-fit\-target`note**: CUDA's VMM pool grows in ~1 GiB chunks as context depth increases during a session\. Using`\-\-fit\-target 128`produces great bench numbers but can OOM mid\-session when the pool needs to grow\. Use`≥512 MiB`for production text servers\. Use`2048`for vision models where the mmproj allocation is large\. **Dry run without starting a server:** ``` llama-fit-params \ -m /path/to/model.gguf \ -fitt 512 \ # fit-target MiB -fitc 65536 # fit-ctx tokens # Outputs something like: -c 65536 -ngl 49 -ot "blk\.8\.ffn_...=CPU,..." ``` This output is what you hardcode for static placement\. ### 10\.5 Static vs Dynamic Placement `\-\-fit on`Hardcoded`\-ngl`\+`\-ot`Startup\+1–5s probe delayInstantPlacementFresh each bootFixedTuning effortNoneOne\-time from`llama\-fit\-params`Best forExperimentationProductionReproducibilityGood \(if VRAM is free\)DeterministicFor a stable homelab server, derive placement once with`llama\-fit\-params`and hardcode it in your run script\. Use`\-\-fit on`when testing new models or after hardware changes\. --- ## 11\. Context and KV Cache ### 11\.1`\-\-ctx\-size`\- Choosing Context Length The KV cache grows linearly with context and lives in VRAM\. Large context on small VRAM can push expert layers off GPU\. Approximate KV VRAM usage \(varies by model, head count, and layer structure\): Contextf16 KVq8\_0 KV8k~0\.5 GB~0\.25 GB32k~2 GB~1 GB64k~4 GB~2 GB128k~8 GB~4 GBOn a 12 GB card at 128k context with f16 KV, 8 GB goes to KV cache \- leaving only ~4 GB for model weights and attention\. Context choice directly affects how many GPU layers you can afford\. **Practical guidance:**coding sessions work well at 64k; long\-context RAG may need 128k\+; vision inference is typically safest at 64k on 12 GB VRAM\. ### 11\.2 KV Cache Quantization \(`\-ctk`,`\-ctv`\) \[CUDA\] Quantizing the KV cache halves \(q8\_0\) or further reduces \(q4\_0\) its VRAM footprint\. ``` -ctk q8_0 -ctv q8_0 ``` KV quantVRAM vs f16QualityRecommendation`f16`1×LosslessDefault**`q8\_0`****0\.5×****Effectively lossless****Default recommendation**`q4\_0`~0\.33×Some degradation at long contextOnly under extreme VRAM pressure**The compounding effect**: on a 12 GB card with 64k context, switching f16 → q8\_0 KV frees ~2 GB\. That 2 GB lets`llama\-fit\-params`keep one to two additional GPU layers \- translating directly to higher TG\. Confirmed on Qwen3\-Coder\-Next: q8\_0 KV at 64k unlocked 2 extra GPU layers and added ~2 t/s TG vs f16 KV\. > At short bench contexts \(512 tokens\), the KV cache is tiny and this effect is near\-zero\. Always test at your real serving context length\. ### 11\.3`\-\-parallel`\- Concurrent Inference Slots Each slot maintains its own KV cache\.`\-\-parallel 4`multiplies KV VRAM by 4\. ``` --parallel 1 # single user homelab: reclaims KV VRAM for model weights ``` On gpt\-oss\-120b, dropping`\-\-parallel 4`→`\-\-parallel 1`freed ~540 MiB VRAM \- enough for one more GPU layer and \+1 t/s TG\. ### 11\.4`\-\-flash\-attn on`\(`\-fa`\) Reduces attention memory traffic and avoids materializing the full attention matrix, which makes long\-context inference much more practical on constrained VRAM\. No meaningful downside on CUDA in my testing\. ``` --flash-attn on ``` Always enable\. Required for certain KV quantization types on some configurations\. > \[Vulkan\]: Flash attention support varies by driver version\. Verify before relying on it\. --- ## 12\. Batch Sizes ### 12\.1`\-\-batch\-size`\(`\-b`\) \- Prompt Processing Throughput Controls how many tokens are processed in one forward pass during prefill\. Higher = better PP throughput; more VRAM required\. ``` --batch-size 2048 # high throughput (~maximum for most models) --batch-size 1024 # balanced; good default --batch-size 512 # conservative; use for vision or tight VRAM ``` Reduce if you hit CUDA OOM during the prefill phase specifically\. ### 12\.2`\-\-ubatch\-size`\(`\-ub`\) \- Physical Micro\-Batch Physical sub\-batch within a logical batch\. Must be ≤`\-\-batch\-size`\. ``` --ubatch-size 512 # typical default ``` **Vision/multimodal critical**: an image tokenizes to several hundred tokens\. If`\-\-ubatch\-size`< image token count, llama\.cpp throws an assertion during vision inference\. Use`\-\-ubatch\-size 512`or higher and test with your actual image sizes\. On 12 GB VRAM,`\-\-batch\-size 256 \-\-ubatch\-size 512`is a stable vision baseline\. --- ## 13\. Sampling Parameters Sampling controls the probability distribution at each decode step\. These affect output quality and \- via vocabulary truncation \(`top\-k`\) \- slightly affect speed\. **Start with model card defaults\.**Most GGUF releases specify tested values\. Use those before experimenting\. ### 13\.1`\-\-temp`\- Temperature - `0\.0`: greedy / deterministic\. Best for coding agents where reproducibility matters\. - `0\.7`: standard creative chat\. - `1\.0`: no rescaling; follows raw model distribution\. Most modern instruction\-tuned models are calibrated for this\. ### 13\.2`\-\-top\-k`\- Vocabulary Truncation Keeps only top K most probable tokens before sampling\. - `0`: full vocabulary \(default; most diverse; slowest to compute\) - `100`: safe performance cap \- confirmed no measurable quality loss on coding tasks \(gpt\-oss\-120b\) - `20–64`: model\-specific tighter caps ### 13\.3`\-\-top\-p`\- Nucleus Sampling Filters to tokens whose cumulative probability ≥ p\. Applied after`top\-k`\. - `1\.0`: no filtering \(gpt\-oss, Sarvam defaults\) - `0\.95`: standard for chat/code \(Gemma 4, Qwen3 defaults\) ### 13\.4`\-\-min\-p` Filters tokens below`min\-p × max\_token\_probability`\. - `0\.0`off \(most models\) - `0\.01`light floor \(Qwen3\-Coder\-Next\) ### 13\.5`\-\-repeat\-penalty` - `1\.0`: no penalty\. Recommended for code \- code naturally repeats patterns \(variable names, keywords\) and penalizing them degrades output\. - `1\.1–1\.3`: mild penalty for prose\. ### Recommended Defaults by Model Family Modeltemptop\-ktop\-pmin\-prepeat\-penaltyGemma 41\.0640\.950\.01\.0Qwen3 / Qwen3\-Coder1\.0400\.950\.011\.0gpt\-oss\-120b1\.01001\.00\.01\.0Sarvam1\.0201\.00\.01\.0 --- ## 14\. Threading and CPU Control ### 14\.1`\-\-threads`\(`\-t`\) CPU threads for the token generation phase \(expert compute in hybrid MoE setups\)\. ``` --threads 10 # P-core count minus 1-2 for OS headroom ``` More threads than available P\-cores is counterproductive \- they contend for the same memory bus and typically reduce TG\. ### 14\.2`\-\-threads\-batch` CPU threads for the PP \(prefill\) phase\. PP is a burst workload; you can set this to full thread count\. ``` --threads-batch 12 ``` ### 14\.3`taskset`\- P\-Core Pinning \[Linux\] Most reliable way to keep inference off E\-cores on Intel 12th gen\+ \(and other hybrid architectures\): ``` taskset -c 0-11 llama-server ... # pin all process threads to P-cores ``` Verify your P\-core range from CPU documentation or`lstopo`\. On Intel 12600K, cores 0–11 \(6 P\-cores × 2 threads\) are the P\-cores; 12–15 are E\-cores\. ### 14\.4`\-\-poll` Controls CPU spin aggressiveness while waiting for GPU kernel completion\. - `0`: yield/sleep - `100`: busy spin **On hybrid CPU\+GPU inference, this is flat\.**GPU kernel execution and PCIe transfer dominate synchronization\. Confirmed across multiple sweeps \- within noise at all poll levels\. Leave at default`50`or set to`0`to reduce idle CPU load\. Do not tune this\. ### 14\.5`\-\-numa` NUMA affinity modes:`distribute`,`isolate`,`numactl`\. **Single\-socket systems: skip\.**There is only one NUMA node; these modes provide no benefit and can hurt performance\. Use`taskset \-c`for affinity instead\. Relevant only on dual\-socket server hardware \(AMD EPYC, Intel Xeon\) where NUMA topology is real\. --- ## 15\. Memory Control ### 15\.1`\-\-no\-mmap` Without this, llama\.cpp uses memory\-mapped I/O\. Expert weight accesses during decode are non\-sequential \- the OS page fault handler triggers repeatedly for cold pages, adding latency jitter to TG\. With`\-\-no\-mmap`, the entire model loads into RAM before inference begins\. No page faults\. ``` --no-mmap # recommended for all hybrid MoE and persistent server setups ``` Tradeoff: longer startup\. Worth it for any persistent server\. ### 15\.2`\-\-mlock` Pins model pages in RAM, preventing the OS from swapping them under memory pressure\. ``` --mlock ``` Important when`vm\.swappiness`is high \(many Linux distributions default to 60–150 with ZRAM\) or when running close to the RAM ceiling\. Without it, a swap event mid\-session can make TG appear to stall\. Skip only if RAM is critically tight\. --- ## 16\. Priority and Process Settings ### 16\.1`\-\-prio` Scheduling priority for the inference process\. Scale 0–3\. ``` --prio 2 # high priority; reduces OS scheduling jitter on TG ``` ### 16\.2`\-\-no\-warmup` Skips initial kernel warmup pass at startup \(compiles CUDA kernels on first real request instead\)\. ``` --no-warmup # reduces startup time; safe for persistent servers ``` --- ## 17\. CUDA\-Specific Settings \[CUDA\] ### 17\.1 Environment Variables ``` export LLAMA_SET_ROWS=1 # Improves CPU cache locality for MoE expert row access export GGML_CUDA_GRAPH_OPT=1 # Batches CUDA kernel launches; reduces dispatch overhead ``` **`GGML\_CUDA\_GRAPH\_OPT`caveat**: CUDA graph optimization captures the kernel graph at a specific context depth\. When depth increases significantly \(long agentic sessions\), CUDA triggers a re\-capture, growing the VMM pool\. On tight\-fit configs \(< 512 MiB headroom\), this causes mid\-session OOM\. If you observe intermittent CUDA OOM on long sessions, set`GGML\_CUDA\_GRAPH\_OPT=0`\. ### 17\.2 Notable Build Flags FlagEffect`GGML\_CUDA\_GRAPHS=ON`Enables CUDA graph capture at build time`GGML\_CUDA\_FA=ON`Compiles CUDA Flash Attention kernels`GGML\_CUDA\_FA\_ALL\_QUANTS=ON`Flash attention for all quant types`GGML\_NATIVE=ON`CPU\-native optimization; don't use for distributed binaries`GGML\_LTO=ON`Link\-time optimization; slower build, faster runtime### 17\.3 cuBLAS \- Tested and Closed `GGML\_CUDA\_FORCE\_CUBLAS=ON`forces CUDA BLAS routines over the default GGML MMQ \(mixed\-precision matrix quantization\) kernels\. Tested on mxfp4 and Q4 models:**slower**than default\. GGML MMQ has native mxfp4/Q4 paths tuned for consumer decode batch sizes \(1–16 tokens\)\. cuBLAS is optimized for large datacenter batches\. Result: ~45 t/s PP regression, no TG improvement\. Default build wins on consumer hardware\. May be worth re\-evaluating on 24\+ GB cards where larger batch sizes make cuBLAS more competitive\. --- ## 18\. Speculative Decoding & MTP \(Multi\-Token Prediction\) Autoregressive token generation \(TG\) is memory\-bandwidth bound: the GPU must read all active model weights from memory for every single token it generates\. Speculative decoding bypasses this bottleneck by utilizing a lightweight "draft" model to guess upcoming tokens, which the base model verifies in a single forward pass\. On models trained with Multi\-Token Prediction \(MTP\) heads \(like Gemma 4 or Qwen 3\.6\), we use native MTP speculative drafting to achieve massive speedups\. ### 18\.1 MTP Drafting Configuration Flags Instead of pairing the base model with an unrelated draft model, mainline`llama\.cpp`supports native companion MTP draft models: - `\-\-spec\-draft\-model`: Path to the companion MTP GGUF file \(e\.g\.`mtp\-gemma\-4\-26B\-A4B\-it\.gguf`~460MB\)\. - `\-\-spec\-type draft\-mtp`: Tells llama\-server to run in MTP verification mode\. - `\-\-spec\-draft\-n\-max`: The maximum candidate sequence length drafted per iteration\.- For larger models \(e\.g\., Gemma 4 26B\), set to`2`\. Higher values introduce computational overhead that hurts TG\. - For lighter models \(e\.g\., Gemma 4 12B\), set to`4`to capture longer draft runs\. ### 18\.2 The KV Cache constraint \(`\-ctk f16 \-ctv f16`\) MTP draft verification relies on high\-fidelity attention metrics to validate proposed tokens\. In my Gemma 4 MTP tests, quantizing the KV cache \(`\-ctk q8\_0 \-ctv q8\_0`\) introduced enough noise to drive draft acceptance close to zero\. - **MTP Speculative Rule**: For the Gemma 4 MTP configs tested here, leave the KV cache at full precision \(`\-ctk f16 \-ctv f16`\) unless you have benchmarked acceptance rate and throughput on your exact build\. - Switching to`f16`KV cache increases VRAM usage but maintained draft acceptance rates of 70%\+ in these tests, resulting in a massive net speedup\. ### 18\.3 Speculative Performance Gains Tested on a single RTX 4070 12GB: - Gemma 4 26B Baseline: 38\.5 tok/s - Gemma 4 26B QAT \+ MTP:**100\.60 tok/s**\(2\.6x speedup\) - Gemma 4 12B QAT \+ MTP:**120\.80 tok/s**\(2\.0x speedup\) --- ## 19\. Vision / Multimodal ### 19\.1`\-\-mmproj` Path to the multimodal projector file: ``` --mmproj /path/to/mmproj-BF16.gguf ``` Typically 1–3 GB\. Allocates in VRAM at startup alongside the model\. ### 19\.2 OOM Failure Modes on Constrained VRAM **Failure 1 \- mmproj allocation**: the projector needs contiguous VRAM at load time\. If`\-\-fit\-target`left only a small margin, the allocation fails\. Symptom: crash at model load \(not during inference\)\. Fix: use`\-\-fit\-target 2048`for vision models\. **Failure 2 \- batch assertion**: image token count exceeds`\-\-ubatch\-size`\. An image can tokenize to several hundred tokens; if the batch is too small, llama\.cpp asserts\. Fix: use`\-\-ubatch\-size 512`or higher\. ### 19\.3 Safe Vision Profile \(12 GB VRAM\) ``` llama-server \ -m model.gguf \ --mmproj mmproj.gguf \ --ctx-size 65536 \ --fit on --fit-ctx 65536 --fit-target 2048 \ -ctk q8_0 -ctv q8_0 \ --flash-attn on \ --batch-size 256 --ubatch-size 512 \ --no-mmap --mlock \ --parallel 1 ``` Separate text and vision servers on different ports if running both workloads from the same GPU\. --- ## 20\. ik\_llama\.cpp Fork \[Advanced\] [ikawrakow/ik\_llama\.cpp](https://github.com/ikawrakow/ik_llama.cpp)is a fork with MoE\-specific kernel optimizations not yet upstream\. Worth evaluating once you've hit the ceiling on stock llama\.cpp\. ### 20\.1 Key Flags FlagEffectCost`\-fmoe`\(fused\-moe;**default on**\)Fused MoE expert kernel: significant TG upliftPP drops ~50%`\-muge`\(merge\-up\-gate\)Repacks up\+gate projections; better read locality\+25–30 GB RAM; slow startup`\-mqkv`\(merge\-qkv\)Merges Q, K, V projectionsSmall TG gain; no RAM cost`\-ger`\(grouped\-expert\-routing\)Groups token\-expert assignments for cache localityVariable; sweep to confirm### 20\.2 Real Numbers \(Qwen3\-Coder\-Next, RTX 4070 12 GB\) Configpp512 \(t/s\)tg128 \(t/s\)NotesUpstream baseline \(N\_CPU\_MOE=40\)**453**40\.6Referenceik fused\-moe \(default on\)21540\.5PP halves; TG same as upstreamik fused\-moe \+ merge\-qkv21740\.9Marginal gain over fused aloneik fused\-moe \+ merge\-up\-gate215**41\.1**Best TG; \+27 GB RAM overheadik fused\-moe \+ gcr \+ merge\-up\-gate21541\.2Diminishing returnsOn gpt\-oss\-120b, ik\_llama was also tested: CUDA graph compilation overhead dominated; PP collapsed to ~98 t/s \(vs 428 upstream\) and TG dropped to 20 t/s \(vs 28 upstream\)\. Not competitive for that model\. Performance is model\-architecture\-dependent\. ### 20\.3 When to Use - **Use ik\_llama**if TG is your sole bottleneck, you have RAM headroom, and PP regression is acceptable \(generation\-heavy agent loops\) - **Stay on upstream**for RAG, long\-prompt workloads, or when PP matters\. Upstream also has better startup predictability and no extra RAM cost\. --- ## 21\. Security Notes Local inference servers are still HTTP services\. Do not expose`llama\-server`, LM Studio, or any model gateway directly to the public internet without authentication, firewalling, and rate limits\. Minimum safe defaults: - Bind to localhost for local tools unless you explicitly need LAN access\. - Put a reverse proxy with authentication in front of anything reachable outside the machine\. - Assume prompts, outputs, and tool calls may appear in app logs, shell history, reverse proxy logs, or frontend histories\. - Treat model files like software dependencies: check license terms, source, and expected file hashes when possible\. - Keep separate endpoints for trusted local agent workflows and anything exposed to other devices\. If you need remote access, prefer a private VPN, Tailscale, WireGuard, or a locked\-down tunnel over opening the raw inference port\. --- ## 22\. Diagnostic Checklist Run before benchmarking or when TG is unexpectedly low\. ``` # 1. RAM speed - most common culprit for MoE TG underperformance sudo dmidecode -t memory | grep -E "Speed|Configured" # "Configured Memory Speed" must match your XMP/EXPO rated speed # 2. CPU governor cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # Expected: performance # 3. EPP - must be "performance" (governor alone is not enough on intel_pstate) cat /sys/devices/system/cpu/cpu0/cpufreq/energy_performance_preference # Expected: performance (not balance_performance, not powersave) # 4. Actual CPU frequency - P-cores should be near rated max boost grep "cpu MHz" /proc/cpuinfo | sort -rn | head -6 # 5. Free VRAM - start inference from near-empty nvidia-smi | grep MiB # 6. Thermal - sustained load under throttle temperature cat /sys/class/thermal/thermal_zone*/temp # In millidegrees; 80000 = 80°C; throttle typically starts 85–105°C depending on chip # 7. Background CPU hogs ps aux --sort=-%cpu | head -10 # 8. Swap activity (high vm.swappiness systems) cat /proc/vmstat | grep -E "pswpin|pswpout" # Growing non-zero values = model weights being swapped mid-session # 9. PCIe link speed [CUDA] nvidia-smi -q | grep -A 3 "PCIe Generation" # Expected: Current Gen = 3 or 4 # 10. Active tuned profile (if using tuned-ppd) sudo tuned-adm active # Expected: throughput-performance ``` ### Known TG Variability Root Causes Root causeSymptomFix`power\-profiles\-daemon`degrading HWPTG varies 20–30% between boots; all sysfs checks look fineReplace with`tuned\-ppd`\+`throughput\-performance`RAM not at rated speed \(XMP/EXPO off\)TG 2–3× below expected; stable but lowEnable XMP/EXPO in BIOSE\-cores included in thread rangeTG lower than P\-core\-only baseline`taskset \-c <p\-cores\-only\>``vm\.swappiness`\+ model too large for RAMTG stalls mid\-session`\-\-mlock`; reduce model or add RAM`GGML\_CUDA\_GRAPH\_OPT=1`with varying contextIntermittent OOM at long promptsSet`GGML\_CUDA\_GRAPH\_OPT=0``\-\-fit\-target`too smallOOM mid\-session \(not at startup\)Increase`\-\-fit\-target`to ≥512 MiBVision`\-\-fit\-target`too smallCrash at model load with mmprojUse`\-\-fit\-target 2048`for vision --- ## Changelog DateNote2026\-06\-21Added a problem\-based reading map, centralized safe starting profiles, and consolidated the easy\-to\-miss vision, CUDA graph, hybrid CPU, and MTP guardrails\.2026\-06\-17Reworked opening structure with a TL;DR, moved priority checklist, added measurement and security sections, updated llama\.cpp/LM Studio guidance, tightened QAT/MTP wording, and fixed stale internal links\.2026\-06\-12Updated optimization priority checklist, renumbered sections, and added dedicated guides for QAT quantization and MTP speculative decoding\.2026\-04\-04Initial post \- synthesized from l3ms scripts, bench\-runbook, and model posts\.## References - [l3ms \- homelab LLM toolkit \(scripts, bench runbook, run scripts\)](https://github.com/carteakey/l3ms) - [llama\.cpp](https://github.com/ggml-org/llama.cpp) - [ik\_llama\.cpp fork](https://github.com/ikawrakow/ik_llama.cpp) - [llama\.cpp server README](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md) - [gpt\-oss\-120b optimization post](https://carteakey.dev/blog/local-inference/optimizing-gpt-oss-120b-local-inference/) - [Qwen3\-Coder\-Next 40 t/s post](https://carteakey.dev/blog/local-inference/optimizing-qwen3-coder-next-local-inference/) - [Gemma 4 26B local post](https://carteakey.dev/blog/local-inference/running-gemma-4-26b-a4b-locally/) - [Qwen3\.6\-35B\-A3B local post](https://carteakey.dev/blog/local-inference/running-qwen3-6-35b-a3b-locally/) - [Gemma 4 MTP local post](https://carteakey.dev/blog/running-gemma-4-mtp-locally/)

Similar Articles

Inference Engines for LLMs & Local AI Hardware (2026 Edition)

X AI KOLs

This article provides a comprehensive guide to LLM inference engines for local AI hardware in 2026, explaining how to choose based on hardware strategy, workload, and serving model, and covering engines like llama.cpp, MLX, ExLlamaV2/3, vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo.

LLMs 101: A Practical Guide (2026 Edition)

X AI KOLs

A comprehensive practical guide to LLMs covering inference mechanics, tokens, Transformers, KV cache, local deployment hardware, and quantization as of May 2026.