Tip: use this llama.cpp PR to improve PP on Intel ARC

Reddit r/LocalLLaMA Tools

Summary

A llama.cpp PR significantly improves prompt processing speed on Intel ARC GPUs, with benchmark showing speed increase from 245t/s to 462t/s on a B580. The improvement currently works for F16 KV quantization, with plans to support other quants.

https://github.com/ggml-org/llama.cpp/pull/25222 Another win for Intel ARC users (all 4 of us). The community keeps improving llama.cpp for Intel ARC. This time, the hero from that Pull Request (with the help of Claude) improved the prompt processing speed by a lot. For comparison, I have a B580 and a 116k context conversation and it used to take 510 seconds to process everything from scratch, 245t/s; now it takes 262 seconds and a very fast speed of 462t/s; Qwen3.6 35B A3B Q5_K_XL ./llama-server --host 0.0.0.0 --port 8080 --model /models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --jinja --threads 8 --ctx-size 262144 --cache-ram 0 --parallel 1 --temperature 0.0 --top-p 0.2 --top-k 20 --no-mmap --spec-type draft-mtp --spec-draft-n-max 3 --batch-size 2700 --ubatch-size 2700 --n-gpu-layers 99 --n-cpu-moe 99. The only catch is that it is for F16 KV for now, but the contributor said he will work on other quants later. You see, Intel's hardware is very capable of doing great things and each contribution by the community and Intel makes us closer to achieving the full speed of the hardware
Original Article

Similar Articles

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.