Tip: use this llama.cpp PR to improve PP on Intel ARC

Reddit r/LocalLLaMA 07/02/26, 09:29 PM Tools

llama-cpp intel-arc prompt-processing speed-improvement open-source gpu

Summary

A llama.cpp PR significantly improves prompt processing speed on Intel ARC GPUs, with benchmark showing speed increase from 245t/s to 462t/s on a B580. The improvement currently works for F16 KV quantization, with plans to support other quants.

https://github.com/ggml-org/llama.cpp/pull/25222 Another win for Intel ARC users (all 4 of us). The community keeps improving llama.cpp for Intel ARC. This time, the hero from that Pull Request (with the help of Claude) improved the prompt processing speed by a lot. For comparison, I have a B580 and a 116k context conversation and it used to take 510 seconds to process everything from scratch, 245t/s; now it takes 262 seconds and a very fast speed of 462t/s; Qwen3.6 35B A3B Q5_K_XL ./llama-server --host 0.0.0.0 --port 8080 --model /models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --jinja --threads 8 --ctx-size 262144 --cache-ram 0 --parallel 1 --temperature 0.0 --top-p 0.2 --top-k 20 --no-mmap --spec-type draft-mtp --spec-draft-n-max 3 --batch-size 2700 --ubatch-size 2700 --n-gpu-layers 99 --n-cpu-moe 99. The only catch is that it is for F16 KV for now, but the contributor said he will work on other quants later. You see, Intel's hardware is very capable of doing great things and each contribution by the community and Intel makes us closer to achieving the full speed of the hardware

Original Article

Tip: use this llama.cpp PR to improve PP on Intel ARC

Similar Articles

Intel Arc Pro B70 llama.cpp benchmarks posted

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

Dual GPU llama.cpp speedup

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Submit Feedback

Similar Articles

Intel Arc Pro B70 llama.cpp benchmarks posted
Benchmark results for Intel Arc Pro B70 GPU running llama.cpp with SYCL on Qwen models show 63 tokens per second performance.

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.