Tip: use this llama.cpp PR to improve PP on Intel ARC
Summary
A llama.cpp PR significantly improves prompt processing speed on Intel ARC GPUs, with benchmark showing speed increase from 245t/s to 462t/s on a B580. The improvement currently works for F16 KV quantization, with plans to support other quants.
Similar Articles
Intel Arc Pro B70 llama.cpp benchmarks posted
Benchmark results for Intel Arc Pro B70 GPU running llama.cpp with SYCL on Qwen models show 63 tokens per second performance.
sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp
A pull request for llama.cpp ports multi-column MMVQ from CUDA to SYCL, achieving approximately 45% speculative decoding speedup on Intel Arc GPUs.
Dual GPU llama.cpp speedup
A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.
Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro
This article describes how to use the SYCL backend with llama.cpp to achieve over 60 tokens per second on the Qwen 3.6-35B-A3B model using an Intel Arc Pro B70 GPU, with the entire model and KV cache in VRAM.
Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.
A rejected PR for llama.cpp provides up to 30% faster prompt processing for MOE models on AMD Strix Halo hardware, with gains diminishing at higher context lengths.