Cached at:
06/02/26, 03:32 PM
# Taming the Battlemage: 63 Tokens/Sec Sustained on Qwen 3.6 with Intel SYCL
Source: [https://lemongravy.me/articles/intel-gpu-llamacpp/](https://lemongravy.me/articles/intel-gpu-llamacpp/)
Running massive local Large Language Models \(LLMs\) historically required a heavy investment in Nvidia hardware\. However, with the arrival of Intel's discrete GPU architectures like Battlemage and Arc Pro, consumer\-grade Intel hardware has become a highly competitive alternative\.
Achieving optimal performance on Intel GPUs is not plug\-and\-play\. Getting a 35\-billion parameter model like Qwen 3\.6 to sustain**63\+ tokens per second**requires choosing the right compute backend, understanding how hybrid model caching works, and tuning the containerized runtime down to the thread level\.
This guide breaks down the engineering behind this performance tier, detailing how to utilize the native SYCL compute backend, work with the context checkpoint system, and deploy a production\-ready Docker stack\.
> **Hardware note:**This guide was developed on an i5\-12400F with DDR4 memory and an Intel Arc Pro B70 \(32GB\)\. All benchmarks were run against llama\.cpp commit[`210a657`](https://github.com/ggml-org/llama.cpp/commit/210a6570ceda20c5d6439172c09ada08c3754cc9)\. With`\-\-n\-gpu\-layers 99`and`\-\-cache\-ram 0`, the entire model and KV cache live exclusively in VRAM — the PCIe bus only carries token IDs and logits during inference, so host memory bandwidth is effectively irrelevant for token generation\. This is particularly true for sparse MoE models like Qwen 3\.6\-35B\-A3B, where only 3B parameters activate per token\. For dense models or configurations that spill to host RAM, system bandwidth will matter significantly more\.
---
## 1\. Compute Architecture: Why SYCL Outperforms Vulkan \(When It Works\)
When configuring a compute engine like`llama\.cpp`for Intel hardware, developers generally choose between two backends: Vulkan or SYCL\. For compute\-bound machine learning workloads on Intel silicon,**SYCL generally offers higher peak performance**— but the gap depends heavily on your system configuration and driver maturity\.
FeatureVulkan BackendSYCL \(oneAPI\) Backend**Abstraction Layer**High\-level, cross\-vendor graphics APILow\-level, Intel\-native compute runtime**Hardware Hooks**Standard compute shadersDirect execution via Intel Level Zero driver**Matrix Operations**General\-purpose compute shadersCan leverage**Xe Matrix Extensions \(XMX\)**for prompt processing**Memory Pathways**Standard host\-to\-device mappingUnified Shared Memory \(USM\) with optimized allocation**Flash Attention**Supported \(software implementation\)Supported \(software implementation, optimized for Intel\)**Setup Complexity**Low — works out of the box on most systemsHigh — requires oneAPI toolkit, specific driver versions**Stability**More mature on consumer Battlemage cardsCan be sensitive to PCIe gen, RAM speed, and driver versionsVulkan is built for broad compatibility across multiple vendors, meaning it operates on a least\-common\-denominator model\. It cannot natively exploit hardware\-specific compute features built into Intel silicon\.
SYCL, powered by Intel's oneAPI toolkit, communicates directly with the hardware through the Level Zero driver\. This allows the runtime to bind execution queues into the hardware's XMX engines for prompt processing \(GEMM operations\), yielding significant throughput gains on compute\-bound workloads\. For token generation — which is memory\-bandwidth\-bound rather than compute\-bound — the advantage is narrower, but USM optimizations and immediate command list dispatch still provide measurable improvements\.
**A word of caution:**SYCL's advantage is not universal\. Some users have reported that Vulkan delivers better or more consistent performance on certain system configurations, particularly when running dense models that are sensitive to host memory bandwidth\. The reports of poor SYCL performance tend to involve dense models on older platforms — for fully\-offloaded sparse MoE workloads like this one, the backend choice matters more than the bus\. That said, benchmark both backends on your specific hardware before committing to either\.
---
## 2\. Understanding Hybrid Model Caching in llama\.cpp
Qwen 3\.6 is a hybrid model that blends standard Attention layers with Recurrent/Gated\-DeltaNet layers\. This architecture creates specific caching behaviors that are important to understand when optimizing for multi\-turn performance\.
### How Context Checkpoints Work
During inference,`llama\.cpp`saves periodic snapshots of the model's internal state — called context checkpoints — so that subsequent turns in the same conversation don't need to reprocess the entire history\. When a follow\-up message arrives, the server finds the most recent valid checkpoint and restores from it, only processing the new tokens\.
For hybrid models, this checkpoint system is especially important because the recurrent layers maintain a running state that's built sequentially from every previous token\. You can't partially roll back a recurrent state the way you can with a standard KV cache\.
### Multi\-Turn in the Same Conversation: Fast
When continuing an ongoing conversation, the checkpoint system works well\. The server finds the checkpoint from the previous turn, restores the full model state \(attention KV cache \+ recurrent state\), and only evaluates the new tokens:
```
slot update_slots: id 0 | Checking checkpoint with [17, 17] against 21...
slot update_slots: id 0 | restored context checkpoint (pos_min = 17, n_past = 18, size = 62.813 MiB)
slot print_timing: id 0 | prompt eval time = 89.49 ms / 4 tokens
```
89 milliseconds to process the new tokens, because the entire prior conversation state was restored from the checkpoint\.
### Switching Between Conversations: Full Reprocess Required
When the server switches to a*different*conversation, the existing checkpoint becomes invalid — the recurrent state was built from the previous conversation's tokens and is mathematically wrong for a new one\. This is a fundamental property of recurrent architectures, not a software bug:
```
slot update_slots: id 0 | Checking checkpoint with [17, 17] against 2...
slot update_slots: id 0 | forcing full prompt re-processing due to lack of cache data
(likely due to SWA or hybrid/recurrent memory)
```
The server correctly identifies that no checkpoint covers the new conversation's token sequence and reprocesses from scratch\. In practice, this costs a few hundred milliseconds thanks to the fast SYCL prompt processing pipeline \(330–530 t/s after JIT warmup\):
Prompt LengthReprocess Time11 tokens180 ms243 tokens740 ms327 tokens691 ms351 tokens657 msFor single\-user deployments with`\-\-parallel 1`, conversation switching is the norm\. The reprocessing cost scales linearly with prompt length but remains small relative to generation time\.
### The seq\_rm Patch \(PR \#22534 / Issue \#22746\)
There is a separate, genuine bug that affects hybrid models: when the context window fills up and`llama\.cpp`needs to truncate old tokens via`llama\_memory\_seq\_rm`, the recurrent layers can't perform partial sequence deletion\. In unpatched builds, this failure triggers a destructive cache flush:
```
// Unpatched behavior on seq_rm failure
SLT_WRN(slot, "failed to truncate target tokens");
slot.prompt_clear(true);
slot.n_prompt_tokens_cache = 0;
```
The community fix intercepts this failure, recognizes it as an expected limitation of hybrid architectures, and preserves the valid cache state rather than destroying it\. This patch is included in the Dockerfile below and matters most for long conversations that push against the context window limit\.
### First\-Request JIT Compilation
One more behavior worth noting: the very first request after container startup includes a one\-time cost for SYCL kernel JIT compilation\. On this hardware, that's roughly 27 seconds:
```
slot print_timing: id 0 | prompt eval time = 27384.24 ms / 22 tokens
```
This is not a caching issue — it's the SYCL runtime compiling GPU kernels on first use\. Every subsequent request runs at full speed regardless of whether a checkpoint is available\.
---
## 3\. Living on the Master Branch: The Double\-Edged Sword
To achieve 63\+ tokens per second on consumer Intel silicon, you cannot rely on stable, months\-old releases\. You have to compile directly from the latest`llama\.cpp`upstream`master`branch\. However, treating the bleeding edge as a production environment comes with a distinct set of trade\-offs\.
### The Upside: Raw Innovation
By compiling directly from source, you instantly inherit the latest low\-level optimizations engineered by the community\. For Intel hardware, this means getting immediate access to:
- **Upstream SYCL Refactors:**Rapid improvements to how Unified Shared Memory \(USM\) routes data across Intel's Level Zero compute runtimes\.
- **Bleeding\-Edge Architecture Support:**Native support for the complex tokenizing and layer schemes of freshly released models like Qwen 3\.6\.
- **Micro\-Optimizations:**Immediate integration of compiler\-specific flags \(`icx`/`icpx`\) that cut down CPU\-to\-GPU orchestration latency\.
### The Downside: Unvetted Regressions
The reason these optimizations aren't in a stable release yet is because they haven't been thoroughly vetted across every hardware combination\. When you build from`master`, you aren't just getting the latest features —**you are also inheriting the latest bugs\.**
The hybrid model`seq\_rm`issue is a good example — it's the kind of edge case that only surfaces under specific conditions \(long conversations on hybrid architectures\) and requires reading the source to diagnose\. When deploying local LLMs on alternative silicon, compilation from source is mandatory for maximum performance, but**continuous monitoring and targeted patching**are the taxes you pay for living on the frontier\.
---
## 4\. Production Deployment Blueprint
To deploy this setup without altering your host system dependencies or creating conflicting library versions, the entire execution environment is containerized\.
### The Dockerfile
This multi\-stage`Dockerfile`uses Intel's official`deep\-learning\-essentials`container as a foundation\. It pulls the latest`llama\.cpp`master branch, applies the`seq\_rm`hybrid memory patch via an`awk`script, and compiles the runtime using Intel's`icx`and`icpx`compilers\.
```
ARG ONEAPI_VERSION=2025.3.3-0-devel-ubuntu24.04
# ==========================================
# STAGE 1: Build & Patch Engine
# ==========================================
FROM intel/deep-learning-essentials:$ONEAPI_VERSION AS build
ARG LEVEL_ZERO_VERSION=1.28.2
ARG LEVEL_ZERO_UBUNTU_VERSION=u24.04
RUN apt-get update && apt-get install -y git libssl-dev wget ca-certificates gawk && \
cd /tmp && \
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb && \
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb && \
apt-get -o Dpkg::Options::="--force-overwrite" install -y ./level-zero.deb ./level-zero-devel.deb && \
rm -f /tmp/level-zero.deb /tmp/level-zero-devel.deb
WORKDIR /app
# Pin to tested commit — update this when upgrading
ARG LLAMA_CPP_COMMIT=210a6570ceda20c5d6439172c09ada08c3754cc9
RUN git clone https://github.com/ggml-org/llama.cpp.git . && \
git checkout $LLAMA_CPP_COMMIT
# Inject seq_rm Hybrid Memory Patch (PR #22534 / Issue #22746)
# Prevents destructive cache flush when seq_rm fails on hybrid/recurrent layers
RUN cat << 'EOF' > patch.awk
/common_context_seq_rm\(ctx_tgt, slot\.id, p0, -1\);/ {
print " if (!llama_memory_seq_rm(llama_get_memory(ctx_tgt), slot.id, p0, -1)) {"
print " if (ctx_tgt_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL && slot.n_prompt_tokens_cache > 0) {"
print " SLT_INF(slot, \"seq_rm failed (expected for hybrid) - keeping %d cached tokens from checkpoint\\n\", slot.n_prompt_tokens_cache);"
print " } else {"
print " SLT_WRN(slot, \"failed to truncate target tokens with position >= %d - clearing the memory\\n\", p0);"
print " slot.prompt_clear(true);"
print " slot.n_prompt_tokens_cache = 0;"
print " }"
print " }"
print ""
print " if (ctx_dft) {"
print " if (!llama_memory_seq_rm(llama_get_memory(ctx_dft.get()), slot.id, p0, -1)) {"
print " if (ctx_dft_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL && slot.n_prompt_tokens_cache > 0) {"
print " SLT_INF(slot, \"draft seq_rm failed (expected for hybrid) - keeping %d cached tokens\\n\", slot.n_prompt_tokens_cache);"
print " } else {"
print " SLT_WRN(slot, \"failed to truncate draft tokens with position >= %d - clearing memory\\n\", p0);"
print " }"
print " }"
print " }"
getline; getline; getline
next
}
{ print }
EOF
RUN awk -f patch.awk tools/server/server-context.cpp > tmp.cpp && mv tmp.cpp tools/server/server-context.cpp
# Compile using native Intel OneAPI Compilers
RUN cmake -B build -DGGML_NATIVE=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF && \
cmake --build build --config Release -j$(nproc)
RUN mkdir -p /app/lib && find build -name "*.so*" -exec cp -P {} /app/lib \;
RUN mkdir -p /app/full && cp build/bin/* /app/full
# ==========================================
# STAGE 2: Runtime Environment Setup
# ==========================================
FROM intel/deep-learning-essentials:$ONEAPI_VERSION AS base
ARG IGC_VERSION=v2.20.5
ARG COMPUTE_RUNTIME_VERSION=25.40.35563.10
RUN mkdir /tmp/neo/ && cd /tmp/neo/ \
&& wget https://github.com/intel/intel-graphics-compiler/releases/download/$IGC_VERSION/intel-igc-core-2_2.20.5+19972_amd64.deb \
&& wget https://github.com/intel/intel-graphics-compiler/releases/download/$IGC_VERSION/intel-igc-opencl-2_2.20.5+19972_amd64.deb \
&& wget https://github.com/intel/compute-runtime/releases/download/$COMPUTE_RUNTIME_VERSION/intel-ocloc_25.40.35563.10-0_amd64.deb \
&& wget https://github.com/intel/compute-runtime/releases/download/$COMPUTE_RUNTIME_VERSION/intel-opencl-icd_25.40.35563.10-0_amd64.deb \
&& wget https://github.com/intel/compute-runtime/releases/download/$COMPUTE_RUNTIME_VERSION/libigdgmm12_22.8.2_amd64.deb \
&& wget https://github.com/intel/compute-runtime/releases/download/$COMPUTE_RUNTIME_VERSION/libze-intel-gpu1_25.40.35563.10-0_amd64.deb \
&& dpkg --install *.deb && apt-get update && apt-get install -y libgomp1 curl && apt clean -y && rm -rf /tmp/neo
# ==========================================
# STAGE 3: Final Server Image
# ==========================================
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/lib/ /app
COPY --from=build /app/full/llama-server /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-server" ]
```
### The`docker\-compose\.yaml`
This configuration optimizes performance by minimizing hardware virtualization and OS scheduler interference\.
```
###############################################################################
# llama.cpp SYCL – Qwen3.6-35B-A3B on Intel Arc Pro B70
# - Includes PR #22534 (seq_rm Hybrid Memory Patch)
# - CPU Pinned to Core 4, Threads locked to 1 (Zero Contention)
# - Full GPU Offload with VRAM-only KV Cache
###############################################################################
services:
llama:
build: .
image: llama-server-intel-patched
container_name: llama-qwen36
restart: unless-stopped
network_mode: host
ipc: host
# Low-level system hardware exposure
devices:
- /dev/dri:/dev/dri
privileged: true
# Core Pinning: Lock backend orchestration to single physical core
cpuset: "4"
volumes:
- /llm/models:/models
# SYCL Driver Optimizations
environment:
ONEAPI_DEVICE_SELECTOR: "level_zero:0"
UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS: "1"
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS: "1"
ZES_ENABLE_SYSMAN: "1"
command:
- --model
- /models/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
- --host
- "0.0.0.0"
- --port
- "8080"
# Hardware Offload Parameters
- --n-gpu-layers
- "99"
- --flash-attn
- "on"
# Memory Allocation & Quantized KV Cache
- --ctx-size
- "262144"
- --cache-type-k
- q8_0
- --cache-type-v
- q8_0
# Strict VRAM Retention: Absolute lock on memory swap bounds
- --cache-ram
- "0"
# Execution Concurrency Tuning
- --threads
- "1"
- --parallel
- "1"
# Sampling Architecture
- --temp
- "0.7"
- --top-p
- "0.8"
- --top-k
- "20"
- --presence-penalty
- "0.0"
- --repeat-penalty
- "1.05"
- --min-p
- "0.0"
healthcheck:
test: ["CMD-SHELL", "curl -sf http://localhost:8080/health || exit 1"]
interval: 30s
timeout: 10s
retries: 5
start_period: 60s
```
---
## 5\. Parameter Optimization Breakdown
Every configuration option implemented in the deployment stack serves a distinct purpose to eliminate bottlenecking:
### Operating System & Scheduler Isolation
- **`network\_mode: host`&`ipc: host`**: Eliminates Docker's virtualized network and IPC namespace translation tables\. The server communicates directly with the host OS network stack and shared memory with zero namespace\-crossing overhead\.
- **`cpuset: "4"`**: Hard\-locks the container to physical Core 4 on the host CPU\. By default, Linux's CFS scheduler migrates threads across cores for load balancing\. For a single\-threaded GPU dispatch loop, this core migration invalidates L1/L2 caches unnecessarily\. Pinning to a dedicated core keeps the instruction and data caches hot\.
- **`privileged: true`**: Grants the container full access to`/dev/dri`device paths without permission filtering\. A more targeted alternative is using`\-\-group\-add render`and`\-\-group\-add video`with the`devices`mount, which avoids running the entire container as privileged\.
### Intel Driver Tuning
- **`UR\_L0\_ENABLE\_RELAXED\_ALLOCATION\_LIMITS: "1"`**: Standard Intel compute drivers restrict single contiguous allocations to approximately 4GB\. Because LLM KV caches and weight tensors can exceed this, this parameter forces Level Zero to accept allocations up to the physical VRAM capacity\.
- **`SYCL\_PI\_LEVEL\_ZERO\_USE\_IMMEDIATE\_COMMANDLISTS: "1"`**: Bypasses the typical driver\-level command batching\. Instructions are dispatched directly to the hardware command queue rather than being buffered into regular command lists, reducing dispatch latency\.
### Engine Execution Tuning
- **`\-\-cache\-ram 0`**: Prevents`llama\.cpp`from spilling cold KV cache segments to host system RAM\. This forces the entire cache to remain in VRAM, avoiding the latency penalty of PCIe transfers during generation\.
- **`\-\-threads 1`&`\-\-parallel 1`**: When offloading 100% of model layers \(`\-\-n\-gpu\-layers 99`\) onto the GPU, the CPU's role is reduced to orchestrating the dispatch pipeline\. With a single pinned core, spawning multiple threads would cause contention rather than parallelism\. One thread, one core, one dispatch loop\.
- **`\-\-cache\-type\-k q8\_0`&`\-\-cache\-type\-v q8\_0`**: Quantizes the KV cache to 8\-bit precision\. For a 256K context window, this dramatically reduces the VRAM footprint compared to FP16, making the full context window viable on 32GB without measurable degradation in output quality\.
---
## 6\. Live Performance Data
All numbers below are from real production logs on the hardware described above\.
### Hardware Identification
The SYCL runtime binds to the Level Zero device and detects the full 32GB frame buffer:
```
llama-qwen36 | 0.00.180.301 I device_info:
llama-qwen36 | 0.00.180.317 I - SYCL0 : Intel(R) Graphics [0xe223] (31023 MiB, 31023 MiB free)
llama-qwen36 | 0.00.180.358 I system_info: n_threads = 1 (n_threads_batch = 1) / 12 | CPU : AVX2 = 1 | AVX_VNNI = 1 |
```
### Startup and First Request
Model loading takes ~10 seconds, followed by ~11 seconds of warmup\. The first inference request includes a one\-time SYCL JIT compilation cost of ~27 seconds:
```
llama-qwen36 | 0.00.181.639 I srv llama_server: loading model
llama-qwen36 | 0.10.892.639 I common_init_from_params: warming up the model with an empty run
llama-qwen36 | 0.22.061.656 I srv llama_server: model loaded
llama-qwen36 | 1.03.009.365 I slot print_timing: id 0 | prompt eval time = 27384.24 ms / 22 tokens
llama-qwen36 | 1.03.009.370 I slot print_timing: id 0 | eval time = 2018.40 ms / 128 tokens ( 15.77 ms per token, 63.42 tokens per second)
```
After JIT warmup, the 27\-second prompt eval never recurs\.
### Multi\-Turn Same Conversation: Checkpoint Restore
The second turn in the same conversation restores from a checkpoint and only processes new tokens:
```
llama-qwen36 | 1.05.750.839 W slot update_slots: id 0 | restored context checkpoint (pos_min = 17, n_tokens = 18, n_past = 18)
llama-qwen36 | 1.07.784.813 I slot print_timing: id 0 | prompt eval time = 89.49 ms / 4 tokens
llama-qwen36 | 1.07.784.830 I slot print_timing: id 0 | eval time = 1956.35 ms / 128 tokens ( 15.28 ms per token, 65.43 tokens per second)
```
### Sustained Generation Performance
Across longer generation runs, throughput settles into the 62–64 t/s range:
```
llama-qwen36 | eval time = 2018.40 ms / 128 tokens ( 15.77 ms per token, 63.42 t/s)
llama-qwen36 | eval time = 1956.35 ms / 128 tokens ( 15.28 ms per token, 65.43 t/s)
llama-qwen36 | eval time = 2499.40 ms / 162 tokens ( 15.43 ms per token, 64.82 t/s)
llama-qwen36 | eval time = 25690.01 ms / 1593 tokens ( 16.13 ms per token, 62.01 t/s)
llama-qwen36 | eval time = 15702.83 ms / 1000 tokens ( 15.70 ms per token, 63.68 t/s)
llama-qwen36 | eval time = 23990.09 ms / 1518 tokens ( 15.80 ms per token, 63.28 t/s)
llama-qwen36 | eval time = 15708.12 ms / 1000 tokens ( 15.71 ms per token, 63.66 t/s)
```
Short bursts peak at ~65 t/s; sustained generation over 1000\+ tokens holds at**62–64 t/s**\. The slight decay at longer sequences is expected as KV cache access patterns become less cache\-friendly\.
## llama\-bench Results
ComponentDetailGPUIntel Arc Pro B70BackendSYCL \(Level Zero\)Build`354ebac8c`\(9468\)modelsizeparamsbackendnglthreadstype\_ktype\_vfatestt/sqwen35moe 35B\.A3B Q4\_K \- Medium20\.81 GiB34\.66 BSYCL991q8\_0q8\_01pp512977\.40 ± 2\.02qwen35moe 35B\.A3B Q4\_K \- Medium20\.81 GiB34\.66 BSYCL991q8\_0q8\_01tg12870\.54 ± 0\.12## Conclusion
Maximizing inference performance on alternative silicon demands understanding both hardware behavior and the software stack above it\. By leveraging the SYCL backend for Intel\-native compute access, containerizing workloads with low\-level device mappings, and understanding the caching constraints of hybrid model architectures, Intel GPUs can deliver competitive performance for local LLM deployment\.
63 tokens per second sustained on an i5\-12400F with DDR4 and a $949 GPU — running a 35\-billion parameter model\. The Intel Arc ecosystem for AI inference is maturing rapidly but is not yet at parity with CUDA in terms of software stability and breadth of support\. The performance is real — but so is the engineering effort required to get there\.