Blackwell and PDL performance increase

Reddit r/LocalLLaMA 05/22/26, 09:09 PM Tools

llama-cpp pdl nvidia blackwell performance inference benchmark

Summary

Llama.cpp now supports Nvidia's Programmatic Dependent Launch (PDL) for Blackwell GPUs, offering a 5-10% performance boost on token generation. The feature is not enabled by default and requires a build flag.

Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.) In short, PDL enables more efficient execution of kernels and as a result better performance. So far, it's not enabled by default, if you don't know about it, you will likely miss it. To enable PDL you need to build Llama.cpp with the '**-D GGML\_CUDA\_PDL=ON**' flag and it's not yet enabled for all kernels, there is likely more performance to be had once more kernels are enabled with PDL. (To later disable PDL, if needed, do '**export GGML\_CUDA\_PDL=0**' before starting llama.cpp) # Benchmarks |Model|pp512|tg128|pp512 @ PDL|tg128 @ PDL|pp %|tg %| |:-|:-|:-|:-|:-|:-|:-| |Qwen 3.6 35B.A3B MXFP4|5412.39 ± 62.58 |172.72 ± 3.94 |5416.55 ± 58.92 |183.03 ± 0.93 |0|5.97 | |Qwen 3.6 35B.A3B UD-Q5\_K\_XL|4564.77 ± 47.55 |162.24 ± 6.67 |4582.22 ± 45.65 |177.11 ± 1.29 |0|9.17 | |Gemma 4 26B.A4B NVFP4|6728.74 ± 89.56 |107.39 ± 2.44 |6850.46 ± 97.86 |112.71 ± 0.38 |1.8|4.95 | |Qwen 3.6 27B NVFP4|2687.16 ± 70.18|41.31 ± 0.03|2708.97 ± 55.56|42.22 ± 0.05|0|2.2| (All tests run with b9282 and results are best of two on an RTX Pro 4500 Blackwell 32GB.) # Conclusion There is virtually no difference on pre-fill, however there is on average 5% to 6% performance boost on token generation based on above tests. According to the PR, somewhere between 4% and 10% improvement on token generation is expected. As mentioned, this is not enabled by default when building, if you are on Blackwell, this is a free lunch and worth trying out.

Original Article

Blackwell and PDL performance increase

Similar Articles

Build 9254 fixes my TG regression and adds PDL for NVIDIA GPUs

NCCL-Free Tensor Parallelism on Dual Blackwell PCIe llama.cpp b9095 released!

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

@populartourist: llama.cpp release b9235 added some new toys for boosting inference. Benchmarked Qwen3.6 27B on an RTX 5090 with llama.c…

Submit Feedback

Similar Articles

Build 9254 fixes my TG regression and adds PDL for NVIDIA GPUs

NCCL-Free Tensor Parallelism on Dual Blackwell PCIe llama.cpp b9095 released!

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

@populartourist: llama.cpp release b9235 added some new toys for boosting inference. Benchmarked Qwen3.6 27B on an RTX 5090 with llama.c…