This article benchmarks vLLM, SGLang, and llama.cpp on a mixed Blackwell/Ada GPU cluster for long context prefill, finding vLLM significantly outperforms others on heterogeneous setups while SGLang crashes with Ada cards due to FP4 support limitations.
I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and three modded 4090 48GB. All tests were done using 4-bit weights, specifically NVFP4 for vLLM and SGLang, and MXFP4 for llama.cpp. The main takeaway is that vLLM significantly outperforms the others on mixed multi-GPU setups for long context prefill. Llama.cpp struggles heavily with pipeline parallelism under these conditions, falling behind by a factor of 4 to 6. This appears to be due to how the execution graph is handled across multiple devices, with CPU-side embeddings causing graph splits and pipeline bubbles. SGLang performs wonderfully on a pure Blackwell setup, almost matching vLLM. However, it instantly crashes if you introduce an Ada card into the pipeline because it currently lacks a software fallback for FP4 weights, strictly requiring Compute Capability 10.0. vLLM handles this seamlessly by emulating FP4 on the older cards. Another interesting finding is how well vLLM handles uneven GPU splits. By manually tweaking the layer distribution using the VLLM\_PP\_LAYER\_PARTITION environment variable, I was able to balance the compute load between the fast Blackwells and the slower 4090s doing FP4 emulation. This eliminated pipeline bottlenecks and resulted in massive speedups even on a 397B model. Here is the summary of the benchmark results. Model and Context | GPU Setup | Engine | TTFT | Prefill Speed \---|---|---|---|--- Qwen3.6-35B-A3B (184k tokens) | 2 GPUs (6000 + 5090) | vLLM | 10.2s | 18060 t/s Qwen3.6-35B-A3B (184k tokens) | 2 GPUs (6000 + 5090) | llama.cpp | 24.9s | 7405 t/s MiniMax-M2.7 (82k tokens) | 6 GPUs (Mixed) | vLLM | 13.2s | 6212 t/s MiniMax-M2.7 (82k tokens) | 6 GPUs (Mixed) | llama.cpp | 77.0s | 1065 t/s MiniMax-M2.7 (82k tokens) | 6 GPUs (Mixed) | SGLang | Crashed | N/A Qwen3.5-122B-A10B (75k tokens) | 4 GPUs (Pure Blackwell) | vLLM | 5.0s | 15084 t/s Qwen3.5-122B-A10B (75k tokens) | 4 GPUs (Pure Blackwell) | SGLang | 5.3s | 14177 t/s Qwen3.5-122B-A10B (75k tokens) | 4 GPUs (Pure Blackwell) | llama.cpp | 20.6s | 3662 t/s Qwen3.5-397B-A17B (75k tokens) | 7 GPUs (Uneven PP split) | vLLM | 9.8s | 7683 t/s Qwen3.5-397B-A17B (75k tokens) | 7 GPUs (Uneven PP split) | llama.cpp | 57.2s | 1319 t/s If you are building a mixed cluster or relying heavily on pipeline parallelism for large models, vLLM chunked prefill and manual layer partitioning are incredibly useful. I hope this data is helpful for anyone planning their hardware topologies or struggling with prefill times on multi-GPU setups. I'm not a native English speaker so I used LLM to translate. Edit: typo
A user discusses the trade-offs between using vLLM and llama.cpp for local, single-user inference on AMD hardware, questioning if vLLM's performance benefits justify the complexity in non-enterprise settings.
llama.cpp is an open-source C/C++ library for efficient LLM inference on local hardware, supporting various quantization methods and multiple backends (CPU, GPU, etc.).
A technical guide on setting up local LLM autocomplete (Qwen2.5-Coder-7B) and agentic coding (Qwen3.6-35B-A3B) on a single 16GB GPU with 64GB+ RAM using llama.cpp, including commands and performance benchmarks.
Llama.cpp now supports Nvidia's Programmatic Dependent Launch (PDL) for Blackwell GPUs, offering a 5-10% performance boost on token generation. The feature is not enabled by default and requires a build flag.