@pupposandro: PFlash now run @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090. - 111 tok/s decode @ short ctx - 128K TTFT…
Summary
PFlash now supports running @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090, achieving 111 tok/s decode and 5.4x faster prefill than llama.cpp, with NIAH passes up to 131K context.
View Cached Full Text
Cached at: 05/15/26, 11:10 PM
PFlash now run @poolsideai’s Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090.
- 111 tok/s decode @ short ctx
- 128K TTFT in 15.91s, 5.4x faster prefill vs llama.cpp
- NIAH passes every (ctx, keep) point up to 131K
- first MoE target supported by PFlash
- hand-rolled CUDA, ggml only, no libllama
great collab w/ @eisokant, @eric_alcaide, and the rest of the @poolsideai team. looking forward to working more on their great coding models.
repo + GGUF in first comment.
repo: http://github.com/Luce-Org/lucebox-hub… GGUF: http://huggingface.co/Lucebox/Laguna-XS.2-GGUF…
Thanks!
Similar Articles
@pupposandro: https://x.com/pupposandro/status/2054241934164492328
The article announces support for DFlash and PFlash speculative decoding in llama.cpp for AMD Strix Halo iGPUs, demonstrating significant speedups in inference performance using ROCm.
@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.
Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP
Luce releases DFlash and PFlash support for AMD Strix Halo APUs, achieving 2.23x decode and 3.05x prefill speedups over llama.cpp HIP on Qwen3.6-27B.
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
BeeLlama v0.2.0 introduces major DFlash speculative decoding improvements, achieving up to 4.93x speedup on single RTX 3090 for Gemma 4 31B and 4.40x for Qwen 3.6 27B, with prompt processing near baseline.
Gemma 4 26B Hits 600 Tok/s on One RTX 5090
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.