@pupposandro: PFlash now run @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090. - 111 tok/s decode @ short ctx - 128K TTFT…

X AI KOLs Following 05/14/26, 12:57 PM Tools

inference moe cuda ggml pflash poolside-ai rtx-3090

Summary

PFlash now supports running @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090, achieving 111 tok/s decode and 5.4x faster prefill than llama.cpp, with NIAH passes up to 131K context.

PFlash now run @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090. - 111 tok/s decode @ short ctx - 128K TTFT in 15.91s, 5.4x faster prefill vs llama.cpp - NIAH passes every (ctx, keep) point up to 131K - first MoE target supported by PFlash - hand-rolled CUDA, ggml only, no libllama great collab w/ @eisokant, @eric_alcaide, and the rest of the @poolsideai team. looking forward to working more on their great coding models. repo + GGUF in first comment.

Original Article

View Cached Full Text

Cached at: 05/15/26, 11:10 PM

PFlash now run @poolsideai’s Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090.

111 tok/s decode @ short ctx
128K TTFT in 15.91s, 5.4x faster prefill vs llama.cpp
NIAH passes every (ctx, keep) point up to 131K
first MoE target supported by PFlash
hand-rolled CUDA, ggml only, no libllama

great collab w/ @eisokant, @eric_alcaide, and the rest of the @poolsideai team. looking forward to working more on their great coding models.

repo + GGUF in first comment.

repo: http://github.com/Luce-Org/lucebox-hub… GGUF: http://huggingface.co/Lucebox/Laguna-XS.2-GGUF…

Thanks!

Similar Articles

@pupposandro: https://x.com/pupposandro/status/2054241934164492328

X AI KOLs Timeline

The article announces support for DFlash and PFlash speculative decoding in llama.cpp for AMD Strix Halo iGPUs, demonstrating significant speedups in inference performance using ROCm.

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

X AI KOLs Following

A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Reddit r/LocalLLaMA

Luce releases DFlash and PFlash support for AMD Strix Halo APUs, achieving 2.23x decode and 3.05x prefill speedups over llama.cpp HIP on Qwen3.6-27B.

BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

Reddit r/LocalLLaMA

BeeLlama v0.2.0 introduces major DFlash speculative decoding improvements, achieving up to 4.93x speedup on single RTX 3090 for Gemma 4 31B and 4.40x for Qwen 3.6 27B, with prompt processing near baseline.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

Similar Articles

@pupposandro: https://x.com/pupposandro/status/2054241934164492328

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Submit Feedback