@pupposandro: PFlash now run @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090. - 111 tok/s decode @ short ctx - 128K TTFT…

X AI KOLs Following Tools

Summary

PFlash now supports running @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090, achieving 111 tok/s decode and 5.4x faster prefill than llama.cpp, with NIAH passes up to 131K context.

PFlash now run @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090. - 111 tok/s decode @ short ctx - 128K TTFT in 15.91s, 5.4x faster prefill vs llama.cpp - NIAH passes every (ctx, keep) point up to 131K - first MoE target supported by PFlash - hand-rolled CUDA, ggml only, no libllama great collab w/ @eisokant, @eric_alcaide, and the rest of the @poolsideai team. looking forward to working more on their great coding models. repo + GGUF in first comment.
Original Article
View Cached Full Text

Cached at: 05/15/26, 11:10 PM

PFlash now run @poolsideai’s Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090.

  • 111 tok/s decode @ short ctx
  • 128K TTFT in 15.91s, 5.4x faster prefill vs llama.cpp
  • NIAH passes every (ctx, keep) point up to 131K
  • first MoE target supported by PFlash
  • hand-rolled CUDA, ggml only, no libllama

great collab w/ @eisokant, @eric_alcaide, and the rest of the @poolsideai team. looking forward to working more on their great coding models.

repo + GGUF in first comment.

repo: http://github.com/Luce-Org/lucebox-hub… GGUF: http://huggingface.co/Lucebox/Laguna-XS.2-GGUF…

Thanks!

Similar Articles

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.