@ying11231: Impressive performance on TPU.
Summary
A blog post from LMSYS Org details optimizing Ling-2.6-1T, a 1 trillion parameter hybrid MoE model, on TPU v7x using SGLang-JAX, achieving efficient inference by hiding MoE data movement behind computation with a single Pallas kernel.
View Cached Full Text
Cached at: 06/17/26, 10:03 PM
Impressive performance on TPU.
LMSYS Org (@lmsysorg): 🚀 Our new blog: Optimizing Ling-2.6-1T on TPU with SGLang-JAX: Hiding MoE Data Movement Behind Compute with One Pallas Kernel
Ling-2.6-1T, a 1T hybrid MoE model, now serves on TPU v7x with SGLang-JAX. The SGLang-JAX team worked together with @inclusionAI on two fronts:
Similar Articles
@leloykun: [WIP] Blog post on Lean4-to-TileLang Tensor Program Superoptimizer here:
A technical blog post introduces a Lean4-to-TileLang tensor program superoptimizer that automatically generates optimized GPU/TPU kernels and hyperparameter scaling laws, demonstrating performance gains over torch.compile.
@hank_aibtc: Family, local LLMs are incredibly impressive! I stumbled upon this gpt-oss-20b-tq3 on Hugging Face, and it's truly captivating! OpenAI's official open-source 20B+ parameter MoE model, optimized by the community using TurboQuant 3-bit quantization + MLX...
The article highlights the gpt-oss-20b-tq3 model, a quantized version of an OpenAI MoE model that runs efficiently on standard 16GB MacBook Airs using TurboQuant and MLX optimizations.
@witcheer: can’t believe gpt-oss-20b perfs on 8GB vRAM 21B total params, 3.6B active (MoE). OpenAI, Apache 2.0. uses only 1.8 GB V…
A new open-source MoE model, gpt-oss-20b (21B total, 3.6B active), runs on only 1.8GB VRAM and achieves perfect scores on agentic coding tasks, outperforming other local models like Gemma and Qwen.
@ItsmeAjayKV: Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill sp…
A user reports achieving over 90 tokens per second inference speed with Qwen 3.6-35b-a3b MoE model on an RTX 3090 using llama.cpp, with prefill speeds exceeding 1000 t/s, indicating practical local deployment of large language models on consumer hardware.
@modal: We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap…
Modal collaborated with LMSys and Z Lab to integrate DFlash speculative decoding into SGLang, achieving up to 4.3x throughput improvement over baseline and 1.5x over native multi-token prediction for large language models.