throughput

#throughput

@FinanceYF5: Kimi K3 went viral in two days. Deedy says there's a whole set of compute economics behind it. 1/ K3 has only been online for two days, but it has already reached #10 on OpenRouter, processing 140 billion tokens daily. However, the servers can't handle the load: throughput dropped from 30 Token/s to 13, latency spiked to 72 seconds, and time to first token exceeded 20 seconds...

X AI KOLs Following ↗ · 6d ago Cached

Kimi K3 reached #10 on OpenRouter within two days of launch, processing 140 billion tokens daily, but high load caused throughput to drop and latency to spike.

0 favorites 0 likes

#throughput

@NVIDIAAI: As AI models continue to grow in scale and capability, shaping a model matters just as much as its size. We're introduc…

X AI KOLs Timeline ↗ · 2026-07-13 Cached

NVIDIA introduces a series on AI Model Co-Design, explaining how model dimensions affect GPU performance and the trade-offs between throughput and interactivity for LLM deployment. The first post provides a practical primer on designing hardware-friendly LLMs to improve system throughput and user responsiveness.

0 favorites 0 likes

#throughput

The 4-Bitter Lesson: Balancing Stability and Performance in NVFP4 RL

Hacker News Top ↗ · 2026-07-10 Cached

This article presents a recipe for low-precision (NVFP4) RL training that balances throughput and stability, addressing issues from forward and backward pass quantization errors.

0 favorites 0 likes

#throughput

@WescheNex1q: 16 people chatting with Qwen3.6-35B at once ONE DGX Spark. This is a real capture, not a mockup: every token you see re…

X AI KOLs Timeline ↗ · 2026-07-09 Cached

A real-time demo shows 16 concurrent users chatting with Qwen3.6-35B on a single DGX Spark, achieving peak 440 tok/s total and 105 tok/s per user using NVFP4 + MTP-3 on vLLM.

0 favorites 0 likes

#throughput

@YRSM_Simon: Crazy

X AI KOLs Timeline ↗ · 2026-07-09 Cached

DeepSeek-V4-Flash-DSpark achieves 328 tok/s single inference and 1.7k tok/s batch throughput on 4x RTX PRO 6000 GPUs.

0 favorites 0 likes

#throughput

@Alacritic_Super: If you want to master LLM inference, start with these three papers. They introduced many of the ideas powering today's …

X AI KOLs Timeline ↗ · 2026-07-08 Cached

This thread recommends three key papers for mastering LLM inference: PagedAttention, Sarathi-Serve, and SGLang, which introduce efficient memory management, chunked prefills, and structured generation techniques used in modern inference engines like vLLM and TensorRT-LLM.

0 favorites 0 likes

#throughput

Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding

Hugging Face Daily Papers ↗ · 2026-07-07 Cached

The paper introduces Nemotron-Labs-Diffusion, a tri-mode language model that unifies autoregressive, diffusion, and self-speculation decoding, achieving superior throughput and efficiency compared to existing models.

0 favorites 0 likes

#throughput

DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation

Hugging Face Daily Papers ↗ · 2026-07-06 Cached

DSpark is a speculative decoding framework that combines semi-autoregressive draft generation with confidence-scheduled verification to accelerate LLM inference and improve throughput in high-concurrency settings.

0 favorites 0 likes

#throughput

@wafer_ai: BREAKING: these engineers figured out how to serve GLM 5.2 on @AMD MI355X at 2626 tok/s/node and 213 tok/s single strea…

X AI KOLs Timeline ↗ · 2026-07-03 Cached

Engineers successfully serve GLM 5.2 on AMD MI355X at 2626 tok/s per node and 213 tok/s single stream, achieving ~80% of B200 throughput at over 2x lower cost than Blackwell.

0 favorites 0 likes

#throughput

@AnjneyMidha: very cool

X AI KOLs Following ↗ · 2026-06-30 Cached

Etched announced it is coming out of stealth after successful A0 tapeout, with $1B+ customer contracts and $800M raised, claiming SOTA inference performance on its first racks shipping this summer.

0 favorites 0 likes

#throughput

@gabriel1: inference will be the biggest market in the world, intelligence is in infinite demand etched is bringing the AI Summer

X AI KOLs Timeline ↗ · 2026-06-30 Cached

Etched, an AI inference hardware startup, exited stealth after raising $800M and securing over $1B in customer contracts. Their first racks ship this summer, claiming state-of-the-art throughput, latency, and power efficiency.

0 favorites 0 likes

#throughput

@dzhulgakov: DSpark from @deepseek_ai ingeniously integrates many speculative decoding ideas to achieve 1.5x to 5x higher throughput…

X AI KOLs Following ↗ · 2026-06-27 Cached

DSpark from DeepSeek AI integrates speculative decoding ideas to achieve 1.5x to 5x higher throughput in production systems. This thread explains 10 key ideas from the basics.

0 favorites 0 likes

#throughput

@Hikari_07_jp: I got DeepSeek-V4-Flash MTP speculative decoding actually working on 2× RTX PRO 6000 +38% single-stream throughput. It …

X AI KOLs Timeline ↗ · 2026-06-24 Cached

Achieved DeepSeek-V4-Flash MTP speculative decoding on 2× RTX PRO 6000 with a 38% throughput increase by fixing a mis-routed quantization format issue.

0 favorites 0 likes

#throughput

@PyTorch: While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering …

X AI KOLs Following ↗ · 2026-06-23 Cached

SGLang provided Day-0 support for DeepSeek-V4, and collaboration between LMSys and NVIDIA engineering teams achieved up to 5x throughput increase in production, with improvements shown on the SemiAnalysis InferenceX dashboard.

0 favorites 0 likes

#throughput

8-16 MI50s Minimax M3 @19 tps TG (peak)

Reddit r/LocalLLaMA ↗ · 2026-06-21

Reports a peak throughput of 19 tokens per second for the Minimax M3 model running on 8-16 MI50 GPUs.

0 favorites 0 likes

#throughput

@seiji_________: Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in…

X AI KOLs Following ↗ · 2026-06-18 Cached

Ray Serve LLM achieves up to 4x higher throughput on prefill-heavy workloads and 24x on decode-heavy workloads in Ray 2.56, matching rust-based routing frameworks like vllm-router in production benchmarks, announced in partnership with Google Cloud GKE team.

0 favorites 0 likes

#throughput

Kimi K2.7 Code High Speed costs 2x for roughly 5x the throughput so I only route part of the agent to it

Reddit r/AI_Agents ↗ · 2026-06-18

The Kimi K2.7 Code High Speed model offers 5x throughput at 2x cost, leading to selective routing within an agent system.

0 favorites 0 likes

#throughput

A Guide to AI Inference Engineering (17 minute read)

TLDR AI ↗ · 2026-06-16 Cached

This guide explains the discipline of AI inference engineering, covering the split between prefill and decoding phases, the shift from closed to open models, and optimization techniques for latency, throughput, and cost.

0 favorites 0 likes

#throughput

DFlash and Spec V2 Decoding (14 minute read)

TLDR AI ↗ · 2026-06-16 Cached

Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.

0 favorites 0 likes

#throughput

@charles_irl: Many are belatedly realizing that intelligence must be open. For open intelligence to succeed, developers must work tog…

X AI KOLs Following ↗ · 2026-06-15 Cached

A collaboration between Modal, SGLang, and Z Lab integrates DFlash speculation into SGLang, achieving up to 4.3x throughput improvement for Alibaba's Qwen 397B-A17B model, advancing open intelligence.

0 favorites 0 likes

throughput

Submit Feedback