@mweinbach: Who said TPUs can't be fast! This is roughly Groq speeds out of TPU 8i, but with Gemini Flash model so far more intelli…

X AI KOLs Timeline 05/19/26, 05:20 PM News

tpu gemini-flash groq inference-speed hardware performance

Summary

Google demonstrated Gemini Flash model achieving 600-1400 tokens per second on TPU 8i, rivaling Groq's inference speeds.

Who said TPUs can't be fast! This is roughly Groq speeds out of TPU 8i, but with Gemini Flash model so far more intelligent https://t.co/J39FF2yCm6

Original Article

View Cached Full Text

Cached at: 05/20/26, 04:35 PM

Who said TPUs can’t be fast!

This is roughly Groq speeds out of TPU 8i, but with Gemini Flash model so far more intelligent https://t.co/J39FF2yCm6

Max Weinbach (@mweinbach): Google just showed a demo, Gemini Flash model running between 600-1400 tokens per second on TPU 8i

It peaked out around 1480 tok/s, with average around 800 tok/s

Similar Articles

@analogalok: Gemma 4 12B QAT (dense) achieves 1000+ tokens/sec prefill on 8GB VRAM with 120k context Gemma 4 12B QAT (dense), TurboQ…

X AI KOLs Following

Gemma 4 12B QAT (dense) achieves over 1000 tokens per second prefill on an 8GB RTX 4060 with 120k context using TurboQuant, enabling full GPU layer offloading. This represents a 42% increase in prefill speed over previous methods.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Reddit r/LocalLLaMA

Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.

@onusoz: 16x parallel Gemma-4-26B-A4B-NVFP4 runs 18 output tokens/s, aggregate 300 tok/s 🫪 1 DGX Spark with 128 GB unified memo…

X AI KOLs Timeline

@onusoz demonstrates running 16 parallel instances of NVIDIA's quantized Gemma-4-26B-A4B-NVFP4 model on a single DGX Spark with 128GB unified memory, achieving 300 tok/s aggregate, showcasing high concurrency without flashinfer.

Our eighth generation TPUs: two chips for the agentic era

Hacker News Top

Google unveils 8th-gen TPUs: TPU 8t for training and TPU 8i for inference, purpose-built for power-efficient, large-scale AI agent workloads and arriving later this year.