How fast is 10 tokens per second really?

Simon Willison's Blog 05/20/26, 05:57 PM News

Summary

Simon Willison explores the practical meaning of 10 tokens per second speed for large language models, offering context on how fast that feels and its implications for usability.

No content available

Original Article

View Cached Full Text

Cached at: 05/20/26, 06:38 PM

# How fast is 10 tokens per second really? Source: [https://simonwillison.net/2026/May/20/tokens-per-second/](https://simonwillison.net/2026/May/20/tokens-per-second/) This is a**link post**by Simon Willison, posted on[20th May 2026](https://simonwillison.net/2026/May/20/)\. [ai2028](https://simonwillison.net/tags/ai/)[generative\-ai1795](https://simonwillison.net/tags/generative-ai/)[llms1761](https://simonwillison.net/tags/llms/) ### Monthly briefing Sponsor me for**$10/month**and get a curated email digest of the month's most important LLM developments\. Pay me to send you less\! [Sponsor & subscribe](https://github.com/sponsors/simonw/)

Similar Articles

How fast is N tokens per second really?

Hacker News Top

A web tool that lets users visually experience different LLM token generation rates (e.g., 5–800 tok/s) across code, text, reasoning, and agent modes, helping internalize performance numbers from benchmarks.

Getting a feel for how fast X tokens/second really is.

Reddit r/LocalLLaMA

The author introduces a web-based script designed to help users intuitively understand token-per-second speeds in local LLM setups by simulating text, code, and reasoning generation rates.

Token maxxing

Reddit r/singularity

Discusses strategies and techniques for maximizing token usage in large language models to improve efficiency and output quality.

Compute Optimal Tokenization (2 minute read)

TLDR AI

This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.

@RedHat_AI: 145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output qu…

X AI KOLs Timeline

Red Hat demonstrates that using speculative decoding can boost LLM inference speed from 145 to 424 tokens per second on the same H100 hardware with no quality loss, highlighting a significant optimization for production serving.