How fast is 10 tokens per second really?
Summary
Simon Willison explores the practical meaning of 10 tokens per second speed for large language models, offering context on how fast that feels and its implications for usability.
View Cached Full Text
Cached at: 05/20/26, 06:38 PM
Similar Articles
How fast is N tokens per second really?
A web tool that lets users visually experience different LLM token generation rates (e.g., 5–800 tok/s) across code, text, reasoning, and agent modes, helping internalize performance numbers from benchmarks.
Getting a feel for how fast X tokens/second really is.
The author introduces a web-based script designed to help users intuitively understand token-per-second speeds in local LLM setups by simulating text, code, and reasoning generation rates.
Token maxxing
Discusses strategies and techniques for maximizing token usage in large language models to improve efficiency and output quality.
Compute Optimal Tokenization (2 minute read)
This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.
@RedHat_AI: 145 tokens per second. Add speculative decoding. 424 tokens per second. Same model. Same H100. Zero change in output qu…
Red Hat demonstrates that using speculative decoding can boost LLM inference speed from 145 to 424 tokens per second on the same H100 hardware with no quality loss, highlighting a significant optimization for production serving.