quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

Reddit r/MachineLearning Tools

Summary

quicktok is a fast and exact BPE tokenizer in C++ that is byte-identical with tiktoken, achieving 2–11x speedup over existing alternatives. It supports cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3 encoders.

Been working on this a while! Should be useful for anyone trying to speed up their tokenization workflows. **quicktok** is a fast/exact BPE tokenizer written in C++. Token ids are byte-identical to `tiktoken` and encoding runs **2–3.6×** faster than `bpe-openai` (the fastest alternative I know of) and **4–11×** faster than `tiktoken` itself. It ships cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3. **Approach.** Same algorithm as `bpe-openai` (exact backtracking BPE) but I apply lots of data structure engineering to cut memory accesses: * A 2-byte trie is used for the longest-match walk * Dense exactly-keyed caches are used for merge-validity checks * A hand-compiled pretokenizer is used instead of a general regex engine **Benchmarks** (Apple M1, single thread, MB/s, cl100k\_base and every output verified token-for-token before timing): |encoder|The Pile|Code|Common Crawl| |:-|:-|:-|:-| |**quicktok (native)**|**121.7**|**139.2**|**71.3**| |**quicktok (Python)**|**77.9**|**83.6**|**49.7**| |bpe-openai|36.6|38.7|28.9| |rs-bpe|30.9|34.7|23.5| |tiktoken-rs|15.4|13.8|13.3| |tiktoken (Python)|13.6|12.8|12.3| |TokenDagger|11.1|11.9|10.7| o200k\_base is similar in ratios. Each encoder is called through its own raw API and benchmarks can be reproduced with `make bench-compare` in the repo. `pip install quicktok-v1` Repo: [https://github.com/dmatth1/quicktok](https://github.com/dmatth1/quicktok)
Original Article

Similar Articles

Incremental BPE Tokenization

arXiv cs.CL

This paper introduces an incremental algorithm for Byte Pair Encoding (BPE) tokenization that processes each byte in O(log^2 t) time, enabling efficient partial tokenization in streaming settings and achieving speedups over existing implementations.

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Reddit r/LocalLLaMA

A quantized version of Qwen3.6 27B using a pure Q4_K_M method fits entirely in 16 GB VRAM, achieving up to 40 tok/s token generation speed with MTP, and significantly reducing model size compared to other GGUF variants.