quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]
Summary
quicktok is a fast and exact BPE tokenizer in C++ that is byte-identical with tiktoken, achieving 2–11x speedup over existing alternatives. It supports cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3 encoders.
Similar Articles
ztok — a fast multithreaded tokenizer in Zig that loads tiktoken / HF / SentencePiece and is 2–5× faster
ztok 是一个用 Zig 编写的高性能多线程分词器库,支持多种格式(tiktoken、HF、SentencePiece 等),速度比现有方案快 2–5 倍,适用于 RAG 分块和数据集分词。
Incremental BPE Tokenization
This paper introduces an incremental algorithm for Byte Pair Encoding (BPE) tokenization that processes each byte in O(log^2 t) time, enabling efficient partial tokenization in streaming settings and achieving speedups over existing implementations.
@no_stp_on_snek: @antirez Turbo3 BEATS fp8 by +5% decode tok/s at 32K context still tinkering but i've been cooking TQ+ in your kitchen
Turbo3 achieves 5% faster decode tokens per second compared to fp8 at 32K context, a performance improvement in quantization or model optimization.
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM
A quantized version of Qwen3.6 27B using a pure Q4_K_M method fits entirely in 16 GB VRAM, achieving up to 40 tok/s token generation speed with MTP, and significantly reducing model size compared to other GGUF variants.