speed-up

#speed-up

quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

Reddit r/MachineLearning ↗ · 5d ago

quicktok is a fast and exact BPE tokenizer in C++ that is byte-identical with tiktoken, achieving 2–11x speedup over existing alternatives. It supports cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3 encoders.

0 favorites 0 likes

#speed-up

@hank_aibtc: https://x.com/ClementDelangue/status/2058672394865111544/video/1… Local LLM speed ceiling broken again! llama.cpp natively supports MTP (Multi-Token Prediction): - No extra draft model needed…

X AI KOLs Timeline ↗ · 2026-05-26 Cached

llama.cpp natively supports Multi-Token Prediction (MTP) without requiring an extra draft model. By leveraging the model's built-in prediction head, local models like Qwen3.6-27B achieve 1.7x+ speedup, making 27B models run smoothly on consumer GPUs.

0 favorites 0 likes

#speed-up

@davideciffa: If you have an Nvidia RTX 4090 --ddtree-budget 36 is the best configuration that buys you 2.5x speed up during decoding…

X AI KOLs Timeline ↗ · 2026-05-24 Cached

A tweet recommending --ddtree-budget 36 for Nvidia RTX 4090, claiming 2.5x speedup during decoding for Qwen3.6_27B.

0 favorites 0 likes

speed-up

quicktok: a faster tokenizer (exact and byte-identical with tiktoken) [P]

@hank_aibtc: https://x.com/ClementDelangue/status/2058672394865111544/video/1… Local LLM speed ceiling broken again! llama.cpp natively supports MTP (Multi-Token Prediction): - No extra draft model needed…

@davideciffa: If you have an Nvidia RTX 4090 --ddtree-budget 36 is the best configuration that buys you 2.5x speed up during decoding…

Submit Feedback