@steeve: aaaaaand we're faster (i know i know)
Summary
Steeve Morin reports that after 5 days of work, his implementation is now within 10% of llama.cpp's speed, achieving 64 tok/s vs 70 tok/s, with more work to do.
View Cached Full Text
Cached at: 06/08/26, 05:24 PM
aaaaaand we’re faster (i know i know) https://t.co/Yt4QUg6esp
Steeve Morin (@steeve): After 5 days of work, we are now within 10% of llama.cpp (64 tok/s vs 70 tok/s) More work to do but momentum is great.
Similar Articles
@steeve: Progress: 26 tok/s (llama 3.1 3b) .@tenstorrent claims 33 tok/s so we’re not far off
Steeve Morin reports running Llama 3.1 3B on Tenstorrent hardware via ZML, achieving 26 tok/s, close to Tenstorrent's claimed 33 tok/s.
@leopardracer: THIS AMERICAN DEVELOPER SPENT WEEKS DEBUGGING TIMEOUT ERRORS IN OLLAMA. THEN HE LOOKED UNDER THE HOOD LM Studio is just…
A developer fixed persistent timeout errors in Ollama by using llama.cpp directly, bypassing wrappers like LM Studio and Ollama, achieving 53 tok/s on an M1 Max with 262K context.
@binsquares: omg, GPU acceleration on smolvm works way better than I thought. can run llama.cpp inside the smol machine with close t…
User @binsquares reports that GPU acceleration on smolvm achieves nearly 90% of host performance when running llama.cpp via the Vulkan backend.
Dual GPU llama.cpp speedup
A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.
@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.