@__tinygrad__: We are on the MLPerf board with AMD MI350X training Llama 8B. This is with our driver, runtime, kernels, and training l…
Summary
tinygrad announces it has achieved a spot on the MLPerf benchmark board using AMD MI350X hardware to train Llama 8B, with its own driver, runtime, kernels, and training loop, and plans to improve the time and tackle 405B next.
View Cached Full Text
Cached at: 06/16/26, 09:41 PM
We are on the MLPerf board with AMD MI350X training Llama 8B. This is with our driver, runtime, kernels, and training loop. 405B next MLPerf, along with a better time on 8B (tinygrad currently at 170 min). https://t.co/syPwte872y
Similar Articles
llama.cpp B9387 Significant AMD/ROCm PP Update
llama.cpp version b9387 introduces MFMA support for AMD CDNA architecture (MI100, MI200, MI300 series), improving processing pipeline performance on datacenter AMD GPUs.
@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.
Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]
A monokernel approach for LLM decoding on AMD MI300X GPUs achieves up to 3,300 output tokens/s per request without speculative decoding or quantization, using memory access patterns mapped to the die topology.
Gemma4 26b MoE running in MLX with turboquant (and custom kernel)
A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.
@leopardracer: https://x.com/leopardracer/status/2055341758523883631
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.