@tinygrad: We are on the MLPerf board with AMD MI350X training Llama 8B. This is with our driver, runtime, kernels, and training l…

X AI KOLs Timeline 06/16/26, 04:22 PM News

mlperf benchmark tinygrad amd mi350x llama-8b training

Summary

tinygrad announces it has achieved a spot on the MLPerf benchmark board using AMD MI350X hardware to train Llama 8B, with its own driver, runtime, kernels, and training loop, and plans to improve the time and tackle 405B next.

We are on the MLPerf board with AMD MI350X training Llama 8B. This is with our driver, runtime, kernels, and training loop. 405B next MLPerf, along with a better time on 8B (tinygrad currently at 170 min). https://t.co/syPwte872y

Original Article

View Cached Full Text

Cached at: 06/16/26, 09:41 PM

Similar Articles

llama.cpp B9387 Significant AMD/ROCm PP Update

Reddit r/LocalLLaMA

llama.cpp version b9387 introduces MFMA support for AMD CDNA architecture (MI100, MI200, MI300 series), improving processing pipeline performance on datacenter AMD GPUs.

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

X AI KOLs Following

A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.

Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

Reddit r/MachineLearning

A monokernel approach for LLM decoding on AMD MI300X GPUs achieves up to 3,300 output tokens/s per request without speculative decoding or quantization, using memory access patterns mapped to the die topology.

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Reddit r/LocalLLaMA

A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

X AI KOLs Timeline

A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.