The NPU on AMD Strix Halo devices is now usable for AI inference, enabling hybrid mode that combines NPU and iGPU for faster prompt processing. Tools like Lemonade and AMD's ROCm software make this possible.
Admittedly this is news for me, but I'm hoping it could be of some use to others here as well! So, THE NPU IS USABLE!! I've owned an AMD Ryzen 395 Max AI+ (or whatever the naming is lol) for about a year now and have relied solely on GGUFs and Vulkan. I acknowledge that the AMD Ryzen AI team has been working hard to get their ROCm software up to speed w/ their hardware. https://kyuz0.github.io/amd-strix-halo-toolboxes/ This database did NOT look so ROCm friendly 6 months ago. Why should I care? If you own a device w/ both an NPU and a iGPU (like the strix halo series) then you WANT hybrid models. The NPU is CRAZY FAST at PromptProcessing, and can run parallel to gpu firing. Okay, What is Hybrid Mode? So, LLMs can run through the NPU only. If they're built for it. Check out "FastFlowLM NPU" models for examples that do that. BUT HYBRID mode combines the best of both, and FINALLY utilizes the hardware purchased nearly a year go (for some, more than that). What can i do to test this? Download Lemonade! Thanks to their efforts that focus primarily on Ryzen AI and working directly w AMD, I've FINALLY got my machine working in ways it couldn't a year ago and Lemonade made it happen. It's GUI is ultra bare-bones and I wouldn't recommend it for any actual agentic/chat/harness usage BUT being able to sanity-test software without investing days or weeks into it? 10/10 Here's the link: lemonade-server.ai Speaking of links, read more about Hybrid Mode and making your own Hybrid Models here: https://ryzenai.docs.amd.com/en/latest/llm/overview.htmlhttps://ryzenai.docs.amd.com/en/latest/llm/overview.html --- So, that's it. Just wanted to share. REALLY EXCITED that my year old computer is still advancing in the software science of it all. I have a single wishlist/request now: MTP-supported Hybrid Models. Qwen 3.6 has that speedup tech introduced by Unsloth, and AMD has a guide for "new processor shapes" since 3.6 GGUF can't simply be "converted to ONNX". Here's that guide: https://ryzenai.docs.amd.com/en/latest/oga_op_prepare.html If anyone attempts it, please share on huggingface! This was all written by hand btw, no llm assistance, just passionate dev obsessed w "new shiny".
xdna-top is a terminal monitor that shows both NPU and iGPU activity on Ryzen AI Max/Strix Halo systems, providing an honest view of NPU counter deltas instead of fake utilization percentages.
The article announces support for DFlash and PFlash speculative decoding in llama.cpp for AMD Strix Halo iGPUs, demonstrating significant speedups in inference performance using ROCm.
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.
A user details their modding and benchmarking of an AMD Strix Halo system with dual RTX 3090 eGPUs and NVLink, finding improvements in LLM inference speed for dense models, especially with vLLM, and discusses power efficiency trade-offs.