A user details their modding and benchmarking of an AMD Strix Halo system with dual RTX 3090 eGPUs and NVLink, finding improvements in LLM inference speed for dense models, especially with vLLM, and discusses power efficiency trade-offs.
https://preview.redd.it/kz66mxzseq2h1.jpg?width=4096&format=pjpg&auto=webp&s=da98623808c4bde0dc79b239c8cf8930c5572769 https://preview.redd.it/ocsigi0veq2h1.jpg?width=4096&format=pjpg&auto=webp&s=eb4b053e46e434b2c54de7fff6c584e01c80ea5e [This pic is not representing bench setup, just happily captured while I figured out running same model over 3 GPUs. Halo is always busy, 3090s are waiting Halo does his job.](https://preview.redd.it/rbedmn78pq2h1.png?width=1202&format=png&auto=webp&s=248d88c5f54c8e0b9c9ae2d4ae1caf04e6e5754b) **In short.** **1. Strix halo alone (124GB UMA VRAM) is already nice but adding 1 or 2 eGPUs is pretty good for running the recently popular 27B or 31B dense models.** **2. The native bandwidth limit of eGPUs can be mitigated. I tried scrambling a 2slot NVLink (cheaper than 3 slots) setup with a simple cooling mod on 3090s. You** ***might*** **experience up to several times better PP/s and TG/s on small densed models, depending on the situation, and it can be useful in multi coding agents scenarios.** **3. Basically using riser cable can achieve eGPU's slot flexibility to fit 2slot NVLink with small mod on typical motherboard pcie 3090 cards.** **4. Depending on KVcache types in vLLM, not only max context length and concurrent requests change but speed differs a lot in longer context. It might look good at beginning but not promising longer run.** **5. For power efficiency, 27B dense models get better PP/s and TG/s per watt on eGPU. But for 122B, running on Strix halo alone via llama cpp showed better power efficiency than combined 3 GPUs.** **6. NVLink does not do anything on llama.cpp's layer split, I have tried recent -sm tensor, gaining Tg/s was 30%ish but pp/s down performance was too big, so I stopped, and continue to vLLM on dual 3090.** I was getting a bit frustrated by the relatively slow PP/s on 27B, 31B densed models of my Bosgame M5 Strix Halo, So I decided to do some scrambling to overcome it. Recently, these dense models are getting much more attention than 70B+ MoE models. To run them better I bought single 3090 via local second hand market, after I saw improvement, then quickly moved to dual egpu setup via both nvme pcie 4x4. I was hesitated to try NVLink since no gurantee on my eGPU case, and 3 slot NVLink was too expensive(600USD+). Still I wanted to see if I could improve the eGPU's PHB speed which has to go through CPU. But most 3090 cards including mine are 3 slot thick, so I end up buying a 2slot bridge for around $250 including custom fees. For this, I removed the 3 fan shroud on the top 3090 and roughly attached 120mm fans with a 3D printed side blow duct to make it fit. Surprisingly, the temperature of this modded 3090 actually stays lower than the unmodded one on bottom. **Test Environment:** * Fedora 43 * llama cpp: Strix halo performance power mode, build 9221. * 122B test was split by `-sm layer` using rocm7.2.3 and cuda. * 27B test used rocm 7.2.3 as baseline. (Comparing rocm 7.2.3 and vulkan radv, rocm has better pp/s and vulkan has better tg/s). Benchmarks were repeated only 2 times. * *Note:* Since MTP is not fully implemented in llama cpp benchmarks yet, I borrowed the code\_python MTP metrics (-pp/s% and +tg/s%) from kyuz0's strix halo toolbox for the 27B and 122B (using 35B A3B Moe stats) to plot simulated MTP lines. *(*[*https://kyuz0.github.io/amd-strix-halo-toolboxes/mtp.html*](https://kyuz0.github.io/amd-strix-halo-toolboxes/mtp.html)*)* * vLLM: Nightly build. 3090s are power limited to 230W each. * vLLM benchmarks followed the Club 3090 direction: * Narrative: "Write a detailed 800-word essay explaining transformer attention." (max\_tokens=1000) * Code: "Write a Python implementation of quicksort with comments explaining each step." (max\_tokens=800) * Sampling: temp=0.6, top\_p=0.95, top\_k=20, presence\_penalty=0.0, enable\_thinking=false. Three warmups and five measured runs. * Since Club 3090 doesn't have benchmarks based on context depth, I added those tests. **Benched vLLM models - Qwen 3.6 27B** |Recipe|**Quantization**|**KV cache**|**Context**|**Concurrency**|**Drafter**| |:-|:-|:-|:-|:-|:-| |**docker-compose**\-dual *(small, INT4 Standard)*|AutoRound **INT4**|fp8\_e5m2|**131K**|**4** *(total \~524K)*|MTP=3| |**turbo** *(High-Concurrency)*|AutoRound **INT4**|TQ3 (3-bit)|**262K**|**4** *(total \~1048K)*|MTP=3| |**mixed-bf16** *(Precision,kinda Q6 feeling)*|Mixed **(INT4+8)**|bfloat16|**110K**|**2** *(total \~220K)*|MTP=3| |**mixed-fp8** *(Sweet Spot)*|Mixed **(INT4+8)**|fp8\_e5m2|**131K**|**2** *(total \~262K)*|MTP=2| |**autoround INT8** *(Largest)*|AutoRound **INT8**|fp8\_e5m2|**115K**|**1** *(total \~115K)*|MTP=3| Mixed bf16, Mixed fp8, Autoround INT8 recipes are small edited from Club 3090's recipe to look for better than Q4 level of quantization. (*I noticed MTP 2 on mixed-fp8 recipe while I am writing, too much work again to fix, so, keep it mind some different condition)* **Benched vLLM models - Qwen 3.6 27B** |Recipe|**KV cache**|**Context**|**Concurrency**|**Drafter**| |:-|:-|:-|:-|:-| |**awq-bf16** **(pure AWQ)**|bf16|**262K**|**262K × 1,** **131K × 2,** **65K × 4**|MTP=4| |**awq\_autoround** **(hybrid awq)**|bf16|**262K**|**262K × 1,** **131K × 2**, **65K × 4**|MTP=4| |**int8** **(larger context)**|INT8|**340K \~ 392K**|**262K × 1**, **170K × 2,** **98K × 4**|MTP=4| |**docker-compose-bf16** *(default)*|bf16|**60K**|**60K × 1**|MTP=4| Awq\_autoround recipe is also small edited from original. **Results:** Triple : dual 3090 + Strix halo 122B Q4 K XL unsloth, q8\_0, Strix Halo vs Triple https://preview.redd.it/k3owfjdupq2h1.png?width=1600&format=png&auto=webp&s=0ac542116870087ebdbeeb959ab7bb6e398b802b https://preview.redd.it/avlcn0hpoq2h1.png?width=1600&format=png&auto=webp&s=a824f6b42c48e2b4e3ae7690a36b473ca8d8c81c Strix halo (llama cpp 27B MTP Q6 K XL unsloth, 25GB including mmproj) vs Dual 3090, Qwen3.6-27B-Mixed-AutoRound Minachist 28.9GB) I chose these quants since considerably good enough quality and size wise close https://preview.redd.it/gl5xz5ufqq2h1.png?width=1600&format=png&auto=webp&s=4f14f93ffacd94fbb68c6bb52f462012fad0882f https://preview.redd.it/n93cgeshqq2h1.png?width=1600&format=png&auto=webp&s=98d219e97e13137db627d66d84124aae84275a74 **Power efficiency** Rough calculation, but for 27B dense models, the eGPU setup has better power efficiency. However, when running the 122B model, Strix halo alone running on llama cpp was actually more power efficient. https://preview.redd.it/s2ryohacsq2h1.png?width=1600&format=png&auto=webp&s=e0764be736283bb211e52ed67110b0b9e28fc8ad https://preview.redd.it/8xdltx0esq2h1.png?width=1600&format=png&auto=webp&s=2d0d2a8b637aae66c5c2511c95e2b1c6baae8ae5 **NVLink on / off** Tested NVLink on vs off. As concurrency and context go up, NVLink defends the bandwidth bottleneck pretty well. BF16 cache senario https://preview.redd.it/92qm9owysq2h1.png?width=1600&format=png&auto=webp&s=af40d019a444877c1d7128b30dbc5b0d80837c66 https://preview.redd.it/6zqs4g80tq2h1.png?width=1600&format=png&auto=webp&s=4951dc402159bd64d8959ebdf5fe1f42c8b5d9e2 fp8 cache case. https://preview.redd.it/yzcgl1wjtq2h1.png?width=1600&format=png&auto=webp&s=6b6e547721a6daeb480423b5928c5a30cdf98e51 https://preview.redd.it/zopa2nlktq2h1.png?width=1600&format=png&auto=webp&s=25f05e0a183ae75627f2ae1071ea9318f91dfe0a INT4 quant's fp8 senario https://preview.redd.it/6um96q5qtq2h1.png?width=1600&format=png&auto=webp&s=463dfd330cd6f783ab9d6e446f58dc15be568326 https://preview.redd.it/e4j0sj3stq2h1.png?width=1600&format=png&auto=webp&s=4655627f234372ea7d4c847aaaca9faeb2080f7b Gemma4 31B's case Gemma-4-31B-it-AutoRound-AWQ, mattbucci, BF16 cache https://preview.redd.it/rey8p3zytq2h1.png?width=1600&format=png&auto=webp&s=aa573c264af1e3fed6a87ec0837bca32066116b3 https://preview.redd.it/wera6hiztq2h1.png?width=1600&format=png&auto=webp&s=d8c92a6abffcbd0d866c17a7d3ecf2a19764a47c This shows differences based on quantization and KV cache types. You can see how much max context length and speed fluctuate just by changing the cache type. on Amphere card, TQ3 was pretty bad to keep Tg/s despite it can give more context amount.. https://preview.redd.it/j6y2cg6nvq2h1.png?width=1164&format=png&auto=webp&s=52eef18357c23d2341444e3e7e873902837fd87d https://preview.redd.it/jb917qmovq2h1.png?width=1164&format=png&auto=webp&s=e94a60d752d0ad6bf28c070015a15c1cb37a0759 Code vs Narrative MTP When concurrency is 1, code generation is always faster than narrative. But as you can see, when concurrency is 2 and it goes into deeper context, code speed drops and gets reversed by narrative. Seems like a weird load happens when concurrent requests and long context combine. https://preview.redd.it/pcw1duwdwq2h1.png?width=1600&format=png&auto=webp&s=f6366e31b70af3d3d3361288320b9ebba4cda5c8 Huge thanks to Club 3090 ([https://github.com/noonghunna/club-3090/tree/master](https://github.com/noonghunna/club-3090/tree/master)), kyuz0's toolbox ([https://github.com/kyuz0/amd-strix-halo-toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes)), and DasDigitaleMomentum's distrobox ([https://github.com/DasDigitaleMomentum/strix-halo-cuda-combined-toolbox](https://github.com/DasDigitaleMomentum/strix-halo-cuda-combined-toolbox))
The author ran 55 inference benchmark runs across Strix Halo, RTX 3090, and RTX 5070 with multiple backends, revealing that memory bandwidth dominates decode speed, the RTX 5070 beats the 3090 on small models, and reasoning models appear ~5x slower due to hidden reasoning content.
The article presents benchmark results for 8 local LLMs on an RTX 3090, showing that power efficiency peaks around 225W, with diminishing returns at maximum power.
A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.
A user shares power limit testing on a 4x RTX 3090 setup running Qwen3.6-27B with vLLM, finding 220W as the sweet spot for peak efficiency with minimal throughput loss.
This article provides a tutorial on fine-tuning Large Language Models (LLMs) using AMD Strix Halo hardware, covering both Linux and native Windows environments with SFT and LoRA methods.