If you're using Windows, disable memory compression to stop bottlenecks!
Summary
A user shares a fix for performance bottlenecks when running AI models on AMD GPUs in Windows 11 by disabling memory compression via the command 'Disable-mmagent -mc'.
Similar Articles
Memory Bandwidth for Local AI Hardware (2026 Edition)
The article breaks down memory bandwidth as the critical metric for local AI hardware performance, comparing current GPUs and unified memory systems from NVIDIA, Apple, AMD, Intel, and others across different performance tiers.
Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models
The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.
AMD's tiny AI PC points to a more local future for model inference
AMD's Ryzen AI Max platform with 128GB unified memory enables local inference of large models up to 200 billion parameters, aiming to shift AI workloads from cloud to compact personal hardware.
Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models
User benchmarks show no significant speed difference between Windows 11 and Linux when running large MoE models with llama.cpp, debunking a common myth. Tests on a multi-GPU setup with models like Qwen 3.5 122B, 397B, and MiniMax 2.7 yield nearly identical prompt processing and token generation speeds.
@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2062553418460479577
An open-source tool called Headroom compresses AI agent context by up to 90% using a reversible Compress-Cache-Retrieve architecture, enabling models to retrieve original details on demand instead of discarding them permanently.