If you're using Windows, disable memory compression to stop bottlenecks!

Reddit r/LocalLLaMA 05/14/26, 11:07 AM Tools

windows performance memory-compression ai-workload amd-gpu llm optimization

Summary

A user shares a fix for performance bottlenecks when running AI models on AMD GPUs in Windows 11 by disabling memory compression via the command 'Disable-mmagent -mc'.

This is a follow up to this post: [https://www.reddit.com/r/LocalLLaMA/comments/1ta3ben/dont\_you\_have\_issues\_in\_w11\_with\_amd\_gpu\_where/](https://www.reddit.com/r/LocalLLaMA/comments/1ta3ben/dont_you_have_issues_in_w11_with_amd_gpu_where/) I fixed this never-ending issue by just disabling memory compression via admin terminal: `Disable-mmagent -mc` All issues have been resolved, I can open any game and my IA won't slow down at all like before (even when the games are closed)!

Original Article

Similar Articles

Memory Bandwidth for Local AI Hardware (2026 Edition)

X AI KOLs

The article breaks down memory bandwidth as the critical metric for local AI hardware performance, comparing current GPUs and unified memory systems from NVIDIA, Apple, AMD, Intel, and others across different performance tiers.

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

Reddit r/LocalLLaMA

The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.

AMD's tiny AI PC points to a more local future for model inference

Reddit r/ArtificialInteligence

AMD's Ryzen AI Max platform with 128GB unified memory enables local inference of large models up to 200 billion parameters, aiming to shift AI workloads from cloud to compact personal hardware.

Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models

Reddit r/LocalLLaMA

User benchmarks show no significant speed difference between Windows 11 and Linux when running large MoE models with llama.cpp, debunking a common myth. Tests on a multi-GPU setup with models like Qwen 3.5 122B, 397B, and MiniMax 2.7 yield nearly identical prompt processing and token generation speeds.

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2062553418460479577

X AI KOLs Timeline

An open-source tool called Headroom compresses AI agent context by up to 90% using a reversible Compress-Cache-Retrieve architecture, enabling models to retrieve original details on demand instead of discarding them permanently.