If you're using Windows, disable memory compression to stop bottlenecks!

Reddit r/LocalLLaMA 05/14/26, 11:07 AM Tools

windows performance memory-compression ai-workload amd-gpu llm optimization

Summary

A user shares a fix for performance bottlenecks when running AI models on AMD GPUs in Windows 11 by disabling memory compression via the command 'Disable-mmagent -mc'.

This is a follow up to this post: [https://www.reddit.com/r/LocalLLaMA/comments/1ta3ben/dont\_you\_have\_issues\_in\_w11\_with\_amd\_gpu\_where/](https://www.reddit.com/r/LocalLLaMA/comments/1ta3ben/dont_you_have_issues_in_w11_with_amd_gpu_where/) I fixed this never-ending issue by just disabling memory compression via admin terminal: `Disable-mmagent -mc` All issues have been resolved, I can open any game and my IA won't slow down at all like before (even when the games are closed)!

Original Article

Similar Articles

Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

Reddit r/LocalLLaMA

The article shares a performance optimization trick for llama.cpp, showing that increasing the micro-batch size (`-ub`) combined with partial CPU offloading (`--n-cpu-moe`) can drastically improve prompt processing speed for large models like gpt-oss-120b on consumer GPUs.

Running local models on an M4 with 24GB memory

Hacker News Top

A guide on running local AI models like Qwen 3.5-9B on an M4 MacBook with 24GB RAM using tools like LM Studio, Ollama, and pi, including specific configuration tips for optimal performance.

Why your current hardware will choke on 2026 Multi-Agent workflows (Mac Studio vs. RTX 5090)

Reddit r/ArtificialInteligence

Comparison of hardware requirements for running multi-agent AI workflows locally, highlighting VRAM and KV Cache constraints.

Stop wasting electricity

Reddit r/LocalLLaMA

The author demonstrates how to reduce RTX 4090 power consumption by up to 40% while running quantized Qwen models via llama.cpp, without sacrificing inference speed. By capping GPU power limits through nvidia-smi and adjusting llama-server parameters, users can significantly lower heat, noise, and extend hardware lifespan.

@leftcurvedev_: Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp…

X AI KOLs Timeline

Explains how the -ncmoe flag in llama.cpp improves performance for MoE models like Qwen3.6 35B A3B on limited VRAM (8-12GB) by offloading some expert layers to CPU+RAM, with benchmarks showing up to 5x speedup on an RTX 3070Ti.