The memory wall gets expensive: KV cache is why you should stop worshiping softmax attention

Reddit r/singularity News

Summary

The article discusses how rising DDR5 memory prices signal a broader memory bottleneck in AI, particularly the KV cache in softmax attention for LLMs, and highlights post-transformer architectures like linear attention and state space models that aim to reduce memory usage.

https://preview.redd.it/tbn5b21yl99h1.png?width=1230&format=png&auto=webp&s=6761bd2d18c1a7105c968fc1a594ccbfc3b029e2 Things are getting a bit different where a simple signal that says “memory is the bottleneck” is showing up outside AI too: We all have seen that DDR5 prices for a common 2×16GB kit have jumped really hard over the last 18 months (the attached PCPartPicker’s tracked listing) However, what is important to notice is that this chart is not about AI memory directly, i mean, GPUs do not use DDR5 for frontier model training or inference, they care much more about your HBM. Though the broader signal matters a lot, with computer memory is becoming valuable enough that producers are shifting toward AI/HBM, and that makes memory optimization becoming more and more hard to ignore. Now for the 2026 I would say the bigger question is if memory is expensive, where is the bottleneck? In LLM inference, the analogous cost is the KV cache. In softmax attention, longer context means keeping more past keys and values around so memory use grows with sequence length And that is why post-transformer architectures are worth watching with linear attention variants, state space models and hybrids that try to replace the growing KV cache with a fixed-size recurrent state. And if you are already catching up, kudos! As we can see across the spectrum already, Kimi Linear uses a hybrid linear/softmax design, Nemotron-style models mix Mamba-like blocks with attention, and Dragon Hatchling (BDH) takes a more radical route, where working memory lives in a fixed-size synaptic state rather than a KV cache that grows with context. Ending my 2 cents here IF memory keeps getting more expensive, does architecture change faster than hardware catches up?
Original Article

Similar Articles

Memory

Reddit r/artificial

Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.