The memory wall gets expensive: KV cache is why you should stop worshiping softmax attention

Reddit r/singularity 06/24/26, 05:28 PM News

memory kv-cache attention transformers llm-inference hardware-architecture

Summary

The article discusses how rising DDR5 memory prices signal a broader memory bottleneck in AI, particularly the KV cache in softmax attention for LLMs, and highlights post-transformer architectures like linear attention and state space models that aim to reduce memory usage.

https://preview.redd.it/tbn5b21yl99h1.png?width=1230&format=png&auto=webp&s=6761bd2d18c1a7105c968fc1a594ccbfc3b029e2 Things are getting a bit different where a simple signal that says “memory is the bottleneck” is showing up outside AI too: We all have seen that DDR5 prices for a common 2×16GB kit have jumped really hard over the last 18 months (the attached PCPartPicker’s tracked listing) However, what is important to notice is that this chart is not about AI memory directly, i mean, GPUs do not use DDR5 for frontier model training or inference, they care much more about your HBM. Though the broader signal matters a lot, with computer memory is becoming valuable enough that producers are shifting toward AI/HBM, and that makes memory optimization becoming more and more hard to ignore. Now for the 2026 I would say the bigger question is if memory is expensive, where is the bottleneck? In LLM inference, the analogous cost is the KV cache. In softmax attention, longer context means keeping more past keys and values around so memory use grows with sequence length And that is why post-transformer architectures are worth watching with linear attention variants, state space models and hybrids that try to replace the growing KV cache with a fixed-size recurrent state. And if you are already catching up, kudos! As we can see across the spectrum already, Kimi Linear uses a hybrid linear/softmax design, Nemotron-style models mix Mamba-like blocks with attention, and Dragon Hatchling (BDH) takes a more radical route, where working memory lives in a fixed-size synaptic state rather than a KV cache that grows with context. Ending my 2 cents here IF memory keeps getting more expensive, does architecture change faster than hardware catches up?

Original Article

The memory wall gets expensive: KV cache is why you should stop worshiping softmax attention

Similar Articles

The KV-cache wall: why fixed-size memory sequence models keep coming back

Memory

@TheTuringPost: Why KV cache is one of the main reasons LLMs are fast? KV cache is what connects attention mechanism with generation st…

AI memory is starting to feel more important than model intelligence

@HaochengXiUCB: New blog post: The Forgetting Wall in Video and World Models Long-horizon video generation is not just limited by compu…

Submit Feedback

Similar Articles

The KV-cache wall: why fixed-size memory sequence models keep coming back

@TheTuringPost: Why KV cache is one of the main reasons LLMs are fast? KV cache is what connects attention mechanism with generation st…

AI memory is starting to feel more important than model intelligence

@HaochengXiUCB: New blog post: The Forgetting Wall in Video and World Models Long-horizon video generation is not just limited by compu…