This article analyzes Jan Chorowski's BDH architecture proposal, which explores embedding LLM memory directly into network weights using sparse high-dimensional key-query spaces as an alternative to traditional KV caches.
I've seen BDH come up in a few discussion threads, but I couldn't find a compact explanation of what the architecture is actually claiming. I found jan chorowski's seminar and took notes, so posting the short version here in case it saves others the full watch. I'm exploring post-transformer architectures, so treat this as my understanding of one architecture, please correct it and not a definitive take. I read more and more anterograde amnesia to characterize transformers' memory as being unable to form new long-term memories as they compensate with markdown notes. So transformers' memory is a combination of static pre-training context compressed into the weights and very short-term context (current user session) encoded in KV-cache. The attention part was the most interesting to me. Standard attention retrieves values by comparing a query to past keys. Jan's idea is to stop treating keys/queries as small abstract vectors. In the (attached) photo of the slide he sets keys and queries equal to neuron activations in high dimensional space, so sigma is the accumulated connectivity matrix and reading memory becomes graph propagation. So it’s not just linearizing attention as in vanilla SSM, trading off performance for efficiency. His line was: You cannot swap basically a non-linear attention layer for a linear attention layer and change nothing else in the model. In other words: if you linearize attention, Jan's claim is that you also need to change the memory space. The key/query space becomes very large, sparse, and positive/neuron-like because the model is working with non-negative activations. Another slide claims `>10^7` key-query dimensions for BDH versus `~10^3` for Transformers; the short-term memory states are thus projected to fixed, positive, and very high-dimensional spaces, becoming much more expressive and manipulable than KV cache. The practical issue is obvious: a full `Neurons x Neurons` connectivity matrix is too large. The implementation uses low-rank factorization plus ReLU thresholding, keeping the graph compressed and sparse instead of materializing `N x N`. Other claims that seem important to put here but need follow up: * RNNs maybe had the wrong memory/compute ratio: O(N\^2) transition parameters but only O(N) state * BDH memory is more like a noisy fixed-size hash table: sparse keys write to a few buckets, collisions add noise, but memory does not grow one token at a time * Recovered graphs show modular/heavy-tailed-looking structure * A Europarl example shows a synapse activating after "US dollar" but not after "US" * Repeated facts cause fewer active neurons /fewer writes over time, roughly 6% active neurons dropping to about 2%. I would treat the results as interesting claims to inspect, not proof. The caveats matter: * This is not a conversion of existing Transformer weights; jan says BDH models train from scratch or at best distill. * Long-term weights still use backprop and the hebbian style part is short-term synaptic memory * Sparse hardware is still a limitation. Current GPUs still do lots of work over zeros. I still have some questions: * Is the recovered connectivity graph a real interpretability handle or a basis dependent story? * Does fixed-size noisy memory beat KV cache growth in practice? * What benchmarks would convince people this is more than an elegant framing? curious what people here think especially anyone following post-transformer architectures, SSMs, linear attention or continual learning.
Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.
A new post highlights a drawback of diffusion LLMs: bidirectional attention causes keys and values to drift across steps, breaking KV caching. However, generation quality is robust to slight KV drift, and research has focused on maximizing stale KV reuse without quality degradation.
Sebastian Raschka reviews recent innovations in LLM architectures focused on long-context efficiency, including KV sharing, compressed convolutional attention, and layer-wise attention budgeting from models like Gemma 4, ZAYA1, Laguna XS.2, and DeepSeek V4.
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.