I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

Reddit r/MachineLearning Models

Summary

The author presents SM1, a variant of Mamba1 with d_state=1, using two native PyTorch ops to replace the selective scan, reducing memory by 16x compared to d_state=16. The closed-form solution eliminates the state dimension, enabling efficient inference with constant memory per token.

On windows mamba-ssm is not easily available and doesn't compile on sm\_120. SM1 (Scalar Mamba1) replaces the entire selective scan with two native PyTorch ops: `L = torch.cumprod(dA, dim=1)` `h = L * (h0.unsqueeze(1) + torch.cumsum(dBx / L.clamp(min=1e-6), dim=1))` `y = h * C` This is the exact closed-form solution to the d\_state=1 recurrence via variation of parameters. Not an approximation, it is identical to sequential computation of floating point precision. d\_state=2 breaks it. d\_state=1 is the boundary where the closed form exists. The Mamba1 scan intermediates are (B, T, F, S). SM1 eliminates S entirely, there is 16x less scan memory than a Mamba1 with d\_state=16. The inference state for a 130M param model is about 14,080 floats, 56 KB, no KV cache, O(1) per token forever. I am currently training it on 163K MIDI files, which is 2.5B tokens roughly in my custom format. 130M params fits in under half of my 16 GB card which is an RTX 5060 Ti. d\_state scales expressivity only when the representation does not already encode structure. Thus if you encode structure in tokens, you do not need d\_state to be more than a scalar.
Original Article

Similar Articles

SimpleMem: Efficient Lifelong Memory for LLM Agents

Papers with Code Trending

Introduces SimpleMem, an efficient memory framework for LLM agents that uses semantic lossless compression to improve accuracy and reduce token consumption, achieving 26.4% F1 improvement and up to 30x reduction in inference-time token usage.

δ-mem: Efficient Online Memory for Large Language Models

Hugging Face Daily Papers

The paper introduces δ-mem, a lightweight memory mechanism that enhances large language models by augmenting a frozen attention backbone with a compact associative memory state. It demonstrates improved performance on memory-heavy benchmarks with minimal computational overhead.