I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

Reddit r/MachineLearning 05/23/26, 05:30 AM Models

mamba state-space-models pytorch blackwell efficient-inference sequence-modeling

Summary

The author presents SM1, a variant of Mamba1 with d_state=1, using two native PyTorch ops to replace the selective scan, reducing memory by 16x compared to d_state=16. The closed-form solution eliminates the state dimension, enabling efficient inference with constant memory per token.

On windows mamba-ssm is not easily available and doesn't compile on sm\_120. SM1 (Scalar Mamba1) replaces the entire selective scan with two native PyTorch ops: `L = torch.cumprod(dA, dim=1)` `h = L * (h0.unsqueeze(1) + torch.cumsum(dBx / L.clamp(min=1e-6), dim=1))` `y = h * C` This is the exact closed-form solution to the d\_state=1 recurrence via variation of parameters. Not an approximation, it is identical to sequential computation of floating point precision. d\_state=2 breaks it. d\_state=1 is the boundary where the closed form exists. The Mamba1 scan intermediates are (B, T, F, S). SM1 eliminates S entirely, there is 16x less scan memory than a Mamba1 with d\_state=16. The inference state for a 130M param model is about 14,080 floats, 56 KB, no KV cache, O(1) per token forever. I am currently training it on 163K MIDI files, which is 2.5B tokens roughly in my custom format. 130M params fits in under half of my 16 GB card which is an RTX 5060 Ti. d\_state scales expressivity only when the representation does not already encode structure. Thus if you encode structure in tokens, you do not need d\_state to be more than a scalar.

Original Article

Similar Articles

@rshia_afz: 1/ SSMs struggle on recall benchmarks due to their fixed-size state. But are current models actually storing context “w…

X AI KOLs Timeline

The article introduces Raven, a new State Space Model (SSM) with selective memory allocation that achieves state-of-the-art performance on recall tasks and demonstrates superior length generalization compared to existing models like SWA.

Looped State-Space Language Models with Adaptive Exit-State Selection

arXiv cs.AI

This paper explores looped (recurrent) state-space language models using Mamba and hybrid Mamba-Transformer backbones, showing they outperform non-looped baselines on reasoning tasks and remain competitive under iso-parameter and iso-FLOPs pretraining, with adaptive exit-state selection improving intermediate-depth performance.

Training Hybrid Block Diffusion Language Models with Partial Bidirectionality

arXiv cs.LG

This paper proposes a hybrid Mamba-attention architecture for block diffusion language models that restricts reverse Mamba scans to the active denoising block, enabling exact caching across blocks and achieving high throughput for long-context generation.

PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift

arXiv cs.LG

This paper proposes Physics-Informed Multi-Scale Mamba (PIMSM), a state-space architecture that aligns model memory with physical timescales to improve robustness under distribution shift in scientific time series, demonstrating improvements on fMRI and weather forecasting tasks.

A Hybrid Mamba for Audio-Visual Navigation

arXiv cs.LG

This paper proposes Samba, a hybrid state-space architecture for audio-visual navigation that uses a Mamba State Encoder to replace GRUs and an Audio Mamba Encoder to better capture global time-frequency dependencies, achieving an 11.3% improvement in navigation success rate on the Matterport3D dataset.

Similar Articles

@rshia_afz: 1/ SSMs struggle on recall benchmarks due to their fixed-size state. But are current models actually storing context “w…

Looped State-Space Language Models with Adaptive Exit-State Selection

Training Hybrid Block Diffusion Language Models with Partial Bidirectionality

PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift

A Hybrid Mamba for Audio-Visual Navigation

Submit Feedback