I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]
Summary
The author presents SM1, a variant of Mamba1 with d_state=1, using two native PyTorch ops to replace the selective scan, reducing memory by 16x compared to d_state=16. The closed-form solution eliminates the state dimension, enabling efficient inference with constant memory per token.
Similar Articles
@rshia_afz: 1/ SSMs struggle on recall benchmarks due to their fixed-size state. But are current models actually storing context “w…
The article introduces Raven, a new State Space Model (SSM) with selective memory allocation that achieves state-of-the-art performance on recall tasks and demonstrates superior length generalization compared to existing models like SWA.
PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift
This paper proposes Physics-Informed Multi-Scale Mamba (PIMSM), a state-space architecture that aligns model memory with physical timescales to improve robustness under distribution shift in scientific time series, demonstrating improvements on fMRI and weather forecasting tasks.
SimpleMem: Efficient Lifelong Memory for LLM Agents
Introduces SimpleMem, an efficient memory framework for LLM agents that uses semantic lossless compression to improve accuracy and reduce token consumption, achieving 26.4% F1 improvement and up to 30x reduction in inference-time token usage.
@no_stp_on_snek: first receipts: triattention v3 evicts safely with longctx. ✓HIT every rung 32k → 256k on qwen3.5-2b-4bit (hybrid mamba…
Introduces triattention v3, a new attention mechanism that enables safe eviction without recall loss for long-context inference, demonstrated on a hybrid mamba+attention model up to 256k tokens.
δ-mem: Efficient Online Memory for Large Language Models
The paper introduces δ-mem, a lightweight memory mechanism that enhances large language models by augmenting a frozen attention backbone with a compact associative memory state. It demonstrates improved performance on memory-heavy benchmarks with minimal computational overhead.