Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Source
- Raw Markdown: paper_mamba-2-2024.md
- PDF: paper_mamba-2-2024.pdf
- Preprint: arXiv 2405.21060
- Official code: state-spaces/mamba
Core Claim
Mamba-2 builds a bridge between selective SSMs and attention through structured state space duality: SSM sequence transformations can be viewed as semiseparable matrix mixers, making it possible to borrow attention-style algorithms and systems ideas while retaining recurrent inference.
Key Contributions
- Identifies structured SSM transformations with semiseparable matrices and uses that lens to connect recurrent, scan, and attention-like quadratic forms.
- Introduces the SSD algorithm, a block-decomposed semiseparable matrix multiplication method that is more hardware friendly than Mamba’s selective scan.
- Designs the Mamba-2 block with larger state sizes, more Transformer-like training-system compatibility, tensor parallelism, and sequence parallelism.
- Shows Mamba-2 Pareto-dominating Mamba and Transformer++ in the paper’s perplexity and wall-clock scaling setup.
Evidence And Results
The paper reports a dedicated SSD implementation that is 2-8x faster than the optimized Mamba selective scan and can use much larger recurrent state sizes with limited slowdown. It also reports that Mamba-2 with 2.7B parameters trained on 300B Pile tokens outperforms Mamba-2.8B, Pythia-2.8B, and Pythia-6.9B in the paper’s downstream comparison.
Relevance To This Wiki
Mamba-2 is the main mathematical and systems bridge between attention and recurrent state-space models. For time-series and world-model pages, it supplies a clean vocabulary for talking about compact latent-state mixers as structured matrices rather than only as RNNs, convolutions, or attention approximations.
It is also the immediate background for ParaRNN: Mamba-2 keeps efficient parallel training by staying in a linear recurrent family, while ParaRNN asks whether nonlinear recurrent cells can be made parallel enough to compete at large scale.
Limitations
- SSD trades some transition expressivity for a more hardware-friendly semiseparable structure.
- The paper is mostly about token-sequence language modeling and retrieval-style synthetic tasks, not direct numeric time-series forecasting or action-conditioned dynamics.
- The state-space duality framing does not automatically cover arbitrary nonlinear recurrent dynamics; ParaRNN occupies that next step.
Links Into The Wiki
- Mamba-2
- Efficient Recurrent Sequence Models
- Time-Series Scaling And Efficiency
- Mamba
- Mamba-3
- ParaRNN
Open Questions
- Which semiseparable-matrix constraints are harmless for time-series passive dynamics, and which prevent useful state tracking?
- Can structured state space duality guide efficient non-causal or bidirectional time-series encoders?
- Where is the boundary between SSD-style recurrent state and attention-style memory for long-context event streams?