Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Source

Core Claim

Mamba-2 builds a bridge between selective SSMs and attention through structured state space duality: SSM sequence transformations can be viewed as semiseparable matrix mixers, making it possible to borrow attention-style algorithms and systems ideas while retaining recurrent inference.

Key Contributions

  • Identifies structured SSM transformations with semiseparable matrices and uses that lens to connect recurrent, scan, and attention-like quadratic forms.
  • Introduces the SSD algorithm, a block-decomposed semiseparable matrix multiplication method that is more hardware friendly than Mamba’s selective scan.
  • Designs the Mamba-2 block with larger state sizes, more Transformer-like training-system compatibility, tensor parallelism, and sequence parallelism.
  • Shows Mamba-2 Pareto-dominating Mamba and Transformer++ in the paper’s perplexity and wall-clock scaling setup.

Evidence And Results

The paper reports a dedicated SSD implementation that is 2-8x faster than the optimized Mamba selective scan and can use much larger recurrent state sizes with limited slowdown. It also reports that Mamba-2 with 2.7B parameters trained on 300B Pile tokens outperforms Mamba-2.8B, Pythia-2.8B, and Pythia-6.9B in the paper’s downstream comparison.

Relevance To This Wiki

Mamba-2 is the main mathematical and systems bridge between attention and recurrent state-space models. For time-series and world-model pages, it supplies a clean vocabulary for talking about compact latent-state mixers as structured matrices rather than only as RNNs, convolutions, or attention approximations.

It is also the immediate background for ParaRNN: Mamba-2 keeps efficient parallel training by staying in a linear recurrent family, while ParaRNN asks whether nonlinear recurrent cells can be made parallel enough to compete at large scale.

Limitations

  • SSD trades some transition expressivity for a more hardware-friendly semiseparable structure.
  • The paper is mostly about token-sequence language modeling and retrieval-style synthetic tasks, not direct numeric time-series forecasting or action-conditioned dynamics.
  • The state-space duality framing does not automatically cover arbitrary nonlinear recurrent dynamics; ParaRNN occupies that next step.

Open Questions

  • Which semiseparable-matrix constraints are harmless for time-series passive dynamics, and which prevent useful state tracking?
  • Can structured state space duality guide efficient non-causal or bidirectional time-series encoders?
  • Where is the boundary between SSD-style recurrent state and attention-style memory for long-context event streams?