Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Source

Core Claim

Mamba-2 builds a bridge between selective SSMs and attention through structured state space duality: SSM sequence transformations can be viewed as semiseparable matrix mixers, making it possible to borrow attention-style algorithms and systems ideas while retaining recurrent inference.

Key Contributions

Identifies structured SSM transformations with semiseparable matrices and uses that lens to connect recurrent, scan, and attention-like quadratic forms.
Introduces the SSD algorithm, a block-decomposed semiseparable matrix multiplication method that is more hardware friendly than Mamba’s selective scan.
Designs the Mamba-2 block with larger state sizes, more Transformer-like training-system compatibility, tensor parallelism, and sequence parallelism.
Shows Mamba-2 Pareto-dominating Mamba and Transformer++ in the paper’s perplexity and wall-clock scaling setup.

Evidence And Results

The paper reports a dedicated SSD implementation that is 2-8x faster than the optimized Mamba selective scan and can use much larger recurrent state sizes with limited slowdown. It also reports that Mamba-2 with 2.7B parameters trained on 300B Pile tokens outperforms Mamba-2.8B, Pythia-2.8B, and Pythia-6.9B in the paper’s downstream comparison.

Relevance To This Wiki

Mamba-2 is the main mathematical and systems bridge between attention and recurrent state-space models. For time-series and world-model pages, it supplies a clean vocabulary for talking about compact latent-state mixers as structured matrices rather than only as RNNs, convolutions, or attention approximations.

It is also the immediate background for ParaRNN: Mamba-2 keeps efficient parallel training by staying in a linear recurrent family, while ParaRNN asks whether nonlinear recurrent cells can be made parallel enough to compete at large scale.

Limitations

SSD trades some transition expressivity for a more hardware-friendly semiseparable structure.
The paper is mostly about token-sequence language modeling and retrieval-style synthetic tasks, not direct numeric time-series forecasting or action-conditioned dynamics.
The state-space duality framing does not automatically cover arbitrary nonlinear recurrent dynamics; ParaRNN occupies that next step.

Links Into The Wiki

Open Questions

Which semiseparable-matrix constraints are harmless for time-series passive dynamics, and which prevent useful state tracking?
Can structured state space duality guide efficient non-causal or bidirectional time-series encoders?
Where is the boundary between SSD-style recurrent state and attention-style memory for long-context event streams?

Alex Knowledge Base

Explorer

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Source

Core Claim

Key Contributions

Evidence And Results

Relevance To This Wiki

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks