Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Source

Core Claim

Mamba introduces selective state space models: recurrent sequence mixers whose state-space parameters depend on the input token, letting the model selectively propagate, ignore, or reset information while keeping linear scaling in sequence length and constant-size autoregressive state.

Key Contributions

  • Adds input-dependent selectivity to structured SSMs, especially through the discretization step and input/output projections.
  • Replaces convolutional SSM computation with a hardware-aware parallel scan, because input-dependent dynamics are no longer linear time-invariant.
  • Packages the selective SSM into a simple attention-free Mamba block that combines sequence mixing and channel mixing.
  • Demonstrates strong results across token sequences, audio, and genomics, including language-model scaling up to the billion-parameter regime.

Evidence And Results

Mamba reports that its selective SSM layer solves synthetic selective-copy and induction-style tasks and extrapolates to million-length sequences. In language modeling, the paper frames Mamba as the first attention-free linear-time sequence model to match a strong Transformer++ recipe under its scaling-law setup, and it reports 4-5x higher inference throughput than comparable Transformers in the benchmarked autoregressive setting.

Relevance To This Wiki

Mamba is the main background source for efficient recurrent sequence models before ParaRNN. It shows how selective recurrent latent state can compete with attention on information-dense token sequences, but it preserves parallel training by keeping the recurrent update linear in the state and using a parallel scan.

For time-series and world-model readers, the important abstraction is a compact latent state that compresses history with input-dependent updates. The paper itself is not a numeric time-series foundation-model paper, so wiki pages should cite it as sequence-model architecture background rather than as direct evidence about forecasting accuracy.

Limitations

  • Selective SSMs recover expressivity through input-dependent parameters, but the state update remains linear in the hidden state.
  • Mamba’s efficient training relies on custom fused scan kernels and recomputation rather than a generic nonlinear recurrence solver.
  • The core results emphasize language, audio, and genomics, so transfer to multivariate time series, event streams, trajectories, or action-conditioned world models remains a separate question.

Open Questions

  • Which time-series and event-stream settings actually need Mamba-style selectivity rather than simpler linear recurrent or convolutional mixers?
  • How much of Mamba’s advantage comes from selective state dynamics versus the surrounding block design and optimized kernels?
  • Can selective SSM state resets be made action-aware for action-conditioned world models?