---
abstract: |
  `\noindent`{=latex} Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to *selectively* propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (**Mamba**). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
author:
- 'Albert Gu[^1]'
- 'Tri Dao`\samethanks`{=latex}'
bibliography:
- biblio.bib
title: 'Mamba: Linear-Time Sequence Modeling with Selective State Spaces'
---

```{=latex}
\newcommand{\citep}{\parencite}
```
```{=latex}
\newcommand{\citet}{\textcite}
```
```{=latex}
\newcommand{\cmark}{\ding{51}}
```
```{=latex}
\newcommand{\xmark}{\ding{55}}
```
```{=latex}
\newcommand{\paren}[1]{\left (#1 \right)}
```
```{=latex}
\newcommand{\dd}{\mathop{}\!d}
```
```{=latex}
\DeclareMathOperator*{\diag}{diag}
```
```{=latex}
\newcommand{\NAME}{Mamba}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\dt}{\Delta}
```
```{=latex}
\newcommand{\A}{\bm{A}}
```
```{=latex}
\newcommand{\B}{\bm{B}}
```
```{=latex}
\newcommand{\C}{\bm{C}}
```
```{=latex}
\newcommand{\D}{\bm{D}}
```
```{=latex}
\newcommand{\K}{\overline{\bm{K}}}
```
```{=latex}
\newcommand{\da}{\overline{\bm{A}}}
```
```{=latex}
\newcommand{\db}{\overline{\bm{B}}}
```
```{=latex}
\newcommand{\dA}{\overline{\bm{A}}}
```
```{=latex}
\newcommand{\dB}{\overline{\bm{B}}}
```
```{=latex}
\newcommand{\AB}{(\A, \B)}
```
```{=latex}
\newcommand{\ABC}{(\A, \B, \C)}
```
```{=latex}
\newcommand{\dtAB}{(\dt, \A, \B)}
```
```{=latex}
\newcommand{\dtABC}{(\dt, \A, \B, \C)}
```
```{=latex}
\newcommand{\dAB}{(\dA, \dB)}
```
```{=latex}
\newcommand{\para}[1]{\paragraph{#1}}
```
```{=latex}
\newcommand{\maybe}[1]{\textcolor{blue}{#1}}
```
```{=latex}
\newcommand*\samethanks[1][\value{footnote}]{\footnotemark[#1]}
```
```{=latex}
\maketitle
```
Introduction {#sec:intro}
============

Foundation models (FMs), or large models pretrained on massive data then adapted for downstream tasks, have emerged as an effective paradigm in modern machine learning. The backbone of these FMs are often *sequence models*, operating on arbitrary sequences of inputs from a wide variety of domains such as language, images, speech, audio, time series, and genomics [@sutskever2014sequence; @dosovitskiy2020image; @oord2016wavenet; @brown2020language; @ismail2019deep; @poli2023hyena]. While this concept is agnostic to a particular choice of model architecture, modern FMs are predominantly based on a single type of sequence model: the Transformer [@vaswani2017attention] and its core attention layer [@bahdanau2015neural] The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length. An enormous body of research has appeared on more efficient variants of attention to overcome these drawbacks [@tay2022efficient], but often at the expense of the very properties that makes it effective. As of yet, none of these variants have been shown to be empirically effective at scale across domains.

Recently, structured state space sequence models (SSMs) [@gu2021combining; @gu2022efficiently] have emerged as a promising class of architectures for sequence modeling. These models can be interpreted as a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), with inspiration from classical state space models [@kalman1960new]. This class of models can be computed very efficiently as either a recurrence or convolution, with linear or near-linear scaling in sequence length. Additionally, they have principled mechanisms for modeling long-range dependencies [@gu2020hippo] in certain data modalities, and have dominated benchmarks such as the Long Range Arena [@tay2021long]. Many flavors of SSMs  [@gu2022efficiently; @gupta2022diagonal; @gu2022parameterization; @li2023makes; @ma2023mega; @smith2023s5; @orvieto2023resurrecting] have been successful in domains involving continuous signal data such as audio and vision [@goel2022raw; @saon2023diagonal; @nguyen2022s4nd]. However, they have been less effective at modeling discrete and information-dense data such as text.

We propose a new class of **selective state space models**, that improves on prior work on several axes to achieve the modeling power of Transformers while scaling linearly in sequence length.

```{=latex}
\para{Selection Mechanism.}
```
First, we identify a key limitation of prior models: the ability to efficiently *select* data in an input-dependent manner (i.e. focus on or ignore particular inputs). Building on intuition based on important synthetic tasks such as selective copy and induction heads, we design a simple selection mechanism by parameterizing the SSM parameters based on the input. This allows the model to filter out irrelevant information and remember relevant information indefinitely.

`\para{Hardware-aware Algorithm.}`{=latex} This simple change poses a technical challenge for the computation of the model; in fact, all prior SSMs models must be time- and input-invariant in order to be computationally efficient. We overcome this with a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution, but does not materialize the expanded state in order to avoid IO access between different levels of the GPU memory hierarchy. The resulting implementation is faster than previous methods both in theory (scaling linearly in sequence length, compared to pseudo-linear for all convolution-based SSMs) and on modern hardware (up to 3$\times$ faster on A100 GPUs).

```{=latex}
\para{Architecture.}
```
We simplify prior deep sequence model architectures by combining the design of prior SSM architectures [@dao2023hungry] with the MLP block of Transformers into a single block, leading to a simple and homogenous architecture design (**Mamba**) incorporating selective state spaces.

Selective SSMs, and by extension the Mamba architecture, are fully recurrent models with key properties that make them suitable as the backbone of general foundation models operating on sequences.

```{=latex}
\begin{enumerate*}[label=(\roman*)]\item High quality: selectivity brings strong performance on dense modalities such as language and genomics.
\item Fast training and inference: computation and memory scales linearly in sequence length during training, and unrolling the model autoregressively during inference requires only constant time per step since it does not require a cache of previous elements.
\item Long context: the quality and efficiency together yield performance improvements on real data up to sequence length 1M.
\end{enumerate*}
```
We empirically validate Mamba's potential as a general sequence FM backbone, in both pretraining quality and domain-specific task performance, on several types of modalities and settings:

-   **Synthetics.** On important synthetic tasks such as copying and induction heads that have been proposed as being key to large language models, Mamba not only solves them easily but can *extrapolate solutions indefinitely long* ($>$1M tokens).

-   **Audio and Genomics.** Mamba out-performs prior state-of-the-art models such as SaShiMi, Hyena, and Transformers on modeling audio waveforms and DNA sequences, both in pretraining quality and downstream metrics (e.g. reducing FID on a challenging speech generation dataset by more than half). In both settings, its *performance improves with longer context up to million-length sequences*.

-   **Language Modeling.** Mamba is the first *linear-time sequence model that truly achieves Transformer-quality performance*, both in pretraining perplexity and downstream evaluations. With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa [@touvron2023llama]. Our Mamba language model has 5$\times$ generation throughput compared to Transformers of similar size, and Mamba-3B's quality matches that of Transformers twice its size (e.g. 4 points higher avg. on common sense reasoning compared to Pythia-3B and even exceeding Pythia-7B).

Model code and pre-trained checkpoints are open-sourced at <https://github.com/state-spaces/mamba>.

![ (**Overview**.) Structured SSMs independently map each channel (e.g. $D=5$) of an input $x$ to output $y$ through a higher dimensional latent state $h$ (e.g. $N=4$). Prior SSMs avoid materializing this large effective state ($DN$, times batch size $B$ and sequence length $L$) through clever alternate computation paths requiring time-invariance: the $(\dt, \A, \B, \C)$ parameters are constant across time. Our selection mechanism adds back input-dependent dynamics, which also requires a careful hardware-aware algorithm to only materialize the expanded states in more efficient levels of the GPU memory hierarchy. ](fig/selection.png){#fig:selection width="\\linewidth"}

State Space Models {#sec:background}
==================

Structured state space sequence models (S4) are a recent class of sequence models for deep learning that are broadly related to RNNs, and CNNs, and classical state space models. They are inspired by a particular continuous system `\eqref{eq:ssm}`{=latex} that maps a 1-dimensional function or sequence $x(t) \in \R \mapsto y(t) \in \R$ through an implicit latent state $h(t) \in \R^N$.

Concretely, S4 models are defined with four parameters $(\dt, \A, \B, \C)$, which define a sequence-to-sequence transformation in two stages.

```{=latex}
\begin{subequations}
    \label{eq:ssm}
    \begin{align}
      h'(t) &= \A h(t) + \B x(t) \\
      y(t) &= \C h(t)
    \end{align}
  \end{subequations}
```
```{=latex}
\begin{subequations}
    \label{eq:ssm:recurrence}
    \begin{align}
    \label{eq:ssm:recurrence:1}
      h_{t} &= \dA h_{t-1} + \dB x_t \\
    \label{eq:ssm:recurrence:2}
      y_t &= \C h_t
    \end{align}
  \end{subequations}
```
```{=latex}
\begin{subequations}%
    \label{eq:ssm:convolution}
    \begin{align}
      \label{eq:ssm:convolution:1}
      \bm{\overline{K}} &= (\bm{C}\bm{\overline{B}}, \bm{C}\bm{\overline{A}}\bm{\overline{B}}, \dots, \bm{C}\bm{\overline{A}}^{k}\bm{\overline{B}}, \dots) \\
      \label{eq:ssm:convolution:2}
      y &= x \ast \bm{\overline{K}}
    \end{align}
  \end{subequations}
```
```{=latex}
\para{Discretization.}
```
The first stage transforms the \`\`continuous parameters" $\dtAB$ to \`\`discrete parameters" $\dAB$ through fixed formulas $\dA = f_A(\dt, \A)$ and $\dB = f_B(\dt, \A, \B)$, where the pair $(f_A, f_B)$ is called a *discretization rule*.

Various rules can be used such as the zero-order hold (ZOH) defined in equation `\eqref{eq:zoh}`{=latex}. $$\label{eq:zoh}
    \dA = \exp(\dt \bm{A})
    \qquad
    \dB = (\dt \bm{A})^{-1} (\exp(\dt \bm{A}) - \bm{I}) \cdot \dt \bm{B}$$

Discretization has deep connections to continuous-time systems which can endow them with additional properties such as resolution invariance [@nguyen2022s4nd] and automatically ensuring that the model is properly normalized [@gu2023train; @orvieto2023resurrecting]. It also has connections to gating mechanisms of RNNs [@tallec2018can; @gu2020improving] which we will revisit in `\cref{sec:method:properties}`{=latex}. However, from a mechanical point of view discretization can simply be viewed as the first step of the computation graph in the forward pass of an SSM.

Alternate flavors of SSMs can bypass the discretization step and parameterize $\dAB$ directly instead [@zhang2023effectively], which may be easier to reason about.

```{=latex}
\para{Computation.}
```
After the parameters have been transformed from $(\dt, \A, \B, \C) \mapsto (\dA, \dB, \C)$, the model can be computed in two ways, either as a **linear recurrence** `\eqref{eq:ssm:recurrence}`{=latex} or a **global convolution** `\eqref{eq:ssm:convolution}`{=latex}.

Commonly, the model uses the convolutional mode `\eqref{eq:ssm:convolution}`{=latex} for efficient parallelizable training (where the whole input sequence is seen ahead of time), and switched into recurrent mode `\eqref{eq:ssm:recurrence}`{=latex} for efficient autoregressive inference (where the inputs are seen one timestep at a time).

```{=latex}
\para{Linear Time Invariance (LTI).}
```
An important property of equations `\eqref{eq:ssm}`{=latex} to `\eqref{eq:ssm:convolution}`{=latex} is that the model's dynamics are constant through time. In other words $\dtABC$, and consequently $\dAB$ as well, are fixed for all time-steps. This property is called *linear time invariance (LTI)*, which is deeply connected to recurrence and convolutions. Informally, we think of LTI SSMs as being equivalent to any linear recurrence `\eqref{eq:ssm:recurrence:1}`{=latex} or convolution `\eqref{eq:ssm:convolution:2}`{=latex}, and use LTI as an umbrella term for these classes of models.

Thus far, all structured SSMs have been LTI (e.g. computed as convolutions) because of fundamental efficiency constraints, discussed in `\cref{sec:method:scan}`{=latex}. However, a core insight of this work is that LTI models have fundamental limitations in modeling certain types of data, and our technical contributions involve removing the LTI constraint while overcoming the efficiency bottlenecks.

```{=latex}
\para{Structure and Dimensions.}
```
Finally, we note that structured SSMs are so named because computing them efficiently also requires imposing structure on the $\A$ matrix. The most popular form of structure is diagonal [@gupta2022diagonal; @gu2022parameterization; @smith2023s5], which we also use.

In this case, the $\A \in \R^{N \times N}, \B \in \R^{N \times 1}, \C \in \R^{1 \times N}$ matrices can all be represented by $N$ numbers. To operate over an input sequence $x$ of batch size $B$ and length $L$ with $D$ channels, the SSM is applied independently to each channel. Note that in this case, the total hidden state has dimension $DN$ per input, and computing it over the sequence length requires $O(BLDN)$ time and memory; this is the root of the fundamental efficiency bottleneck addressed in `\cref{sec:method:scan}`{=latex}.

```{=latex}
\para{General State Space Models.}
```
We note that the term *state space model* has a very broad meaning which simply represents the notion of any recurrent process with a latent state. It has been used to refer to many disparate concepts in different disciplines, including Markov decision processes (MDP) (reinforcement learning [@hafner2020dream]), dynamic causal modeling (DCM) (computational neuroscience [@friston2003dynamic]), Kalman filters (controls [@kalman1960new]), hidden Markov models (HMM) and linear dynamical systems (LDS) (machine learning), and recurrent (and sometimes convolutional) models at large (deep learning).

Throughout this entire paper we use the term \`\`SSM" to refer exclusively to the class of structured SSMs or S4 models [@gu2022efficiently; @gupta2022diagonal; @gu2022parameterization; @ma2023mega; @smith2023s5; @hasani2023liquid] and use these terms interchangeably. For convenience we may also include derivatives of such models, such as those focusing on either the linear-recurrence or global-convolution viewpoints [@orvieto2023resurrecting; @li2023makes; @poli2023hyena], and clarify nuances when necessary.

```{=latex}
\para{SSM Architectures.}
```
SSMs are standalone sequence transformations that can be incorporated into end-to-end neural network architectures.

(We also sometimes call SSM architectures SSNNs, which are to SSM layers as CNNs are to linear convolution layers.)

We discuss some of the most well-known SSM architectures, many of which will also serve as our primary baselines.

-   Linear attention [@katharopoulos2020transformers] is an approximation of self-attention involving a recurrence which can be viewed as a degenerate linear SSM.

-   H3 [@dao2023hungry] generalized this recurrence to use S4; it can be viewed as an architecture with an SSM sandwiched by two gated connections (`\cref{fig:architecture}`{=latex}). H3 also inserts a standard local convolution, which they frame as a shift-SSM, before the main SSM layer.

-   Hyena [@poli2023hyena] uses the same architecture as H3 but replaces the S4 layer with an MLP-parameterized global convolution [@romero2021ckconv].

-   RetNet [@sun2023retentive] adds an additional gate to the architecture and uses a simpler SSM, allowing an alternative parallelizable computation path, using a variant of multi-head attention (MHA) instead of convolutions.

-   RWKV [@peng2023rwkv] is a recent RNN designed for language modeling based on another linear attention approximation, the attention-free Transformer [@zhai2021attention]. Its main \`\`WKV" mechanism involves LTI recurrences and can be viewed as the ratio of two SSMs.

Other closely related SSMs and architectures are discussed further in an extended related work (`\cref{sec:related}`{=latex}). We highlight in particular S5 [@smith2023s5], QRNN [@bradbury2016quasi], and SRU [@lei2017simple], which we view as the most closely related methods to our core selective SSM.

Selective State Space Models {#sec:method}
============================

We motivate our selection mechanism using intuition from synthetic tasks (`\cref{sec:method:motivation}`{=latex}), then explain how to incorporate this mechanism into state space models (`\cref{sec:method:selective}`{=latex}). The resulting time-varying SSMs cannot use convolutions, presenting a technical challenge of how to compute them efficiently. We overcome this with a hardware-aware algorithm that exploits the memory hierarchy on modern hardware (`\cref{sec:method:scan}`{=latex}). We then describe a simple SSM architecture without attention or even MLP blocks (`\cref{sec:method:architecture}`{=latex}). Finally, we discuss some additional properties of selection mechanisms (`\cref{sec:method:properties}`{=latex}).

Motivation: Selection as a Means of Compression {#sec:method:motivation}
-----------------------------------------------

We argue that a fundamental problem of sequence modeling is *compressing context into a smaller state*. In fact, we can view the tradeoffs of popular sequence models from this point of view. For example, attention is both effective and inefficient because it explicitly does not compress context at all. This can be seen from the fact that autoregressive inference requires explicitly storing the entire context (i.e. the KV cache), which directly causes the slow linear-time inference and quadratic-time training of Transformers. On the other hand, recurrent models are efficient because they have a finite state, implying constant-time inference and linear-time training. However, their effectiveness is limited by how well this state has compressed the context.

To understand this principle, we focus on two running examples of synthetic tasks (`\cref{fig:copying}`{=latex}).

-   The **Selective Copying** task modifies the popular Copying task [@arjovsky2016unitary] by varying the position of the tokens to memorize. It requires *content-aware* reasoning to be able to memorize the relevant tokens (*colored*) and filter out the irrelevant ones (*white*).

-   The **Induction Heads** task is a well-known mechanism hypothesized to explain the majority of in-context learning abilities of LLMs [@olsson2022context]. It requires *context-aware* reasoning to know when to produce the correct output in the appropriate context (*black*).

These tasks reveal the failure mode of LTI models. From the recurrent view, their constant dynamics (e.g. the $\dAB$ transitions in `\eqref{eq:ssm:recurrence}`{=latex}) cannot let them select the correct information from their context, or affect the hidden state passed along the sequence in an input-dependent way. From the convolutional view, it is known that global convolutions can solve the vanilla Copying task [@romero2021ckconv] because it only requires time-awareness, but that they have difficulty with the Selective Copying task because of lack of content-awareness (`\cref{fig:copying}`{=latex}). More concretely, the spacing between inputs-to-outputs is varying and cannot be modeled by static convolution kernels.

In summary, the efficiency vs. effectiveness tradeoff of sequence models is characterized by how well they compress their state: efficient models must have a small state, while effective models must have a state that contains all necessary information from the context. In turn, we propose that a fundamental principle for building sequence models is **selectivity**: or the context-aware ability to focus on or filter out inputs into a sequential state. In particular, a selection mechanism controls how information propagates or interacts along the sequence dimension (see `\cref{sec:method:properties}`{=latex} for more discussion).

```{=latex}
\begin{figure*}[!t]\centering
  \includegraphics[width=\linewidth]{fig/copying.pdf}
  \caption{
    (\textit{Left}) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions.
    (\textit{Right Top}) The Selective Copying task has random spacing in between inputs and requires time-varying models that can \textit{selectively} remember or ignore inputs depending on their content.
    (\textit{Right Bottom}) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability for LLMs.
  }
  \label{fig:copying}
  
\end{figure*}
```
Improving SSMs with Selection {#sec:method:selective}
-----------------------------

One method of incorporating a selection mechanism into models is by letting their parameters that affect interactions along the sequence (e.g. the recurrent dynamics of an RNN or the convolution kernel of a CNN) be input-dependent.

`\cref{alg:s4,alg:s6}`{=latex} illustrates the main selection mechanism that we use. The main difference is simply making several parameters $\dt, \B, \C$ functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension $L$, meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in `\cref{sec:background}`{=latex}.) This loses the equivalence to convolutions `\eqref{eq:ssm:convolution}`{=latex} with implications for its efficiency, discussed next.

We specifically choose $s_B(x) = \mathsf{Linear}_N(x)$, $s_C(x) = \mathsf{Linear}_N(x)$, $s_\dt(x) = \mathsf{Broadcast}_D(\mathsf{Linear}_1(x))$, and $\tau_\dt = \mathsf{softplus}$, where $\mathsf{Linear}_d$ is a parameterized projection to dimension $d$. The choice of $s_\dt$ and $\tau_\dt$ is due to a connection to RNN gating mechanisms explained in `\cref{sec:method:properties}`{=latex}.

```{=latex}
\begin{figure*}[!t]\begin{minipage}{.49\linewidth}
    \begin{algorithm}[H]
      \small
      \algrenewcommand\algorithmicrequire{\textbf{Input: }}
      \algrenewcommand\algorithmicensure{\textbf{Output: }}
      \caption{SSM (S4)}
      \label{alg:s4}
      \begin{algorithmic}[1]
        \Require $x : \mathtt{(B, L, D)}$
        \Ensure $y : \mathtt{(B, L, D)}$
        \State $\A : \mathtt{(D, N)} \gets \mathsf{Parameter}$

        \Comment{Represents structured $N \times N$ matrix}
        \State $\B : \mathtt{(D, N)} \gets \mathsf{Parameter}$
        \State $\C : \mathtt{(D, N)} \gets \mathsf{Parameter}$
        \State $\dt : \mathtt{(D)} \gets \tau_\dt(\mathsf{Parameter})$
        \State $\dA, \dB : \mathtt{(D, N)} \gets \mathsf{discretize}(\dt, \A, \B)$
        \State $y \gets \mathsf{SSM}(\dA, \dB, \C)(x)$

        \Comment{Time-invariant: recurrence or convolution}
        \State \textbf{return} $y$
      \end{algorithmic}
    \end{algorithm}
  \end{minipage}
  \begin{minipage}{.49\linewidth}
    \begin{algorithm}[H]
      \small
      \algrenewcommand\algorithmicrequire{\textbf{Input: }}
      \algrenewcommand\algorithmicensure{\textbf{Output: }}
      \caption{SSM + Selection (S6)}
      \label{alg:s6}
      \begin{algorithmic}[1]
        \Require $x : \mathtt{(B, L, D)}$
        \Ensure $y : \mathtt{(B, L, D)}$
        \State $\A : \mathtt{(D, N)} \gets \mathsf{Parameter}$

        \Comment{Represents structured $N \times N$ matrix}
        \State $\B : \textcolor{BrickRed}{\mathtt{(B, L, N)}} \gets \textcolor{BrickRed}{s_B(x)}$
        \State $\C : \textcolor{BrickRed}{\mathtt{(B, L, N)}} \gets \textcolor{BrickRed}{s_C(x)}$
        \State $\dt : \textcolor{BrickRed}{\mathtt{(B, L, D)}} \gets \tau_\dt(\mathsf{Parameter} \textcolor{BrickRed}{+ s_\dt(x)})$
        \State $\dA, \dB : \textcolor{BrickRed}{\mathtt{(B, L, D, N)}} \gets \mathsf{discretize}(\dt, \A, \B)$
        \State $y \gets \mathsf{SSM}(\dA, \dB, \C)(x)$

        \Comment{\textcolor{BrickRed}{Time-varying}: recurrence (\textcolor{BrickRed}{\emph{scan}}) only}
        \State \textbf{return} $y$
      \end{algorithmic}
    \end{algorithm}
  \end{minipage}
  
\end{figure*}
```
Efficient Implementation of Selective SSMs {#sec:method:scan}
------------------------------------------

Hardware-friendly primitives such as convolutions [@krizhevsky2012imagenet] and attention [@bahdanau2015neural; @vaswani2017attention] enjoy widespread application. Here we aim to make selective SSMs efficient on modern hardware (GPUs) as well. The selection mechanism is quite natural, and earlier works attempted to incorporate special cases of selection, such as letting $\dt$ vary over time in recurrent SSMs [@gu2020hippo]. However,

as previously mentioned a core limitation in the usage of SSMs is their computational efficiency,

which was why S4 and all derivatives used LTI (non-selective) models, most commonly in the form of global convolutions.

### Motivation of Prior Models

We first revisit this motivation and overview our approach to overcome limitations of prior methods.

-   At a high level, recurrent models such as SSMs always balance a tradeoff between expressivity and speed: as discussed in `\cref{sec:method:motivation}`{=latex}, models with larger hidden state dimension should be more effective but slower. Thus we want to *maximize hidden state dimension without paying speed and memory costs*.

-   Note that the recurrent mode is more flexible than the convolution mode, since the latter `\eqref{eq:ssm:convolution}`{=latex} is derived from expanding the former `\eqref{eq:ssm:recurrence}`{=latex} [@gu2021combining; @gu2022efficiently]. However, this would require computing and materializing the latent state $h$ with shape $\mathtt{(B,L,D,N)}$, which is much larger (by a factor of $N$, the SSM state dimension) than the input $x$ and output $y$ of shape $\mathtt{(B,L,D)}$. Thus the more efficient convolution mode was introduced which could bypass the state computation and materializes a convolution kernel `\eqref{eq:ssm:convolution:1}`{=latex} of size only $\mathtt{(B,L,D)}$.

-   Prior LTI state space models leverage the dual recurrent-convolutional forms to increase the effective state dimension by a factor of $N$ ($\approx 10-100$), much larger than traditional RNNs, without efficiency penalties.

### Overview of Selective Scan: Hardware-Aware State Expansion

The selection mechanism is designed to overcome the limitations of LTI models; at the same time, we therefore need to revisit the computation problem of SSMs. We address this with three classical techniques: kernel fusion, parallel scan, and recomputation. We make two main observations:

-   The naive recurrent computation uses $O(BLDN)$ FLOPs while the convolutional computation uses $O(BLD\log(L))$ FLOPs, and the former has a lower constant factor. Thus for long sequences and not-too-large state dimension $N$, the recurrent mode can actually use fewer FLOPs.

-   The two challenges are the sequential nature of recurrence, and the large memory usage. To address the latter, just like the convolutional mode, we can attempt to not actually materialize the full state $h$.

The main idea is to leverage properties of modern accelerators (GPUs) to materialize the state $h$ only in more efficient levels of the memory hierarchy. In particular, most operations (except matrix multiplication) are bounded by memory bandwidth [@williams2009roofline; @ivanov2021data; @dao2022flashattention]. This includes our scan operation, and we use kernel fusion to reduce the amount of memory IOs, leading to a significant speedup compared to a standard implementation.

Concretely, instead of preparing the scan input $\dAB$ of size $\mathtt{(B,L,D,N)}$ in GPU HBM (high-bandwidth memory), we load the SSM parameters $(\dt, \A, \B, \C)$ directly from slow HBM to fast SRAM, perform the discretization and recurrence in SRAM, and then write the final outputs of size $(\mathtt{B,L,D})$ back to HBM.

To avoid the sequential recurrence, we observe that despite not being linear it can still be parallelized with a work-efficient parallel scan algorithm [@blelloch1990prefix; @martin2018parallelizing; @smith2023s5].

Finally, we must also avoid saving the intermediate states, which are necessary for backpropagation. We carefully apply the classic technique of recomputation to reduce the memory requirements: the intermediate states are not stored but recomputed in the backward pass when the inputs are loaded from HBM to SRAM. As a result, the fused selective scan layer has the same memory requirements as an optimized transformer implementation with FlashAttention.

Details of the fused kernel and recomputation are in `\cref{sec:hardware_aware_algo}`{=latex}.

The full Selective SSM layer and algorithm is illustrated in `\cref{fig:selection}`{=latex}.

A Simplified SSM Architecture {#sec:method:architecture}
-----------------------------

As with structured SSMs, selective SSMs are standalone sequence transformations that can be flexibly incorporated into neural networks. The H3 architecture is the basis for the most well-known SSM architectures (`\cref{sec:background}`{=latex}), which are generally comprised of a block inspired by linear attention interleaved with an MLP (multi-layer perceptron) block. We simplify this architecture by combining these two components into one, which is stacked homogenously (`\cref{fig:architecture}`{=latex}). This is inspired by the gated attention unit (GAU) [@hua2022transformer], which did something similar for attention.

This architecture involves expanding the model dimension $D$ by a controllable expansion factor $E$. For each block, most of the parameters ($3ED^2$) are in the linear projections ($2ED^2$ for input projections, $ED^2$ for output projection) while the inner SSM contributes less.

The number of SSM parameters (projections for $\Delta, \B, \C$, and the matrix $\A$) are much smaller in comparison.

We repeat this block, interleaved with standard normalization and residual connections, to form the Mamba architecture. We always fix to $E=2$ in our experiments and use two stacks of the block to match the $12D^2$ parameters of a Transformer's interleaved MHA (multi-head attention) and MLP blocks.

We use the SiLU / Swish activation function [@hendrycks2016gaussian; @ramachandran2017swish], motivated so that the Gated MLP becomes the popular \`\`SwiGLU" variant [@dauphin2017language; @shazeer2020glu; @chowdhery2022palm; @touvron2023llama].

Finally, we additionally use an optional normalization layer (we choose LayerNorm [@ba2016layer]), motivated by RetNet's usage of a normalization layer in a similar location [@sun2023retentive].

```{=latex}
\begin{figure*}[!t]\centering
  \includegraphics[width=0.9\linewidth]{fig/architecture.pdf}
  \caption{
    (\textbf{Architecture}.) Our simplified block design combines the H3 block, which is the basis of most SSM architectures, with the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we simply repeat the Mamba block homogenously.
    Compared to the H3 block, Mamba replaces the first multiplicative gate with an activation function.
    Compared to the MLP block, Mamba adds an SSM to the main branch.
    For $\sigma$ we use the SiLU / Swish activation~\citep{hendrycks2016gaussian,ramachandran2017swish}.
  }
  \label{fig:architecture}
  
\end{figure*}
```
Properties of Selection Mechanisms {#sec:method:properties}
----------------------------------

The selection mechanism is a broader concept that can be applied in different ways, such as to more traditional RNNs or CNNs, to different parameters (e.g. $\A$ in `\cref{alg:s6}`{=latex}), or using different transformations $s(x)$.

### Connection to Gating Mechanisms

We highlight the most important connection: the classical gating mechanism of RNNs is an instance of our selection mechanism for SSMs. We note that the connection between RNN gating and the discretization of continuous-time systems is well established [@funahashi1993approximation; @tallec2018can]. In fact, `\cref{thm:gating}`{=latex} is an improvement of @gu2021combining [Lemma 3.1] generalizing to the ZOH discretization and input-dependent gates (proof in `\cref{sec:mechanics}`{=latex}). More broadly, $\dt$ in SSMs can be seen to play a generalized role of the RNN gating mechanism. In line with prior work, we adopt the view that *discretization of SSMs is the principled foundation of heuristic gating mechanisms*.

```{=latex}
\begin{theorem}
  \label{thm:gating}
  When $N=1, \A=-1, \B=1, s_\dt=\mathsf{Linear}(x)$, and $\tau_\dt=\mathsf{softplus}$,
  then the selective SSM recurrence (\cref{alg:s6}) takes the form
  
    \begin{equation}%
      \label{eq:gates}
      \begin{aligned}
        g_t &= \sigma(\mathsf{Linear}(x_t)) \\
        h_{t} &= (1-g_t) h_{t-1} + g_t x_t
        .
      \end{aligned}
    \end{equation}
  
\end{theorem}
```
As mentioned in `\cref{sec:method:selective}`{=latex}, our specific choices of $s_\dt, \tau_\dt$ is from this connection. In particular, note that if a given input $x_t$ should be completely ignored (as necessary in the synthetic tasks), all $D$ channels should ignore it, and so we project the input down to $1$ dimension before repeating/broadcasting with $\dt$.

### Interpretation of Selection Mechanisms

We elaborate on three particular mechanistic effects of selection.

```{=latex}
\para{Variable Spacing.}
```
Selectivity allows filtering out irrelevant noise tokens that may occur between inputs of interest. This is exemplified by the Selective Copying task, but occurs ubiquitously in common data modalities, particularly for discrete data -- for example the presence of language fillers such as \`\`um". This property arises because the model can mechanistically filter out any particular input $x_t$, for example in the gated RNN case (`\cref{thm:gating}`{=latex}) when $g_t \to 0$.

```{=latex}
\para{Filtering Context.}
```
It has been empirically observed that many sequence models do not improve with longer context [@shi2023large], despite the principle that more context should lead to strictly better performance. An explanation is that many sequence models cannot effectively ignore irrelevant context when necessary; an intuitive example are global convolutions (and general LTI models). On the other hand, selective models can simply reset their state at any time to remove extraneous history, and thus their performance in principle improves monotonicly with context length (e.g. `\cref{sec:exp:dna:length}`{=latex}).

```{=latex}
\para{Boundary Resetting.}
```
In settings where multiple independent sequences are stitched together, Transformers can keep them separate by instantiating a particular attention mask, while LTI models will bleed information between the sequences. Selective SSMs can also reset their state at boundaries (e.g. $\Delta_t \to \infty$, or `\cref{thm:gating}`{=latex} when $g_t \to 1$). These settings may occur artificially (e.g. packing documents together to improve hardware utilization) or naturally (e.g. episode boundaries in reinforcement learning [@lu2023structured]).

Additionally, we elaborate on effects of each selective parameter.

#### Interpretation of $\dt$.

In general, $\dt$ controls the balance between how much to focus or ignore the current input $x_t$. It generalizes RNN gates (e.g. $g_t$ in `\cref{thm:gating}`{=latex}): mechanically, a large $\dt$ resets the state $h$ and focuses on the current input $x$, while a small $\dt$ persists the state and ignores the current input. SSMs `\eqref{eq:ssm}`{=latex}-`\eqref{eq:ssm:recurrence}`{=latex} can be interpreted as a continuous system discretized by a timestep $\dt$, and in this context the intuition is that large $\dt \to \infty$ represents the system focusing on the current input for longer (thus \`\`selecting" it and forgetting its current state) while a small $\dt \to 0$ represents a transient input that is ignored.

#### Interpretation of $\A$.

e remark that while the $\A$ parameter could also be selective, it ultimately affects the model only through its interaction with $\dt$ via $\dA = \exp(\dt \A)$ (the discretization `\eqref{eq:zoh}`{=latex}). Thus selectivity in $\dt$ is enough to ensure selectivity in $\dAB$, and is the main source of improvement. We hypothesize that making $\A$ selective in addition to (or instead of) $\dt$ would have similar performance, and leave it out for simplicity.

#### Interpretation of $\B$ and $\C$.

As discussed in `\cref{sec:method:motivation}`{=latex}, the most important property of selectivity is filtering out irrelevant information so that a sequence model's context can be compressed into an efficient state. In an SSM, modifying $\B$ and $\C$ to be selective allows finer-grained control over whether to let an input $x_t$ into the state $h_t$, or the state into the output $y_t$. These can be interpreted as allowing the model to modulate the recurrent dynamics based on content (input) and context (hidden states) respectively.

arxiv

Additional Model Details {#sec:method:details}
------------------------

#### Real vs. Complex.

Most prior SSMs use complex numbers in their state $h$, which is necessary for strong performance on many tasks in perceptual modalities [@gu2022efficiently]. However, it has been empirically observed that completely real-valued SSMs seem to work fine, and possibly even better, in some settings [@ma2023mega]. We use real values as the default, which work well for all but one of our tasks; we hypothesize that the complex-real tradeoff is related to the continuous-discrete spectrum in data modalities, where complex numbers are helpful for continuous modalities (e.g. audio, video) but not discrete (e.g. text, DNA).

#### Initialization.

Most prior SSMs also suggest special initializations, particularly in the complex-valued case, which can help in several settings such as low-data regimes. Our default initialization for the complex case is S4D-Lin and for the real case is S4D-Real [@gu2022parameterization], which is based on the HIPPO theory [@gu2020hippo]. These define the $n$-th element of $\A$ as $-1/2 + n i$ and $-(n+1)$ respectively. However, we expect many initializations to work fine, particularly in the large-data and real-valued SSM regimes; some ablations are considered in `\cref{sec:exp:ablations}`{=latex}.

#### Parameterization of $\dt$.

We defined the selective adjustment to $\dt$ as $s_\dt(x) = \mathsf{Broadcast}_D(\mathsf{Linear}_1(x))$, which was motivated by the mechanics of $\dt$ (`\cref{sec:method:properties}`{=latex}). We observe that it can be generalized from dimension $1$ to a larger dimension $\mathtt{R}$. We set this to be a small fraction of $\mathtt{D}$, which uses a negligible number of parameters compared to the main Linear projections in the block. We additionally note that the broadcasting operation can instead be viewed as another Linear projection, initialized to a specific pattern of $1$'s and $0$'s; if this projection is trainable, this leads to the alternative $s_\dt(x) = \mathsf{Linear}_D(\mathsf{Linear}_R(x))$, which can be viewed as a low-rank projection.

In our experiments, the $\dt$ parameter (which can be viewed as a bias term) is initialized to $\tau_\dt^{-1}(\mathsf{Uniform}([0.001, 0.1]))$, following prior work on SSMs [@gu2023train].

```{=latex}
\begin{remark}
  For brevity in our experimental results, we sometimes abbreviate selective SSMs as \emph{S6 models}, because they are S4 models with a \emph{selection} mechanism and computed with a \emph{scan}.
\end{remark}
```
Empirical Evaluation {#sec:exps}
====================

In `\cref{sec:exp:synthetic}`{=latex} we test Mamba's ability to solve the two synthetic tasks motivated in `\cref{sec:method:motivation}`{=latex}. We then evaluate on three domains, each evaluated on autoregressive pretraining as well as downstream tasks.

-   `\cref{sec:exp:language}`{=latex}: language model pretraining (scaling laws), and zero-shot downstream evaluation.

-   `\cref{sec:exp:genomics}`{=latex}: DNA sequence pretraining, and fine-tuning on a long-sequence classification task.

-   `\cref{sec:exp:audio}`{=latex}: audio waveform pretraining, and the quality of autoregressively generated speech clips.

Finally, `\cref{sec:exp:benchmark}`{=latex} shows Mamba's computational efficiency at both training and inference time, and `\cref{sec:exp:ablations}`{=latex} ablates various components of the architecture and selective SSMs.

Synthetic Tasks {#sec:exp:synthetic}
---------------

Full experiment details for these tasks including task details and training protocol are in `\cref{sec:exp-details:synthetics}`{=latex}.

### Selective Copying

The Copying task is one of the most well-studied synthetic tasks for sequence modeling, originally designed to test the memorization abilities of recurrent models. As discussed in `\cref{sec:method:motivation}`{=latex}, LTI SSMs (linear recurrences and global convolutions) can easily solve this task by only keeping track of time instead of reasoning about the data; for example, by constructing a convolution kernel of exactly the right length (`\cref{fig:copying}`{=latex}). This was explicitly validated in earlier work on global convolutions [@romero2021ckconv]. The Selective Copying task prevents this shortcut by randomizing the spacing between tokens. Note that this task has been introduced before as the Denoising task [@jing2019gated].

Note that many previous works argue that adding architecture gating (multiplicative interactions) can endow models with \`\`data-dependence" and solve related tasks [@dao2023hungry; @poli2023hyena]. However, we find this explanation insufficient intuitively because such gating does not interact along the sequence axis, and cannot affect the spacing between tokens. In particular architecture gating is not an instance of a selection mechanism (`\cref{sec:discussion:selection}`{=latex}).

`\cref{tab:copying}`{=latex} confirms that gated architectures such as H3 and Mamba only partially improve performance, while the selection mechanism (modifying S4 to S6) easily solves this task, particularly when combined with these more powerful architectures.

### Induction Heads

Induction heads [@olsson2022context] is a simple task from the mechanistic interpretability lens [@elhage2021mathematical] that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative recall and copy: for example, if the model has seen a bigram such as \`\`Harry Potter" in the sequence, then the next time \`\`Harry" appears in the same sequence, the model should be able to predict \`\`Potter" by copying from history.

#### Dataset.

We train a 2-layer model on the induction heads task at sequence length $256$, with a vocab size of $16$, which is comparable to prior work on this task [@dao2023hungry] but with longer sequences. We additionally investigate generalization and extrapolation abilities by evaluating on a range of sequence lengths from $2^6 = 64$ up to $2^{20} = 1048576$ at test time.

#### Models.

Following established work on induction heads, we use 2 layer models, which allows attention to mechanistically solve the induction heads task [@olsson2022context]. We test both multi-head attention (8 heads, with various positional encodings) and SSM variants. We use a model dimension $D$ of $64$ for Mamba and $128$ for the other models.

#### Results.

`\cref{fig:induction}`{=latex} shows that Mamba---or more precisely, its selective SSM layer---has the ability to solve the task perfectly because of its ability to selectively remember the relevant token while ignoring everything else in between. **It generalizes perfectly to million-length sequences, or $4000\times$ longer than it saw during training**, while no other method goes beyond $2\times$.

Out of positional encoding variants for attention models, xPos (which was designed for length extrapolation) is slightly better than the others; also note that all attention models were only tested up to sequence length $2^{14}=16384$ due to memory limitations. Out of other SSMs, H3 and Hyena are similar, contrary to the findings in @poli2023hyena.

```{=latex}
\begin{figure*}
  \begin{minipage}{0.33\linewidth}
    \small
    \centering
    \begin{tabular}{@{}llll@{}}
      \toprule
      \textsc{Model}          & \textsc{Arch.}        & \textsc{Layer}      & \textsc{Acc.} \\
      \midrule
      S4                       & No gate               & S4                  & 18.3 \\
      -                       & No gate               & S6                  & \textbf{97.0} \\
      \midrule
      H3                      & H3                    & S4                  & 57.0 \\
      Hyena                   & H3                    & Hyena               & 30.1 \\
      -                       & H3                    & S6                  & \textbf{99.7} \\
      \midrule
      -                       & Mamba                 & S4                  & 56.4 \\
      -                       & Mamba                 & Hyena               & 28.4 \\
      Mamba                   & Mamba                 & S6                  & \textbf{99.8} \\
      \bottomrule
    \end{tabular}
    
      \captionsetup{type=table,skip=12pt}
    
    \caption{
      (\textbf{Selective Copying}.) \\
      Accuracy for combinations of architectures and inner sequence layers.
    }
    \label{tab:copying}
  \end{minipage}
  \hfill
  \begin{minipage}{0.55\linewidth}
    \centering
    \includegraphics[width=\linewidth]{fig/induction.pdf}
    
      \captionsetup{type=table,skip=-6pt}
    
    \caption{
      (\textbf{Induction Heads}.)
      Models are trained on sequence length $2^8=256$, and tested on increasing sequence lengths of $2^6=64$ up to $2^{20}=1048576$.
      Full numbers in \cref{tab:induction}.
    }
    \label{fig:induction}
  \end{minipage}
\end{figure*}
```
Language Modeling {#sec:exp:language}
-----------------

We evaluate the Mamba architecture on standard autoregressive language modeling against other architectures, on both pretraining metrics (perplexity) and zero-shot evaluations. We set the model sizes (depth and width) to mirror GPT3 specifications. We use the Pile dataset [@pile], and follow the training recipe described in @brown2020language. All training details are in `\cref{sec:exp-details:lm}`{=latex}.

### Scaling Laws

For baselines, we compare against the standard Transformer architecture (GPT3 architecture), as well as the strongest Transformer recipe we know of (here referred to as Transformer++), based on the PaLM and LLaMa architectures (e.g. rotary embedding, SwiGLU MLP, RMSNorm instead of LayerNorm, no linear bias, and higher learning rates). We also compare against other recent subquadratic architectures (`\cref{fig:lm-scaling}`{=latex}). All model details are in `\cref{sec:exp-details:lm}`{=latex}.

`\cref{fig:lm-scaling}`{=latex} shows scaling laws under the standard Chinchilla [@hoffmann2022empirical] protocol, on models from $\approx 125M$ to $\approx 1.3B$ parameters. **Mamba is the first attention-free model to match the performance of a very strong Transformer recipe (Transformer++) that has now become standard, particularly as the sequence length grows.** (We note that full results on context length 8k are missing for the RWKV and RetNet baselines, prior strong recurrent models that can also be interpreted as SSMs, because of a lack of efficient implementations leading to out-of-memory or unrealistic computation requirements.)

```{=latex}
\begin{figure*}[!t]\centering
  \begin{subfigure}[t]{0.49\linewidth}
    \centering
    \includegraphics[width=\textwidth]{fig/pile_2k.pdf}
  \end{subfigure}
  \begin{subfigure}[t]{0.49\linewidth}
    \centering
    \includegraphics[width=\textwidth]{fig/pile_8k.pdf}
  \end{subfigure}
  \caption{
    (\textbf{Scaling Laws}.) %
    Models of size $\approx 125M$ to $\approx 1.3B$ parameters, trained on the Pile.
    Mamba scales better than all other attention-free models and is the first to match the performance of a very strong ``Transformer++'' recipe that has now become standard,
    particularly as the sequence length grows.
  }
  \label{fig:lm-scaling}
  
\end{figure*}
```
### Downstream Evaluations

`\cref{table:downstream_zeroshot}`{=latex} shows the performance of Mamba on a range of popular downstream zero-shot evaluation tasks. We compare against the most well-known open source models at these sizes, most importantly Pythia [@biderman2023pythia] and RWKV [@peng2023rwkv] which were trained with the same tokenizer, dataset, and training length (300B tokens) as our models. (Note that Mamba and Pythia are trained with context length 2048, while RWKV was trained with context length 1024.)

```{=latex}
\begin{table*}[!th]\small
  \centering
  \captionsetup{font=small}
  \caption{
    (\textbf{Zero-shot Evaluations}.) Best results for each size in bold.
    We compare against open source LMs with various tokenizers, trained for up to 300B tokens.
    Pile refers to the validation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B).
    For each model size, Mamba is best-in-class on every single evaluation result,
    and generally matches baselines at twice the model size.
    %
  }
  \resizebox{0.99\linewidth}{!}
  {
    \begin{tabular}{@{}lllllllllll@{}}
      \toprule
      \sc{Model}                                        & \sc{Token.} & \sc{Pile}             & \sc{LAMBADA}          & \sc{LAMBADA}         & \sc{HellaSwag}       & \sc{PIQA}            & \sc{Arc-E}           & \sc{Arc-C}           & \sc{WinoGrande}      & \sc{Average} \\
                                                        &             & \sc{ppl $\downarrow$} & \sc{ppl $\downarrow$} & \sc{acc $\uparrow$}  & \sc{acc $\uparrow$}  & \sc{acc $\uparrow$}  & \sc{acc $\uparrow$}  & \sc{acc $\uparrow$}  & \sc{acc $\uparrow$}  & \sc{acc $\uparrow$} \\
                                        \midrule
      
      Hybrid H3-130M                                           & GPT2        & ---                   & 89.48                 & 25.77                & 31.7                 & 64.2                 & 44.4                 & 24.2                 & 50.6                 & 40.1 \\
      Pythia-160M                                       & NeoX        & 29.64                 & 38.10                 & 33.0                 & 30.2                 & 61.4                 & 43.2                 & 24.1                 & \textbf{51.9}        & 40.6 \\
      \textbf{Mamba-130M}                               & NeoX        & \textbf{10.56}        & \textbf{16.07}        & \textbf{44.3}        & \textbf{35.3}        & \textbf{64.5}        & \textbf{48.0}        & \textbf{24.3}        & \textbf{51.9}        & \textbf{44.7} \\
      \midrule
      
      Hybrid H3-360M                                           & GPT2        & ---                   & 12.58                 & 48.0                 & 41.5                 & 68.1                 & 51.4                 & 24.7                 & 54.1                 & 48.0 \\
      Pythia-410M                                       & NeoX        & 9.95                  & 10.84                 & 51.4                 & 40.6                 & 66.9                 & 52.1                 & 24.6                 & 53.8                 & 48.2 \\
      \textbf{Mamba-370M}                               & NeoX        & \textbf{8.28}         & \textbf{8.14}         & \textbf{55.6}        & \textbf{46.5}        & \textbf{69.5}        & \textbf{55.1}        & \textbf{28.0}        & \textbf{55.3}        & \textbf{50.0} \\
      \midrule
      Pythia-1B                                         & NeoX        & 7.82                  & 7.92                  & 56.1                 & 47.2                 & 70.7                 & 57.0                 & 27.1                 & 53.5                 & 51.9 \\
      \textbf{Mamba-790M}                               & NeoX        & \textbf{7.33}         & \textbf{6.02}         & \textbf{62.7}        & \textbf{55.1}        & \textbf{72.1}        & \textbf{61.2}        & \textbf{29.5}        & \textbf{56.1}        & \textbf{57.1} \\
      \midrule
      GPT-Neo 1.3B                                      & GPT2        & ---                   & 7.50                  & 57.2                 & 48.9                 & 71.1                 & 56.2                 & 25.9                 & 54.9                 & 52.4 \\
      Hybrid H3-1.3B                                           & GPT2        & ---                   & 11.25                 & 49.6                 & 52.6                 & 71.3                 & 59.2                 & 28.1                 & 56.9                 & 53.0 \\
      OPT-1.3B                                          & OPT         & ---                   & 6.64                  & 58.0                 & 53.7                 & 72.4                 & 56.7                 & 29.6                 & 59.5                 & 55.0 \\
      Pythia-1.4B                                       & NeoX        & 7.51                  & 6.08                  & 61.7                 & 52.1                 & 71.0                 & 60.5                 & 28.5                 & 57.2                 & 55.2 \\
      RWKV-1.5B                                         & NeoX        & 7.70                  & 7.04                  & 56.4                 & 52.5                 & 72.4                 & 60.5                 & 29.4                 & 54.6                 & 54.3 \\
      \textbf{Mamba-1.4B}                               & NeoX        & \textbf{6.80}         & \textbf{5.04}         & \textbf{64.9}        & \textbf{59.1}        & \textbf{74.2}        & \textbf{65.5}        & \textbf{32.8}        & \textbf{61.5}        & \textbf{59.7} \\
      \midrule
      GPT-Neo 2.7B                                      & GPT2        & ---                   & 5.63                  & 62.2                 & 55.8                 & 72.1                 & 61.1                 & 30.2                 & 57.6                 & 56.5 \\
      Hybrid H3-2.7B                                           & GPT2        & ---                   & 7.92                  & 55.7                 & 59.7                 & 73.3                 & 65.6                 & 32.3                 & 61.4                 & 58.0 \\
      OPT-2.7B                                          & OPT         & ---                   & 5.12                  & 63.6                 & 60.6                 & 74.8                 & 60.8                 & 31.3                 & 61.0                 & 58.7 \\
      Pythia-2.8B                                       & NeoX        & 6.73                  & 5.04                  & 64.7                 & 59.3                 & 74.0                 & 64.1                 & 32.9                 & 59.7                 & 59.1 \\
      RWKV-3B                                           & NeoX        & 7.00                  & 5.24                  & 63.9                 & 59.6                 & 73.7                 & 67.8                 & 33.1                 & 59.6                 & 59.6 \\
      \textbf{Mamba-2.8B}                               & NeoX        & \textbf{6.22}         & \textbf{4.23}         & \textbf{69.2}        & \textbf{66.1}        & \textbf{75.2}        & \textbf{69.7}        & \textbf{36.3}        & \textbf{63.5}        & \textbf{63.3} \\
      \midrule
      GPT-J-6B                                          & GPT2        & --                    & 4.10                  & 68.3                 & 66.3                 & 75.4                 & 67.0                 & 36.6                 & 64.1                 & 63.0 \\
      OPT-6.7B                                          & OPT         & --                    & 4.25                  & 67.7                 & 67.2                 & 76.3                 & 65.6                 & 34.9                 & 65.5                 & 62.9 \\
      Pythia-6.9B                                       & NeoX        & 6.51                  & 4.45                  & 67.1                 & 64.0                 & 75.2                 & 67.3                 & 35.5                 & 61.3                 & 61.7 \\
      RWKV-7.4B                                         & NeoX        & 6.31                  & 4.38                  & 67.2                 & 65.5                 & 76.1                 & 67.8                 & 37.5                 & 61.0                 & 62.5 \\
      \bottomrule
    \end{tabular}
  }
  \label{table:downstream_zeroshot}
\end{table*}
```
DNA Modeling {#sec:exp:genomics}
------------

Motivated by the success of large language models, there has been recent exploration into using the foundation model paradigm for genomics. DNA has been likened to language in that it consists of sequences of discrete tokens with a finite vocabulary. It is also known for requiring long-range dependencies to model [@avsec2021effective]. We investigate Mamba as a FM backbone for pretraining and fine-tuning in the same setting as recent works on long-sequence models for DNA [@nguyen2023hyenadna]. In particular, we focus on two explorations of scaling laws across model size and sequence length (`\cref{fig:dna}`{=latex}), and a difficult downstream synthetic classification task requiring long context (`\cref{fig:species}`{=latex}).

For pretraining, we largely follow a standard causal language modeling (next token prediction) setup for the training and model details (see also `\cref{sec:exp-details:lm}`{=latex}). For the dataset, we largely follow the setup of HyenaDNA [@nguyen2023hyenadna], which uses the HG38 dataset for pretraining consisting of a single human genome with about 4.5 billion tokens (DNA base pairs) in the training split.

### Scaling: Model Size

In this experiment, we investigate the scaling properties of genomics foundation models with various model backbones (`\cref{fig:dna}`{=latex} *Left*).

```{=latex}
\para{Training.}
```
To advantage the baselines, we train on a short sequence length of $1024$; as shown in `\cref{sec:exp:dna:length}`{=latex}, we expect results to favor Mamba even more at longer sequence lengths. We fix a global batch size of $1024$, for a total of $2^{20} \approx 1M$ tokens per batch. Models were trained for $10K$ gradient steps for a total of $10B$ tokens.

```{=latex}
\para{Results.}
```
`\cref{fig:dna}`{=latex} (*Left*) shows that Mamba's pretraining perplexity improves smoothly with model size, and that Mamba scales better than both HyenaDNA and Transformer++. For example, at the largest model size of $\approx 40M$ parameters, the curve shows that **Mamba can match the Transformer++ and HyenaDNA models with roughly $3\times$ to $4\times$ fewer parameters**.

### Scaling: Context Length {#sec:exp:dna:length}

In the next DNA experiment, we investigate the scaling properties of models with respect to sequence length. We only compare the HyenaDNA and Mamba models, as quadratic attention becomes prohibitively expensive at longer sequence lengths. We pretrain models on sequence lengths $2^{10}=1024$, $2^{12}=4096$, $2^{14}=16384$, $2^{16}=65536$, $2^{18}=262144$, $2^{20}=1048576$. We fix a model size of 6 layers by width $128$ (about 1.3M-1.4M parameters). Models were trained for $20K$ gradient steps for a total of $\approx 330B$ tokens. The longer sequence lengths used sequence length warmup similar to [@nguyen2023hyenadna].

```{=latex}
\para{Results.}
```
`\cref{fig:dna}`{=latex} (*Right*) shows that **Mamba is able to make use of longer context even up to extremely long sequences of length 1M**, and its pretraining perplexity improves as the context increases. On the other hand, the HyenaDNA model gets worse with sequence length. This is intuitive from the discussion in `\cref{sec:method:properties}`{=latex} on properties of the selection mechanism. In particular, LTI models cannot selectively ignore information; from a convolutional perspective, a very long convolution kernel is aggregating all information across a long sequence which may be very noisy. Note that while HyenaDNA claims to improve with longer context, their results do not control for computation time.

### Synthetic Species Classification

We evaluate models on a downstream task of classifying between 5 different species by randomly sampling a contiguous segment of their DNA. This task is adapted from HyenaDNA, which used the species $\{ \texttt{human}, \texttt{lemur}, \texttt{mouse}, \texttt{pig}, \texttt{hippo} \}$. We modify the task to be significantly more challenging by classifying between the five *great apes* species\
$\{ \texttt{human}, \texttt{chimpanzee}, \texttt{gorilla}, \texttt{orangutan}, \texttt{bonobo} \}$, which are known to share 99% of their DNA.

```{=latex}
\begin{figure*}[!t]\centering
\begin{subfigure}{.49\linewidth}%
  \centering
  \includegraphics[width=\linewidth]{fig/dna_scaling.pdf}
\end{subfigure}
\begin{subfigure}{.49\linewidth}%
  \centering
  \includegraphics[width=\linewidth]{fig/dna_length.pdf}
\end{subfigure}
\caption{
  (\textbf{DNA Scaling Laws}.) Pretraining on the HG38 (human genome) dataset.
  (\textit{Left}) Fixing short context length $2^{10}=1024$ and increasing size from $\approx200K$ to $\approx 40M$ parameters, Mamba scales better than baselines.
  (\textit{Right}) Fixing model size and increasing sequence lengths while keeping tokens/batch and total training tokens fixed.
  Unlike baselines, the selection mechanism of Mamba facilitates better performance with increasing context length.
}
\label{fig:dna}
\end{figure*}
```
```{=latex}
\begin{figure*}[!ht]\begin{minipage}[t]{.49\linewidth}
    \centering
    \includegraphics[width=\linewidth]{fig/species.pdf}
    \captionsetup{type=figure}
    \caption{
      (\textbf{Great Apes DNA Classification}.)
      Accuracy after fine-tuning on sequences of length $2^{10}=1024$ up to $2^{20}=1048576$ using pretrained models of the same context length.
      Numerical results in \cref{tab:species}.
    }
    \label{fig:species}
  \end{minipage}
  \hfill
  \begin{minipage}[t]{.49\linewidth}
    \centering
    \includegraphics[width=\linewidth]{fig/youtubemix.pdf}
    \captionsetup{type=figure}
    \caption{
      (\textbf{Audio Pretraining}.) Mamba improves performance over prior state-of-the-art (Sashimi) in autoregressive audio modeling, while improving up to minute-long context or million-length sequences (controlling for computation).
    }
    \label{fig:youtubemix}
  \end{minipage}
\end{figure*}
```
Audio Modeling and Generation {#sec:exp:audio}
-----------------------------

For the audio waveform modality, we compare primarily to the SaShiMi architecture and training protocols [@goel2022raw].

This model comprises:

1.  a U-Net backbone with two stages of pooling by a factor $p$ that doubles the model dimension $D$ per stage,

2.  alternating S4 and MLP blocks in each stage.

We consider replacing the S4+MLP blocks with Mamba blocks.

Experiment details are in `\cref{sec:exp-details:audio}`{=latex}.

### Long-Context Autoregressive Pretraining

We evaluate pretraining quality (autoregressive next-sample prediction) on YouTubeMix [@deepsound], a standard piano music dataset used by prior work consisting of $4$ hours of solo piano music, sampled at a rate of 16000 Hz. Pretraining details largely follow the standard language modeling setup (`\cref{sec:exp:language}`{=latex}). `\cref{fig:youtubemix}`{=latex} evaluates the effect of increasing training sequence lengths from $2^{13}=8192$ to $2^{20}\approx 10^6$, while keeping computation fixed.

(There are some slight edge cases to the way the data is curated, which may lead to kinks in the scaling curves. For example, only minute-long clips were available so the maximum sequence length is actually bounded by $60s \cdot 16000Hz = 960000$.)

**Both Mamba and the SaShiMi (S4+MLP) baseline improve consistently with longer context lengths; Mamba is better throughout, and the gap widens at longer lengths.** The main metric is bits per byte (BPB), which is a constant factor $\log(2)$ of the standard negative log-likelihood (NLL) loss for pretraining other modalities.

We note one important detail: this is the only experiment in this paper in which we switched from the real parameterization to complex (`\cref{sec:method:details}`{=latex}). We show additional ablations in `\cref{sec:exp-details:audio}`{=latex}.

### Autoregressive Speech Generation

SC09 is a benchmark speech generation dataset [@Warden2018SpeechCA; @donahue2019adversarial], consisting of $1$-second clips sampled at 16000 Hz of the digits \`\`zero" through \`\`nine" with highly variable characteristics. We largely follow the autoregressive training setup and generation protocol of @goel2022raw.

`\cref{tab:sc09}`{=latex} shows automated metrics of the Mamba-UNet model compared to a variety of baselines from @goel2022raw: WaveNet [@oord2016wavenet], SampleRNN [@mehri2017samplernn], WaveGAN [@donahue2019adversarial], DiffWave [@kong2021diffwave], and SaShiMi. **A small Mamba model outperforms the state-of-the-art (and much larger) GAN- and diffusion- based models.** A larger model parameter-matched to the baselines further improves on fidelity metrics dramatically.

`\cref{tab:sc09-ablations}`{=latex} takes the small Mamba model and investigates combinations of different architectures for the outer stages and center stage. It shows that Mamba is consistently better than S4+MLP in the outer blocks, and Mamba $>$ S4+MLP $>$ MHA+MLP in the center blocks.

```{=latex}
\begin{figure*}[!ht]\begin{minipage}[t]{.50\linewidth}
    \centering
    \captionsetup{type=table}
    \caption{
      (\textbf{SC09}) Automated metrics for unconditional generation on a challenging dataset of fixed-length speech clips.
      (\emph{Top to Bottom}) Autoregressive baselines, non-autoregressive baselines, Mamba, and dataset metrics.
    }
    \scriptsize
    \begin{tabular}{@{}lllllll@{}}
      \toprule
      \textsc{Model} & \textsc{Params} & \textsc{NLL $\downarrow$} & \textsc{FID $\downarrow$} & \textsc{IS $\uparrow$} & \textsc{mIS $\uparrow$} & \textsc{AM $\downarrow$}      \\
      \midrule
      SampleRNN      & 35.0M           & 2.042                     & 8.96                      & 1.71                   & 3.02                    & 1.76                                \\
      WaveNet        & 4.2M            & 1.925                     & 5.08                      & 2.27                   & 5.80                    & 1.47                                \\
      SaShiMi        & 5.8M            & 1.873                     & 1.99                      & 5.13                   & 42.57                   & 0.74                           \\
      \midrule
      WaveGAN        & 19.1M           & -                         & 2.03                      & 4.90                   & 36.10                   & 0.80                                \\
      DiffWave       & 24.1M           & -                         & 1.92                      & 5.26                   & 51.21                   & 0.68                                \\
      \;\; + SaShiMi & 23.0M           & -                         & 1.42                      & 5.94                   & 69.17                   & 0.59                           \\
      \midrule
      \textbf{Mamba} & 6.1M            & \textbf{1.852}            & \underline{0.94}          & \underline{6.26}       & \underline{88.54}       & \underline{0.52}                           \\
      \textbf{Mamba} & 24.3M           & \underline{1.860}         & \textbf{0.67}             & \textbf{7.33}          & \textbf{144.9}          & \textbf{0.36}                           \\
      \midrule
      Train          & -               & -                         & $0.00$                    & $8.56$                 & $292.5$                 & $0.16$                                \\
      Test           & -               & -                         & $0.02$                    & $8.33$                 & $257.6$                 & $0.19$                                \\
      \bottomrule
    \end{tabular}
    \label{tab:sc09}
  \end{minipage}
  \hfill
  \begin{minipage}[t]{.49\linewidth}
    \centering
    \captionsetup{type=table}
    \caption{
      (\textbf{SC09 Model Ablations}) Models with 6M parameters. In SaShiMi's U-Net backbone, there are 8 center blocks operating on sequence length $1000$, sandwiched on each side by 8 outer blocks on sequence length $4000$, sandwiched by 8 outer blocks on sequence length $16000$ (40 blocks total). The architecture of the 8 center blocks are ablated independently of the rest. Note that Transformers (MHA+MLP) were not tested in the more important outer blocks because of efficiency constraints.
    }
    \scriptsize
    \begin{tabular}{@{}lllllll@{}}
      \toprule
      \textsc{Outer} & \textsc{Center} & \textsc{NLL $\downarrow$} & \textsc{FID $\downarrow$} & \textsc{IS $\uparrow$} & \textsc{mIS $\uparrow$} & \textsc{AM $\downarrow$}      \\
      \midrule
      S4+MLP         & MHA+MLP         & 1.859                     & 1.45                      & 5.06                   & 47.03                   & 0.70 \\
      S4+MLP         & S4+MLP          & 1.867                     & 1.43                      & 5.42                   & 53.54                   & 0.65 \\
      S4+MLP         & Mamba           & 1.859                     & 1.42                      & 5.71                   & 56.51                   & 0.64 \\
      Mamba          & MHA+MLP         & \textbf{1.850}            & 1.37                      & 5.63                   & 58.23                   & 0.62 \\
      Mamba          & S4+MLP          & 1.853                     & \underline{1.07}          & \underline{6.05}       & \underline{73.34}       & \underline{0.55} \\
      Mamba          & Mamba           & \underline{1.852}         & \textbf{0.94}             & \textbf{6.26}          & \textbf{88.54}          & \textbf{0.52} \\
      \bottomrule
    \end{tabular}
    \label{tab:sc09-ablations}
  \end{minipage}
\end{figure*}
```
Speed and Memory Benchmarks {#sec:exp:benchmark}
---------------------------

We benchmark the speed of the SSM scan operation (state expansion $N=16$), as well as the end-to-end inference throughput of Mamba, in `\cref{fig:scan_benchmark}`{=latex}. Our efficient SSM scan is faster than the best attention implementation that we know of (FlashAttention-2 [@dao2023flashattention2]) beyond sequence length 2K, and up to 20-40$\times$ faster than a standard scan implementation in PyTorch. Mamba achieves 4-5$\times$ higher inference throughput than a Transformer of similar size, since without the KV cache it can use much higher batch sizes. For example, a Mamba-6.9B (untrained) would have higher inference throughput than a $5\times$ smaller Transformer-1.3B. Details in `\cref{sec:exp-details:benchmark}`{=latex}, which additionally includes a benchmark of memory consumption.

```{=latex}
\begin{figure*}[!th]\centering
  \begin{subfigure}{.5\textwidth}
    \centering
    \includegraphics[width=.95\linewidth]{fig/ssm_scan.pdf}
  \end{subfigure}%
  \begin{subfigure}{.5\textwidth}
    \centering
    \includegraphics[width=.95\linewidth]{fig/mamba_inference.pdf}
  \end{subfigure}
  \caption{
    (\textbf{Efficiency Benchmarks}.)
    (\emph{Left}) Training: our efficient scan is $40\times$ faster than a standard implementation.
    (\emph{Right}) Inference: as a recurrent model, Mamba can achieve $5\times$ higher throughput than Transformers.
  }
  \label{fig:scan_benchmark}
\end{figure*}
```
Model Ablations {#sec:exp:ablations}
---------------

We perform a series of detailed ablations on components of our model, focusing on the setting of language modeling with size $\approx 350$M models at Chinchilla token counts (same setting as `\cref{fig:lm-scaling}`{=latex}).

### Architecture

`\cref{tab:ablations-arch}`{=latex} investigates the effects of the architecture (block) and its inner SSM layer (`\cref{fig:architecture}`{=latex}). We find that

-   Among previous non-selective (LTI) SSMs, which are equivalent to global convolutions, performance is very similar.

-   Replacing the complex-valued S4 variant from previous work with a real-valued one does not affect performance much, suggesting that (at least for LM) real-valued SSMs may be a better choice when accounting for hardware efficiency.

-   Replacing any of these with a selective SSM (S6) significantly improves performance, validating the motivation of `\cref{sec:method}`{=latex}.

-   The Mamba architecture performs similarly to the H3 architecture (and seems slightly better when using a selective layer).

We also investigate interleaving the Mamba block with other blocks such as MLP (a traditional architecture) MHA (a hybrid attention architecture) in `\cref{sec:exp-details:lm:scaling-ablations}`{=latex}.

### Selective SSM

`\cref{tab:ablations-variable}`{=latex} ablates the selective SSM layer by considering different combinations of selective $\dt$, $\B$, and $\C$ parameters (`\cref{alg:s6}`{=latex}), showing that $\dt$ is the most important parameter due to its connection to RNN gating (`\cref{thm:gating}`{=latex}).

`\cref{tab:ablations-init}`{=latex} considers different initializations of the SSM, which have been shown to make a large difference in some data modalities and settings [@gu2022efficiently; @gu2022parameterization]. On language modeling, we find that simpler real-valued diagonal initializations (S4D-Real, row 3) instead of more standard complex-valued parameterizations (S4D-Lin, row 1) perform better. Random initializations also work well, consistent with findings from prior work [@mehta2023long].

`\cref{tab:ablations-dt}`{=latex} and `\cref{tab:ablations-N}`{=latex} consider varying the dimension of the $\dt$ and $(\B, \C)$ projections respectively. Changing them from static to selective provides the most benefit, while increasing the dimensions further generally improves performance modestly with a small increase in parameter count.

```{=latex}
\centering
```
  [Model]{.smallcaps}   [Arch.]{.smallcaps}   [SSM Layer]{.smallcaps}   `\sc{Perplexity}`{=latex}
  --------------------- --------------------- ------------------------- ---------------------------
  Hyena                 H3                    Hyena                     $10.24$
  H3                    H3                    S4 (complex)              $10.30$
  \-                    H3                    S4 (real)                 $10.34$
  \-                    H3                    S6                        $\mathbf{8.95}$

  :  (**Ablations: Architecture and SSM layer**.) The Mamba block performs similarly to H3 while being simpler. In the inner layer, there is little difference among different parameterizations of LTI models, while selective SSMs (S6) provide a large improvement. More specifically, the S4 (real) variant is S4D-Real and the S4 (complex) variant is S4D-Lin.

```{=latex}
\qquad
```
  [Model]{.smallcaps}   [Arch.]{.smallcaps}   [SSM Layer]{.smallcaps}   `\sc{Perplexity}`{=latex}
  --------------------- --------------------- ------------------------- ---------------------------
  \-                    Mamba                 Hyena                     $10.75$
  \-                    Mamba                 S4 (complex)              $10.54$
  \-                    Mamba                 S4 (real)                 $10.56$
  Mamba                 Mamba                 S6                        $\mathbf{8.69}$

  :  (**Ablations: Architecture and SSM layer**.) The Mamba block performs similarly to H3 while being simpler. In the inner layer, there is little difference among different parameterizations of LTI models, while selective SSMs (S6) provide a large improvement. More specifically, the S4 (real) variant is S4D-Real and the S4 (complex) variant is S4D-Lin.

`\label{tab:ablations-arch}`{=latex}

```{=latex}
\centering
```
```{=latex}
\captionsetup{type=table}
```
  `\sc{Selective}`{=latex} $\dt$    `\sc{Selective}`{=latex} $\B$    `\sc{Selective}`{=latex} $\C$    `\sc{Perplexity}`{=latex}
  --------------------------------- -------------------------------- -------------------------------- ---------------------------
  `\xmark               `{=latex}   `\xmark              `{=latex}   `\xmark              `{=latex}   10.93
  `\xmark               `{=latex}   `\cmark              `{=latex}   `\xmark              `{=latex}   10.15
  `\xmark               `{=latex}   `\xmark              `{=latex}   `\cmark              `{=latex}   9.98
  `\cmark               `{=latex}   `\xmark              `{=latex}   `\xmark              `{=latex}   9.81
  `\cmark               `{=latex}   `\cmark              `{=latex}   `\cmark              `{=latex}   8.71

`\label{tab:ablations-variable}`{=latex}

```{=latex}
\hfill
```
```{=latex}
\centering
```
```{=latex}
\captionsetup{type=table}
```
  $\A_n$ `\sc{Initialization}`{=latex}   `\sc{Field}`{=latex}   `\sc{Perplexity}`{=latex}
  -------------------------------------- ---------------------- ---------------------------
  $\A_n = -\frac{1}{2} + n i$            Complex                9.16
  $\A_n = -1/2$                          Real                   8.85
  $\A_n = -(n+1)$                        Real                   8.71
  $\A_n \sim \exp(\mathcal{N}(0, 1))$    Real                   8.71

`\label{tab:ablations-init}`{=latex}

```{=latex}
\captionsetup{type=table}
```
```{=latex}
\centering
```
  `\sc{Size of $\dt$ proj.}`{=latex}   `\sc{Params (M)}`{=latex}   `\sc{Perplexity}`{=latex}
  ------------------------------------ --------------------------- ---------------------------
  \-                                   358.9                       9.12
  $1$                                  359.1                       8.97
  $2$                                  359.3                       8.97
  $4$                                  359.7                       8.91
  $8$                                  360.5                       8.83
  $16$                                 362.1                       8.84
  $32$                                 365.2                       8.80
  $64$                                 371.5                       8.71

`\label{tab:ablations-dt}`{=latex}

```{=latex}
\hfill
```
```{=latex}
\centering
```
```{=latex}
\captionsetup{type=table}
```
  `\sc{State dimension}`{=latex} $N$   `\sc{Params (M)}`{=latex}   `\sc{Perplexity}`{=latex}
  ------------------------------------ --------------------------- ---------------------------
  $1$                                  367.1                       9.88
  $2$                                  367.4                       9.86
  $4$                                  368.0                       9.82
  $8$                                  369.1                       9.82
  $16$                                 371.5                       9.81
  $1$                                  367.1                       9.73
  $2$                                  367.4                       9.40
  $4$                                  368.0                       9.09
  $8$                                  369.1                       8.84
  $16$                                 371.5                       8.71

`\label{tab:ablations-N}`{=latex}

Of particular note is the dramatic improvement of the selective SSM when the state size $N$ is increased, with over a 1.0 perplexity improvement for a cost of only 1% additional parameters. This validates our core motivation in `\cref{sec:method:motivation,sec:method:scan}`{=latex}.

Discussion {#sec:discussion}
==========

We discuss related work, limitations, and some future directions.

#### Related Work.

`\cref{sec:discussion:selection}`{=latex} discusses how the selection mechanism relates to similar concepts. `\cref{sec:related}`{=latex} has an extended related work of SSMs and other related models.

#### No Free Lunch: Continuous-Discrete Spectrum.

Structured SSMs were originally defined as discretizations of continuous systems `\eqref{eq:ssm}`{=latex}, and have had a strong inductive bias toward continuous-time data modalities such as perceptual signals (e.g. audio, video). As discussed in `\cref{sec:method:motivation,sec:method:properties}`{=latex}, the selection mechanism overcomes their weaknesses on discrete modalities such as text and DNA; but this conversely can impede their performance on data that LTI SSMs excel on. Our ablations on audio waveforms examine this tradeoff in more detail.

#### Downstream Affordances.

Transformer-based foundation models (particularly LLMs) have a rich ecosystem of properties and modes of interaction with pretrained models, such as fine-tuning, adaptation, prompting, in-context learning, instruction tuning, RLHF, quantization, and so on. We are particularly interested in whether Transformer alternatives such as SSMs have similar properties and affordances.

#### Scaling.

Our empirical evaluation is limited to small model sizes, below the threshold of most strong open source LLMs (e.g. Llama [@touvron2023llama]) as well as other recurrent models such as RWKV [@peng2023rwkv] and RetNet [@sun2023retentive], which have been evaluated at the 7B parameter scale and beyond. It remains to assess whether Mamba still compares favorably at these larger sizes. We also note that scaling SSMs may involve further engineering challenges and adjustments to the model that are not discussed in this paper.

Conclusion {#sec:conclusion}
==========

We introduce a selection mechanism to structured state space models, allowing them to perform context-dependent reasoning while scaling linearly in sequence length. When incorporated into a simple attention-free architecture, Mamba achieves state-of-the-art results on a diverse set of domains, where it matches or exceeds the performance of strong Transformer models. We are excited about the broad applications of selective state space models to build foundation models for different domains, especially in emerging modalities requiring long context such as genomics, audio, and video. Our results suggest that Mamba is a strong candidate to be a general sequence model backbone.

### Acknowledgments {#acknowledgments .unnumbered}

We thank Karan Goel, Arjun Desai, and Kush Bhatia for helpful feedback on the draft.

```{=latex}
\printbibliography
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
```{=latex}
\onecolumn
```
Discussion: Selection Mechanism {#sec:discussion:selection}
===============================

Our selection mechanism is inspired by and related to concepts such as gating, hypernetworks, and data-dependence. It can also be viewed as related to \`\`fast weights" [@schmidhuber1992learning; @ba2016using], which connects classical RNNs with the mechanism of linear attention [@schlag2021linear]. However, we believe that it is a distinct concept that is worth clarifying.

#### Gating.

Gating originally referred to the gating mechanisms of RNNs such as the LSTM [@lstm] and GRU [@chung2014empirical], or the gated equation `\eqref{eq:gates}`{=latex} in `\cref{thm:gating}`{=latex}. This was interpreted as a particular mechanism for controlling whether to let an input into the hidden state of an RNN. In particular, this affects the propagation of signal through time and causes inputs to interact along the sequence length dimension.

However, the concept of gating has since been relaxed in popular usage to simply mean any multiplicative interaction (often with an activation function). For example, *elementwise* multiplicative components of neural network architectures (that do not interact along sequence length) are now commonly referred to as gated architectures [@hua2022transformer; @mehta2023long], despite a very different meaning than the original RNN sense. Thus we believe the original concept of *RNN gating* versus the popular usage of *multiplicative gating* actually have a very different semantic meaning.

#### Hypernetworks.

Hypernetworks refer to neural networks whose parameters are themselves generated by smaller neural networks. The original idea [@ha2017hypernetworks] used it in a narrow sense to define a large RNN whose recurrent parameters are generated by a smaller RNN, and other variants have been around for a long time [@schmidhuber1992learning].

#### Data-dependence.

Similar to hypernetworks, data-dependence can refer to any notion where some parameters of the model depend on the data [@poli2023hyena].

#### Example: GLU Activation.

To illustrate the issues with these concepts, consider a simple diagonal linear layer $y = \bm{D}x$, where $\bm{D}$ is a diagonal weight parameter. Now suppose that $\bm{D}$ is itself generated from a linear transformation of $x$, with an optional nonlinearity: $\bm{D} = \sigma(\bm{W} x)$. Since it is diagonal, the multiplication becomes an elementwise product: $y = \sigma(\bm{W} x) \circ x$.

This is a rather trivial transformation, yet it technically satisfies the common meanings of gating (since it has a multiplicative \`\`branch"), hypernetworks (since the parameter $\bm{D}$ is generated by another layer), and data-dependent (since $\bm{D}$ depends on the data $x$). However, this in fact simply defines a GLU function, which is so simple that it is often considered just an activation function [@dauphin2017language; @shazeer2020glu] instead of a meaningful layer.

#### Selection.

Thus, while selection mechanisms could be considered a special case of ideas such as architectural gating, hypernetworks, or data-dependence, so can an enormous range of other constructions---essentially anything with a multiplication, including standard attention mechanisms [@bahdanau2015neural; @vaswani2017attention] as well---and we find it uninformative to think of them as such.

Instead, we view it as most closely related to the gating mechanism of traditional RNNs, which is a special case (`\cref{thm:gating}`{=latex}) and also has a deeper history of connections to SSMs through variable (input-dependent) discretization of $\dt$ [@funahashi1993approximation; @tallec2018can; @gu2020hippo]. We also eschew the term \`\`gating" in favor of *selection* to clarify the overloaded use of former. More narrowly, we use selection to refer to the *mechanistic* action of a model to select or ignore inputs and facilitate data interaction along the sequence length (`\cref{sec:method:motivation}`{=latex}). Beyond selective SSMs and gated RNNs, other examples may include input-dependent convolutions [@yang2019condconv; @lioutas2020time; @kosma2023time; @lutati2023focus] and even attention.

Related Work {#sec:related}
============

We overview several prior works related to our methods. We mention that some of the most closely related models include recurrent layers such as S4, S5, and quasi-RNNs; as well as end-to-end architectures such as H3, RetNet, and RWKV.

S4 Variants and Derivatives
---------------------------

We describe a brief overview of some structured SSMs from past work, particularly those that have a relation to our method.

-   S4 [@gu2021combining; @gu2022efficiently] introduced the first structured SSM, describing diagonal structure and diagonal plus low-rank (DPLR). It focused on efficient convolutional algorithms for DPLR SSMs due to a connection to continuous-time online memorization (HIPPO) [@gu2020hippo].

-   DSS [@gupta2022diagonal] first discovered the empirical effectiveness of diagonal structured SSMs by approximating the HIPPO initialization. This was expanded on theoretically in S4D [@gu2022parameterization].

-   S5 [@smith2023s5] independently discovered the diagonal SSM approximation, and is the first S4 model to be computed recurrently with the parallel scan. However, this required lowering the effective state dimension, which they accomplished by switching the SSM dimensions from a SISO (single-input single-output) to MIMO (multi-input multi-output) formulation. Our proposed S6 shares the scan, but differs by (i) keeping the SISO dimensions, which provides a larger effective recurrent state, (ii) using a hardware-aware algorithm to overcome the computation issue, (iii) adding the selection mechanism.

    @lu2023structured applied S5 to meta-RL in order to handle resetting the SSM state between episode trajectories. Their mechanism can be viewed as a particular hard-coded instance of a selection mechanism, where $\dA$ is manually set to $0$, instead of our learnable mechanism that depends on the input. It would be interesting to apply selective SSMs generically to this setting and probe if the model has learned to automatically reset its state on episode boundaries.

-   Mega [@ma2023mega] introduced a simplification of S4 to be real- instead of complex- valued, giving it an interpretation of being an exponential moving average (EMA). They additionally make an interesting connection of the discretization step of SSMs to an EMA *damping* term. Contrary to findings in the original S4 papers, this was the first model to show that real-valued SSMs are empirically effective in certain settings or when combined with different architectural components.

-   Liquid S4 [@hasani2023liquid] is also motivated by augmenting S4 with an input-dependent state transition. From this perspective it shares similarity to selection mechanisms, although in a limited form which is still computed convolutionally and close to LTI.

-   SGConv [@li2023makes], Hyena [@poli2023hyena], LongConv [@fu2023simple], MultiresConv [@shi2023sequence], and Toeplitz Neural Network [@qin2023toeplitz] all focus on the convolutional representation of S4 and create global or long convolution kernels with different parameterizations. However, these methods cannot do fast autoregressive inference directly.

Notably, all of these methods, and all other structured SSMs that we are aware of, have been non-selective and usually strictly LTI (linear time invariant).

SSM Architectures
-----------------

We use SSM architectures or state space neural networks (SSNN) to refer to deep neural network architectures incorporating one of the previous SSMs as a black box layer.

-   GSS [@mehta2023long] was the first gated neural network architecture incorporating SSMs. It is motivated by the gated attention unit (GAU) of @hua2022transformer and looks quite similar to our block, except with additional projections. Most importantly, its projection *contracts* the model dimension to reduce the state size of the SSM, while ours *expands* the model dimension in order to increase the state size, based on the motivation in `\cref{sec:method:motivation}`{=latex}.

-   Mega [@ma2023mega] combined the EMA simplification of S4 described above into a hybrid architecture using an efficient attention approximation.

-   H3 [@dao2023hungry] is motivated by combining S4 with linear attention [@katharopoulos2020transformers]. It is the first to generalize this formulation of linear attention to more general recurrences, which is also the basis of later architectures.

-   Selective S4 [@wang2023selective] incorporates S4 as a black box to generate a binary mask which is multiplied on the input. While sharing the \`\`selection" name, we consider this an architectural modification that is closer to architectural gating than a selection mechanism (`\cref{sec:discussion:selection}`{=latex}). For example, we hypothesize that it would not solve the Selective Copying task because simply masking out the irrelevant inputs does not affect the spacing between the relevant ones (indeed, the Selective Copying task can even be viewed as coming pre-masked if the noise tokens are embedded to 0).

-   RetNet [@sun2023retentive] is also based on Linear Attention and very similar to H3, but reduces the inner S4 layer to a special case where the state dimension is $N=1$. Although not framed as such, its recurrence can be viewed as a special case of a linear SSM.

    Its primary source of improvement is using a linear attention with large *head dimension*, which can be viewed as another method to perform input-dependent state expansion. Using a larger head dimension in the context of linear attention variants was first done by H3, but not extensively used since this requires a proportional amount of extra computation. RetNet avoids this with an alternate way to parallelize the computation with a variant of standard multi-head attention instead of convolutions, made feasible by their particular special case of SSMs which acts as a simple EMA.

-   RWKV [@peng2023rwkv] is another recent RNN designed for language modeling. It is based on AFT (attention-free Transformer [@zhai2021attention]), another variant of linear attention. Its main \`\`WKV" mechanism involves LTI recurrences and can be seen as the ratio of two SSMs.

We also highlight the gated attention unit (GAU) from @hua2022transformer, which was motivated by combining the Transformer's MHA and MLP blocks together and was an inspiration for our architecture (`\cref{sec:method:architecture}`{=latex}) combining the H3 and MLP blocks.

Relationship to RNNs
--------------------

RNNs and SSMs are broadly related, as they both involve the concepts of *recurrence* on a latent *state*.

Several older RNNs such as the strongly typed RNN [@balduzzi2016strongly], quasi-RNN (QRNN) [@bradbury2016quasi], and simple recurrent unit (SRU) [@lei2017simple; @lei2021attention] involve forms of gated RNNs without time-wise nonlinearities. Because of the connections of gating mechanisms and selection mechanisms, these can be viewed as cases of selective SSMs, and are thus more powerful in a sense than the family of LTI structured SSMs above. The main differences are:

-   They do not use state expansion ($N=1$) or selective $\B, \C$ parameters, both of which are important for performance (`\cref{sec:exp:ablations}`{=latex}).

-   They use a heuristic gating mechanism, which we generalize as a consequence of the selection mechanism + discretization (`\cref{thm:gating}`{=latex}). The connections to principled SSM theory provides better parameterizations and initializations (`\cref{sec:method:details}`{=latex}).

Additionally, older RNNs famously suffered from efficiency issues and the vanishing gradients problem [@hochreiter1991untersuchungen; @hochreiter2001gradient; @pascanu2013difficulty], both caused by their sequential nature. The former could be solved for some of the above RNNs by leveraging the parallel scan [@martin2018parallelizing], but the latter was difficult without theory later developed for SSMs. For example, modern structured SSMs differ in more careful parameterization of the recurrent dynamics inspired by classical SSM theory (e.g. through discretization [@gu2021combining; @gu2023train]), or direct analysis [@orvieto2023resurrecting; @kaul2020linear; @gupta2022simplifying]).

We also note that there is a long line of work on orthogonal RNNs [@arjovsky2016unitary; @henaff2016recurrent; @mhammedi2017efficient; @vorontsov2017orthogonality; @lezcano2019cheap] which are motivated by constraining the $\dA$ transition matrix to be orthogonal or unitary, in order to control its eigenvalues and prevent the vanishing gradient problem. However, these had other limitations; we believe that these stem from the fact that orthogonal/unitary RNNs are also LTI. For example, they are almost always evaluated on the Copying task which they can solve perfectly, but observed to struggle on the Selective Copying task [@jing2019gated].

Linear Attention
----------------

The Linear Attention (LA) [@katharopoulos2020transformers] framework is an important result popularizing kernel attention and showing how it relates to recurrent autoregressive models. Many variants have proposed alternative kernels and other modifications. Random Feature Attention (RFA) [@peng2021random] chooses the kernel feature map to approximate softmax attention (i.e. the $\exp$ feature map) using the random Fourier feature approximation of Gaussian kernels [@rahimi2007random]. Performer [@choromanski2021rethinking] finds an approximation to the exponential kernel involving only positive features, which also allows the softmax normalization term. TransNormer [@qin2022devil] showed that the LA denominator term can be unstable and proposed replacing it with a LayerNorm. cosFormer [@qin2022cosformer] augments RFA with a cosine reweighting mechanism that incorporates positional information to emphasize locality. Linear Randomized Attention [@zheng2022linear] generalize RFA from the perspective of importance sampling, and generalize it to provide better estimates of the full softmax kernel (rather than just the $\exp$-transformed numerator).

Aside from kernel attention, many other variants of efficient attention exist; the survey @tay2022efficient offers an extensive categorization of many of these.

Long Context Models
-------------------

Long context has become a popular subject, and several recent models have claimed to scale to longer and longer sequences. However, these are often from a computational standpoint and have not been extensively validated. These include:

-   Recurrent Memory Transformer [@bulatov2023scaling], a lightweight wrapper around a Transformer backbone. It showed ability to generalize up to 1M sequences but only on synthetic memorization tasks; their main result is similar to our Induction Heads extrapolation experiment (`\cref{fig:induction}`{=latex}).

-   LongNet [@ding2023longnet], which claimed to scale to 1B length but only evaluated on length $<100K$ for actual tasks.

-   Hyena and HyenaDNA [@poli2023hyena; @nguyen2023hyenadna], which claimed to leverage up to 1M context. However, their experiments trained on proportionally more data at longer contexts, making it hard to conclude if quality improvements at 1M context are due to context length or due to more data and computation.

-   Sparse Transformer [@child2019generating] showed a proof-of-concept of using a strided sparse attention Transformer to model audio waveforms of length $2^{20}=1048576$, although did not discuss performance tradeoffs when controlling for computation and model size.

In contrast, we believe this work presents one of the first approaches to meaningfully demonstrate increasing performance with longer context.

Mechanics of Selective SSMs {#sec:mechanics}
===========================

```{=latex}
\begin{proof}[Proof of \cref{thm:gating}]Consider a selective SSM (\cref{alg:s6}) with
$N=1, \A=-1, \B=1, s_\dt=\mathsf{Linear}(x), \tau_\dt=\mathsf{softplus}$.
The corresponding continuous-time SSM \eqref{eq:ssm} is
\begin{align*}%
  h(t) = -h(t) + x(t)
\end{align*}
which is also called a \emph{leaky integrator}.

The discretization step size is
\begin{align*}%
  \dt_t &= \tau_\dt(\mathsf{Parameter} + s_\dt(x_t)) \\
      &= \mathsf{softplus}(\mathsf{Parameter} + \mathsf{Linear}(x_t)) \\
      &= \mathsf{softplus}(\mathsf{Linear}(x_t))
\end{align*}
where we observe that the parameter can be viewed as a learnable bias and folded into the linear projection.

Now applying the zero-order hold (ZOH) discretization formulas:
\begin{align*}%
  \dA_t &= \exp(\dt \A) = \frac{1}{1 + \exp(\mathsf{Linear}(x_t))} = \sigma(-\mathsf{Linear}(x_t))
    \\&= 1 - \sigma(\mathsf{Linear}(x_t))
    \\
  \dB_t &= (\dt \bm{A})^{-1} (\exp(\dt \bm{A}) - \bm{I}) \cdot \dt \bm{B} = -(\exp(\dt \bm{A}) - \bm{I}) = 1 - \dA
    \\&= \sigma(\mathsf{Linear}(x_t))
    .
\end{align*}

Thus the final discrete recurrence \eqref{eq:ssm:recurrence:1} is
\begin{align*}%
  g_t &= \sigma(\mathsf{Linear}(x_t)) \\
  h_{t} &= (1-g_t) h_{t-1} + g_t x_t
\end{align*}
as desired.
\end{proof}
```
Hardware-aware Algorithm For Selective SSMs {#sec:hardware_aware_algo}
===========================================

Without input-dependent selectivity, SSMs can be efficiently implemented as a convolution [@gu2022efficiently; @dao2023hungry], which leverages the fast Fourier transform (FFT) as primitive. With selectivity, SSMs are no-longer equivalent to convolution, but we leverage the parallel associative scan. While SSM scans are theoretically efficient ($O(B L D N)$ FLOPs, scaling linear in $L$), training foundation models with selective SSMs requires them to be efficient on modern hardware (GPUs) as well. We describe how we use *kernel fusion* and *recomputation* to make SSM scan fast and memory-efficient. We evaluate the speed of our scan implementation compared to convolution and attention in `\cref{sec:exp:benchmark}`{=latex}, showing that it is up to 7$\times$ times faster than attention at sequence length 32K, and is as memory-efficient as the best attention implementation (FlashAttention).

#### Speed.

On modern hardware accelerators (GPUs) most operations (except matrix multiply) are bounded by memory-bandwidth [@williams2009roofline; @ivanov2021data; @dao2022flashattention]. This the case with our scan operation, and we use kernel fusion to reduce the amount of memory IOs, leading to significant speedup compared to a standard implementation.

The standard way to implement the scan algorithm in `\cref{sec:method:selective}`{=latex} is to prepare the scan input $\dA, \dB$ of size $(B, L, D, N)$ in GPU HBM (high-bandwidth memory, commonly referred to as GPU memory), call a parallel associative scan implementation to write the scan output of size $(B, L, D, N)$ to GPU HBM, then multiply that scan output with $\C$ to produce an output of size $(B, L, D)$. However, this requires the number of memory reads/writes on the order of $O(BLDN)$. We can instead fuse the discretization step, the scan, and the multiplication with $\C$ into one kernel:

1.  We read in $O(BLD + DN)$ bytes of memory ($\dt, \A, \B, \C$) from slow HBM to fast SRAM.

2.  We discretize to produce $\dA, \dB$ of size $(B, L, D, N)$ in SRAM.

3.  We perform a parallel associative scan, yielding intermediate states of size $(B, L, D, N)$ in SRAM.

4.  We multiply and sum with $\C$, producing outputs of size $(B, L, D)$ and write it to HBM.

This way, we reduce IOs by a factor of $O(N)$ (the state dimension), which in practice speeds up the operation by 20-40 times (`\cref{sec:exp:benchmark}`{=latex}).

For sequence length $L$ too long where we cannot fit the sequence in SRAM (which is much smaller than HBM), we split the sequences into chunks and perform the fused scan on each chunk. As long as we have the intermediate scan states, we can continue the scan with the next chunk.

#### Memory.

We describe how we use the classical technique of *recomputation* to reduce the total amount of memory required to train selective SSM layers.

From the way we fuse the forward pass, we do not save the intermediate states of size $(B, L, D, N)$ to avoid memory blowup. However, these intermediate states are necessary for the backward pass to compute gradients. We instead recompute those intermediate states in the backward pass. Since the inputs $\dt, \A, \B, \C$ and output gradient read from HBM to SRAM are of size $O(BLN + DN)$, and the input gradients are also of size $O(BLN + DN)$, recomputation avoids the cost of reading $O(BLND)$ elements from HBM. This means that recomputation of the SSM states in the backward pass speeds up the computation compared to storing them and reading them from HBM.

Beyond optimizing for the memory requirement of just the scan operation, we also use recomputation to optimize the memory requirement of the entire selective SSM block (input projection, convolution, activation, scan, output projection). In particular, we do not save intermediate activations that take a lot of memory but are fast to recompute (e.g. output of activation function or short convolution). As a result, the selective SSM layer has the same memory requirement as an optimized Transformer implementation with FlashAttention. In particular, each attention layer (FlashAttention) stores around 12 bytes of activations per token, an each MLP layer stores around 20 bytes of activations per token, for a total of 32 bytes ((assuming mixed-precision training in FP16 or BF16)). Each selective SSM stores around 16 bytes of activations per token. Hence two layers of selective SSMs have around the same activation memory as an attention layer and an MLP layer.

Experimental Details and Additional Results {#sec:exp-details}
===========================================

Synthetic Tasks {#sec:exp-details:synthetics}
---------------

#### Selective Copying.

Our setting is on sequences of length 4096, with a vocab size of 16 possible tokens (including the white \`\`noise" token from `\cref{fig:copying}`{=latex}) and requiring models to memorize 16 \`\`data" tokens. We use 2 layer models with a model dimension of $D = 64$.

Models are trained for 400K steps at a constant learning rate of $0.0001$ with a batch size of $64$.

#### Induction Heads.

```{=latex}
\footnotesize
```
```{=latex}
\centering
```
<table><caption> (<strong>Induction heads</strong>.) Models are trained on sequence length <span class="math inline">2<sup>8</sup> = 256</span>, and tested on various sequence lengths of <span class="math inline">2<sup>6</sup> = 64</span> up to <span class="math inline">2<sup>20</sup> = 1048576</span>.  denotes perfect generalization accuracy, while  denotes out of memory. </caption><tbody><tr class="odd"><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td></tr><tr class="even"><td style="text-align: left;">(lr)<span>3-17</span></td><td style="text-align: left;"></td><td style="text-align: left;"><span class="math inline">2<sup>6</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>7</sup></span></td><td style="text-align: left;"><span class="math inline"><strong>2</strong><sup><strong>8</strong></sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>9</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>10</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>11</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>12</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>13</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>14</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>15</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>16</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>17</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>18</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>19</sup></span></td><td style="text-align: left;"><span class="math inline">2<sup>20</sup></span></td></tr><tr class="odd"><td style="text-align: left;">MHA-Abs</td><td style="text-align: left;">137K</td><td style="text-align: left;"></td><td style="text-align: left;">99.6</td><td style="text-align: left;">100.0</td><td style="text-align: left;">58.6</td><td style="text-align: left;">26.6</td><td style="text-align: left;">18.8</td><td style="text-align: left;">9.8</td><td style="text-align: left;">10.9</td><td style="text-align: left;">7.8</td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td></tr><tr class="even"><td style="text-align: left;">MHA-RoPE</td><td style="text-align: left;">137K</td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;">100.0</td><td style="text-align: left;">83.6</td><td style="text-align: left;">31.3</td><td style="text-align: left;">18.4</td><td style="text-align: left;">8.6</td><td style="text-align: left;">9.0</td><td style="text-align: left;">5.5</td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td></tr><tr class="odd"><td style="text-align: left;">MHA-xPos</td><td style="text-align: left;">137K</td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;">100.0</td><td style="text-align: left;">99.6</td><td style="text-align: left;">67.6</td><td style="text-align: left;">25.4</td><td style="text-align: left;">7.0</td><td style="text-align: left;">9.0</td><td style="text-align: left;">7.8</td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td></tr><tr class="even"><td style="text-align: left;">H3</td><td style="text-align: left;">153K</td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;">100.0</td><td style="text-align: left;">80.9</td><td style="text-align: left;">39.5</td><td style="text-align: left;">23.8</td><td style="text-align: left;">14.8</td><td style="text-align: left;">8.2</td><td style="text-align: left;">5.9</td><td style="text-align: left;">6.6</td><td style="text-align: left;">8.2</td><td style="text-align: left;">4.7</td><td style="text-align: left;">8.2</td><td style="text-align: left;">6.3</td><td style="text-align: left;">7.4</td></tr><tr class="odd"><td style="text-align: left;">Hyena</td><td style="text-align: left;">69M<span class="math inline"><sup>*</sup></span></td><td style="text-align: left;">97.7</td><td style="text-align: left;"></td><td style="text-align: left;">100.0</td><td style="text-align: left;"></td><td style="text-align: left;">44.1</td><td style="text-align: left;">12.5</td><td style="text-align: left;">6.6</td><td style="text-align: left;">5.1</td><td style="text-align: left;">7.0</td><td style="text-align: left;">5.9</td><td style="text-align: left;">6.6</td><td style="text-align: left;">6.6</td><td style="text-align: left;">5.9</td><td style="text-align: left;">6.3</td><td style="text-align: left;">9.8</td></tr><tr class="even"><td style="text-align: left;">Mamba</td><td style="text-align: left;">74K</td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;">100.0</td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td></tr><tr class="odd"><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td><td style="text-align: left;"></td></tr></tbody></table>

:  (**Induction heads**.) Models are trained on sequence length $2^8=256$, and tested on various sequence lengths of $2^6=64$ up to $2^{20}=1048576$. `\cmark`{=latex} denotes perfect generalization accuracy, while `\xmark`{=latex} denotes out of memory.

`\label{tab:induction}`{=latex}

Training consists of randomly generating data every step, with a batch size of $8$. We choose an \`\`epoch" size of 8192 steps, and track the accuracy on fixed validation sets (also randomly generated) of each target sequence length. For the MHA-Abs and Mamba models, results are reported after the 25th epoch ($8192 \times 25 = 204800$ steps). For the MHA-RoPE and MHA-xPos models, results are reported after the 50th epoch ($8192 \times 50 = 409600$ steps). For the LTI H3 and Hyena models, results are reported after the 10th epoch ($81920$ steps) because they had converged by then and failed to improve further.

We use the Adam optimizer with no weight decay. All models are trained at constant learning rates $2e-4$ and $1e-3$, and the better results are reported for each model ($2e-4$ for all models except Mamba). The attention and Hyena models did not learn at LR $1e-3$. H3 learned at both LRs, but interestingly generalized better to shorter sequences at the smaller LR of $2e-4$. Mamba learned at both LRs, but extrapolated better at the larger LR of $1e-3$.

Language Modeling {#sec:exp-details:lm}
-----------------

### Scaling Law Details {#sec:exp-details:lm:scaling}

Scaling law experiments generally followed the GPT3 recipe. All models were trained on the Pile with the GPT2 tokenizer.

#### Model Sizes.

`\cref{tab:gpt3}`{=latex} specifies the model sizes we use for scaling laws. This is taken directly from the GPT3 specifications [@brown2020language], with very minor modifications. First, we changed the batch size of the 1.3B model from 1M tokens to 0.5M tokens, since we did not use enough parallelization to require the larger batch size. Second, we changed the number of training steps and total tokens to roughly match Chinchilla scaling laws [@hoffmann2022empirical], which specify that training tokens should increase proportionally to model size.

```{=latex}
\centering
```
```{=latex}
\small
```
  `\sc{Params}`{=latex}   $\mathtt{n\_layers}$   $\mathtt{d\_model}$   $\mathtt{n\_heads}$ / $\mathtt{d\_head}$   `\sc{Training steps}`{=latex}   `\sc{Learning Rate}`{=latex}   `\sc{Batch Size}`{=latex}   `\sc{Tokens}`{=latex}
  ----------------------- ---------------------- --------------------- ------------------------------------------ ------------------------------- ------------------------------ --------------------------- -----------------------
  125M                    12                     768                   12 / 64                                    4800                            6e-4                           0.5M tokens                 2.5B
  350M                    24                     1024                  16 / 64                                    13500                           3e-4                           0.5M tokens                 7B
  760M                    24                     1536                  16 / 96                                    29000                           2.5e-4                         0.5M tokens                 15B
  1.3B                    24                     2048                  32 / 64                                    50000                           2e-4                           0.5M tokens                 26B

  :  (**Scaling Law Model Sizes**.) Our model sizes and hyperparameters for scaling experiments. (Model dimension and number of heads applies only to Transformer models.)

`\label{tab:gpt3}`{=latex}

#### Training Recipes.

All models used the AdamW optimizer with

-   gradient clip value $1.0$

-   weight decay $0.1$

-   no dropout

-   linear learning rate warmup with cosine decay

By default, the peak learning rate is the GPT3 specification.

We give several models an \`\`improved recipe", inspired by changes adopted by popular large language models such as PaLM [@chowdhery2022palm] and LLaMa [@touvron2023llama]. These include:

-   linear learning rate warmup with cosine decay to $1e-5$, with a peak value of $5\times$ the GPT3 value

-   no linear bias terms

-   RMSNorm instead of LayerNorm

-   AdamW hyperparameter $\beta=(.9, .95)$ (the GPT3 value) instead of the PyTorch default of $\beta=(.9, .999)$

#### Architecture and Training Details.

Our models are:

-   **Transformer**: The standard Transformer based on GPT3 (`\cref{tab:gpt3}`{=latex}).

-   **Transformer++**: A Transformer with an improved architecture, namely rotary positional encodings [@su2021roformer] and SwiGLU MLP [@shazeer2020glu], and the improved training recipe above.

-   **Hyena**: Interleaving a Hyena block (the H3 block with S4 replaced by a global convolution parameterized by an MLP) with standard MLP blocks. The MLP blocks have expansion factor $2$ instead of $4$ and the number of layers is correspondingly increased by $1.5\times$ to preserve parameter count.

-   **H3++**: The H3 architecture with a few modifications, including (i) using the same \`\`thin" Hyena dimensions above (ii) the improved training recipe above (iii) a linear attention *head dimension* of 8.

-   **RWKV**: The default RWKV model from @peng2023rwkv, including its modified MLP block. We also used as much of its specified training recipe as possible, such as increasing the learning rates by $2\times$ or $3\times$ on certain parameters.

-   **RetNet**: The default RetNet model from @sun2023retentive. We also gave it the improved training recipe above.

-   **Mamba**: The standard Mamba architecture, with the improved training recipe.

### Additional Scaling Law Ablations {#sec:exp-details:lm:scaling-ablations}

We perform additional ablations on the architecture using the same protocol as the 2k context length scaling laws in `\cref{fig:lm-scaling}`{=latex} (*Left*).

#### Mamba Architecture: Interleaving Blocks.

We test the effect of different architectural blocks combined with the Mamba block. We focus on the viewpoint that the Mamba block is simply the standard SwiGLU block with an extra $\mathsf{conv} \to \mathsf{SSM}$ path added. This leads to two natural ablations:

-   What if the Mamba block is interleaved with a standard MLP block, instead of stacked homogenously? This can also be interpreted as taking Mamba and removing half of the SSMs.

-   What if the Mamba block is interleaved with MHA (multi-head attention) blocks? This can also be interpreted as taking a Transformer with SwiGLU MLPs (i.e. what we call Transformer++) and simply adding SSMs to the MLP blocks.

`\cref{fig:lm-scaling-ablations}`{=latex} (*Right*) shows these variants compared to the original (homogenous) Mamba architecture. Interestingly, neither change matters too much. The Mamba-MLP architecture is only slightly worse, and still better than all models except Transformer++. The Mamba-MHA architecture is only slightly better, which is somewhat surprising in light of the fact that many recent works have found that combining (LTI) SSMs with Attention can lead to substantial improvements [@dao2023hungry; @fathullah2023multi; @saon2023diagonal; @zuo2022efficient; @fathi2023block].

#### H3 Architecture: Training Recipes.

Next we ablate differences between the Hyena and H3++ models, our weakest and strongest models outside of Transformer++ and Mamba, particularly to isolate the effect of training recipes.

-   **Hyena**: The Hyena block with its original architecture and GPT3 training recipe (same as `\cref{fig:lm-scaling}`{=latex}).

-   **Hyena+**: The same architecture but with the improved training recipe described above.

-   **H3+**: The same architecture as Hyena+ but with the Hyena convolution kernel swapped out for S4D convolution kernel.

-   **H3++**: The same as H3+, but with a linear attention *head dimension* of 8. This increases computation inside the SSM recurrence but does not increase parameters.

Our general convention is that \`\`Model+" represents the base model with the improved training recipe, and \`\`Model++" also allows for architectural changes.

`\cref{fig:lm-scaling-ablations}`{=latex} (*Right*) shows that

-   A large improvement is achieved by the improved training recipe, which was used for many of the models in the main `\cref{fig:lm-scaling}`{=latex} (RetNet, H3++, Transformer++, Mamba).

-   The choice of the inner LTI SSM does not matter (e.g. Hyena vs. S4), consistent with findings throughout this paper.

-   The head dimension expansion improves performance, consistent with one of our main themes that expanded state dimension improves performance for SSMs (`\cref{sec:method}`{=latex}).

```{=latex}
\centering
```
```{=latex}
\centering
```
![ (**Scaling laws: extra ablations**.) (*Left*) Instead of (*Right*) Instead of ](fig/pile_2k_ablations_mamba.png){#fig:lm-scaling-ablations width="\\textwidth"}

```{=latex}
\centering
```
![ (**Scaling laws: extra ablations**.) (*Left*) Instead of (*Right*) Instead of ](fig/pile_2k_ablations_h3.png){#fig:lm-scaling-ablations width="\\textwidth"}

### Downstream Evaluation Details

This pretraining procedure is the same as the scaling law protocol, but extended to 300B tokens and with the GPT-NeoX tokenizer [@black2022gpt] instead of GPT2 tokenizer. For the 1.3B model, we use a batch size of 1M tokens to be consistent with the GPT3 specifications. We report the perplexity on the Pile validation set, and for this metric only compare to models trained on the same dataset and with the same tokenizer, in particular Pythia and RWKV.

For downstream evaluation, we use the LM evaluation harness from EleutherAI [@eval-harness], as done by most work in this area. We evaluate on the following tasks/datasets that measure common sense reasoning:

-   LAMBADA [@paperno2016lambada]

-   HellaSwag [@zellers2019hellaswag]

-   PIQA [@bisk2020piqa]

-   ARC-challenge [@clark2018think]

-   ARC-easy: an easy subset of ARC-challenge

-   WinoGrande [@sakaguchi2021winogrande]

We report accuracy for LAMBADA, WinoGrande, PIQA, and ARC-easy, and accuracy normalized by sequence length for HellaSwag and ARC-challenge (since normalized accuracy is higher for almost all models for these task).

DNA Modeling {#dna-modeling}
------------

### Pretraining Details

We describe the dataset and training procedure of the HG38 pretraining task in more detail.

The dataset follows the splits from the prior Enformer work on genomics [@avsec2021effective]; the training split contains a total of $S=34021$ segments of length $2^{17}=131072$ that cover the genome, for a total of approximately 4.5 billion tokens (DNA base pairs). These segments are pairs of (chromosome number, starting index, ending index), and can be extended if necessary (e.g. to get longer segments).

We deviate from HyenaDNA when the training sequence length is not $2^{17}$. HyenaDNA always takes a fixed sub-segment (e.g. the beginning or middle of the prescribed segment), and thus for any training sequence length each epoch is fixed to $34021$ samples and doesn't necessarily go through the whole genome. On the other hand, we use the entire training data:

-   When the context length $L$ is less than (or equal to) $2^{17}$, we divide up each segment into non-overlapping sub-segments of length $L$, so that there are $S \times \frac{2^{17}}{L}$ total samples and $S \times 2^{17} \approx 4.5B$ tokens per epoch.

-   When the context length $L$ is greater than $2^{17}$, we turn each segment into two samples, one that begins with the prescribed segment and one that ends with the prescribed segment. Thus each epoch has $2S$ items and $2SL$ tokens per epoch. For example, at sequence length $2^{18}=262144$ there are $4\times$ as many tokens as the default, and at sequence length $2^{20}$ there are $16\times$ as many tokens.

Other training details generally follow the same protocol as our language modeling experiments (`\cref{sec:exp-details:lm}`{=latex}). For example, we use the AdamW with $(\beta_1, \beta_2) = (0.9, 0.95)$, no dropout, weight decay $0.1$. We use a cosine learning rate scheduler with linear warmup for 10% of total steps.

### Scaling: Model Size Details

#### Models.

The models we consider are:

-   Transformer++: a Transformer with improved architecture, notably the usage of RoPE positional encodings [@su2021roformer]. Informally, we found these to be noticeably better than vanilla positional encodings from [@vaswani2017attention].

-   HyenaDNA: the Hyena model from @poli2023hyena [@nguyen2023hyenadna], which is roughly a Transformer with the MHA block replaced by an H3 block using a global convolution parameterized by an MLP.

-   Mamba: the standard Mamba architecture.

#### Model Sizes.

We use the following model sizes.

  --------------------------------- ------ ------ ------ ------ ------ ------- -------
  `\sc{Blocks}`{=latex}             4      5      6      7      8      10      12
  `\sc{Model Dimension}`{=latex}    64     96     128    192    256    384     512
  `\sc{Params (Approx.)}`{=latex}   250K   700K   1.4M   3.5M   7.0M   19.3M   40.7M
  --------------------------------- ------ ------ ------ ------ ------ ------- -------

Note that the number of blocks for Mamba is doubled, because one Transformer \`\`layer" includes both the MHA and MLP blocks (and similarly for Hyena), which requires two Mamba blocks to match parameters (`\cref{sec:method:architecture}`{=latex}).

#### Training.

For each model (Transformer++, HyenaDNA, Mamba), we swept the learning rate across $\{1e-3, 2e-3, 4e-3, 8e-3\}$. The optimal Transformer and HyenaDNA learning rates were 2e-3 across all sizes. The optimal Mamba learning rate was 8e-3; note that Mamba performed better than baselines with matched learning rates (2e-3), but was more stable and improved even more at higher learning rates. (Furthermore, as this LR is on the upper range of the sweep, it is possible that our results are still suboptimal.)

Note that, in contrast to standard LM scaling laws (`\cref{tab:gpt3}`{=latex}), our LR held constant across model sizes for simplicity. The optimal LR should go down for larger models, but we didn't find a noticeable effect at the small model sizes (at most a few million parameters) we considered.

### Scaling: Context Length Details

We use a total batch size of $2^{24}\approx 16M$ tokens per training step, for every sequence length (e.g. at length $2^{20}$ there are $16$ segments per batch and at length $2^{10}$ there are $16384$ segments per batch). This is a large batch size relative to the model size by usual LM standards, but note that a batch size of $2^{23}$ is the minimum possible on a machine with 8 GPUs and sequence length of $2^20$, and that HyenaDNA used much larger batches of $2^{28}$.

The learning rate used was $0.008$ for Mamba and 0.001 for HyenaDNA; we initially attempted to use the same learning rate of $0.002$ from the previous section for HyenaDNA, but found that it was unstable at the longest context length.

#### Sequence Length Warmup.

Following [@nguyen2023hyenadna], we use sequence length warmup (SLW) during pretraining. We choose a simple schedule of 2 epochs at each power-of-two sequence length starting from $2^{10}=1024$. (Note that because of how data is curated, at the longest sequence lengths more steps and tokens are spent proportionally. In particular, each stage up to length $2^{17}$ processes the same number of tokens, but $4\times$ as many tokens are processed at length $2^{18}$, $8\times$ as many at length $2^{19}$, and $16\times$ as many at length $2^{20}$.)

Unlike HyenaDNA, we always control for the number of tokens per gradient update, so the batch size is successively halved as the sequence lengths are doubled in each stage.

```{=latex}
\begin{remark}
  We also note that the schedule was not tuned, and we never experimented with turning off sequence length warmup for these pretraining experiments.
  We later found that SLW did not help noticeably for audio pretraining at similar lengths (\cref{sec:exp:audio}), and it is possible that it is not necessary for DNA pretraining either.
\end{remark}
```
### Species (Great Apes) Classification

Models are causal and therefore only the last element (across the sequence length) of the model's output is used for the classification head. Note that we control for the total number of elements in the loss function per gradient step. The pretraining objective includes all positions across the sequence length, so that $\mathtt{batch\_size} \times \mathtt{sequence\_length}$ is held constant; in other words, the batch size decreases as the sequence length increases. However, for a classification task, since only the last position enters the loss, the batch size itself is held constant. Note that this also means that fine-tuning models with longer sequence lengths is more computationally expensive.

Training consists of 10 epochs, each of which has 1024 gradient steps. Each gradient step uses batch size 64, which are all independently randomly drawn by uniformly picking a species, uniformly picking a chromosome, and then uniformly picking a contiguous segment of DNA.

Following [@nguyen2023hyenadna], models with a maximum context length greater than $2^{14} = 16384$ use sequence length warmup with 1 epoch at length $2^{14}=16384$, 1 epoch at length $2^{15}=32768$, 1 epoch at length $2^{16}=65536$, and so on up to the maximum sequence length. For example, the model with $2^{20}=1048576$ context undergoes $6$ epochs of sequence length warmup before $4$ more epochs at its maximum sequence length.

The learning rate for all Hyena models is $\mathtt{4e-5}$, while the learning rate for all Mamba models is $\mathtt{1e-4}$. These were found by performing learning rate sweeps for each model among $\{1e-5, 2e-5, 4e-5, 1e-4, 2e-4\}$ for the smaller sequence lengths $(2^{10}, 2^{12}, 2^{14}, 2^{16})$, and these values were consistently found to be the best for each model. An abridged learning rate sweep was done at length $2^{18}$, which agreed with these values, and a single run at length $2^{20}$ was performed (as described above, the computational cost of these experiments is proportional to the sequence length). The learning rate followed a cosine decay schedule with warmup with 5 epochs of linear warmup to the maximum learning rate, and 5 epochs of cosine decay down to $1e-6$. The unusually long learning rate warmup schedule was chosen because the sequence length warmup was also long (e.g. comprising 6 out of 10 epochs for the model with context length $2^{20}$); we did not experiment with this choice.

Results for the Species classification task are in `\cref{tab:species}`{=latex}.

```{=latex}
\centering
```
  ---------------------------- ----------------------- --------------------------------------------------------------------- ---------- ---------- ---------- ---------- -----------
  `\sc{Model}`{=latex}         `\sc{Params}`{=latex}   `\multicolumn{6}{c}{\sc{Accuracy (\%) at Sequence Length}}`{=latex}                                               
  `\cmidrule`{=latex}(lr)3-8                           $2^{10}$                                                              $2^{12}$   $2^{14}$   $2^{16}$   $2^{18}$   $2^{20}$
  HyenaDNA                     1.4M                    28.04                                                                 28.43      41.17      42.22      31.10      54.87
  Mamba                        1.4M                    31.47                                                                 27.50      27.66      40.72      42.41      **71.67**
  Mamba                        7M                      30.00                                                                 29.01      31.48      43.73      56.60      **81.31**
  ---------------------------- ----------------------- --------------------------------------------------------------------- ---------- ---------- ---------- ---------- -----------

  :  (**Great Apes DNA Classification**.) Accuracy after fine-tuning on sequences of length $2^{10}=1024$ up to $2^{20}=1048576$ using pretrained models of the same context length. Random guessing is 20%.

`\label{tab:species}`{=latex}

Audio Details {#sec:exp-details:audio}
-------------

### YouTubeMix Audio Pretraining

#### Model.

We use a model with 3 blocks per stage ($3\times5=15$ total Mamba blocks), pooling factor $p=16$, and outer dimension $D=64$, for about 3.5M parameters.

#### Dataset.

The data is mu-law encoded at 8 bits, so the model is modeling discrete tokens with a vocab size of $256$.

The dataset consists of clips of up to 1 minute long, or length $960000$, which is subsampled and divided into segments of any desired sequence length. Since the architecture involves two stages of pooling by a factor of $16$, and we want the resulting sequence length to be a a multiple of $8$ for hardware efficiency, the longest possible sequence is $468 \times 2048 = 958464$. The rest of our sequence lengths are defined by successively halving this and rounding up to the nearest multiple of $2048$.

`\cref{tab:youtubemix-lengths}`{=latex} lists the specifications used in `\cref{fig:youtubemix}`{=latex}. Beyond the varying batch sizes, the number of valid segments in the training set varied between different sequence lengths (e.g. the number of training steps per epoch was not constant for different points in the graph), which may have contributed to kinks in the scaling curves.

```{=latex}
\centering
```
    `\sc{Sequence length}`{=latex} `\sc{Batch size}`{=latex}   `\sc{Tokens / batch}`{=latex}
  -------------------------------- --------------------------- -------------------------------
        $468 \times 2048 = 958464$ $1$                         $958464$
        $234 \times 2048 = 479232$ $2$                         $958464$
        $117 \times 2048 = 239616$ $4$                         $958464$
         $59 \times 2048 = 120832$ $8$                         $966656$
          $30 \times 2048 = 61440$ $16$                        $983040$
          $15 \times 2048 = 30720$ $32$                        $983040$
           $8 \times 2048 = 16384$ $64$                        $1048576$
            $4 \times 2048 = 8192$ $128$                       $1048576$

  : YouTubeMix length scaling sequence lengths and batch sizes.

`\label{tab:youtubemix-lengths}`{=latex}

#### Training.

Models were trained for $200K$ training steps with a maximum learning rate of $0.002$, $20K$ (10%) warmup steps, and weight decay $0.1$ (similar to our general pretraining recipe across domains).

#### Additional Ablations: SSM Parameterizations.

We investigate SSM parameterizations on long-form audio waveform pretraining in the setting of `\cref{fig:youtubemix}`{=latex}. The setting is modified slightly to use larger models ($8$ layers and $D=64$ for 6M params, the SaShiMi default), shorter sequences ($2^{11}=2048$ to $2^{18}=262144$ instead of $2^{13}$ to $2^{20}$), lower LR ($0.001$ from $0.002$), and shorter training cycles (100K instead of 200K steps).

`\cref{fig:youtubemix-ablations}`{=latex} shows that the change from S4 $\to$ S6 (i.e. the selection mechanism) is not always beneficial. On long-form audio waveforms, it in fact significantly hampers performance, which may be intuitive from the point of view that audio is uniformly sampled and very smooth, and therefore benefits from continuous linear time-invariant (LTI) methods. After ablating away the selection mechanism, note that the resulting model is the S4 layer inside the Mamba block. To disambiguate, we call this Mamba-S4 as opposed the default Mamba architecture Mamba-S6.

However, on the right side, we keep the outer layers of the U-Net Mamba-S4 and ablate only the inner layers. The performance differences shrink dramatically; this reinforces the hypothesis that layers closer to the *raw* audio signal should be LTI, but once they are \`\`tokenized" and compressed by the outer layers, the inner layers no longer need to be LTI. In this setting however, the real-valued SSM still underperforms the complex-valued one.

```{=latex}
\centering
```
![ (**Audio Pretraining (YouTubeMix) Ablations**.) As a uniformly-sampled \`\`continuous" signal modality, audio waveforms actually benefit from LTI models which have matching inductive bias. (*Left*) Homogenous models (all blocks have the same parameterization) (*Right*) Only the center U-Net blocks are ablated; the outer blocks are Mamba-S4. Purple line is same as figure on left. ](fig/youtubemix_ablations.png){#fig:youtubemix-ablations width="\\linewidth"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![ (**Audio Pretraining (YouTubeMix) Ablations**.) As a uniformly-sampled \`\`continuous" signal modality, audio waveforms actually benefit from LTI models which have matching inductive bias. (*Left*) Homogenous models (all blocks have the same parameterization) (*Right*) Only the center U-Net blocks are ablated; the outer blocks are Mamba-S4. Purple line is same as figure on left. ](fig/youtubemix_ablations_center.png){#fig:youtubemix-ablations width="\\linewidth"}

```{=latex}
\captionsetup{type=figure}
```
### SC09 Speech Generation

Autoregressive training largely followed the autoregressive language modeling protocol, such as

-   Weight decay $0.1$

-   Learning rate warmup for 10% of total steps

-   AdamW optimizer with $\beta=(0.9, 0.95)$

-   Gradient clip value $0.1$

We used a learning rate of $0.002$ and $200000$ training steps at a batch size of $16$.

The large Mamba model in `\cref{tab:sc09}`{=latex} has 15 layers per stage with an outer dimension of $D=96$ and pooling factor $4$. We note that this dataset is small (training went through 100 epochs) and for this large model, there was significant overfitting of the BPB or NLL. However, automated metrics of generated samples continually improving throughout training.

The models in the architecture ablations in `\cref{tab:sc09-ablations}`{=latex} all have 8 layers per stage with an outer dimension of $\mathtt{D}=64$ and pooling factor $4$. The S4+MLP block has roughly $2D^2 + 4D^2$ parameters (expansion factor $2$ in the MLP). The Transformer block has $4D^2 + 2D^2$ parameters (expansion factor $1$ in the MLP). The Mamba block has the usual $\approx 6D^2$ parameters. All models have roughly 6M total parameters.

Efficiency Benchmark {#sec:exp-details:benchmark}
--------------------

#### Scan Operation.

We compare the core operation of selective SSMs, which is the parallel scan (`\cref{sec:method:scan}`{=latex}), against convolution and attention, measured on an A100 80GB PCIe GPU. Note that these do not include the cost of other operations outside of this core operation, such as computing the convolutional kernel in global-convolution models, or computing the QKV projections in attention.

As a baseline, we implement a standard parallel scan in PyTorch with no kernel fusion. This requires materializing the parameters $\dA, \dB, \C$ in HBM.

Our scan implementation fuses the discretization step and the parallel scan, avoiding the cost of materializing all the large parameters in HBM.

For convolution, we use the standard implementation in PyTorch, which separately performs FFTs on the inputs and the filters, multiply them in frequency domain, then performs an inverse FFT to obtain the result. The theoretical complexity is $O(L \log (L))$ for sequence length $L$.

For attention, we compare against the fastest implementation that we are aware of (FlashAttention-2 [@dao2023flashattention2]), with causal mask. Note that FlashAttention-2 with causal mask is about 1.7$\times$ faster than without causal mask, since approximately only half of the attention entries are computed.

We use batch size of 1 and increase the sequence length from $2^9=512$, $2^{10}\approx 1K$, $2^{11}\approx 2K$, up to $2^{19} \approx 500K$ (some of the baselines run out of memory before reaching 500K). We use a model dimension of $D = 1024$ and state dimension $N = 16$. We measure with BF16 inputs, which is the data type most commonly used for large scale training.

#### End-to-end Inference.

We measure the inference throughput of a Mamba 1.4B model and an untrained Mamba 6.9B model, against a standard Transformer (GPT3 architecture) at 1.3B and 6.7B size. We use the standard Transformer implementation in the Huggingface `transformers` library.

We set the prompt length to be 2048 and the generation length to be 128. We vary the batch size from 1, 2, 4, 8, 16, 32, 64, to 128, and measure time time taken to generate 128 tokens. We then calculate the throughput (tokens/s) as $\text{batch size} \times 128 / \text{time taken}$. We repeat the measurements 3 times and take the average. Measurements are done on an A100 80GB PCIe GPU.

#### Memory Benchmark.

The memory usage simply scales proportionally to the size of the activation tensors, as with most deep sequence models. We report measurements of the training memory requirements of 125M models on 1 A100 80GB GPU. Each batch consists of sequences of length 2048. We compare to the most memory-efficient Transformer implementation we are aware of (with kernel fusion from `torch.compile` and with FlashAttention-2). `\cref{tab:memory}`{=latex} shows that Mamba's memory requirement is comparable to a similar-sized Transformer with an extremely optimized implementation, and we expect further improvement in Mamba's memory footprint in the future.

```{=latex}
\centering
```
  Batch size   Transformer (w/ FlashAttention-2)   Mamba
  ------------ ----------------------------------- --------
  1            4.6GB                               4.8GB
  2            5.2GB                               5.8GB
  4            6.9GB                               7.3GB
  8            11.5GB                              12.3GB
  16           20.7GB                              23.1GB
  32           34.5GB                              38.2GB

  : (**Memory benchmark**.) Mamba's memory footprint is comparable to the most optimized Transformer. Results for 125M models.

`\label{tab:memory}`{=latex}

[^1]: Alphabetical by first name.