Mixture-of-Depths Attention

Source

Credibility

This is a recent arXiv preprint submitted on 2026-03-16. The official repository lists authors from Huazhong University of Science and Technology and ByteDance Seed, includes a code release, and links the paper, blog post, and X/Twitter announcement thread. It is credible enough to track as an important architecture source, but it is not yet peer reviewed; benchmark and efficiency claims should be treated as author-reported until independently replicated.

Core Claim

MoDA reframes inter-layer communication in deep Transformers as retrieval rather than accumulation. Instead of forcing every layer to inherit only a residual-stream blend, each attention head can jointly attend to current sequence key/value pairs and depth key/value pairs from preceding layers under one softmax. The paper argues that this mitigates depth-wise information dilution while keeping the operator practical through a FlashAttention-style fused implementation.

Method

The paper contrasts three depth-stream interfaces:

  • Depth residual: the standard residual stack compresses all prior computation into one hidden state:

  • Depth dense: prior layer states are explicitly preserved and projected back, but parameter and compute costs grow poorly for large models.

  • Depth attention / MoDA: each layer writes depth key/value memories, and later layers retrieve from them with content-based attention.

For one head, the useful abstraction is:

Here sequence keys/values and depth keys/values compete in the same probability space. This makes the layer-depth interface closer to retrieval over available state than to fixed layer summation or residual accumulation.

flowchart LR
  Hidden[Current hidden state]
  SeqKV[Sequence KV]
  DepthKV[Depth KV from previous layers]
  Softmax[Unified softmax]
  Output[MoDA output]
  Write[Write current KV to depth stream]
  Hidden --> SeqKV
  Hidden --> Softmax
  SeqKV --> Softmax
  DepthKV --> Softmax
  Softmax --> Output
  Output --> Write
  Write --> DepthKV

Evidence And Results

  • The paper reports that the fused hardware-aware kernel reaches 97.3% of FlashAttention-2 efficiency at sequence length 64K.
  • The main language-model experiments train 700M and 1.5B decoder-only models with the OLMo2 400B-token recipe.
  • In the main 1.5B setting, the paper reports average perplexity improvement of 0.2 across 10 validation benchmarks and average downstream improvement of 2.11 percentage points across 10 tasks.
  • The layer-number ablation reports gains for both 48-layer and 24-layer small models, with stronger gains for post-norm in the deeper setting.
  • The attention visualizations show non-trivial probability mass assigned to depth-KV slots in middle and late layers, and the paper reports reduced attention-sink behavior.

X Thread Notes

The official X thread is useful because it gives the authors’ conceptual framing and follow-up boundary conditions.

  • The thread frames residual connections as a decade-old communication bottleneck: deep models may have many layers on paper, but residual accumulation can bury earlier signals in one blended stream.
  • The thread distinguishes MoDA from DenseNet, DenseFormer, Hyper-Connections, and MUDDFormer by saying those methods improve layer blending, while MoDA changes the interface to retrieval: query means “what do I need?”, key means “what do I have?”
  • The author describes Flash Depth Attention as the hardware step that makes depth attention trainable at scale, then describes MoDA as the fusion of sequence retrieval and depth retrieval into one operator.
  • The thread’s result claim is narrower than “solved depth scaling”: it says the model actively uses cross-layer retrieval, the attention-sink phenomenon diminishes, and OLMo2 baselines improve under the paper’s recipe.
  • In an author reply, Lianghui Zhu distinguishes MoDA/FDA from Attention Residuals by saying Attention Residuals use fixed depth queries and simplified key/value transformations for efficiency, while MoDA keeps data-dependent representations through a hardware-aware operator. The exact reply URL is preserved in the local X thread snapshot.
  • Other author replies say larger models, MoE variants, and sparse looping are planned follow-ups; the current paper should not be treated as final evidence for those settings. The exact reply URLs are preserved in the local X thread snapshot.
  • A DeiT vision reply says only a partial ImageNet training run was done, with MoDA stably outperforming original DeiT after about two fifths of epochs. That is a weak exploratory signal, not a completed vision benchmark. The exact reply URL is preserved in the local X thread snapshot.

Limitations

  • The source is a preprint with author-reported results.
  • Evidence is language-model centered; no numeric time-series, observability, robotics, or control benchmark is tested.
  • The mechanism adds a depth-KV cache whose memory and bandwidth cost grow with depth; the paper itself proposes bounded depth-KV slot caching as future work.
  • MoDA is not token-wise Mixture-of-Depths routing in the older dynamic-depth sense. It is a unified attention operator over sequence and layer-depth memories.
  • The current implementation is still presented as research software; the discussion explicitly says additional CUDA engineering is needed for industrial-scale training.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute and layer allocationadjacentMoDA lets each token/head retrieve from prior layer states instead of only inheriting a residual blend.It does not allocate variable compute per time-series window, channel, or candidate future; every evaluated model still pays the MoDA path.
Representation quality: intermediate state accessadjacentDepth-KV retrieval makes intermediate layer state an explicit API inside the model rather than only a probing target.No evidence that this preserves dense numeric state, action effects, or rare regimes in time-series models.
Streaming state and serving efficiencyadjacentThe hardware-aware kernel and depth-KV layout are evaluated against FlashAttention-2-style long-context execution.Depth cache growth, bounded slots, batching, and latency must be re-evaluated for always-on numeric streams.
Control and counterfactualsinsufficient evidenceThe paper speculates that MoDA can transfer to world models.No action-conditioned rollout, candidate-action ranking, intervention modeling, or planner-in-the-loop evaluation.
Benchmarks: what level of modeling is tested?warningStrong author-reported OLMo2 comparisons, ablations, and kernel timings are useful architecture evidence.Needs independent replication and matched-budget comparisons against deeper/wider/looped baselines.

Open Questions

  • Can bounded depth-KV slot selection become a reusable memory interface for time-series models, or does it mainly help token-level language modeling?
  • Under matched latency and memory, is content-based depth retrieval better than adding unique layers, looping a shared block, using recurrent memory, or routing sparse experts?
  • Can MoDA-style depth retrieval make intermediate representations specialize more cleanly, reducing the need for manual layer selection or fixed layer aggregation?
  • Which depth slots should be kept when cache budget is fixed: recent layers, high-attention layers, task-specialized layers, or learned persistent slots?
  • Does reduced attention-sink behavior improve long-horizon temporal state tracking, or is it mostly a language-model attention-pattern artifact?