Looped Transformers And Test-Time Memory

Summary

This topic tracks the post-Universal Transformer branch where models reuse computation over depth, retrieve from prior layer state, carry explicit segment memory, maintain explicit memory at test time, or combine these mechanisms. The common question is not just “can a model think longer?” It is which state is refined, retrieved, or carried forward by extra computation: token representations, depth key/value memories, learned memory slots, associative key-value memory, recurrent segment state, or recursive puzzle state.

For the wiki’s time-series and world-model agenda, this branch is architecture background. These sources should be cited as upstream evidence about dynamic compute, recurrent depth, test-time memory, or matched-budget evaluation, not as direct proof for numeric telemetry or action-conditioned world models.

flowchart LR
  UT[Universal Transformer]
  Depth[Looped and recurrent-depth Transformers]
  Memory[Test-time memory]
  Recursive[Small recursive reasoning models]
  UT --> Depth
  UT --> Recursive
  UT --> Memory
  Depth --> MoDA[MoDA / depth retrieval]
  Memory --> RMT[RMT / segment memory tokens]
  RMT --> ARMT[ARMT / associative memory]
  RMT --> RATE[RATE / action trajectories]
  Depth --> Huginn[Huginn / latent reasoning]
  Depth --> Sleep[LLM Sleep / consolidation loops]
  Depth --> Hyperloop[Hyperloop / loop-level hyper-connections]
  Depth --> ELT[ELT / visual generation]
  Depth --> LoopFormer[LoopFormer / elastic depth]
  Depth --> Parcae[Parcae / stable loops]
  Depth --> Sparse[Sparse looped MoE]
  Depth --> Samplers[parallel recurrent-depth samplers]
  ResidualStreams[mHC / matrix residual streams] --> Hyperloop
  Memory --> Titans[Titans]
  Memory --> ATLAS[ATLAS]
  Memory --> MIRAS[MIRAS]
  Memory --> MesaNet[MesaNet]
  Recursive --> HRM[HRM]
  Recursive --> TRM[TRM]
  Recursive --> URM[URM]
  Recursive --> UTM[UT Need Memory]

Main Branches

Recurrent Depth And Latent Reasoning

Universal Transformers provide the root interface: reuse a Transformer block across depth, optionally halt per position, and treat repeated state refinement as a recurrent computation. The modern looped-language-model branch revisits that idea with large-scale pretraining and test-time compute.

Scaling up Test-Time Compute with Latent Reasoning is the scale proof in this batch: a recurrent-depth language model can use extra loops at inference to improve reasoning without producing more chain-of-thought tokens. Reasoning with Latent Thoughts supplies the matching theory and small-scale evidence that loops can simulate hidden reasoning steps for iterative problems.

LoopFormer makes loop count a budget-conditioned interface through shortcut consistency. Parcae focuses on stable looped-model scaling laws. Sparse Layers are Critical to Scaling Looped Language Models argues that dense looping under-scales, while Looped-MoE can recover diversity through different expert routing across passes.

ELT extends the recurrent-depth branch into visual generation. Its important contribution is not only parameter sharing, but loop-boundary supervision: Intra-Loop Self Distillation makes intermediate loops useful enough for any-time image and video generation. For the wiki, that is an adjacent dynamic-compute pattern to compare with LoopFormer-style budget conditioning and early exits.

Efficient Parallel Samplers for Recurrent-Depth Models shifts from architecture to inference: if extra recurrent depth is useful, generation still needs a sampler that does not turn every additional loop into fully serial latency. The Recurrent Transformer is a related but distinct design where each layer maintains layerwise recurrent memory rather than looping a whole block.

Language Models Need Sleep adds a hybrid memory-consolidation branch. The model loops during the sleep phase before cache eviction, updates SSM fast weights, and then answers later with a single wake-time forward pass. That makes it a useful contrast to Huginn-style prediction-time recurrent depth: extra compute is spent on forming persistent fast state, not on refining the answer token at prediction time.

DiffusionBlocks adds a training-side recurrent-depth bridge. In its Huginn-style experiment, the repeated-depth dynamics are treated as a denoising process so training can use a single forward pass rather than BPTT through many recurrent iterations. This is not a test-time-memory method, but it changes the cost model for training looped-depth systems.

Depth Retrieval Without Looping

MoDA is adjacent to recurrent depth but solves a different interface problem. It does not loop the same block or add halting depth. Instead, it gives later layers content-based access to previous layer key/value memories and fuses sequence retrieval with depth retrieval in one softmax. This makes “which earlier layer should I use?” an attention decision inside the forward pass.

For the wiki, MoDA is useful because it separates two questions that are often merged: spending more depth compute and communicating across depth. A model can have many layers but weak inter-layer access, or fewer/looped layers with a stronger depth-memory interface. The caution is that MoDA introduces a depth-KV cache, so matched-budget comparison must include memory bandwidth and serving latency, not only nominal FLOPs.

Matrix Residual Streams For Looped Depth

mHC is not a looped Transformer paper, but it adds an important residual-state mechanism to this topic. It widens the residual stream into parallel streams and constrains residual mixing through the Birkhoff-polytope/Sinkhorn route, making the “state carried across layers” richer than a single hidden vector while trying to preserve residual stability.

Hyperloop Transformers applies that idea at loop boundaries in a middle-cycle looped Transformer. The result is a useful connection between recurrent depth and state capacity: a shared block can be parameter-efficient, but it may need matrix-valued residual state so repeated passes do not collapse into overly similar representations.

For the wiki, this is adjacent dynamic-compute evidence rather than time-series evidence. The right comparison is against unique-depth layers, MoDA-style depth retrieval, explicit memory tokens, and wider ordinary hidden states under matched parameter memory, KV/cache traffic, latency, and expected FLOPs.

Segment-Level Recurrent Memory

Recurrent Memory Transformer is the root segment-memory source: a Transformer segment is wrapped with read/write memory tokens, and the updated memory block is passed to the next segment. Unlike recurrent-depth models, the recurrence is over sequence segments rather than repeated refinement of the same tokens.

Associative Recurrent Memory Transformer strengthens the memory-capacity branch by adding layerwise associative memory to RMT. Its BABILong and associative-retrieval evidence makes it useful for thinking about mutable key/value state, but it remains language/long-context evidence rather than direct time-series evidence.

Recurrent Action Transformer with Memory adapts the segment-memory interface to offline RL trajectories. It matters here because its Memory Retention Valve is an explicit attempt to avoid losing sparse decision-relevant information across many trajectory segments.

Test-Time Memory And Online Optimization

Titans anchors the memory branch by combining attention as short-term memory with a learned long-term neural memory updated at test time. ATLAS extends this direction with a higher-capacity memory module that optimizes memory from current and past tokens, while MIRAS reframes attention, Titans, and modern recurrent models as associative memory systems defined by an attentional-bias objective, retention gate, and memory learning rule.

MesaNet is the online-optimization variant: the model spends inference compute to solve local in-context regression objectives. Titans Revisited is the cautionary reimplementation source: neural memory can help, but the full Titans stack does not automatically dominate strong baselines.

Dragon Hatchling belongs here as a boundary case: it is not a memory add-on to a Transformer, but the architecture itself updates a large recurrent fast state during inference. Its useful transfer is the state contract, sparse positive update pattern, and interpretability claim; its current evidence is language/translation rather than multivariate time series or action-conditioned trajectories.

Scaling Test-Time Compute for Agentic Coding adds an external-memory analogue. It does not loop a Transformer block or train a neural memory module. Instead, it turns prior agent rollouts into structured summaries, then uses those summaries as bounded memory for selection and refinement. This belongs on the topic page as a boundary case: test-time memory can live outside the hidden state, but then summary fidelity, provenance, and omission errors become first-class risks.

TurboQuant is a serving-memory compression boundary case. It does not add neural memory, memory tokens, or looped depth, but it compresses KV-cache and vector-search state while preserving inner-product geometry. The vLLM implementation critique makes this a useful caution for memory methods: compressed state can increase capacity while reducing throughput or adding per-token latency if the serving path must dequantize before attention. For time-series models, the transfer question is whether cached attention, retrieval memory, or latent-state stores can be quantized without erasing rare regimes, cross-channel deviations, exogenous variables, or action history, and whether that quantized path beats FP8/BF16-style baselines in wall-clock serving.

Overlap Map

This topic intentionally groups mechanisms that can all look like “more memory” or “more thinking” from a distance. The wiki boundary is the state contract:

Cluster	Shared Question	Important Difference	Route Detailed Comparisons Through
RMT, ARMT, RATE	Can explicit memory carried across chunks replace full attention over long history?	RMT/ARMT carry memory across sequence segments; RATE applies the interface to offline RL trajectories and remains a policy model.	Efficient Recurrent Sequence Models, World Models for RATE boundary cases.
Titans, ATLAS, MIRAS, MesaNet	Can a model update memory or solve an objective during inference?	Titans/ATLAS emphasize neural long-term memory, MIRAS reframes memory objectives and retention, MesaNet spends compute on local optimization.	This page plus Time-Series Benchmark Hygiene for update cost, chunking, and baseline sensitivity.
Dragon Hatchling	Can a sequence model make mutable fast state the core architecture state?	BDH’s memory is a large recurrent state with sparse positive activations, not an external memory module or looped-depth schedule.	Efficient Recurrent Sequence Models plus this page for state size, update cost, and interpretability probes.
MoDA	Can later layers retrieve useful prior layer state directly?	MoDA stores depth KV within a forward pass; it is inter-layer retrieval, not segment memory or test-time memory.	Intermediate-Layer Representations, this page, and Time-Series Scaling And Efficiency.
mHC, Hyperloop Transformers	Can a richer residual stream make repeated depth less parameter-constrained?	mHC is a residual-stream mechanism; Hyperloop applies it at loop boundaries. This is state capacity across depth, not memory carried across time.	This page and Time-Series Scaling And Efficiency.
ELT	Can loop-boundary outputs remain useful at variable inference budgets?	ELT applies looped depth inside visual generation steps and trains intermediate exits with ILSD. This is visual-generation evidence, not TSFM evidence.	This page, Time-Series Scaling And Efficiency, and Vision Foundation Models.
DiffusionBlocks	Can recurrent-depth training avoid BPTT by using local denoising objectives?	DiffusionBlocks trains a Huginn-style recurrent-depth model as a denoiser with single-pass training, while keeping recurrent-depth inference.	This page plus Training Dynamics and Time-Series Scaling And Efficiency.
Mamba, Mamba-2, Mamba-3, ParaRNN, RWKV-TS	Can compact recurrent state replace quadratic attention for serving?	These models keep state mostly as hidden recurrent dynamics rather than explicit memory slots.	Efficient Recurrent Sequence Models, Time-Series Scaling And Efficiency.
Universal-Transformer descendants and recursive small models	Can repeated computation refine the same representation?	The recurrence is over depth/refinement, not necessarily over streaming time or persistent memory.	This page; compare against memory methods only under matched FLOPs, latency, and state size.
LLM Sleep	Can extra recurrent compute make evicted context useful after the KV cache is cleared?	Recurrence happens during memory consolidation and updates SSM fast weights, not during every wake-time prediction.	This page and Efficient Recurrent Sequence Models.

Recursive Reasoning Branch

Hierarchical Reasoning Model, Tiny Recursive Model, Universal Reasoning Model, and Universal Transformers Need Memory form a small-model recursive reasoning cluster. They matter here because they test repeated state refinement under tight parameter budgets.

The lesson for time-series modeling is not that puzzle solvers transfer directly. The useful interface is the separation of latent state, update depth, supervision, memory slots, and halting or recursion schedule. Universal Transformers Need Memory is especially relevant because it makes the depth-state tradeoff explicit: extra recurrent passes and explicit memory tokens can substitute for one another up to a point.

Hyperloop Transformers adds the parallel residual-stream counterpart to that memory-token story. It does not prove a time-series or reasoning model, but it suggests that looped-depth models may need an explicit state-capacity budget even when the state is internal to the residual stream rather than exposed as memory tokens.

Time-Series Reading Frame

For time-series and digital-world agents, the key comparison is:

Mechanism	State being refined	Candidate TSFM use	Caution
Looped depth	token or patch representations	spend more compute on hard windows, rare regimes, or candidate futures	must beat unique-depth baselines under matched FLOPs, latency, and memory
Depth-KV retrieval	previous layer key/value memories	retrieve intermediate state without manually choosing a layer	cache growth and memory bandwidth may dominate if depth slots are unbounded
Matrix residual streams	parallel residual streams across depth or loop boundaries	add state capacity to repeated-depth models without doubling unique layer parameters	memory access, fused kernels, and state-capacity accounting become part of the serving contract
Segment-level recurrent memory	learned memory tokens or associative memory carried between segments	retain state across long windows without attending over all prior observations	sequential segment processing, memory overwrite, and BPTT stability become central
Test-time memory	persistent learned memory state	retain long context, cross-variate history, and system context	update cost and memory capacity become serving constraints
Core fast-state architecture	large mutable recurrent state	keep streaming latent state inside the model rather than external memory	current evidence is language/translation; numeric streams and action channels are untested
Sleep-time consolidation	SSM fast-weight state before eviction	refresh latent state at finite-window boundaries in infinite streams	scheduling, training stability, and matched wall-clock latency dominate the claim
Block-wise denoising training	loop or depth update treated as denoising step	reduce BPTT or full-depth training memory for recurrent-depth models	local objectives must preserve global state and pretrained conversion is unproven
Structured rollout summaries	external compressed experience	reuse prior action/observation evidence without replaying raw traces	summary omission can erase action-relevant state
KV-cache and retrieval-memory quantization	compressed cached vectors	reduce memory footprint and indexing cost for retained sequence or retrieval state	inner-product fidelity may protect recall while still losing rare or action-relevant state; dequantization can erase serving gains
Online optimization	in-context objective solution	adapt to local regimes without full fine-tuning	extra inference FLOPs can erase efficiency gains
Recursive small models	compact latent puzzle state	study minimal recurrence and supervision structure	puzzle evidence is not telemetry evidence
Early exits	loop-boundary outputs	fast path for easy windows, uncertainty signal for hard windows	convergence is a diagnostic, not automatically calibrated uncertainty

Relation To Existing Topics

Time-Series Scaling And Efficiency should cite this branch when discussing dynamic compute, looped depth, early exits, and memory/latency tradeoffs.
Efficient Recurrent Sequence Models covers compact recurrent state, segment-level recurrent memory, and parallel training. This page covers how that branch relates to repeated Transformer depth and explicit test-time memory.
Foundation Time-Series Model Research Agenda should treat this branch as adjacent dynamic-compute evidence unless a paper directly tests multivariate time-series latent state or action-conditioned trajectories.
Mixture Of Experts becomes relevant when looped models use sparse experts to recover expressivity across repeated passes.

Open Questions

When does extra loop compute outperform a wider or deeper unique-weight model at the same expected training FLOPs, serving latency, and memory footprint?
What should a time-series memory store: recent values, channel relationships, latent regimes, exogenous context, interventions, or candidate futures?
Should memory in action-conditioned time series be ordinary hidden state, explicit memory tokens, associative key/value memory, or a hybrid with a retention valve?
Can loop convergence, memory-update magnitude, or halting depth become calibrated uncertainty signals for time-series systems?
Does test-time memory preserve rare regimes and cross-channel deviations, or mostly improve recall-style tasks?
Can KV-cache or retrieval-memory quantization preserve rare regimes and action-relevant latent state while improving wall-clock serving, or does inner-product fidelity mostly protect recall-style tasks?
Which interface is easier to serve in always-on systems: compact recurrent state, explicit memory slots, cached attention, or elastic looped depth?
Which recursive-reasoning mechanisms transfer from discrete puzzle grids to numeric trajectories, event streams, or action-conditioned world models, and which only work because the puzzle state is small, discrete, and fully observed?
Are explicit memory tokens, matrix residual streams, depth-KV retrieval, wider hidden states, and unique-depth layers additive or substitutable ways to increase state capacity in recurrent-depth models under matched latency, memory, and expected-FLOPs budgets?
Can ILSD-style loop-boundary supervision make recurrent-depth time-series models genuinely useful at multiple inference budgets, or does it only stabilize visual generation exits?
Can DiffusionBlocks-style denoising objectives train recurrent-depth time-series models without erasing long-range state that BPTT would otherwise expose?

Alex Open Research Wiki

Explorer

Looped Transformers And Test-Time Memory

Looped Transformers And Test-Time Memory

Summary

Main Branches

Recurrent Depth And Latent Reasoning

Depth Retrieval Without Looping

Matrix Residual Streams For Looped Depth

Segment-Level Recurrent Memory

Test-Time Memory And Online Optimization

Overlap Map

Recursive Reasoning Branch

Time-Series Reading Frame

Relation To Existing Topics

Open Questions

Graph View

Table of Contents

Backlinks