Recurrent Action Transformer with Memory

Source

Credibility

RATE was first submitted to arXiv in 2023 and the current arXiv page resolves to a 2026 revision. OpenReview lists it as an ICLR 2026 Poster, which makes it a current tier-1 accepted source for memory-augmented offline RL.

Core Claim

RATE adapts RMT-style recurrent memory to offline reinforcement learning by processing trajectories as segmented sequences of return-to-go, observation, and action tokens, then passing learned memory embeddings between segments.

The key contribution for this wiki is not “a world model.” RATE is a memory-augmented policy/decision model: it preserves sparse historical cues across long partially observed episodes so action prediction can condition on information outside the Transformer context window.

Key Contributions

  • Introduces Recurrent Action Transformer with Memory for offline RL in partially observable environments.
  • Combines memory embeddings, cached hidden states, and a Memory Retention Valve (MRV).
  • Uses a read/write memory-token layout around trajectory segments, inherited from RMT-style causal memory.
  • Evaluates memory-intensive environments including T-Maze, ViZDoom-Two-Colors, Memory Maze, Minigrid-Memory, and POPGym.
  • Reports competitive or strong results on standard Atari and MuJoCo offline RL benchmarks.

Method Notes

RATE represents each trajectory as triplets , where is return-to-go, is the observation, and is the action. The encoded trajectory is split into segments:

The Transformer predicts actions for the segment and produces updated memory:

MRV then filters the updated memory through cross-attention from the previous memory to the candidate memory:

In time-series/world-model terms, RATE is direct evidence that action trajectories benefit from explicit long-horizon memory under partial observability, but it does not learn an explicit next-state dynamics model for planning.

Evidence And Results

The strongest evidence is on sparse memory tasks. In T-Maze, RATE is designed to retain an initial cue until a later decision point; the paper reports that it extrapolates to much longer inference lengths than the training setup, while context-limited Decision Transformer collapses once the cue leaves the window.

On ViZDoom-Two-Colors and POPGym, the reported advantage concentrates on memory-dependent settings. The paper’s own interpretation is useful for this wiki: memory embeddings matter most for sparse discrete cues, while cached hidden states matter more for continuous feedback.

The Atari and MuJoCo results are a breadth check: RATE is not only a narrow T-Maze trick, but those benchmarks do not by themselves prove a general action-conditioned world model.

Limitations

  • RATE is a decision/policy model trained from offline trajectories, not an explicit simulator or world model.
  • It conditions on return-to-go, which is useful for Decision Transformer-style control but differs from learning autonomous transition dynamics.
  • The memory machinery may be unnecessary for fully observable or short-horizon tasks.
  • The source is RL-heavy; transferring the design to observability or business time series would require typed actions, observations, rewards, and safety constraints.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Control and counterfactualsadjacentUses explicit action tokens and return-conditioned trajectory modeling in offline RL.No learned next-state or counterfactual dynamics interface; policy model rather than world model.
Streaming state, long context, and constant updatespartially closesCarries memory embeddings across trajectory segments and uses MRV to prevent important information from leaking out.Needs telemetry-style always-on serving state and memory-update evaluation.
Event streams and action historyadjacentTrajectories include observation/action/reward structure and sparse cues.Needs mixed numeric features, event streams, graph context, and typed interventions.
Benchmarks for action-conditioned modelsadjacentEvaluates POMDP-style memory tasks and standard offline RL benchmarks.Not a benchmark for observability, telecom, or business-process control.

Open Questions

  • What is the corresponding reset -> observe -> step(action) -> reward contract for observability or digital operations, where actions are interventions rather than game controls?
  • Can MRV-like memory filtering become an abstention or escalation signal when memory is unstable?
  • How should return-to-go conditioning be replaced or complemented in domains where reward is delayed, noisy, or only available through human/operator feedback?