Recurrent Action Transformer with Memory

Source

Raw Markdown: paper_rate-2023.md
PDF: paper_rate-2023.pdf
Preprint: arXiv 2306.09459
OpenReview: ICLR 2026 Poster
Official code: CognitiveAISystems/RATE
Project page: RATE model

Credibility

RATE was first submitted to arXiv in 2023 and the current arXiv page resolves to a 2026 revision. OpenReview lists it as an ICLR 2026 Poster, which makes it a current tier-1 accepted source for memory-augmented offline RL.

Core Claim

RATE adapts RMT-style recurrent memory to offline reinforcement learning by processing trajectories as segmented sequences of return-to-go, observation, and action tokens, then passing learned memory embeddings between segments.

The key contribution for this wiki is not “a world model.” RATE is a memory-augmented policy/decision model: it preserves sparse historical cues across long partially observed episodes so action prediction can condition on information outside the Transformer context window.

Key Contributions

Introduces Recurrent Action Transformer with Memory for offline RL in partially observable environments.
Combines memory embeddings, cached hidden states, and a Memory Retention Valve (MRV).
Uses a read/write memory-token layout around trajectory segments, inherited from RMT-style causal memory.
Evaluates memory-intensive environments including T-Maze, ViZDoom-Two-Colors, Memory Maze, Minigrid-Memory, and POPGym.
Reports competitive or strong results on standard Atari and MuJoCo offline RL benchmarks.

Method Notes

RATE represents each trajectory as triplets $(R_{t}, o_{t}, a_{t})$ , where $R_{t}$ is return-to-go, $o_{t}$ is the observation, and $a_{t}$ is the action. The encoded trajectory is split into segments:

τ_{0 : T - 1} = {(R_{t}, o_{t}, a_{t})}_{t = 0}^{T - 1}, \tilde{S}_{n} = concat (M_{n}, S_{n}, M_{n}) .

The Transformer predicts actions for the segment and produces updated memory:

\overset{a}{^}_{n}, M_{n + 1} = Transformer (\tilde{S}_{n}) .

MRV then filters the updated memory through cross-attention from the previous memory to the candidate memory:

MRV (M_{n}, M_{n + 1}) = FFN (MultiHead (Q = M_{n}, K = M_{n + 1}, V = M_{n + 1})) .

In time-series/world-model terms, RATE is direct evidence that action trajectories benefit from explicit long-horizon memory under partial observability, but it does not learn an explicit next-state dynamics model for planning.

Evidence And Results

The strongest evidence is on sparse memory tasks. In T-Maze, RATE is designed to retain an initial cue until a later decision point; the paper reports that it extrapolates to much longer inference lengths than the training setup, while context-limited Decision Transformer collapses once the cue leaves the window.

On ViZDoom-Two-Colors and POPGym, the reported advantage concentrates on memory-dependent settings. The paper’s own interpretation is useful for this wiki: memory embeddings matter most for sparse discrete cues, while cached hidden states matter more for continuous feedback.

The Atari and MuJoCo results are a breadth check: RATE is not only a narrow T-Maze trick, but those benchmarks do not by themselves prove a general action-conditioned world model.

Limitations

RATE is a decision/policy model trained from offline trajectories, not an explicit simulator or world model.
It conditions on return-to-go, which is useful for Decision Transformer-style control but differs from learning autonomous transition dynamics.
The memory machinery may be unnecessary for fully observable or short-horizon tasks.
The source is RL-heavy; transferring the design to observability or business time series would require typed actions, observations, rewards, and safety constraints.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Control and counterfactuals	adjacent	Uses explicit action tokens and return-conditioned trajectory modeling in offline RL.	No learned next-state or counterfactual dynamics interface; policy model rather than world model.
Streaming state, long context, and constant updates	partially closes	Carries memory embeddings across trajectory segments and uses MRV to prevent important information from leaking out.	Needs telemetry-style always-on serving state and memory-update evaluation.
Event streams and action history	adjacent	Trajectories include observation/action/reward structure and sparse cues.	Needs mixed numeric features, event streams, graph context, and typed interventions.
Benchmarks for action-conditioned models	adjacent	Evaluates POMDP-style memory tasks and standard offline RL benchmarks.	Not a benchmark for observability, telecom, or business-process control.

Links Into The Wiki

Open Questions

What is the corresponding reset -> observe -> step(action) -> reward contract for observability or digital operations, where actions are interventions rather than game controls?
Can MRV-like memory filtering become an abstention or escalation signal when memory is unstable?
How should return-to-go conditioning be replaced or complemented in domains where reward is delayed, noisy, or only available through human/operator feedback?

Alex Open Research Wiki

Explorer

Recurrent Action Transformer with Memory

Recurrent Action Transformer with Memory

Source

Credibility

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks