Recurrent Action Transformer with Memory
Source
- Raw Markdown: paper_rate-2023.md
- PDF: paper_rate-2023.pdf
- Preprint: arXiv 2306.09459
- OpenReview: ICLR 2026 Poster
- Official code: CognitiveAISystems/RATE
- Project page: RATE model
Credibility
RATE was first submitted to arXiv in 2023 and the current arXiv page resolves to a 2026 revision. OpenReview lists it as an ICLR 2026 Poster, which makes it a current tier-1 accepted source for memory-augmented offline RL.
Core Claim
RATE adapts RMT-style recurrent memory to offline reinforcement learning by processing trajectories as segmented sequences of return-to-go, observation, and action tokens, then passing learned memory embeddings between segments.
The key contribution for this wiki is not “a world model.” RATE is a memory-augmented policy/decision model: it preserves sparse historical cues across long partially observed episodes so action prediction can condition on information outside the Transformer context window.
Key Contributions
- Introduces Recurrent Action Transformer with Memory for offline RL in partially observable environments.
- Combines memory embeddings, cached hidden states, and a Memory Retention Valve (MRV).
- Uses a read/write memory-token layout around trajectory segments, inherited from RMT-style causal memory.
- Evaluates memory-intensive environments including T-Maze, ViZDoom-Two-Colors, Memory Maze, Minigrid-Memory, and POPGym.
- Reports competitive or strong results on standard Atari and MuJoCo offline RL benchmarks.
Method Notes
RATE represents each trajectory as triplets , where is return-to-go, is the observation, and is the action. The encoded trajectory is split into segments:
The Transformer predicts actions for the segment and produces updated memory:
MRV then filters the updated memory through cross-attention from the previous memory to the candidate memory:
In time-series/world-model terms, RATE is direct evidence that action trajectories benefit from explicit long-horizon memory under partial observability, but it does not learn an explicit next-state dynamics model for planning.
Evidence And Results
The strongest evidence is on sparse memory tasks. In T-Maze, RATE is designed to retain an initial cue until a later decision point; the paper reports that it extrapolates to much longer inference lengths than the training setup, while context-limited Decision Transformer collapses once the cue leaves the window.
On ViZDoom-Two-Colors and POPGym, the reported advantage concentrates on memory-dependent settings. The paper’s own interpretation is useful for this wiki: memory embeddings matter most for sparse discrete cues, while cached hidden states matter more for continuous feedback.
The Atari and MuJoCo results are a breadth check: RATE is not only a narrow T-Maze trick, but those benchmarks do not by themselves prove a general action-conditioned world model.
Limitations
- RATE is a decision/policy model trained from offline trajectories, not an explicit simulator or world model.
- It conditions on return-to-go, which is useful for Decision Transformer-style control but differs from learning autonomous transition dynamics.
- The memory machinery may be unnecessary for fully observable or short-horizon tasks.
- The source is RL-heavy; transferring the design to observability or business time series would require typed actions, observations, rewards, and safety constraints.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Control and counterfactuals | adjacent | Uses explicit action tokens and return-conditioned trajectory modeling in offline RL. | No learned next-state or counterfactual dynamics interface; policy model rather than world model. |
| Streaming state, long context, and constant updates | partially closes | Carries memory embeddings across trajectory segments and uses MRV to prevent important information from leaking out. | Needs telemetry-style always-on serving state and memory-update evaluation. |
| Event streams and action history | adjacent | Trajectories include observation/action/reward structure and sparse cues. | Needs mixed numeric features, event streams, graph context, and typed interventions. |
| Benchmarks for action-conditioned models | adjacent | Evaluates POMDP-style memory tasks and standard offline RL benchmarks. | Not a benchmark for observability, telecom, or business-process control. |
Links Into The Wiki
- Recurrent Action Transformer with Memory
- Recurrent Memory Transformer
- World Models
- Efficient Recurrent Sequence Models
- Looped Transformers And Test-Time Memory
- Action-Conditioned Time-Series Datasets
- Foundation Time-Series Model Research Agenda
- D4RL
Open Questions
- What is the corresponding
reset -> observe -> step(action) -> rewardcontract for observability or digital operations, where actions are interventions rather than game controls? - Can MRV-like memory filtering become an abstention or escalation signal when memory is unstable?
- How should return-to-go conditioning be replaced or complemented in domains where reward is delayed, noisy, or only available through human/operator feedback?