---
abstract: |
  Transformers have become increasingly popular in offline reinforcement learning (RL) due to their ability to treat agent trajectories as sequences, reframing policy learning as a sequence modeling task. However, in partially observable environments (POMDPs), effective decision-making depends on retaining information about past events -- something that standard transformers struggle with due to the quadratic complexity of self-attention, which limits their context length. One solution to this problem is to extend transformers with memory mechanisms. We propose the **Recurrent Action Transformer with Memory (RATE)**, a novel transformer-based architecture for offline RL that incorporates a recurrent memory mechanism designed to regulate information retention. We evaluate RATE across a diverse set of environments: memory-intensive tasks (ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid-Memory, and POPGym), as well as standard Atari and MuJoCo benchmarks. Our comprehensive experiments demonstrate that RATE significantly improves performance in memory-dependent settings while remaining competitive on standard tasks across a broad range of baselines. These findings underscore the pivotal role of integrated memory mechanisms in offline RL and establish RATE as a unified, high-capacity architecture for effective decision-making over extended horizons. Code: `https://sites.google.com/view/rate-model/`.
author:
- |
  Egor Cherepanov$^{1,2}$, Aleksei Staroverov$^{1,2}$, Alexey K. Kovalev$^{1,2}$, Aleksandr I. Panov$^{1,2}$\
  $^{1}$AXXX, $^{2}$MIRIAI\
  `cherepanov@axxx.tech`
bibliography:
- iclr2026\_conference.bib
title: Recurrent Action Transformer with Memory
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\renewcommand \thepart{}
```
```{=latex}
\renewcommand \partname{}
```
```{=latex}
\newcommand{\todo}[1]{\textbf{\textcolor{red}{#1}}}
```
```{=latex}
\renewcommand{\sectionautorefname}{Section}
```
```{=latex}
\renewcommand{\subsectionautorefname}{Section}
```
```{=latex}
\renewcommand{\subsubsectionautorefname}{Section}
```
```{=latex}
\providecommand*{\algorithmautorefname}{Algorithm}
```
```{=latex}
\providecommand*{\definitionautorefname}{Definition}
```
```{=latex}
\newcommand{\ga}[1]{\gradientcelld{#1}{0.25}{0.75}{1}{Red}{white}{DarkGreen}{70}}
```
```{=latex}
\newcommand{\gb}[1]{\gradientcelld{#1}{0.25}{0.66}{0.65}{Red}{white}{DarkGreen}{70}}
```
```{=latex}
\newcommand{\gc}[1]{\gradientcelld{#1}{0.25}{0.65}{1}{Red}{white}{DarkGreen}{70}}
```
```{=latex}
\newcommand{\gd}[1]{\gradientcelld{#1}{0.25}{0.61}{1}{Red}{white}{DarkGreen}{70}}
```
```{=latex}
\newcommand{\gtopa}[1]{\cellcolor{DarkGreen!75} #1}
```
```{=latex}
\newcommand{\gtopb}[1]{\cellcolor{LightGreen!75} #1}
```
```{=latex}
\renewcommand\qedsymbol{$\blacksquare$}
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\doparttoc
```
```{=latex}
\faketableofcontents
```
```{=latex}
\maketitle
```
### Introduction {#sec:introduction}

Originally developed for Natural Language Processing (NLP), transformers [@vaswani2017attention] have recently demonstrated strong performance across a wide range of Reinforcement Learning (RL) settings [@agarwal2023transformers; @li2023a]. They have been successfully applied to online [@parisotto2020stabilizing; @esslinger2022dtqn], offline [@chen2021decision; @jiang2023efficient; @wu2023elastic; @zhuang2024reinformer; @wang2025longshort], model-based [@chen2022transdreamer; @robine2023transformerbased], and in-context RL [@polubarov2025vintix; @grigsby2024amago; @schmied2024retrieval]. In particular, transformers show promise for tackling long-horizon credit assignment and operating in memory-intensive environments [@ni2023when; @grigsby2024amago; @esslinger2022dtqn; @parisotto2020stabilizing], provided the full trajectory fits within the model context. Despite their success, transformers face fundamental limitations when applied to long sequences due to the quadratic complexity of self-attention [@keles2023computational], which restricts their applicability in long-horizon inference tasks. While various techniques have been proposed to extend the context window [@dai2019transformerxl; @bulatov2022rmt], these approaches often suffer from training instability [@zhang2022opt] or rely on task-specific sparse attention patterns that may not generalize well beyond NLP [@beltagy2020longformer; @zaheer2020big]. Memory-augmented transformers offer a promising alternative by enabling access to past information without expanding the context length. Motivated by advances in memory mechanisms for NLP models [@dai2019transformerxl; @bulatov2022rmt], we investigate how such approaches can be adapted to RL. Unlike NLP, RL involves structured and modality-rich inputs -- observations, actions, and rewards -- that require domain-specific encoding, and frequently exhibit high sparsity in both reward signal and observations.

In RL, memory usually refers either to using past information within an episode [@lampinen2021towards; @ni2023when], or to transferring experience across environments [@kang2023think; @team2023human], aiding generalization, sample efficiency, and Meta-RL [@duan2016rl; @wang2016learning], and we focus on the former.

![Recurrent Action Transformer with Memory (RATE). The model processes trajectory divided into $n$ segments $S_n$ with memory embeddings $M_n$, where $R$ denotes returns-to-go (future rewards), $o$ are observations, $a$ are actions, and $M_n$ are memory embeddings attached to each segment $S_n$ to retain important historical information.](plots/RATE_scheme.png){#fig:RATE_scheme width="0.7\\linewidth"}

We introduce the **Recurrent Action Transformer with Memory** (**RATE**; see `\autoref{fig:RATE_scheme}`{=latex}), a memory-augmented transformer that incorporates three complementary mechanisms: learned memory embeddings, recurrent caching of past hidden states, and a novel **Memory Retention Valve** (**MRV**) for selective information flow. We empirically show that memory mechanisms effectively preserve information from previous steps, allowing the model to use past information when making decisions in the present. MRV is designed to control the process of updating memory embeddings and prevent the loss of important information when processing long sequences, thus enabling the processing of highly sparse tasks. To assess the effectiveness of our memory mechanisms, we conduct extensive experiments across a diverse set of memory-intensive environments, including ViZDoom-Two-Colors [@memup], Memory Maze [@pasukonis2022memmaze], Minigrid-Memory [@MinigridMiniworld23], Passive T-Maze [@ni2023when], and POPGym [@morad2023popgym], as well as standard RL benchmarks such as Atari [@bellemare2013arcade] and MuJoCo [@d4rl]. We also study the impact of memory on the performance of the proposed model. RATE interpolates and extrapolates well outside the transformer context and is able to retain important information for a long time when operating in highly sparse environments.

Our main contributions are as follows:

1.  We propose **Recurrent Action Transformer with Memory** (RATE), a new transformer for offline RL that combines three complementary memory mechanisms: (i) memory embeddings, (ii) caching of hidden states, and (iii) a **Memory Retention Valve** (MRV), which uses cross-attention to retain key information over long horizons (`\autoref{sec:rate}`{=latex}).

2.  We conduct extensive evaluations on memory-intensive tasks -- including ViZDoom Two-Colors, Memory Maze, Minigrid-Memory, POPGym, and Passive T-Maze -- showing that RATE consistently outperforms strong baselines (`\autoref{sec:exp_res}`{=latex}).

3.  We further show that RATE matches or surpasses standard baselines on the Atari and MuJoCo benchmarks, demonstrating strong generalization across task types and highlighting the model's versatility (`\autoref{sec:exp_res}`{=latex}).

### Background

**Offline RL.** In RL [@sutton2018reinforcement], a task is formalized as a Markov Decision Process (MDP): $\langle \mathcal{S},\mathcal{A}, \mathcal{P}, \mathcal{R}\rangle$, where $s \in \mathcal{S}$ are states, $a \in \mathcal{A}$ are actions, $\mathcal{P}(s'|s,a)$ is a transition function, and $r = \mathcal{R}(s,a)$ is a reward function. States satisfy the Markov property: $\mathbb{P}(s_{t+1}|s_t) = \mathbb{P}(s_{t+1}|s_1,\dots, s_t)$. A trajectory $\tau$ of length $T$ is a sequence $(s_0, a_0, r_0,\dots, s_{T-1}, a_{T-1}, r_{T-1})$, where $r_t = R(s_t,a_t)$ is the immediate reward at the timestep $t$. The return-to-go [@chen2021decision] $R_t = \sum^{T-1}_{t'=t}r_{t'}$ is the sum of future rewards from $t$. The goal is to learn a policy $\pi$ maximizing the expected return. While online RL iteratively collects trajectories through environment interaction, offline RL uses a fixed dataset of trajectories, making it suitable for scenarios where environment interaction is costly or risky. A popular offline RL method, Decision Transformer (DT) [@chen2021decision], models return-conditioned trajectories with a GPT-style architecture, avoiding value estimation. However, its fixed context window limits performance in tasks with delayed rewards or long-term dependencies, motivating memory-augmented models.

**POMDP.** In real-world, agents often receive partial observations rather than full states, breaking the Markov property. For instance, a robot using only camera input or an agent relying on past context. Such cases are modeled as Partially Observable MDPs (POMDPs): $\langle \mathcal{S},\mathcal{A}, \mathcal{O},\mathcal{P}, \mathcal{R},\mathcal{Z}\rangle$, where $o\in \mathcal{O}$ are observations and $\mathcal{Z}^{a}_{s'o} = P(o_{t+1} | s_{t+1} = s', a_t = a)$ defines the observation function. Since single observations are insufficient, agents must use history to infer useful state representations.

![ Attention maps of RATE and DT on the T-Maze [@ni2023when] task with corridor length $T=8$. DT is trained on full 8-step trajectories, while RATE processes the sequence in three segments of length 3 recurrently, passing information between segments through memory embeddings. ](plots/new-plots/rate_demo.png){#fig:attn_rate_vs_dt width="1.0\\columnwidth"}

```{=latex}
\begin{wrapfigure}[26]{r}{0.45\textwidth}


\begin{algorithm}[H]
\caption{RATE}
\label{alg:rate}
\begin{algorithmic}[1]
\REQUIRE $R \in \mathbb{R}^T, o \in \mathbb{R}^{d_o \times T}, a \in \mathbb{R}^T$
\STATE $\tilde{R} \leftarrow \texttt{Encoder}_{R}(R)$\\
       $\tilde{o} \leftarrow \texttt{Encoder}_{o}(o)$\\
       $\tilde{a} \leftarrow \texttt{Encoder}_{a}(a)$
\STATE $\tau_{0:T-1} \leftarrow \{(\tilde{R}_t, \tilde{o}_t, \tilde{a}_t)\}_{t=0}^{T-1}$
\STATE $M_n\leftarrow M_0 \sim \mathcal{N}(0,1)$
\FOR{$n$ in $[0, T // K - 1]$}
    \STATE $S_n \leftarrow \tau_{nK:(n+1)K}$
    \STATE $\tilde{S}_n \leftarrow \texttt{concat}(M_n, S_n, M_n)$
    \STATE $\hat{a}_n, M_{n+1} \leftarrow \texttt{Transformer}(\tilde{S}_n)$
    \STATE $M_{n+1} \leftarrow \texttt{MRV}(M_n, M_{n+1})$\\
           $\hat{a}_n \rightarrow \mathcal{L}(a_n, \hat{a}_n),\ M_{n+1}$
\ENDFOR
\end{algorithmic}
\end{algorithm}


% \end{wrapfigure}

% \begin{wrapfigure}[12]{r}{0.45\textwidth}


\begin{algorithm}[H]
\caption{Memory Retention Valve}
\label{alg:mrv}
\begin{algorithmic}[1]
\REQUIRE $M_n, M_{n+1} \in \mathbb{R}^{m \times d}$
\STATE $\mathbf{Q}_h \leftarrow M_n \mathbf{W}_Q^{h\ T}$
\STATE $\mathbf{K}_h \leftarrow M_{n+1} \mathbf{W}_K^{h\ T}$
\STATE $\mathbf{V}_h \leftarrow M_{n+1} \mathbf{W}_V^{h\ T}$
\STATE $M_{n+1}^h \leftarrow \texttt{softmax}\left(\frac{\mathbf{Q}_h \mathbf{K}_h^T}{\sqrt{d}}\right)\mathbf{V}_h$
\STATE $M_{n+1} \leftarrow \texttt{concat}(M_{n+1}^0, \ldots, M_{n+1}^h)$
\STATE $M_{n+1} \leftarrow M_{n+1} \mathbf{W}_M^T$\\
\textbf{Output:} $M_{n+1}$
\end{algorithmic}
\end{algorithm}

% 
\end{wrapfigure}
```
### Recurrent Action Transformer with Memory {#sec:rate}

Transformers excel at sequence modeling, including offline RL [@chen2021decision; @janner2021offline], but struggle with long-horizon tasks due to fixed context and quadratic attention cost. In memory tasks, agents must recall information seen thousands of steps earlier -- something models like DT cannot do once cues fall outside context. We propose the **Recurrent Action Transformer with Memory (RATE)**, which introduces segment-level recurrence and dynamic memory control. RATE processes trajectories in segments, using lightweight memory and a learnable **Memory Retention Valve (MRV)** to decide what to retain or discard. In T-Maze [@ni2023when], the agent receives a one-bit cue $o_0$ at the first step indicating whether to turn left or right at the end of a maze. Solving the task requires remembering this cue despite sparse rewards. DT fails once $o_0$ leaves the context, making retrieval at inference impossible. `\autoref{fig:attn_rate_vs_dt}`{=latex} shows this: DT attends to $o_0$ only when it fits the context, while RATE segments the input and propagates the memory embeddings, preserving the cue to the end and enabling explicit memory retention.

RATE combines memory embeddings [@bulatov2022rmt], cached hidden states[^1] [@dai2019transformerxl], and a novel MRV to handle long and sparse sequences. The architecture is shown in `\autoref{fig:RATE_scheme}`{=latex}. Let a trajectory $\tau_{0:T-1}$ of length $T$ be represented by triplets $(R_t, o_t, a_t)$, where $R_t$ is the return-to-go, $o_t$ the observation, and $a_t$ the action. Each modality is encoded using modality-specific encoders (Algorithm `\ref{alg:rate}`{=latex}): $\tilde{R}_t = \texttt{Encoder}_R(R_t),\  \tilde{o}_t = \texttt{Encoder}_o(o_t), \ \tilde{a}_t = \texttt{Encoder}_a(a_t)$. The encoded sequence is split into $N = T // K$ non-overlapping segments $S_n$ of length $K$. Thus, the effective context is $K_{\text{eff}} = N \times K$, well beyond standard attention limits. Each segment is prepended and appended with the same memory embeddings $M_n \in \mathbb{R}^{m \times d}$, where $m$ is the number of memory tokens and $d$ is the embedding dimension. This design follows from the use of causal self-attention in the decoder: the prefix copy of $M_n$ provides read access, since every token in the segment $S_n$ can attend backward to the incoming memory, while the suffix copy provides write access, since these memory tokens appear after the segment in the causal ordering and allow the final layers to attend forward into $S_n$ to produce updated memory. Using only the prefix would make memory readable but not updatable, whereas using only the suffix would prevent the segment from accessing previously stored information. Both copies are therefore required for RATE's recurrent memory mechanism:

$$\tilde{S}_n = \texttt{concat}(M_n, S_n, M_n) \in \mathbb{R}^{(3K + 2m) \times d}.$$

Each segment is then processed by the transformer, $$\hat{a}_n,\, M_{n+1} = \texttt{Transformer}(\tilde{S}_n),$$ and the resulting memory $M_{n+1}$ is refined via the MRV before being passed to the next segment.

Naively forwarding memory embeddings leads to error accumulation or overwriting of relevant information. To address this, we introduce the **Memory Retention Valve (MRV)**, a cross-attention module that filters new memory tokens through the lens of the previous ones (Algorithm `\ref{alg:mrv}`{=latex}): $$\text{MRV}(M_n, M_{n+1}) = \texttt{FFN}\left( \texttt{MultiHead}(\text{Query}=M_n,\ \text{Key}=M_{n+1},\ \text{Value}=M_{n+1}) \right)$$

This mechanism allows $M_n$ to control what to retain or overwrite when updating to $M_{n+1}$. Unlike static recurrence[^2], it preserves sparse, long-range information. RATE overcomes DT's limits by extending context with recurrence, preserving early cues via MRV, and retaining key events in sparse settings. As a result, RATE solves tasks where DT fails, generalizes beyond training, and remains competitive on standard MDPs.

###### Attention pattern analysis.

`\autoref{fig:attn_rate_vs_dt}`{=latex} compares attention maps of RATE and DT on a T-Maze sequence. DT (right) attends only within a fixed window, focusing on recent tokens while losing early cues like $o_0$. RATE (left) segments the input and uses memory tokens to propagate information across segments. These tokens retain access to $o_0$ even in later segments, demonstrating RATE's ability to model long-range dependencies beyond the context window through structured memory.

#### Preservation Properties of MRV

We formalize the intuition that the *cross-attention--based* MRV prevents catastrophic overwriting of memory by preserving alignment between consecutive memory states. All vectors are row-vectors. We use $\|\cdot\|_F$ for the Frobenius norm and $\|\cdot\|_2$ for the $\ell_2$ norm.

Let $M_n \in \mathbb{R}^{m \times d}$ and $\tilde{M}_{n+1} \in \mathbb{R}^{m \times d}$ denote the incoming and updated memory embeddings at segment $n$, where $m$ is the number of memory tokens and $d$ is the model dimension. We assume that each row $i$ of $M_n$ is $\ell_2$-normalized: $\|M_{n,i}\|_2 = 1$ . The MRV computes the next memory state as: $Q = M_n W_Q, \ 
K = \tilde{M}_{n+1} W_K, \ 
V = \tilde{M}_{n+1} W_V, \ 
A = \texttt{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right), \ 
M_{n+1} = A V W_M.$

###### $\alpha$-alignment condition.

The memory embeddings are said to satisfy *$\alpha$-alignment*[^3] if there exists a constant $\alpha \in (0,1]$ such that for every row $M_{n,i}$, there exists a row $V_j$ for which: $\langle V_j W_M,\ M_{n,i} \rangle \ge \alpha.$ This implies that the angle between $V_j W_M$ and $M_{n,i}$ is at most $\arccos \alpha$. Empirically, this condition holds in trained models, as the transformer tends to preserve useful memory content and avoids orthogonal rotations between segments.

```{=latex}
\begin{theorem}[\textbf{On memory loss bounds}]\label{thm:mrv_preservation}
Let each memory row be $\ell_2$-normalized, the $\alpha$-alignment condition hold, and $A = \texttt{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)$ be the MRV attention matrix. Then:
\begin{align}
    \|M_{n+1} - M_n\|_F 
    &\le \sqrt{2\left(1 - \frac{\alpha}{m}\right)} \cdot \|M_n\|_F,
    \
    \|M_{n+1}\|_F \ge \left(1 - \sqrt{2\left(1 - \frac{\alpha}{m}\right)}\right) \cdot \|M_n\|_F.
    \label{eq:energy_lower_bound}
\end{align}
\noindent In words: at least a $\left(1 - \sqrt{2\left(1 - \frac{\alpha}{m}\right)}\right)$ part of the memory is guaranteed to be preserved after a single MRV update (\Eqref{eq:energy_lower_bound} (right)), and the memory loss is upper bounded by~\Eqref{eq:energy_lower_bound} (left).
\end{theorem}
```
```{=latex}
\begin{proof}
    Since each row of the attention matrix $A$ is a probability distribution, we have $\sum_{j} A_{ij} = 1$ for every $i$.  
    By the pigeonhole principle, there exists an index $j^*$ such that $A_{ij^*} \ge \frac{1}{m}$.
    
    By assumption, for each $M_{n,i}$ there exists a $V_j$ such that $\langle V_j W_M,\ M_{n,i} \rangle \ge \alpha$. In particular, this holds for $j^*$: $\langle V_{j^*} W_M,\ M_{n,i} \rangle \ge \alpha$. Using the MRV definition $M_{n+1,i} = \sum_{j} A_{ij} V_j W_M$, we write:
    \begin{equation}
        \langle M_{n+1,i},\ M_{n,i} \rangle = \sum_{j} A_{ij} \langle V_j W_M,\ M_{n,i} \rangle \ge A_{ij^*} \langle V_{j^*} W_M,\ M_{n,i} \rangle \ge \frac{\alpha}{m}.
    \end{equation} 
    Let $\theta_i$ be the angle between $M_{n+1,i}$ and $M_{n,i}$. Since both vectors are $\ell_2$-normalized, we have:
    $
        \cos \theta_i = \frac{\langle M_{n+1,i}, M_{n,i} \rangle}{\|M_{n+1,i}\|_2 \cdot \|M_{n,i}\|_2} \ge \frac{\alpha}{m}.
    $
    Using the identity $\|u - v\|_2^2 = 2(1 - \cos \theta)$ for unit vectors:
    \begin{equation}
        \|M_{n+1,i} - M_{n,i}\|_2^2 \le 2\left(1 - \frac{\alpha}{m}\right), \quad
        \Rightarrow \
        \|M_{n+1,i} - M_{n,i}\|_2 \le \sqrt{2\left(1 - \frac{\alpha}{m}\right)}.
    \end{equation}
    Summing over all memory tokens and applying the previous bound:
    $
        \|M_{n+1} - M_n\|_F^2 = \sum_{i=1}^{m} \|M_{n+1,i} - M_{n,i}\|_2^2 \le 2m \left(1 - \frac{\alpha}{m} \right),
    $
    which simplifies to:
    $
        \|M_{n+1} - M_n\|_F \le \sqrt{2m\left(1 - \frac{\alpha}{m}\right)}.
    $
    Consequently, since $\|M_n\|_F = \sqrt{m}$ due to row normalization, we conclude:
    $
        \|M_{n+1} - M_n\|_F \le \sqrt{2\left(1 - \frac{\alpha}{m}\right)} \cdot \|M_n\|_F.
    $

    We now derive the lower bound~\Eqref{eq:energy_lower_bound} (left) using the reverse triangle inequality. For any matrices $M_{n+1}, M_n \in \mathbb{R}^{m \times d}$, we have:
    $
        \|M_{n+1}\|_F \ge \|M_n\|_F - \|M_{n+1} - M_n\|_F.
    $
    Substituting the upper bound from~\Eqref{eq:energy_lower_bound} (right):
    $
        \|M_{n+1} - M_n\|_F \le \sqrt{2\left(1 - \frac{\alpha}{m}\right)} \cdot \|M_n\|_F,
    $
    we obtain:
    $
        \|M_{n+1}\|_F \ge \left(1 - \sqrt{2\left(1 - \frac{\alpha}{m}\right)}\right) \cdot \|M_n\|_F,
    $
    which completes the proof of~\Eqref{eq:energy_lower_bound}.
\end{proof}
```
```{=latex}
\begin{figure*}[t!]
    
    \begin{center}
    \includegraphics[width=\textwidth]{plots/rate_vizdoom_main.png}
    
    \caption{
        Comparison of RATE with transformer baselines (DT, RMT, TrXL) on ViZDoom-Two-Colors
        trained on the first $T_{\text{train}}=90$ steps of the episode:
        with (\textbf{a}) and without (\textbf{b}) pillar in the first 45 steps of the episode; 
        calculated at environment steps $0$ -- $89$ (\textbf{c}) and $90$ -- $179$ (\textbf{d}) 
        with pillar in the first 45 steps; depending on the return-to-go (\textbf{e, f, g}).
        Episode timeout -- 2100 steps.}
    \label{fig:vizdoom_pil_no_pil}
    \end{center}
    
\end{figure*}
```
### Experimental Evaluation {#sec:exp_eval}

We designed our experiments to achieve two main goals: (a) to showcase the strengths of the RATE model in memory-intensive environments (T-Maze, ViZDoom-Two-Colors, Memory Maze, Minigrid-Memory, POPGym), and (b) to assess its effectiveness in standard MDPs, demonstrating its versatility across domains.

**Baselines.** To evaluate the performance of RATE, we compare it against a diverse set of baselines spanning several categories: **transformer-based models** including *Decision Transformer* (DT) [@chen2021decision], *Recurrent Memory Transformer* (RMT) [@bulatov2022rmt] and *Transformer-XL* (TrXL) [@dai2019transformerxl] specially adapted by us for offline RL, and *Long-Short Decision Transformer* (LSDT) [@wang2025longshort]; **classic baselines** such as *Behavior Cloning with* an MLP backbone

```{=latex}
\begin{wrapfigure}[16]{r}{0.6\textwidth}
    
    
    \includegraphics[width=1.0\linewidth]{plots/new-plots/vizdoom_150.pdf}
    
    \caption{
    ViZDoom-Two-Colors results with $T_{\text{train}}{=}150$. The top plot shows average return across all episodes (yellow), and separately for red (red) and green (green) pillars. The bottom plot shows the imbalance metric -- absolute difference between red and green performance. Lower imbalance indicates more consistent behavior and is as important as average return.
    }
    \label{fig:vizdoom_150}
    % 
\end{wrapfigure}
```
(BC-MLP) and *Conservative Q-Learning* [@cql] with an MLP backbone (CQL-MLP); **recurrent models** including *Behavior Cloning with an LSTM backbone* [@hochreiter1997long] (BC-LSTM), *CQL with LSTM* (CQL-LSTM), *Decision LSTM* (DLSTM) [@dlstm], and its *GRU-based variant* [@gru] (DGRU); and a **state space model** baseline, *Decision Mamba* (DMamba) [@ota2024decision; @lv2024decision].

**Memory-intensive tasks.** We evaluate RATE in tasks that require agents to retain information over time `\autoref{fig:memory_envs}`{=latex}; full details are in `\autoref{app:envs}`{=latex}. **ViZDoom-Two-Colors**: the agent must recall a briefly visible pillar color to collect matching items; **T-Maze**: a cue at the start indicates the correct turn at the end, testing sparse long-term memory; **Minigrid-Memory**: like T-Maze, but the clue must be located first, combining memory and credit assignment [@ni2023when]; **Memory Maze**: the agent searches for objects matching a changing target color, requiring spatial memory; **POPGym**: a suite of 48 partially observable tasks [@morad2023popgym] designed to probe different aspects of memory.

```{=latex}
\begin{wrapfigure}[13]{r}{0.6\textwidth}
    
    \begin{center}
    \includegraphics[width=\linewidth]{plots/new-plots/tmaze_limits.pdf}
    
    \caption{
        T-Maze generalization task.
    }
    \label{fig:tmaze_limits}
    \end{center}
    % 
\end{wrapfigure}
```
#### Experimental Results {#sec:exp_res}

###### ViZDoom-Two-Colors.

`\autoref{fig:vizdoom_150}`{=latex} shows training with $T_{\text{train}}{=}150$ and inference up to 2100 steps, where the pillar disappears at step 90. RATE achieves the highest return and lowest imbalance between the red and green pillars, indicating strong and consistent memory use. `\autoref{fig:vizdoom_pil_no_pil}`{=latex} further tests transformer models trained with $T_{\text{train}}=90$ on their ability to retain early cues. With the pillar present (a), RATE again yields the highest and most stable return. DT and TrXL underperform and show a higher imbalance. Removing the pillar (b) degrades all models, confirming reliance on the initial cue. DT's unchanged performance across (a) and (b) highlights its failure to leverage long-term dependencies.

This limitation is clearer in `\autoref{fig:vizdoom_pil_no_pil}`{=latex} (c, d), which separates performance within and beyond the 90-step context. DT's return drops by nearly 50$\%$ in red-pillar episodes once the cue leaves the window, while memory models (RATE, RMT, TrXL) remain stable, demonstrating their ability to retain and use information over long horizons.

`\autoref{fig:vizdoom_pil_no_pil}`{=latex} (e, f, g) shows model performance across target reward levels. RATE consistently outperforms all baselines overall (e), and this advantage is even clearer when separating red (f) and green (g) pillar episodes. While other models show large disparities, RATE maintains stable performance across both conditions, demonstrating effective use of initial cues and validating the strength of its memory architecture.

###### T-Maze.

`\autoref{fig:tmaze_limits}`{=latex} shows the model generalization in Passive T-Maze as inference length grows from 9 to 9600 steps. All models were trained on episodes up to 900 steps; extrapolation beyond this requires long-horizon generalization. RATE achieves 100$\%$ success across all in-distribution lengths and performs well even at 9600-step inference, corresponding to trajectories of $3\times9600=28800$ tokens due to the $(R,o,a)$ triplets. This highlights RATE's ability to retain and leverage sparse cues

```{=latex}
\begin{wrapfigure}[23]{r}{0.6\textwidth}
    
    \begin{center}
    \includegraphics[width=\linewidth]{plots/new-plots/tmaze-heatmaps.pdf}
    
    \caption{
        Heatmaps of success rates on T-Maze tasks. The black dashed line separates 
        \textit{in-distribution inference} (with $T_{val}\leq T_{train}$) from 
        \textit{out-of-distribution inference} (with $T_{val}>T_{train}$).
        Results for other baselines can be found in Appendix,~\autoref{fig:all_heatmaps}.}
    
    \label{fig:tmaze_heatmaps}
    \end{center}
\end{wrapfigure}
```
over extremely long horizons. Other transformers (e.g., DT, LSDT) match RATE on training-length sequences but degrade sharply beyond. DT collapses to $\sim 50\%$ even at moderate lengths due to its lack of memory. Memory-augmented models like RMT generalize slightly further but deteriorate. TrXL performs similarly to DT, suggesting hidden-state caching alone is insufficient for long-range recall of sparse information. RNNs and SSMs (e.g., BC-LSTM, DMamba) show flat curves and fail to learn from sparse long sequences.

RATE both interpolates within training and extrapolates well beyond, a key strength for solving sparse POMDPs. Notably, poor performance of some memory baselines in `\autoref{fig:tmaze_limits}`{=latex} is due to difficulty modeling long sequences during training, not just generalization failure: even for $T_{\text{val}} \leq T_{\text{train}}$, they may fail. However, when trained on shorter sequences, some models learn generalizable behaviors. `\autoref{fig:tmaze_heatmaps}`{=latex} visualizes inference performance for RATE (top), DT (middle), and BC-LSTM (bottom) across training/validation lengths. The black dashed line separates *in-distribution* ($T_{\text{val}} \leq T_{\text{train}}$) from *out-of-distribution* ($T_{\text{val}} > T_{\text{train}}$). From `\autoref{fig:tmaze_heatmaps}`{=latex} (bottom), BC-LSTM generalizes well when trained on short sequences ($\leq150$), but degrades as training lengths grow, reaching $\sim$0.5 when trained on $T \geq 600$, likely due to vanishing gradients or limited capacity [@pascanu2013difficulty; @trinh2018learning]. DT (`\autoref{fig:tmaze_heatmaps}`{=latex} (middle)) handles long training sequences via attention, but fails on longer validation sequences due to fixed context. In contrast, RATE (`\autoref{fig:tmaze_heatmaps}`{=latex} (top)) maintains high success across all validation lengths, enabled by its combination of attention and recurrent memory, which overcomes the limitations of both DT and RNNs.

```{=latex}
\begin{wrapfigure}[12]{r}{0.5\linewidth}
    
    \begin{center}
    \includegraphics[width=\linewidth]{plots/new-plots/minigrid_boxplot.pdf}
    
    \caption{
        Minigrid-Memory generalization task.
    }
    \label{fig:minigrid_boxplot}
    \end{center}
    % 
\end{wrapfigure}
```
**Minigrid-Memory.** `\autoref{fig:minigrid_boxplot}`{=latex} presents average returns on Minigrid-Memory, where all models were trained on grids of fixed size $41 \times 41$ and evaluated on a wide range of unseen grid sizes from $11 \times 11$ to $501 \times 501$. RATE achieves consistently high performance across the entire spectrum, demonstrating both strong interpolation and extrapolation capabilities. While TrXL also performs well on average, its variance is notably higher, indicating sensitivity to grid scale.

```{=latex}
\small
```
```{=latex}
\begin{adjustbox}{width=1\columnwidth}
    \begin{tabular}{lccccccc}
\toprule
    \textbf{Method} & \textbf{Random} &\textbf{BC-LSTM} & \textbf{CQL-LSTM} & \textbf{DT} & \textbf{RMT} & \textbf{TrXL} & \textbf{RATE} \\
    \midrule
    \textbf{Return} & 0.00{\scriptsize$\pm$0.00}
                    & 4.75{\scriptsize$\pm$0.15} 
                    & 0.19{\scriptsize$\pm$0.02} 
                    & 6.83{\scriptsize$\pm$0.51} 
                    & 7.27{\scriptsize$\pm$0.21} 
                    & 7.12{\scriptsize$\pm$0.24} 
                    & \textbf{7.64}{\scriptsize$\pm$0.41} \\
\bottomrule
\end{tabular}
\end{adjustbox}
```
**Memory Maze.** `\autoref{tab:memory_maze}`{=latex} presents results on the Memory Maze task. RATE achieves higher average episode returns by effectively capturing implicit structure, such as maze layout. For reference, the dataset's average return is 4.69. All models were trained on 90-step trajectory subsequences, while full episodes span 1000 steps.

**POPGym.** To further assess generalization and memory capabilities, we evaluated models on all 46 tasks from the POPGym benchmark suite, which covers a wide range of partially observable RL scenarios. The benchmark is split into 33 *memory puzzle tasks* and 15 *reactive POMDP tasks*.

```{=latex}
\begin{wraptable}[8]{r}{0.5\textwidth}
    
    \small
    
    \caption{Aggregated average returns on 48 POPGym tasks, split into memory and reactive subsets.}
    \label{tab:popgym_summary}
    % \setlength{\tabcolsep}{3pt}
    
    \begin{tabular}{@{}l@{\hskip -4pt}r@{\hskip 4pt}r@{\hskip 4pt}r@{\hskip 4pt}r@{\hskip 4pt}r@{}}
    \toprule
    \textbf{Tasks} & \textbf{Rand.} & \textbf{BC-MLP} & \textbf{DT} & \textbf{BC-LSTM} & \textbf{RATE} \\
    \midrule
    All (48)      & -12.2 & -6.8  & 5.8  & 9.0  & \textbf{9.5} \\
    Memory (33)   & -14.6 & -11.9 & -3.5 & -0.2 & \textbf{0.5} \\
    Reactive (15) & 2.3   & 5.1   & \textbf{9.3} & \textbf{9.1} & \textbf{9.1} \\
    \bottomrule
    \end{tabular}
    
\end{wraptable}
```
`\autoref{tab:popgym_summary}`{=latex} reports average normalized scores across all tasks and subsets. RATE achieves the highest overall score ($9.54$), outperforming all baselines. On the challenging memory tasks, RATE maintains a positive average score ($0.45$), while all other models fall below zero -- indicating a consistent failure to exploit long-term dependencies. Notably, DT scores $-3.49$ and BC-MLP drops to $-11.91$, showing the limitations of both context-limited transformers and non-recurrent policies.

On reactive tasks, all models perform better, but the gap between memory-based and non-memory models narrows. RATE, DT, and BC-LSTM show almost the same results, suggesting that the greatest performance gains from RATE's memory mechanisms occur on memory puzzle tasks. For simpler reactive POMDPs, lightweight memory mechanisms appear sufficient. These results also underscore RATE's ability to generalize across both puzzle and reactive settings, confirming that its memory architecture does not hinder performance in simpler tasks while offering clear benefits in those with temporal dependencies. More details are provided in Appendix, `\autoref{tab:popgym_full}`{=latex}.

```{=latex}
\small
```
```{=latex}
\begin{adjustbox}{width=1\columnwidth}
    \begin{tabular}{ll|cccccccc}
    \toprule

    \textbf{Dataset} 
    & \textbf{Environment} 
    & \textbf{CQL} 
    & \textbf{DT}
    & \textbf{TAP}
    & \textbf{TT}
    & 
    \makecell{\textbf{DMamba}\\ \citep{ota2024decision}}
    
    & 
    \makecell{\textbf{DMamba}\\ \citep{lv2024decision}}
    
    & \textbf{MambaDM}
    & \textbf{RATE (ours)}\\ 

    \midrule

    ME & HalfCheetah    
        & 91.6
        & 86.8{\scriptsize$\pm$1.3}
        & 91.8{\scriptsize$\pm$0.8}
        & \textbf{\gtopa{95.0}{\scriptsize$\pm$0.2}}
        & 91.9{\scriptsize$\pm$0.6}
        & \textbf{\gtopb{93.5}{\scriptsize$\pm$0.1}}
        & 86.5{\scriptsize$\pm$1.2}
        & 87.4{\scriptsize$\pm$0.1} \\ 

    ME & Hopper
    & 105.4 
        & \textbf{\gtopb{107.6}{\scriptsize$\pm$1.8}}~~~
        & 105.5{\scriptsize$\pm$1.7}~ 
        & \textbf{\gtopa{110.0}{\scriptsize$\pm$2.7}}~~
        & \textbf{\gtopb{111.1}{\scriptsize$\pm$0.3}}~~
        & \textbf{\gtopa{111.9}{\scriptsize$\pm$1.8}}~~
        & \textbf{\gtopb{110.5}{\scriptsize$\pm$0.3}}~~~
        & \textbf{\gtopa{112.5}{\scriptsize$\pm$0.2}}~~~ \\ 

    ME & Walker2d
    & \textbf{\gtopa{108.8}}
        & \textbf{\gtopb{108.1}{\scriptsize$\pm$0.2}}~~~
        & \textbf{\gtopb{107.4}{\scriptsize$\pm$0.9}}~~
        & \textbf{\gtopb{101.9}{\scriptsize$\pm$6.8}}~~
        & \textbf{\gtopa{108.3}{\scriptsize$\pm$0.5}}~~
        & \textbf{\gtopa{111.6}{\scriptsize$\pm$1.2}}~~
        & \textbf{\gtopa{108.8}{\scriptsize$\pm$0.1}}~~~
        & \textbf{\gtopa{108.7}{\scriptsize$\pm$0.5}}~~~ \\ 

    \midrule

    M & HalfCheetah
        & 44.4
        & ~42.6{\scriptsize$\pm$0.1}~
        & \textbf{\gtopb{45.0}{\scriptsize$\pm$0.1}}
        & \textbf{\gtopa{46.9}{\scriptsize$\pm$0.4}}
        & ~~42.8{\scriptsize$\pm$0.1}~
        & 43.8{\scriptsize$\pm$0.2}
        & ~~42.8{\scriptsize$\pm$0.1}~~
        & 43.5{\scriptsize$\pm$0.3} \\ 

    M & Hopper
    & 58.0 
        & \textbf{\gtopb{~67.6}{\scriptsize$\pm$1.0}}~~
        & ~~63.4{\scriptsize$\pm$1.4}~
        & ~61.1{\scriptsize$\pm$3.6}
        & ~~\textbf{\gtopa{83.5}{\scriptsize$\pm$12.5}}
        & \textbf{\gtopa{98.5}{\scriptsize$\pm$8.2}}
        & \textbf{\gtopa{85.7}{\scriptsize$\pm$7.8}}~
        & \textbf{\gtopb{77.4}{\scriptsize$\pm$1.4}}~ \\ 

    M & Walker2d
    & 72.5 
        & ~74.0{\scriptsize$\pm$1.4}~
        & ~~64.9{\scriptsize$\pm$2.1}~
        & \textbf{\gtopa{79.0}{\scriptsize$\pm$2.8}}
        & \textbf{\gtopb{78.2}{\scriptsize$\pm$0.6}}
        & \textbf{\gtopa{80.3}{\scriptsize$\pm$0.1}}
        & \textbf{\gtopb{78.2}{\scriptsize$\pm$0.6}}~
        & \textbf{\gtopa{80.7}{\scriptsize$\pm$0.7}}~ \\ 

    \midrule

    MR & HalfCheetah
        & \textbf{\gtopa{45.5}}
        & ~36.6{\scriptsize$\pm$0.8}~
        & ~\textbf{\gtopa{40.8}{\scriptsize$\pm$0.6}}~
        & \textbf{\gtopa{41.9}{\scriptsize$\pm$2.5}}
        & \textbf{\gtopb{~~39.6}{\scriptsize$\pm$0.1}}~~
        & \textbf{\gtopa{40.8}{\scriptsize$\pm$0.4}}
        & \textbf{\gtopb{~~39.1}{\scriptsize$\pm$0.1}}~~~
        & \textbf{\gtopb{~~39.0}{\scriptsize$\pm$0.6}}~~~ \\

    MR & Hopper
    & \textbf{\gtopa{95.0}}
        & \textbf{\gtopb{82.7}{\scriptsize$\pm$7.0}}~
        & ~\textbf{\gtopb{87.3}{\scriptsize$\pm$2.3}}~
        & \textbf{\gtopa{91.5}{\scriptsize$\pm$3.6}}
        & \textbf{\gtopb{82.6}{\scriptsize$\pm$4.6}}
        & \textbf{\gtopa{89.1}{\scriptsize$\pm$4.3}}
        & \textbf{\gtopb{86.1}{\scriptsize$\pm$2.5}}~
        & \textbf{\gtopb{83.7}{\scriptsize$\pm$8.2}}~ \\ 

    MR & Walker2d
    & \textbf{\gtopa{77.2}}
        & ~66.6{\scriptsize$\pm$3.0}~
        & ~~66.8{\scriptsize$\pm$3.1}~
        & \textbf{\gtopa{82.6}{\scriptsize$\pm$6.9}}
        & ~~70.9{\scriptsize$\pm$4.3}~
        & \textbf{\gtopa{79.3}{\scriptsize$\pm$1.9}}
        & \textbf{\gtopb{73.4}{\scriptsize$\pm$2.6}}~
        & \textbf{\gtopb{73.7}{\scriptsize$\pm$1.4}}~ \\ 


    \bottomrule

    & \textbf{Average}
    & \textbf{\gtopb{77.6}}
    & 74.7
    & 74.8
    & \textbf{\gtopa{78.9}}
    & \textbf{\gtopa{78.8}}
    & \textbf{\gtopa{83.2}}
    & \textbf{\gtopa{79.0}}
    & \textbf{\gtopa{78.5}} \\ 

    \midrule

    \end{tabular}
    \end{adjustbox}
```
**Atari and MuJoCo.** We evaluate RATE on standard RL benchmarks: Atari games and MuJoCo control tasks (`\autoref{tab:mujococompare}`{=latex}, `\autoref{tab:ataricompare}`{=latex}). For comparison, we include results from recent state-of-the-art methods: Decision Mamba (DMamba) [@ota2024decision; @lv2024decision], Mamba as Decision Maker (MambaDM) [@cao2024mamba], Conservative Q-Learning (CQL) [@cql], Trajectory Transformer (TT) [@janner2021offline], and TAP [@jiang2023efficient], as reported in their original papers. Results show that RATE matches or outperforms specialized offline RL algorithms across both benchmarks. Combined with its strong performance on memory-intensive tasks, this highlights RATE's versatility as a general-purpose offline RL model. See `\autoref{app:training}`{=latex} for full training details and `\autoref{tab:stats}`{=latex} for the evaluation protocol.

```{=latex}
\small
```
```{=latex}
\begin{adjustbox}{width=1\columnwidth}
\begin{tabular}{lcccccc}
\toprule

\textbf{Environment} 
& \textbf{CQL}
& \textbf{BC}
& \textbf{DT} 
& \makecell{\textbf{DMamba} \\ \citep{ota2024decision}}
& \textbf{MambaDM}
& \textbf{RATE (Ours)} \\

\midrule

Breakout    
& 62.5
& 42.8
& 76.9{\scriptsize$\pm$27.3}
& 70.6{\scriptsize$\pm$9.3}
& \textbf{\gtopb{106.9}{\scriptsize$\pm$5.8}}
& \textbf{\gtopa{111.0}{\scriptsize$\pm$2.9}} \\

Qbert
& \textbf{\gtopa{14013.2}}
& 2862.0
& 2215.8{\scriptsize$\pm$1523.7}~
& ~~5786.0{\scriptsize$\pm$1295.2}~~
& ~10052.5{\scriptsize$\pm$1116.5}
& \textbf{\gtopb{12486.9}{\scriptsize$\pm$280.4}}~ \\

SeaQuest
& 782.2
& \textbf{\gtopb{992.1}}
& \textbf{\gtopa{1129.3}{\scriptsize$\pm$189.0}}~~~
& \textbf{\gtopb{992.1}{\scriptsize$\pm$57.7}}~~
& \textbf{\gtopa{1286.0}{\scriptsize$\pm$42.0}}~
& \textbf{\gtopb{1037.9}{\scriptsize$\pm$53.7}}~ \\

Pong
& \textbf{\gtopa{18.8}}
& 6.4
& \textbf{\gtopb{17.1}{\scriptsize$\pm$2.9}}~~
& ~~~1.6{\scriptsize$\pm$15.3} 
& ~~\textbf{\gtopa{18.4}{\scriptsize$\pm$0.8}}
& ~~\textbf{\gtopa{18.8}{\scriptsize$\pm$0.3}} \\

\bottomrule
\end{tabular}
\end{adjustbox}
```
### Ablation Study {#sec:abstudy}

We conduct a comprehensive ablation study to assess the contributions of individual components and architectural choices in RATE, structured around three key research questions.

1.  *How do different components of RATE influence performance on memory tasks?* (RQ1)

2.  *What is the upper-bound results RATE can achieve with access to perfect memory?* (RQ2)

3.  *What role does the MRV play, and which configuration is most effective?* (RQ3)

Further ablations exploring key transformer parameters, memory tokens number, and sequence segmentation strategies are provided in `\autoref{app:add_app}`{=latex} and `\autoref{app:trans_abb_studies}`{=latex}.

**RQ1: Impact of RATE components.** `\label{app:rate_components}`{=latex} To assess the contribution of individual memory mechanisms in RATE, we performed inference-time ablations by replacing memory components with random noise. In T-Maze ($K=30$, $N=3$ segments), corrupting memory embeddings $M$ sharply reduced performance to 50$\%$ success (see `\autoref{fig:vd-tmaze-noise}`{=latex}, right). The agent still reached the decision point but failed

```{=latex}
\begin{wrapfigure}[14]{r}{0.5\textwidth}
    
    
    
        \includegraphics[width=\linewidth]{plots/rate_vizdoom_noise_solo.png}
        % \caption*{\small \textbf{(a)} ViZDoom-Two-Colors: performance drop when memory tokens and cached hidden states are replaced with noise.}
    %
    \hfill
    
        \includegraphics[width=\linewidth]{plots/tmaze-noised.pdf}
        % \caption*{\small \textbf{(b)} T-Maze: SR degrades when memory embeddings are noised.}
    
    
    % \caption{Impact of memory corruption on RATE performance at inference time. 
    % \textbf{(left)} ViZDoom-Two-Colors: performance drop when memory tokens and cached hidden states are replaced with noise.
    % \textbf{(right)} T-Maze: SR degrades when memory embeddings are noised.
    % }
    \caption{Effect of memory corruption on RATE at inference. 
    \textbf{(left)} ViZDoom: performance drops when memory tokens or cached states are noised. 
    \textbf{(right)} T-Maze: SR degrades when memory embeddings are corrupted.
    }
    \label{fig:vd-tmaze-noise}
\end{wrapfigure}
```
to turn correctly -- showing it retained navigation skills but lost the initial cue. Thus, memory embeddings act as dedicated storage for task-relevant information, while transformer layers encode general behavior. In ViZDoom-Two-Colors (see `\autoref{fig:vd-tmaze-noise}`{=latex}, left), adding noise separately to embeddings and cached hidden states showed performance was more sensitive to hidden-state corruption, highlighting their role in continuous rewards and long dependencies. Overall, memory embeddings matter most for sparse, discrete decision points (e.g., T-Maze), while cached representations are crucial in dense, continuous-feedback tasks like ViZDoom.

```{=latex}
\begin{wraptable}[15]{r}{0.5\textwidth}
    
    
    \small
    
    \caption{
    Performance comparison between DT, RATE, and OracleDT. OracleDT is an oracle-informed variant used solely to approximate the upper bound and is not a feasible baseline.
    }
    \label{tab:oracle_dt}
    
    \begin{adjustbox}{width=0.5\textwidth}
    \begin{tabular}{lccc}
    \toprule
    \textbf{T-Maze} \\
    \midrule
    \textbf{Success Rate} & \textbf{OracleDT} & \textbf{DT} & \textbf{RATE} \\
    \midrule
    $T = 90$   & \textbf{1.00{\scriptsize$\pm$0.00}} & \textbf{1.00{\scriptsize$\pm$0.00}} & \textbf{1.00{\scriptsize$\pm$0.00}} \\
    $T = 480$  & \textbf{1.00{\scriptsize$\pm$0.00}} & 0.50{\scriptsize$\pm$0.00}          & 0.90{\scriptsize$\pm$0.07}          \\
    $T = 900$  & \textbf{1.00{\scriptsize$\pm$0.00}} & 0.50{\scriptsize$\pm$0.00}          & 0.90{\scriptsize$\pm$0.07}          \\
    \midrule
    \textbf{ViZDoom-Two-Colors} \\
    \midrule
    \textbf{Total Reward}  & \textbf{56.5{\scriptsize$\pm$0.8}}  & 24.8{\scriptsize$\pm$1.4}  & 41.5{\scriptsize$\pm$1.0}  \\
    \textbf{Red Pillars}   & \textbf{55.3{\scriptsize$\pm$1.6}}  & 7.2{\scriptsize$\pm$0.4}   & 38.2{\scriptsize$\pm$5.1}  \\
    \textbf{Green Pillars} & \textbf{57.2{\scriptsize$\pm$0.5}}  & 42.3{\scriptsize$\pm$3.3}  & 44.7{\scriptsize$\pm$5.8}  \\
    \bottomrule
    \end{tabular}
    \end{adjustbox}
    
\end{wraptable}
```
**RQ2: Performance upper-bound estimate.** `\label{app:oracle_dt}`{=latex} To estimate the upper-bound performance achievable by RATE, we introduce *OracleDT* -- a variant of Decision Transformer augmented with perfect prior knowledge about the environment. Specifically, OracleDT receives an additional input vector $v \in \mathbb{R}^{1 \times \texttt{d\_model}}$ prepended and appended to the context sequence, i.e., $S' = \texttt{concat}(v, S, v)$. This vector encodes one bit of environment-critical information known in advance. In T-Maze, $v$ represents the initial clue ($v_i = 0$ if left, $v_i = 1$ if right); in ViZDoom-Two-Colors, it encodes the pillar color ($v_i = 0$ for red, $v_i = 1$ for green). This setup mirrors a context augmented with perfectly trained memory embeddings, i.e., $\texttt{concat}(M, S, M)$, where $M$ encodes all relevant information. As a result, OracleDT provides an empirical upper bound on achievable performance when key information is available explicitly. In such settings, we expect the relation $R[\text{OracleDT}] \geq R[\text{RATE}] \geq R[\text{DT}]$ to hold (see `\autoref{tab:oracle_dt}`{=latex}). Since this privileged information is not generally accessible during training, OracleDT is not a viable baseline but serves as a useful reference. The gap between OracleDT and RATE quantifies the effectiveness of RATE's memory mechanisms in autonomously discovering, storing, and utilizing task-relevant information.

**RQ 3. Memory Retention Valve scheme ablation.** `\label{app:mrv_ablation}`{=latex} In the T-Maze environment, we observed that without MRV, RATE's performance deteriorates on long corridors ($L \gg K$), eventually reaching SR $= 50\%$ (see `\autoref{tab:mrv_ablation}`{=latex}). This degradation occurs because critical information to be remembered goes into memory embeddings when processing the first segment of the sequence, and then it must be retrieved when making decisions on the last segment. At the same time, due to the recurrent structure of the architecture, memory embeddings continue to be updated during the processing of intermediate segments when no new information needs to be memorized, causing important information from memory embeddings to leak out. To address this information loss, we introduced the **Memory Retention Valve (MRV)** and evaluated five variants: **MRV-CA-1**: Cross-attention mechanism where updated embeddings ($M_{n+1}$) query incoming ones ($M_n$); **MRV-CA-2**: Reversed variant where incoming embeddings ($M_n$) query updated ones ($M_{n+1}$); **MRV-G**: Gating mechanism inspired by GTrXL [@parisotto2020stabilizing]; **MRV-GRU**: GRU-based [@gru] memory processing with

```{=latex}
\begin{wraptable}{r}{0.5\textwidth}
    
    
    \small
    \caption{
    Ablation of MRV configurations in T-Maze ($K_{\text{eff}}{=}30{\times}5{=}150$). 
    Baseline without MRV is marked $\dagger$. Default: \textbf{MRV-CA-2}.
    }
    \label{tab:mrv_ablation}
    
    \begin{adjustbox}{width=0.5\textwidth}
    \begin{tabular}{@{}lcccc@{\hskip 5pt}}
    \toprule
    \textbf{Model} & \textbf{150} & \textbf{360} & \textbf{600} & \textbf{900} \\
    \midrule
    w/o MRV$^\dagger$    
    & \ga{1.00}{\scriptsize$\pm$0.00} 
    & \gb{0.66}{\scriptsize$\pm$0.08} 
    & \gc{0.65}{\scriptsize$\pm$0.07} 
    & \gd{0.61}{\scriptsize$\pm$0.07} \\
    \midrule
    \textbf{MRV-CA-2} 
    & \ga{1.00}{\scriptsize$\pm$0.00} 
    & \gb{0.95}{\scriptsize$\pm$0.05} 
    & \gc{0.90}{\scriptsize$\pm$0.07} 
    & \gd{0.90}{\scriptsize$\pm$0.07} \\
    MRV-G 
    & \ga{0.86}{\scriptsize$\pm$0.07} 
    & \gb{0.77}{\scriptsize$\pm$0.08} 
    & \gc{0.66}{\scriptsize$\pm$0.07} 
    & \gd{0.65}{\scriptsize$\pm$0.08} \\
    MRV-GRU 
    & \ga{0.99}{\scriptsize$\pm$0.01} 
    & \gb{0.74}{\scriptsize$\pm$0.07} 
    & \gc{0.56}{\scriptsize$\pm$0.11} 
    & \gd{0.55}{\scriptsize$\pm$0.12} \\
    MRV-LSTM 
    & \ga{0.85}{\scriptsize$\pm$0.06} 
    & \gb{0.64}{\scriptsize$\pm$0.10} 
    & \gc{0.51}{\scriptsize$\pm$0.11} 
    & \gd{0.47}{\scriptsize$\pm$0.11} \\
    MRV-CA-1 
    & \ga{0.51}{\scriptsize$\pm$0.01} 
    & \gb{0.51}{\scriptsize$\pm$0.01} 
    & \gc{0.49}{\scriptsize$\pm$0.02} 
    & \gd{0.49}{\scriptsize$\pm$0.01} \\
    \bottomrule
    \end{tabular}
    \end{adjustbox}
    
    \end{wraptable}
```
hidden states; **MRV-LSTM**: LSTM-based [@hochreiter1997long] memory processing with cell states.

Among all tested configurations, MRV-CA-2 demonstrated best performance (see `\autoref{tab:mrv_ablation}`{=latex}). This cross-attention scheme uses incoming memory tokens ($M_n$) as queries and updated tokens ($M_{n+1}$) as keys and values. This configuration, referred to simply as MRV throughout the paper, effectively controls information flow through memory. By allowing the model to selectively update its memory based on the relevance of new information, it prevents loss of important context over long sequences.

### Related Work

**Transformers in RL:** Transformers have been applied to online [@parisotto2020stabilizing; @lampinen2021towards; @morad2023reinforcement; @le2024stable], offline [@chen2021decision; @janner2021offline; @wang2025longshort], and model-based RL [@chen2022transdreamer]. Prior work often assumes compact observations or known dynamics [@lee2022multi; @jiang2023efficient], whereas RATE targets long-horizon credit assignment and memory in partially observable environments, using DT [@chen2021decision] as baseline. The *Long-Short Decision Transformer* (LSDT) [@wang2025longshort] augments DT with dual context windows but still lacks explicit, learnable memory. Fast and Forgetful Memory (FFM) [@morad2023reinforcement] and Stable Hadamard Memory (SHM) [@le2024stable] instead explore lightweight recurrent slots with greater stability. **RNNs in RL:** Recurrent models like LSTM [@hochreiter1997long] and GRU [@gru] have long supported memory in RL. DLSTM [@dlstm] replaces transformers with LSTM for sequential control, but RNNs often struggle with long-term dependencies, especially under sparse rewards [@ni2023when]. **SSMs in RL:** SSMs such as S4 [@gu2021efficiently] and Mamba [@gu2023mamba] offer efficient alternatives to attention, showing strong offline RL results [@bar2023decision; @ota2024decision; @lv2024decision; @cao2024mamba], though their ability to handle memory-intensive generalization remains unclear. **Memory-Augmented Transformers:** Extensions like Transformer-XL [@dai2019transformerxl], Compressive Transformer [@rae2019compressive], and RMT [@bulatov2022rmt] extend context via caching or compression. RATE combines token-level memory, hidden-state caching, and a novel MRV gate. Approximate Gated Linear Transformer [@pramanik2023agalite] replaces full attention with a gated, low-rank recurrent update that approximates outer-product memory via cosine features, enabling efficient long-range credit assignment at constant cost. Retrieval-Augmented Decision Transformer (RA-DT) [@schmied2024retrieval] augments DT with an external retrieval memory that stores past sub-trajectories, retrieves relevant ones by vector search, reweights them by utility, and integrates them through cross-attention to guide action prediction in sparse-reward RL.

### Limitations {#sec:lim}

While RATE is tailored for long-horizon, memory-intensive tasks, its complexity may be unnecessary in fully observable or short-term settings where simpler recurrent models suffice. Nonetheless, RATE matches or exceeds their performance across all tasks. Future work may explore adaptive variants that scale memory based on task complexity.

### Conclusion

We propose the **Recurrent Action Transformer with Memory** (**RATE**), a transformer-based architecture for offline RL that combines attention with recurrence for long-horizon decision-making. RATE integrates memory embeddings, hidden state caching, and a **Memory Retention Valve** (**MRV**) to selectively retain critical information across segments. RATE achieves state-of-the-art results on memory-intensive tasks such as T-Maze, Minigrid-Memory, ViZDoom-Two-Colors, Memory Maze, and 48 POPGym tasks, generalizing up to 9600-step sequences and outperforming both recurrent and transformer baselines. Theoretical analysis shows that MRV guarantees lower-bounded memory preservation across updates, and ablation studies confirm its importance for long-horizon stability. Despite its memory focus, RATE also performs competitively on standard benchmarks like Atari and MuJoCo, demonstrating broad versatility. These results establish RATE as a unified, general-purpose offline RL model that excels across both short and long temporal contexts.

### Acknowledgments {#acknowledgments .unnumbered}

The study was supported by the Ministry of Economic Development of the Russian Federation (agreement No. 139-15-2025-013, dated June 20, 2025, IGK 000000C313925P4B0002).

### Reproducibility Statement {#reproducibility-statement .unnumbered}

We have taken several measures to ensure the reproducibility of our results. **Model details:** A full description of the RATE architecture, including pseudocode for both the model and the Memory Retention Valve (MRV), is provided in `\autoref{sec:rate}`{=latex} and Algorithms `\ref{alg:rate}`{=latex} -- `\ref{alg:mrv}`{=latex}. **Theoretical results:** Formal assumptions and complete proofs for our preservation theorem are given in `\autoref{sec:rate}`{=latex}. **Experimental setup:** Details of environments, training procedures, and evaluation protocols are reported in `\autoref{sec:exp_eval}`{=latex}, with additional specifications (hyperparameters, dataset preprocessing, random seeds, and hardware setup) in `\autoref{app:training}`{=latex} and `\autoref{app:envs}`{=latex}. **Baselines:** All baseline implementations are either drawn from widely used open-source libraries or re-implemented with hyperparameters matched to their original publications, as described in `\autoref{sec:exp_eval}`{=latex} and `\autoref{app:trans_abb_studies}`{=latex}. **Code and data:** An anonymous repository with the implementation of RATE, training scripts, and configuration files submitted as supplementary material. Together, these resources allow for full replication of our theoretical analyses and empirical results.

```{=latex}
\bibliographystyle{iclr2026_conference}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
```{=latex}
\onecolumn
```
```{=latex}
\addcontentsline{toc}{section}{Appendix}
```
```{=latex}
\setcounter{tocdepth}{3}
```

```{=latex}
\parttoc
```
### Discussion: Are RNNs Still Better for Memory?

Our experiments provide a systematic comparison between recurrent and transformer-based architectures in memory-intensive tasks. When trained on short sequences, recurrent models such as BC-LSTM perform competitively. For example, in the T-Maze environment, BC-LSTM achieves perfect success rates when trained on sequences up to 150 steps, effectively capturing short-term dependencies via its internal state dynamics.

However, this advantage quickly fades as training sequences grow longer. Increasing the training horizon from 150 to 600 steps causes BC-LSTM's performance to collapse to a 50% success rate across all inference lengths -- even those shorter than the training context -- indicating difficulty with gradient stability and information retention over long spans (`\autoref{fig:tmaze_heatmaps}`{=latex}). In contrast, RATE maintains consistently high performance under the same conditions, demonstrating stronger scalability with sequence length. RATE generalizes robustly to inference horizons up to 9600 steps (28,800 tokens), reflecting the effectiveness of its hybrid memory design. The architecture combines token-based recurrence with gated memory updates via the Memory Retention Valve (MRV), enabling reliable propagation of sparse information across long temporal distances.

These findings extend to more complex environments. In ViZDoom-Two-Colors and Memory Maze (`\autoref{fig:vizdoom_150}`{=latex}, `\autoref{tab:memory_maze}`{=latex}), RATE significantly outperforms BC-LSTM. In ViZDoom, RATE maintains balanced performance across red and green cues, whereas BC-LSTM exhibits instability and higher variance. In Memory Maze, RATE achieves substantially higher returns, benefiting from its capacity to encode and retrieve spatial-temporal patterns over long episodes.

In conclusion, while RNNs remain effective for short-range temporal dependencies, their performance degrades in long-horizon, sparse-reward, and generalization-critical settings. RATE bridges this gap by integrating attention with recurrence, offering a scalable and robust memory solution. These results underscore the architectural promise of combining transformer attention with recurrent dynamics for long-term tasks in RL.

### Decision Transformer

```{=latex}
\begin{wrapfigure}{r}{0.5\linewidth}
% 



\begin{algorithm}[H]
    \caption{Decision Transformer}
    \label{alg:dt}
    \textbf{Require}: $R \in \mathbb{R}^{1\times T}, o \in \mathbb{R}^{d_{o}\times T}, a \in \mathbb{R}^{1\times T}$\\
    % \textbf{Parameters}: $K$ -- context length \\
    
\begin{algorithmic}[1] %[1] enables line numbers
    \STATE $\tilde{R} \in \mathbb{R}^{T\times d} \leftarrow \texttt{Encoder}_{R}(R)$ \\
        $ \ \tilde{o} \in \mathbb{R}^{T\times d} \leftarrow \texttt{Encoder}_{o}(o)$ \\
        $ \ \tilde{a} \in \mathbb{R}^{T\times d} \leftarrow \texttt{Encoder}_{a}(a)$ \\

    \STATE $\tau_{0..T} \leftarrow \{(\tilde{R}_0, \tilde{o}_0, \tilde{a}_0),\ldots,(\tilde{R}_T, \tilde{o}_T, \tilde{a}_T)\}$ \\

    \STATE $n = \texttt{random}(0, T-K)$ \\
    
    \STATE $\hat{a}_n \leftarrow \texttt{Transformer}(\tau_{n..n+K})$\\

\end{algorithmic}
\textbf{Output}: $\hat{a}_n \rightarrow \mathcal{L}(a_n, \hat{a}_n)$
\end{algorithm}

% 
\end{wrapfigure}
```
Decision Transformer (DT) [@chen2021decision] is an algorithm for offline RL that reduces the RL task to a sequence modeling task. In DT, the scheme of which is presented in Algorithm `\ref{alg:dt}`{=latex}, the trajectory $\tau$ is not divided into segments as in RATE. Instead, random fragments of length $K$ are sampled from the trajectory, since originally this architecture was designed to work only with MDP. The predicted actions $\hat{a}$ are sampled autoregressively.

![Memory-intensive environments used to evaluate RATE memory mechanisms.](plots/envs.png){#fig:memory_envs width="1\\columnwidth"}

### Environments {#app:envs}

#### Memory-intensive environments

In this section, we provide an extended description of the environments used in this paper, as well as the methodology used to collect the trajectories. `\autoref{tab:envs_description}`{=latex} summarizes the observations type, rewards type, and actions type for each of the environments considered in this paper.

##### ViZDoom-Two-Colors

We used a modified ViZDoom-Two-Colors environment from [@memup] to assess the model's memory abilities. The agent initially having $100$ hit points (HP) is placed in a room without inner walls filled with acid. At each step in the environment, the agent loses a fixed amount of health ($10/32$ HP per step). In the center of the environment, there is a pillar of either green or red color, which disappears after $45$ environment steps. Throughout the environment, objects of two colors (green and red) are generated. When the agent interacts with an object of the same color as the pillar, it gains an increase in health of $+25$ and a reward of $+1$. When the agent interacts with an object of the opposite color, it loses a similar amount of health. The agent receives an additional reward of $+0.02$ for each step it survives. The episode ends when the agent has zero health. Thus, the agent needs to remember the color of the pillar to select items of the correct color, even if the pillar is out of sight or has disappeared. The agent does not receive information about its current health or rewards, as these observations essentially convey the same information as the color of the pillar but persist beyond step $45$.

We collected a dataset of $5000$ trajectories of $90$ steps in length using a trained A2C [@DBLP:journals/corr/abs-1904-01806] agent (an agent trained with a non-disappearing pillar). The average reward for these $90$ steps is $4.46$. When collecting trajectories, to ensure that the agent saw the pillar before it disappeared, the agent always appeared facing the pillar in the same place -- midway between the pillar and the nearest wall. In order to successfully complete this task, the agent needs to remember the color of the pillar. This environment tests the long-term memory mechanism, since the agent needs to retain information about the pillar for a time much longer than the pillar has been in the environment. Using only short-term memory and, for example, collecting the next item of the same color as the previous collected item, it will not be possible for the agent to survive for a long time, as this policy is extremely unstable. This is due to the fact that in the training dataset the agent occasionally makes a mistake and picks up an object of the opposite color. Thus, irrelevant information about the desired color may enter the transformer context and the agent will start collecting items of an opposite color, which will quickly lead to a failure.

##### T-Maze

To investigate agent's long-term memory on very long environments (the inference trajectory length is much longer than the effective context length $K_{eff}$) we used a modified version of the T-Maze environment [@ni2023when]. The agent's objective in this environment is to navigate from the beginning of the T-shaped maze to the junction and choose the correct direction, based on a signal given at the beginning of the trajectory using four possible actions $a \in \{left, up, right, down\}$. This signal, represented as the $clue$ variable and equals to zero everywhere except the first observation, dictates whether the agent should turn up ($clue=1$) or down ($clue=-1$). Additionally, a constraint on the episode duration $T = L + 2$, where the maximum duration is determined by the length of the corridor $L$ to the junction, adds complexity to the problem. To address this, a binary flag, represented as the $flag$ variable, which is equal to $1$ one step before the junction and $0$ otherwise, indicating the arrival of the agent at the junction, is included in the observation vector. Additionally, a noise channel is added to the observation vector, with random integer values from the set $\{-1, 0, +1\}$. The observation vector is thus defined as $o = [y, clue, flag, noise]$, where $y$ represents the vertical coordinate. The reward $r$ is given only at the end of the episode and depends on the correctness of the agent's turn at the junction, being $1$ for a correct turn and $0$ otherwise. This formulation deviates from the traditional Passive T-Maze environment [@ni2023when] (different observations and reward functions) and presents a more intricate set of conditions for the agent to navigate and learn within the given time constraint.

The dataset consists of $2000$ of trajectories for each segment of length $30$ (i.e. $6000$ trajectories for the $K_{eff}=3\times 30 = 90$) and consists only of successful episodes. An artificial oracle with a priori information about the environment was used to generate the dataset.

```{=latex}
\begin{table*}[t!]\caption{Description of observations and reward functions for the considered environments.}
\label{tab:envs_description}

\small
\begin{adjustbox}{width=1\textwidth}
\begin{tabular}{@{}lcccc@{}}
\toprule
\textbf{Environment} & \textbf{Obs. Type} & \textbf{Rew. Type} & \textbf{Act. Space} & \textbf{Obs. Details} \\
\midrule
ViZDoom-Two-Colors         & Image      & Continuous                  & Discrete   & First-person view \\
T-Maze                     & Vector     & Sparse \& Discrete          & Discrete   & Low-dimensional vector \\
Memory Maze                & Image      & Sparse \& Discrete          & Discrete   & First-person view \\
Minigrid-Memory            & Image      & Sparse                      & Discrete   & $3{\times}3$ grid centered on agent \\
POPGym                     & Vector/Image & Discrete/Continuous       & Discrete/Continuous & Vector or 2D grid \\
Action Assoc. Retrieval    & Vector     & Sparse \& Discrete          & Discrete   & Symbolic vector input \\
Atari                      & Image      & Sparse \& Discrete          & Discrete   & Full game screen \\
MuJoCo                     & Vector     & Continuous                  & Continuous & Low-dimensional state vector \\
\bottomrule
\end{tabular}
\end{adjustbox}
\end{table*}
```
##### Memory Maze

In this first-person view 3D environment [@pasukonis2022memmaze], the agent appears in a randomly generated maze containing several objects of different colors at random locations. The agent's task is to find an object of the same color in the maze as the outline around its observation image. After the agent finds an object of the desired color and steps on it, the color of the outline changes and the agent must find another object. The agent receives a $+1$ reward for stepping on the correct object. Otherwise, it receives no reward. The duration of an episode is a fixed number and is equal to $1000$. Thus, the agent's task is to find as many objects of the desired color as possible in a limited time. The agent's effectiveness in this environment depends on its ability to memorize the structure of the maze and the location of objects in it in order to find the desired objects faster. Using the Dreamer model [@hafner2019dream] to collect dataset of $5000$ trajectories only achieved an average award of $4.7$ per episode, i.e., a rather sparse dataset.

##### Minigrid-Memory

Minigrid-Memory [@MinigridMiniworld23] is a 2D grid environment designed to test an agent's long-term memory and credit-assignment [@ni2023when]. The environment map is a T-shaped maze with a small room with an object inside it at the beginning of the corridor. The agent appears at a random coordinate in the corridor. The agent's task is to reach the room with the object and memorize it, then reach the junction at the end of the maze and make a turn in the direction where the same object is located as in the room at the beginning of the maze. A reward $r = 1 - 0.9 \times \frac{t}{T}$ is given for success, and $0$ for failure. The episode ends after any agent turns at a junction or after a limited amount of time (95 steps) has elapsed. The agent's observations are limited to a $3\times 3$ size frame. $10000$ trajectories with grid size $41$x$41$ were collected using PPO [@schulman2017proximal] with Transformer-XL (TrXL) [@pleines2023trxlppo] with a context length equal to the maximum episode duration.

##### POPGym

POPGym [@morad2023popgym] is a benchmark suite consisting of 46 diverse partially observable environments designed to isolate different aspects of memory use and generalization in reinforcement learning. The tasks include both short-horizon reactive scenarios and long-horizon memory puzzles that require the agent to remember information across extended delays or infer hidden states from past observations. The environments vary in observation modality (image vs. vector), reward sparsity, and temporal dependencies. For our dataset, we followed the original POPGym evaluation protocol and used a PPO [@schulman2017proximal] agent with a GRU [@gru] backbone (PPO-GRU), which showed the best performance in the original benchmark. We collected trajectories using this policy for all 46 environments. The collected dataset reflects the diverse difficulty and memory requirements of the benchmark and serves as a challenging testbed for evaluating general-purpose memory architectures like RATE.

#### Standard benchmarks

##### Atari games

For the Atari game environments [@bellemare2013arcade], we used the same dataset as in DT, namely the DQN replay dataset with grayscale state images [@agarwal2020optimistic]. This dataset contains $500$ thousand of the $50$ million steps of an online DQN [@dqn] agent for each game. We use the following set of games: SeaQuest, Breakout, Pong and Qbert.

```{=latex}
\begin{wrapfigure}[9]{r}{0.4\textwidth}
    
    \begin{center}
    \includegraphics[width=0.4\textwidth]{plots/aar.drawio.png}
    
    
    \caption{Action Associative Retrieval.}
    \label{fig:aar_scheme}
    \end{center}
    % 
\end{wrapfigure}
```
##### MuJoCo.

Despite the fact that memory is not required in decision making in control environments like MuJoCo [@d4rl], we conducted additional experiments in this environment to compare with DT. For the continuous control tasks, we selected a standard MuJoCo locomotion environment and a set of trajectories from the D4RL benchmark [@d4rl]. Since we chose DT and TAP as the main models for comparison on this data, we focused on the environments used in both works (HalfCheetah, Hopper, and Walker). We used three different dataset settings: 1) **Medium** -- $1$ million timesteps generated by a \`\`medium" policy that achieves about a third of the score of an expert policy; 2) **Medium-Replay** -- the replay buffer of an agent trained with the performance of a medium policy (about $200$k--$400$k timesteps in our environments); 3) **Medium-Expert** -- $1$ million timesteps generated by the medium policy concatenated with 1 million timesteps generated by an expert policy. The scores for the MuJoCo experiments are normalized such that 100 represents an expert policy, following the benchmark protocol outlined in [@d4rl]. The performance metrics for Conservative Q-Learning (CQL) and Trajectory Autoencoding Planner (TAP) are reported from the TAP paper [@jiang2023efficient], and for DT from the DT paper [@chen2021decision], as they use the same dataset and evaluation protocol.

```{=latex}
\begin{figure*}[t]
    

    % 1-я строка
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/RATE_heatmap.pdf}
        \caption*{\small \textbf{RATE}}
    
    \hfill
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/DT_heatmap.pdf}
        \caption*{\small \textbf{DT}}
    

    % 2-я строка
    
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/CQL-LSTM_heatmap.pdf}
        \caption*{\small \textbf{CQL-LSTM}}
    %
    \hfill
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/CQL-MLP_heatmap.pdf}
        \caption*{\small \textbf{CQL-MLP}}
    

    % 3-я строка
    
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/DGRU_heatmap.pdf}
        \caption*{\small \textbf{DGRU}}
    %
    \hfill
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/DLSTM_heatmap.pdf}
        \caption*{\small \textbf{DLSTM}}
    

    % 4-я строка
    
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/DMamba_heatmap.pdf}
        \caption*{\small \textbf{DMamba}}
    %
    \hfill
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/BC-MLP_heatmap.pdf}
        \caption*{\small \textbf{BC-MLP}}
    

    % 5-я строка
    
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/LSDT_heatmap.pdf}
        \caption*{\small \textbf{LSDT}}
    %
    \hfill
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/BC-LSTM_heatmap.pdf}
        \caption*{\small \textbf{BC-LSTM}}
    %

    % 6-я строка
    
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/RMT_heatmap.pdf}
        \caption*{\small \textbf{RMT}}
    %
    \hfill
    
        
        \includegraphics[width=\linewidth]{plots/new-plots/heatmaps/TrXL_heatmap.pdf}
        \caption*{\small \textbf{TrXL}}
    

    \caption{Results for all models in the T-Maze generalization task.}
    \label{fig:all_heatmaps}
\end{figure*}
```
### Action Associative Retrieval {#app:aar}

As shown in `\autoref{fig:tmaze_heatmaps}`{=latex}, DT has a SR $=50\%$ for inference at corridor lengths longer than the transformer context length. This is due to the fact that even a DT trained on balanced data has a slight bias in the predicted probability towards one of the two required actions, which leads to the fact that when $t > K$ the agent constantly produces only one action: up or down. In turn, the presence of memory in the agent allows us to combat this problem.

To check how the agent's performance changes during training, we design an **Action Associative Retrieval** (**AAR**) `\autoref{fig:aar_scheme}`{=latex} environment.

![Experimental results with RATE and DT in the AAR environment. The graphs show the 10-runs average results of training on trajectories of length $T = 90$ and validation on trajectories of length $T=180$, for RATE with $K_{eff} = 3\times 30 = 90$ and for DT with $K=90$. ](plots/cross_plots.png){#fig:aar_cross_plots width="100%"}

There are two states in this environment: $S_0$ and $S_1$. The agent appears in state $S_0$ and by performing the action $a_0 \in \{0, 1\}$ moves to state $S_1$. Next, the agent must take $N-2$ steps to move from state $S_1$ to state $S_1$ by performing action $a=2$ (no op.). At the end of the episode, the agent must perform the same action that moved it from state $S_0$ to state $S_1$ in order to move from state $S_1$ to state $S_0$. Thus, the action $a \in \{0, 1, 2\}$. Agent observations $o = [state, flag, noise]$, where $state \in \{0, 1\}$ is the index of the current state, $flag \in \{0, 1\}$ is a flag equal to $1$ in case the next step requires returning to the initial state and equal to $0$ otherwise, $noise \in \{-1, 0, +1\}$ is the noise channel. The agent receives a $+1$ reward if it returns to the initial state $S_0$ by performing the action that took it out from the $S_0$ to the $S_1$, and $-1$ in other cases. The training dataset consists of oracle-generated $6000$ trajectories with positive reward.

More formally, we can talk about the presence of memory in an agent when solving AAR (T-Maze-like) tasks under the condition that: $$\forall t > K: \frac{1}{N_0}\sum_{i=1}^{N_0}p_i(a_t = a^0|a_0 = a^0) + \frac{1}{N_1}\sum_{i=1}^{N_1}p_i(a_t = a^1|a_0 = a^1) > 1$$

This condition means that if the agent has memory, the sum of the average conditional probabilities over all experiments will be greater than one, i.e., these probabilities are independent of each other. Provided that the sum of these probabilities is less than or equal to one, the agent will choose at best the same target action in most experiments, even if another action is required.

where $a^0, a^1 \in \mathcal{A}$ -- two mutually exclusive actions leading to a reward; $t$ is the step at which the final action is required; $N_0, N_1$ are the number of experiments in environments where target action $a_t = a^0$ and $a_t = a^1$, respectively.

In the results `\autoref{fig:aar_cross_plots}`{=latex}, the first $1\%$ of training steps was removed because it corresponds to the beginning of the training and is unrepresentative. Blue dots correspond to the beginning of training, red dots to the end of training. As can be seen from `\autoref{fig:aar_cross_plots}`{=latex}, during training, the probabilities $p_i(a_t = a^0|a_0 = a^0)$ and $p_i(a_t = a^1|a_0 = a^1)$ on the training trajectories have a strong positive correlation ($R_{train}^{DT}=1.00$ and $R_{train}^{RATE}=0.97$), where $R$ -- correlation coefficient. This indicates that within-context (effective context) DT and RATE models are able to predict both $a^0$ and $a^1$ actions equally well.

At the same time, during validation, for the RATE model this pattern is preserved -- the red points corresponding to the probabilities of choosing actions $a^0$ and $a^1$ are in the upper right part of the graph, positive correlation persists ($R_{val}^{RATE}=0.80$). On the other hand, in the DT case, the cluster of red dots is skewed toward choosing action $a^1$ and action $a^0$ with equal probabilities equal to $0.5$. Thus, in sum, these probabilities are less or equal to one, as evidenced by a strong negative correlation ($R_{val}^{DT}=-0.97$). The results confirm the inability of DT to generalize on trajectories whose lengths exceed the context length and the ability of RATE to handle such tasks.

### Training {#app:training}

This section provides additional details on the training process of the baselines considered in the paper. We treated the inclusion of the feed-forward network (FFN) block in RATE's transformer decoder as a hyperparameter, as RATE performed slightly better without FFN in some environments. In contrast, other transformer-based baselines were trained with the standard transformer decoder including FFN.

#### ViZDoom-Two-Colors

Since the pillar disappears at time $t{=}45$, all trajectories span from $t{=}0$ to $t{=}90$ to ensure that the cue remains available during training. In this setting, we compare DT with context length $K{=}90$ to RATE, RMT, and TrXL models using $K{=}30$ and $N{=}3$ segments. Thus, RATE processes sequences of the same total length $K_{\text{eff}}{=}N{\times}K{=}90$ but accesses only $K{=}30$ tokens at a time. Additionally, we ran experiments with $N{=}3$, $K{=}50$, and $T{=}150$ to validate model robustness under longer and more complex configurations.

#### Passive T-Maze

We trained models on sequences of length $T_{\text{train}} \in \{9,\ 30,\ 90,\ 150,\ 300,\ 600,\ 900\}$ and evaluated them on $T_{\text{val}} \in \{9,\ 30,\ 90,\ 150,\ 300,\ 600,\ 900,\ 1200,\ 2400,\ 4800,\ 9600\}$. For RATE, each sequence was split into $N=3$ segments, yielding a context length of $K = T_{\text{train}} / 3$. All training trajectories started from $t=0$, ensuring the cue was always included. `\noindent `{=latex}In what follows, we adopt the notation **MODEL-N**, where $N=3$ indicates segmentation into three recurrent blocks (e.g., RATE-3 is trained on full sequences of length $T=90$ with $K=30$). This convention is used throughout the ablation studies.

#### Memory Maze

To train RATE, DT, RMT, and TrXL on Memory Maze, we used the same approach as for ViZDoom-Two-Colors environment, but instead of using fixed trajectories starting at $t=0$, we sampled consecutive 90-step subsequences from the original 1000-step trajectories. Each subsequence was sampled with a stride of 90 steps, resulting in approximately 11 training sequences per original trajectory. As in the ViZDoom-Two-Colors case, training for DT was performed with a context length of $K=90$ and for RATE, RMT, and TrXL with a context length of $K=30$ and number of segments $N=3$, i.e., effective context length $K_{eff} = N\times K = 3\times 30 = 90$.

#### Minigrid-Memory

To train baselines in this environment, we used only mazes of fixed size $41\times41$, ensuring a consistent corridor length during training. For evaluation, models were validated on mazes ranging from $11\times11$ to $501\times501$, where corridor lengths vary within each grid, enabling assessment of both interpolation and extrapolation capabilities. All training trajectories used an episode timeout of 96 steps, while validation trajectories across all maze sizes used a longer timeout of 500 steps. As in T-Maze, each trajectory began at $t=0$, ensuring the cue was always observed. During training, RATE used a context length of $K=30$ with $N=3$ segments, while other baselines (except RMT and TrXL) used $K=90$.

#### POPGym Suite

POPGym [@morad2023popgym] comprises 46 tasks of varying memory complexity, including both memory puzzles and reactive POMDPs. Since episode lengths vary widely across tasks -- from as short as 12 steps to as long as 1000 -- we ensured a consistent and fair memory evaluation for RATE by setting the context length $K = T / 3$ and using $N = 3$ segments for every environment, where $T$ denotes the maximum episode length of each task. This uniform configuration allowed RATE to process full trajectories with recurrent segmentation, ensuring its memory capacity was equally tested across tasks of different lengths and difficulties.

#### Atari and MuJoCo

When training RATE on Atari games and MuJoCo control tasks, sequences of length $T=90$ (Atari) and $T=60$ (MuJoCo) were sampled randomly from the original trajectories in the dataset. These trajectories were then divided into $N=3$ segments of length $K=30$ (Atari) and $K=20$ (MuJoCo), forming an effective context of length $K_{eff}=N\times K = 90$ ($60$ for MuJoCo).

For Atari, we used the identical experimental design described in the DT paper [@chen2021decision]. It is worth noting that we presented raw scores for Atari, rather than gamer-normalized scores as described in the DT paper. `\autoref{tab:ataricompare}`{=latex} shows the results for Atari environments. RATE outperforms DT significantly in environments like Breakout and Qbert. We attribute this to the observation that, although these environments do not explicitly demand memory, intricate dynamics from the past exert a greater influence on agent behavior than in environments such as SeaQuest. Actions executed in the past notably alter the present state of the environment in Breakout and Qbert, whereas in SeaQuest, such actions hold little significance. For instance, the emergence of enemies and divers in SeaQuest is entirely independent of the agent's prior actions.

For MuJoCo, our findings suggest that the conventional strategy of utilizing return is not suitable for our segment-based scheme. The issue arises during the trajectory, where the agent's return persistently diminishes. However, the true value of the agent's state at the onset and conclusion of the episode could remain unchanged, provided the agent's policy performs consistently well. To rectify this discrepancy, we propose a novel evaluation strategy for MuJoCo tasks. In this approach, each segment commences with the maximum return, simulating the scenario where the agent initiates the trajectory anew. This method effectively mitigates the aforementioned issue, enhancing the accuracy of our evaluation process. Our MuJoCo experiments in `\autoref{tab:mujococompare}`{=latex} show that this benefits performance significantly for some environments. Thus, using RATE allowed us to obtain the best metrics for MuJoCo in 3/9 cases compared to the other baselines. RATE also outperforms DT in 9/9 tasks.

```{=latex}
\begin{adjustbox}{width=\textwidth}
\begin{tabular}{lccccccc}
\toprule
\textbf{Hyperparameter} &
\textbf{ViZDoom2C} &
\textbf{Memory Maze} &
\textbf{T-Maze} &
\textbf{Minigrid-Memory} &
\textbf{POPGym} &
\textbf{Atari} &
\textbf{MuJoCo} \\
\midrule
\textit{Memory-specific parameters} \\
\midrule
Number of memory tokens      & 15    & 15    & 10    & 10    & 30   & 15   & 5 \\
Number of cached tokens      & 100   & 360   & 0     & 180   & 100  & 360  & 60 \\
Number of MRV heads          & 2     & 0     & 2     & 4     & 2    & 1    & 1 \\
MRV activation               & ReLU  & ReLU  & ReLU  & ReLU  & ReLU & ReLU$^\ddagger$ & ReLU \\
\midrule
\textit{Transformer architecture} \\
\midrule
Number of layers             & 6     & 6     & 8     & 4     & 10   & 6    & 3 \\
Number of attention heads    & 8     & 8     & 8     & 4     & 2    & 8    & 1 \\
Embedding dimension          & 64    & 64    & 64    & 128   & 32   & 128  & 128 \\
Context length $K$           & 50    & 30    & 50    & 30    & 18   & 30   & 20 \\
Number of segments           & 3     & 3     & 3     & 3     & 3    & 3    & 3 \\
Skip dec FFN                 & False & True  & True  & False & True & True & True \\
\midrule
\textit{Regularization} \\
\midrule
Hidden dropout               & 0.2   & 0.5   & 0.2   & 0.3   & 0.1  & 0.2  & 0.2 \\
Attention dropout            & 0.05  & 0.2   & 0.1   & 0.1   & 0.05 & 0.05 & 0.05 \\
Weight decay                 & 0.001 & 0.1   & 0.001 & 0.001 & 0.001& 0.1  & 0.1 \\
\midrule
\textit{Training configuration} \\
\midrule
Max epochs                   & 150   & 80    & 200   & 500   & 200  & 10   & 10 \\
Batch size                   & 128   & 64    & 64    & 64    & 32   & 128  & 4096 \\
Loss function                & CE    & CE    & CE    & CE    & CE   & CE   & MSE \\
Optimizer                    & AdamW & AdamW & AdamW & AdamW & AdamW & AdamW & AdamW \\
Learning rate                & 3e-4  & 3e-4  & 1e-4  & 1e-4  & 3e-4 & 3e-4 & 6e-5 \\
Grad norm clip               & 5.0   & 1.0   & 1.0   & 5.0   & 5.0  & 1.0  & 1.0 \\
Cosine decay                 & False & True  & False & False & False& True & False \\
Linear warmup                & True  & True  & True  & True  & True & True & True \\
$(\beta_1, \beta_2)$         & (0.9, 0.999) & (0.9, 0.95) & (0.9, 0.999) & (0.9, 0.999) & (0.9, 0.999) & (0.9, 0.95) & (0.9, 0.95) \\
\bottomrule
\end{tabular}
\end{adjustbox}
```
```{=latex}
\begin{table*}[t]

\caption{Performance on POPGym tasks (mean$\pm$sem over three runs, 100 seeds each).}
\label{tab:popgym_full}
\renewcommand{\arraystretch}{1.05}
\setlength{\tabcolsep}{4pt}
\begin{adjustbox}{width=\textwidth}

\begin{tabular}{lcccccc}
\toprule
\textbf{Environment} &
\textbf{RATE} &
\textbf{DT} &
\textbf{Random} &
\textbf{BC-MLP} &
\textbf{BC-LSTM} &
\makecell{\textbf{Dataset}\\ \textbf{Average}\\ \textbf{Return}}\\
\midrule
AutoencodeEasy-v0              & $-0.29\pm0.00$ & $-0.47\pm0.00$ & $-0.50\pm0.00$ & $-0.47\pm0.00$ & $-0.32\pm0.00$ & $-0.26$\\
AutoencodeMedium-v0            & $-0.47\pm0.00$ & $-0.49\pm0.00$ & $-0.50\pm0.00$ & $-0.49\pm0.00$ & $-0.47\pm0.00$ & $-0.48$\\
AutoencodeHard-v0              & $-0.46\pm0.00$ & $-0.49\pm0.00$ & $-0.50\pm0.01$ & $-0.50\pm0.00$ & $-0.44\pm0.00$ & $-0.43$\\
BattleshipEasy-v0              & $-0.81\pm0.02$ & $-0.93\pm0.03$ & $-0.46\pm0.01$ & $-1.00\pm0.00$ & $-0.49\pm0.01$ & $-0.35$\\
BattleshipMedium-v0            & $-0.91\pm0.02$ & $-0.91\pm0.03$ & $-0.39\pm0.01$ & $-1.00\pm0.00$ & $-0.81\pm0.02$ & $-0.43$\\
BattleshipHard-v0              & $-0.92\pm0.01$ & $-0.97\pm0.01$ & $-0.41\pm0.00$ & $-1.00\pm0.00$ & $-0.67\pm0.01$ & $-0.40$\\
ConcentrationEasy-v0           & $-0.06\pm0.02$ & $-0.05\pm0.01$ & $-0.19\pm0.01$ & $-0.92\pm0.00$ & $-0.14\pm0.00$ & $-0.12$\\
ConcentrationMedium-v0         & $-0.84\pm0.00$ & $-0.84\pm0.00$ & $-0.84\pm0.00$ & $-0.88\pm0.00$ & $-0.84\pm0.00$ & $-0.87$\\
ConcentrationHard-v0           & $-0.25\pm0.00$ & $-0.25\pm0.01$ & $-0.19\pm0.00$ & $-0.92\pm0.00$ & $-0.19\pm0.01$ & $-0.44$\\
CountRecallEasy-v0             & $ 0.07\pm0.01$ & $-0.46\pm0.01$ & $-0.93\pm0.00$ & $-0.92\pm0.00$ & $ 0.05\pm0.00$ & $ 0.22$\\
CountRecallMedium-v0           & $-0.47\pm0.01$ & $-0.75\pm0.03$ & $-0.88\pm0.00$ & $-0.88\pm0.00$ & $-0.47\pm0.00$ & $-0.48$\\
CountRecallHard-v0             & $-0.54\pm0.00$ & $-0.81\pm0.02$ & $-0.93\pm0.00$ & $-0.92\pm0.00$ & $-0.56\pm0.00$ & $-0.55$\\
HigherLowerEasy-v0             & $ 0.50\pm0.00$ & $ 0.50\pm0.00$ & $ 0.00\pm0.01$ & $ 0.47\pm0.00$ & $ 0.50\pm0.00$ & $ 0.51$\\
HigherLowerMedium-v0           & $ 0.50\pm0.00$ & $ 0.50\pm0.00$ & $-0.01\pm0.00$ & $ 0.49\pm0.00$ & $ 0.50\pm0.00$ & $ 0.49$\\
HigherLowerHard-v0             & $ 0.52\pm0.00$ & $ 0.51\pm0.00$ & $ 0.01\pm0.01$ & $ 0.50\pm0.00$ & $ 0.51\pm0.01$ & $ 0.49$\\
LabyrinthEscapeEasy-v0         & $ 0.95\pm0.00$ & $ 0.80\pm0.01$ & $-0.39\pm0.00$ & $ 0.72\pm0.05$ & $ 0.92\pm0.01$ & $ 0.95$\\
LabyrinthEscapeMedium-v0       & $-0.81\pm0.01$ & $-0.82\pm0.01$ & $-0.94\pm0.01$ & $-0.89\pm0.01$ & $-0.86\pm0.00$ & $-0.94$\\
LabyrinthEscapeHard-v0         & $-0.56\pm0.01$ & $-0.67\pm0.04$ & $-0.84\pm0.04$ & $-0.71\pm0.03$ & $-0.69\pm0.02$ & $-0.49$\\
LabyrinthExploreEasy-v0        & $ 0.95\pm0.00$ & $ 0.88\pm0.06$ & $-0.34\pm0.01$ & $ 0.87\pm0.01$ & $ 0.93\pm0.00$ & $ 0.96$\\
LabyrinthExploreMedium-v0      & $ 0.79\pm0.00$ & $ 0.77\pm0.01$ & $-0.73\pm0.00$ & $ 0.26\pm0.01$ & $ 0.71\pm0.01$ & $ 0.79$\\
LabyrinthExploreHard-v0        & $ 0.88\pm0.00$ & $ 0.86\pm0.01$ & $-0.61\pm0.00$ & $ 0.45\pm0.01$ & $ 0.82\pm0.01$ & $ 0.87$\\
MineSweeperEasy-v0             & $ 0.15\pm0.03$ & $-0.33\pm0.04$ & $-0.26\pm0.03$ & $-0.47\pm0.01$ & $ 0.20\pm0.00$ & $ 0.28$\\
MineSweeperMedium-v0           & $-0.44\pm0.00$ & $-0.40\pm0.01$ & $-0.43\pm0.00$ & $-0.49\pm0.00$ & $-0.35\pm0.01$ & $-0.27$\\
MineSweeperHard-v0             & $-0.20\pm0.00$ & $-0.37\pm0.02$ & $-0.39\pm0.01$ & $-0.48\pm0.00$ & $-0.16\pm0.00$ & $-0.10$\\
MultiarmedBanditEasy-v0        & $ 0.37\pm0.01$ & $ 0.27\pm0.01$ & $ 0.02\pm0.00$ & $ 0.05\pm0.00$ & $ 0.17\pm0.02$ & $ 0.62$\\
MultiarmedBanditMedium-v0      & $ 0.22\pm0.03$ & $ 0.27\pm0.01$ & $ 0.01\pm0.00$ & $ 0.01\pm0.00$ & $ 0.17\pm0.01$ & $ 0.43$\\
MultiarmedBanditHard-v0        & $ 0.32\pm0.01$ & $ 0.35\pm0.01$ & $ 0.01\pm0.00$ & $ 0.21\pm0.01$ & $ 0.14\pm0.00$ & $ 0.59$\\
NoisyPositionOnlyCartPoleEasy-v0   & $ 0.88\pm0.03$ & $ 0.87\pm0.02$ & $ 0.11\pm0.00$ & $ 0.23\pm0.00$ & $ 0.44\pm0.01$ & $ 0.98$\\
NoisyPositionOnlyCartPoleMedium-v0 & $ 0.18\pm0.01$ & $ 0.17\pm0.01$ & $ 0.11\pm0.00$ & $ 0.16\pm0.00$ & $ 0.22\pm0.01$ & $ 0.36$\\
NoisyPositionOnlyCartPoleHard-v0   & $ 0.33\pm0.01$ & $ 0.34\pm0.00$ & $ 0.12\pm0.01$ & $ 0.18\pm0.00$ & $ 0.25\pm0.01$ & $ 0.57$\\
NoisyPositionOnlyPendulumEasy-v0   & $ 0.87\pm0.00$ & $ 0.84\pm0.01$ & $ 0.27\pm0.01$ & $ 0.31\pm0.00$ & $ 0.88\pm0.00$ & $ 0.90$\\
NoisyPositionOnlyPendulumMedium-v0 & $ 0.60\pm0.01$ & $ 0.56\pm0.01$ & $ 0.26\pm0.00$ & $ 0.28\pm0.00$ & $ 0.66\pm0.00$ & $ 0.67$\\
NoisyPositionOnlyPendulumHard-v0   & $ 0.68\pm0.00$ & $ 0.63\pm0.01$ & $ 0.27\pm0.01$ & $ 0.30\pm0.00$ & $ 0.72\pm0.00$ & $ 0.73$\\
PositionOnlyCartPoleEasy-v0    & $ 0.93\pm0.03$ & $ 1.00\pm0.00$ & $ 0.12\pm0.00$ & $ 0.15\pm0.00$ & $ 0.17\pm0.00$ & $ 1.00$\\
PositionOnlyCartPoleMedium-v0  & $ 0.05\pm0.01$ & $ 0.03\pm0.00$ & $ 0.04\pm0.00$ & $ 0.05\pm0.00$ & $ 0.06\pm0.00$ & $ 1.00$\\
PositionOnlyCartPoleHard-v0     & $ 0.07\pm0.00$ & $ 0.34\pm0.08$ & $ 0.05\pm0.00$ & $ 0.09\pm0.00$ & $ 0.12\pm0.00$ & $ 1.00$\\
PositionOnlyPendulumEasy-v0     & $ 0.54\pm0.02$ & $ 0.51\pm0.03$ & $ 0.27\pm0.00$ & $ 0.29\pm0.00$ & $ 0.91\pm0.00$ & $ 0.92$\\
PositionOnlyPendulumMedium-v0   & $ 0.47\pm0.01$ & $ 0.49\pm0.01$ & $ 0.26\pm0.00$ & $ 0.28\pm0.00$ & $ 0.82\pm0.00$ & $ 0.82$\\
PositionOnlyPendulumHard-v0     & $ 0.49\pm0.01$ & $ 0.55\pm0.01$ & $ 0.26\pm0.00$ & $ 0.30\pm0.00$ & $ 0.89\pm0.00$ & $ 0.88$\\
RepeatFirstEasy-v0              & $ 1.00\pm0.00$ & $ 0.45\pm0.16$ & $-0.49\pm0.01$ & $-0.50\pm0.00$ & $ 1.00\pm0.00$ & $ 1.00$\\
RepeatFirstMedium-v0            & $ 0.10\pm0.02$ & $ 0.42\pm0.14$ & $-0.50\pm0.00$ & $-0.50\pm0.00$ & $-0.50\pm0.00$ & $ 0.99$\\
RepeatFirstHard-v0              & $ 0.99\pm0.01$ & $-0.21\pm0.18$ & $-0.50\pm0.00$ & $-0.50\pm0.00$ & $ 0.99\pm0.01$ & $ 1.00$\\
RepeatPreviousEasy-v0           & $ 1.00\pm0.00$ & $ 1.00\pm0.00$ & $-0.49\pm0.01$ & $-0.52\pm0.00$ & $ 1.00\pm0.00$ & $ 1.00$\\
RepeatPreviousMedium-v0         & $-0.46\pm0.00$ & $-0.47\pm0.00$ & $-0.51\pm0.00$ & $-0.48\pm0.00$ & $-0.45\pm0.00$ & $-0.48$\\
RepeatPreviousHard-v0           & $-0.38\pm0.01$ & $-0.38\pm0.00$ & $-0.50\pm0.01$ & $-0.50\pm0.00$ & $-0.38\pm0.00$ & $-0.39$\\
VelocityOnlyCartPoleEasy-v0     & $ 1.00\pm0.00$ & $ 1.00\pm0.00$ & $ 0.11\pm0.00$ & $ 0.99\pm0.00$ & $ 1.00\pm0.00$ & $ 1.00$\\
VelocityOnlyCartPoleMedium-v0   & $ 1.00\pm0.00$ & $ 0.96\pm0.02$ & $ 0.04\pm0.00$ & $ 0.63\pm0.00$ & $ 1.00\pm0.00$ & $ 0.99$\\
VelocityOnlyCartPoleHard-v0     & $ 1.00\pm0.00$ & $ 1.00\pm0.00$ & $ 0.06\pm0.00$ & $ 0.83\pm0.01$ & $ 1.00\pm0.00$ & $ 1.00$\\
\bottomrule
\end{tabular}
\end{adjustbox}
\end{table*}
```
```{=latex}
\begin{table*}[t!]\caption{Experimental setup and evaluation metrics across different environments. $N_{runs}$ denotes the number of model runs; $N_{seeds}$ denotes the number of inference episodes with different seeds; sem denotes standard error of the mean, and std denotes standard deviation.}
\label{tab:stats}
\begin{center}
\begin{tabular}{lcccc}
\toprule
\textbf{Environment} & \textbf{Experiment Setup} & \textbf{Results} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5}
& $N_{\text{runs}}$ & $N_{\text{seeds}}$ & \textbf{Metric} & \textbf{Notation} \\
\midrule
\textit{Memory-intensive environments} \\
\midrule
ViZDoom-Two-Colors       & 6  & 100 & Return    & mean$\pm$sem \\
T-Maze                   & 4 & 100 & Success Rate    & mean$\pm$sem \\
Memory Maze              & 3  & 100 & Return    & mean$\pm$sem \\
Minigrid-Memory          & 3  & 100 & Return    & mean$\pm$sem \\
POPGym                   & 3  & 100 & Return    & mean$\pm$sem \\
\midrule
\textit{Diagnostic environment} \\
\midrule
Action Associative Retrieval & 10 & --- & Success Rate & mean$\pm$sem \\
\bottomrule
\end{tabular}
\end{center}
\end{table*}
```
### Additional ablation studies {#app:add_app}

To determine the optimal hyperparameters associated with memory mechanisms, additional ablation studies were performed in ViZDoom-Two-Colors and T-Maze environments, and the results are presented in `\autoref{fig:doom_ablation}`{=latex} and `\autoref{fig:abbl_doom_tmaze}`{=latex} (right). From the ablation studies results, it was found that for environments like ViZDoom-Two-Colors with continuous reward signal and image observations, the best results can be obtained using number of cached memory tokens $\texttt{mem\_len} = (K\times 3 + 2 \times \texttt{num\_mem\_tokens})\times N$, where $K$ -- context length and $N$ -- number of segments.

On the other hand, for environments with sparse events like T-Maze, it has been found that using caching of hidden states of previous tokens ($\texttt{mem\_len} > 0$) prevents remembering important information.

```{=latex}
\begin{figure*}[h]
    
    \includegraphics[width=0.49\textwidth]{plots/rate_noise_rtg.png}
    \hfill
    \raisebox{20pt}{\includegraphics[width=0.49\textwidth]{plots/ablation_tmaze.png}}
    \caption{(\textbf{left}) Investigating the RATE memory tokens noise effect in the ViZDoom-Two-Colors. (\textbf{right}) Results of RATE-3 (trained on corridor lengths $\leq 90$) ablation studies in the T-Maze environment. $\texttt{n\_head\_ca}$ -- number of MRV attention heads, $\texttt{num\_mem\_tokens}$ -- number of memory tokens.}
    \label{fig:abbl_doom_tmaze}
\end{figure*}
```
![Results of RATE ablation studies in the ViZDoom-Two-Colors environment.](plots/rate_vizdoom_main_vizdoom_ablation.png){#fig:doom_ablation width="100%"}

#### Additional ViZDoom-Two-Colors ablation

The effect of combining of memory tokens with noise is shown in `\autoref{fig:abbl_doom_tmaze}`{=latex} (left). The noise was applied as a convex combination: $\texttt{memory\_tokens} = (1-\alpha)\times \texttt{memory\_tokens} + \alpha \times \texttt{noise}$. With unchanged caching of hidden states from previous steps at growth of the noise parameter $\alpha$, at first there is a decrease of performance at inference on green pillars (up to $\alpha=0.5$), and only then a decrease of performance at inference on red pillars. This phenomenon can be explained by the fact that memory embeddings is trained to record mostly information about red pillars, which helps to combat bias in the training data.

```{=latex}
\begin{adjustbox}{width=\textwidth}

\begin{tabular}{lcccc}
\toprule
\textbf{Environment} & \textbf{Encoder Configuration} \\
\cmidrule{2-5}
& \textbf{Return} & \textbf{Observation} & \textbf{Conv. params$^\ddagger$} & \textbf{Action} \\
\midrule
\textit{Image-based environments} \\
\midrule
ViZDoom-Two-Colors & Linear & Conv2D $\times$ 3 & (32, 64, 64) / (8, 4, 3) / 0 & Embedding \\
Memory Maze & Linear & Conv2D $\times$ 3 & (32, 64, 64) / (8, 4, 3) / 2 & Embedding \\
Minigrid-Memory & Linear & Conv2D $\times$ 3 & (32, 64, 64) / (8, 4, 3) / 0 & Embedding \\
Atari & Linear & Conv2D $\times$ 3 & (32, 64, 64) / (8, 4, 3) / 0 & Embedding \\
\midrule
\textit{Vector-based environments} \\
\midrule
T-Maze & Linear & Linear & --- & Embedding \\
MuJoCo & Linear & Linear & --- & Linear \\
Action Associative Retrieval & Linear & Linear & --- & Embedding \\
POPGym & Linear & Linear & --- & Embedding / Linear \\
\bottomrule
\end{tabular}
\end{adjustbox}
```
```{=latex}
\begin{figure*}[t]
    \begin{center}
    \includegraphics[width=1.0\textwidth]{plots/gates_v3.png}
    % 
    \caption{Memory Retention Valve configurations used in the ablation study.
    \textbf{MRV-CA-2}: cross-attention-based MRV which uses an attention mechanism to control the updating of memory embeddings and which is used in the work as the main mechanism.
    \textbf{MRV-CA-1}: uses the same mechanism as MRV-CA-2 but the updated memory embeddings $M_{n+1}$ are fed to Query, and the incoming memory embeddings $M_n$ are fed to Key and Value.
    \textbf{MRV-G}: gated MRV which uses a gating mechanism similar to the one used in Gated Transformer-XL~\citep{parisotto2020stabilizing}.
    \textbf{MRV-GRU}: uses a GRU~\citep{gru} block to process updated memory embeddings with hidden states.
    \textbf{MRV-LSTM}: uses a LSTM~\citep{hochreiter1997long} block to process updated memory embeddings with cached states.}
    \label{fig:mrv_variants}
    \end{center}
    % \vskip -0.4in§
\end{figure*}
```
#### Curriculum Learning

Since in the T-Maze environment, the number of actions at the junction relates to the number of actions when moving straight along the corridor as $\frac{1}{L}$ and tends to $0$ as $L$ increases, there is a significant imbalance in the agent's action distribution, which can cause problems when performing rare class (turning actions) prediction. Theoretically, this situation can be remedied through curriculum learning.

Curriculum learning (CL) is a technique in which a model is trained on examples of increasing difficulty. In this approach, the model is first trained on the set of trajectories $Q_1 = q_1$ of length $K \times 1$, then the trained model is re-trained on the set of trajectories $Q_2 = q_1 \cup q_2$, where the set $q_2$ is formed by trajectories of length $K \times 2$, and so on (in order of increasing complexity of the trajectories). Thus, for the $N$ segments considered during training, the set $Q_{N}= \bigcup_{i=1}^{N}q_i$ is used.

In the T-Maze environment, DT, RATE, RMT, and TrXL were trained with and without curriculum learning because this approach theoretically produces better results. However, it is important to note that the T-Maze task is successfully solved by the RATE model without using curriculum learning, and even vice versa -- its use slightly degraded performance on long corridors. However, with respect to TrXL, the use of CL yielded slightly better results. The work showed that using CL does not achieve significantly better performance on the T-Maze task. The results of using the CL on the T-Maze environment are presented in `\autoref{fig:curr}`{=latex} (left), and the results of applying noise to memory embeddings to assess its importance are presented in `\autoref{fig:curr}`{=latex} (right).

```{=latex}
\begin{figure*}[h]
    
    \includegraphics[width=0.49\textwidth]{plots/rate_tmaze_noise.png}
    \hfill
    \raisebox{0.0cm}{\includegraphics[width=0.49\textwidth]{plots/rate_tmaze_main_curr.png}}
    \caption{(\textbf{left}). Results with and without the use of curriculum learning and (\textbf{right}) results of replacing RATE memory tokens with white noise at inference in T-Maze.}
    \label{fig:curr}
\end{figure*}
```
#### Supplemental MRV ablation {#app:mrvs}

One of the options for implementing the memory tokenization gating mechanism was an approach similar to the one proposed in Gated Transforer-XL (GTrXL) [@parisotto2020stabilizing] work. Thus, the MRV-G scheme was inspired by the gating mechanism from GTrXL and implemented as follows: $$\label{app:mrv}
    r = \sigma(M_n W_r + M_{n+1} U_r)$$ $$z = \sigma(M_n W_z + M_{n+1} U_z - \texttt{bias})$$ $$h = \texttt{tanh}(M_n W_g + (M_{n+1} \times r) U_r)$$ $$\tilde{M}_{n+1} = \sigma(M_n (1 - z) + z \times h)$$ The results of the RATE (trained on corridor lengths of $\leq 150$) inference on the T-Maze environment with these MRV configurations are shown in `\autoref{fig:mrv_tmaze}`{=latex} and in `\autoref{tab:mrv_ablation}`{=latex}. The results presented in `\autoref{fig:mrv_tmaze}`{=latex} confirm the high stability of RATE when using cross-attention-based MRV (MRV-CA-2), as well as the model's ability to hold important information in memory embeddings when inference on long tasks.

```{=latex}
\begin{figure*}[t]
    \begin{center}
    \includegraphics[width=\textwidth]{plots/RATE_MEM_gates_ablation.png}
    % 
    \caption{Results of RATE inference with different MRV configurations on the T-Maze environment. Training was performed with the number of segments $N=5$ and context length $K=30$, i.e. on trajectories of length $\leq 150$. MRV-CA-2 is the final MRV configuration that is used throughout the work and is designated as MRV.}
    \label{fig:mrv_tmaze}
\end{center}
% 
\end{figure*}
```
#### Ablation on number of segments and segment length {#app:abl_seg_len}

Partitioning the trajectories into fixed-length segments allows the RATE model to train on long trajectories without increasing the context size, which makes the parameters $N$ (the number of segments into which the training trajectories are divided) and $K$ (the context length, i.e., the size of a single segment) critical because they determine the length of the effective context $K_{eff}=K\times N$. `\autoref{fig:sweep_k_eff}`{=latex} presents the results of ablation studies for parameters $N$ and $K$ at fixed $K_{eff} = 90$.

```{=latex}
\begin{figure*}[t]
    
    
        
        \includegraphics[width=\textwidth]{plots/rate-sweep/k_eff.png}
        \caption{Ablation of segment size $K$ and segment count $N$ with fixed effective context $K_{\text{eff}} = K{\uparrow} \times N{\downarrow} = 90$.}
        \label{fig:sweep_k_eff}
    
    \hfill
    
        
        
        \includegraphics[width=\textwidth]{plots/rate-sweep/dec_ffn.png}
        \caption{Ablation of feed-forward block usage in the decoder.}
        \label{fig:sweep_dec_ffn}
    
\end{figure*}
```
### Transformer Ablation Studies {#app:trans_abb_studies}

###### Transformer core hyperparameters.

This section presents the results of ablation studies on the main hyperparameters of the RATE transformer. The RATE configuration for the T-Maze environment specified in `\autoref{tab:rate_hyperparams}`{=latex} was chosen for the ablation studies. The ablation studies focus on understanding the impact of key hyperparameters by systematically varying one parameter while keeping others constant. The results are shown in `\autoref{fig:sweep_n_layers}`{=latex}, `\autoref{fig:sweep_n_heads}`{=latex}, and `\autoref{fig:sweep_d_model}`{=latex}.

```{=latex}
\begin{figure*}[t]
    
    
        
        \includegraphics[width=\textwidth]{plots/rate-sweep/n_layers.png}
        \caption{Results of ablation by the number of layers of the RATE model in T-Maze environment.}
        \label{fig:sweep_n_layers}
    
    \hfill
    
        
        \includegraphics[width=\textwidth]{plots/rate-sweep/n_heads.png}
        \caption{Results of ablation by the number of attention heads of the RATE model in T-Maze environment.}
        \label{fig:sweep_n_heads}
    
    \hfill
    
        
        
        \includegraphics[width=\textwidth]{plots/rate-sweep/d_model.png}
        \caption{Results of ablation by the features sizes of the RATE model in T-Maze environment.}
        \label{fig:sweep_d_model}
    
\end{figure*}
```
###### Feed-Forward Network.

For RATE, the inclusion of the decoder feed-forward block is treated as a tunable hyperparameter. In most environments, we disable it, as doing so often leads to better performance `\autoref{fig:sweep_dec_ffn}`{=latex}. However, for ViZDoom-Two-Colors and Minigrid-Memory, we found that retaining the feed-forward block yields slightly improved results, and thus it is enabled in those settings.

### Recommendations for Hyperparameter Settings {#app:params_setting}

Transformer-based models require careful hyperparameter tuning, and the addition of memory mechanisms in RATE introduces a few more components. However, **configuring RATE remains largely similar to tuning a standard transformer**. Based on extensive empirical evaluation, we provide the following **practical guidelines** to simplify the setup process.

###### Step-by-step configuration:

1.  **Segment setup.** Divide each trajectory into $N=3$ segments. For a trajectory of length $T$, set the context length to $K = T // 3$.

2.  **Memory configuration.** Use the following default parameters for RATE's memory mechanisms:

    -   `num_mem_tokens` = 5

    -   `n_head_ca` = 1

    -   `mrv_act` = ReLU

    -   `mem_len` =

        -   $(3 \times K + 2 \times \texttt{num\_mem\_tokens}) \times N$ for dense reward environments (e.g., ViZDoom-Two-Colors, Minigrid-Memory)

        -   $0$ for sparse reward environments (e.g., T-Maze)

3.  **Transformer core.** Set the standard architecture parameters (number of layers, attention heads, embedding dimension, etc.) based on the task complexity and computational constraints.

4.  **Memory tuning.** After adjust, fine-tune memory-related parameters if needed (e.g., `num_mem_tokens`, `mem_len`, dropout rates).

This configuration provides a strong default setup and has consistently performed well across all evaluated tasks.

### Technical details {#app:iron}

`\autoref{tab:model_parameters}`{=latex} and `\autoref{tab:training_time_size}`{=latex} shows the technical parameters of the training models. Note that the difference between the number of DT and RATE parameters is small. Training RATE with trajectory splitting into $N$ segments allows $\sim N$ smaller GPU memory size usage than for DT. The training was conducted using a single NVIDIA A100 80 Gb graphics card.

```{=latex}
\begin{table*}[t!]
    
    \caption{Comparison of RATE and DT Model Parameters. RATE has 1.0-7.7\% less parameters compared to DT due to the fact that RATE does not use feed-forward network in the transformer decoder by default.}
    \begin{tabular}{lccc}
        \toprule
        \textbf{Environment} & \textbf{RATE} & \textbf{DT} & \textbf{diff, \%} \\
        \midrule
        T-Maze & 1,723,840 & 1,775,488 & -2.91 \\
        ViZDoom-Two-Colors & 4,537,504 & 4,672,032 & -2.88 \\
        Minigrid-Memory & 2,000,864 & 2,051,872 & -2.49 \\
        Memory Maze & 1,639,840 & 1,673,696 & -2.02 \\
        POPGym & 6,760,192 & 6,827,008 & -0.98 \\
        % MIKASA-Robo & 1,412,520 & 1,529,896 & -7.67 \\
        \bottomrule
    \end{tabular}
    \label{tab:model_parameters}
\end{table*}
```
```{=latex}
\begin{table*}[h]
    
    \caption{
    Computational efficiency comparison between RATE and DT models across different memory-intensive environments. We report three key metrics: (1) training time per epoch (mean±std, in seconds), (2) inference latency per step (mean±sem, in milliseconds), and (3) GPU memory footprint (in MiB). Lower values indicate better efficiency.
    }
    \renewcommand{\arraystretch}{1.2}
    \begin{adjustbox}{width=1\columnwidth}
    \begin{tabular}{@{}lccccccc@{}}
        \toprule
        & \textbf{RATE} & \textbf{DT} \\
        \cmidrule(lr){2-4} \cmidrule(lr){5-7}
        \textbf{Environment} & Train (s) & Test (ms) & Size (MiB) & Train (s) & Test (ms) & Size (MiB) \\
        \midrule
        T-Maze & 16.17±2.75 & 7.20±0.31 & 3,148 & 95.75±0.49 & 10.69±0.14 & 8,608 \\
        ViZDoom-Two-Colors & 77.44±3.56 & 10.35±0.52 & 7,750 & 68.18±1.56 & 10.45±0.41 & 14,046 \\
        Minigrid-Memory & 33.74±2.65 & 9.94±2.24 & 4,102 & 16.77±1.37 & 10.43±2.84 & 4,298 \\
        Memory Maze & 110.26±2.97 & 38.98±0.62 & 6,638 & 82.69±1.56 & 40.36±0.46 & 10,386 \\
        POPGym & 3.37±0.25 & 8.91±0.37 & 5,948 & 3.64±0.53 & 8.98±0.32 & 10,696 \\
        % MIKASA-Robo & 71.30±8.08 & 485.67±8.75 & 10,396 & 44.90±6.16 & 473.29±5.97 & 29,902 \\
        \bottomrule
    \end{tabular}
    \end{adjustbox}
    \label{tab:training_time_size}
\end{table*}
```

[^1]: In our setting, *cached hidden states* refers to the mechanism introduced in Transformer-XL [@dai2019transformerxl], in which the hidden activations computed for preceding segments are stored and reused as an extended key-value context when processing the next segment. Concretely, instead of recomputing all past representations from scratch, the model concatenates the fixed, non-trainable hidden states from earlier segments with the current segment's inputs, thereby enabling segment-level recurrence and information flow across boundaries without backpropagating gradients through the cached states.

[^2]: *Static recurrence* denotes the practice of forwarding cached hidden states from one segment to the next without any gating or content-based filtering. For instance, Transformer-XL uses this mechanism by directly reusing past hidden states as extended key-value context, which increases the effective horizon but provides no control over what information is retained or overwritten

[^3]: Throughout the paper, *alignment* refers only to a geometric notion: two vectors are \`\`$\alpha$-aligned" when the angle between them does not exceed a specified threshold. This has no relation to alignment or preference tuning in large language models.
