---
bibliography:
- reference.bib
---

```{=latex}
\newcommand{\promptbox}[3]{%
  \colorlet{promptbg}{#3}%
  \colorlet{promptframe}{#3}%
  \par\noindent
  \begingroup
  \setlength{\fboxsep}{6pt}%
  \colorbox{promptframe!85!black}{%
    \parbox{\dimexpr\linewidth-2\fboxsep\relax}{%
      \color{white}\bfseries\small
      \rule[-4pt]{0pt}{14pt}#1\strut}%
  }\par\nointerlineskip
  \endgroup
  \lstinputlisting[style=gepaprompt]{#2}%
  \medskip%
}
```
```{=latex}
\providecommand{\methodname}{\textsc{FST}\xspace}
```
```{=latex}
\newcommand{\todo}[1]{\textcolor{red}{[\textsc{todo}: #1]}}
```
```{=latex}
\newcommand{\NUM}[1]{\textcolor{orange}{#1}}
```
```{=latex}
\newcommand{\devvrit}[1]{\textcolor{red}{Devvrit: #1}}
```
```{=latex}
\begin{tcolorbox}[abstractbox, width=\textwidth]\centering

  {\LARGE\titleFont \textcolor{titleblue}{Learning, Fast and Slow: Towards LLMs\\ That Adapt Continually}\par}
  \vspace{0.5em}

{\normalsize\authorFont
  \textbf{Rishabh Tiwari}$^{*\,1, 4}$\hspace{0.2em}
  \textbf{Kusha Sareen}$^{*\,2}$\hspace{0.2em}
  \textbf{Lakshya A Agrawal}$^{*\,1}$\\[0.2em]
  \textbf{Joseph E. Gonzalez}$^{\,1}$\hspace{0.2em}
  \textbf{Matei Zaharia}$^{\,1}$\hspace{0.2em}
  \textbf{Kurt Keutzer}$^{\,1}$\hspace{0.2em}
  \textbf{Inderjit S Dhillon}$^{\,3}$\\[0.2em]
  \textbf{Rishabh Agarwal}$^{\dagger\,2, 5}$\hspace{0.2em}
  \textbf{Devvrit Khatri}$^{\dagger\,3, 6}$
}\\[0.8em]
{\normalsize\authorFont $^{1}${UC Berkeley}\hspace{1.5em}$^{2}${Mila}\hspace{1.5em}$^{3}${UT Austin}\hspace{1.5em}$^{4}${Eragon}\hspace{1.5em}$^{5}${Periodic Labs}\hspace{1.5em}$^{6}${Mirendil}}\\[0.3em]
  {\normalsize\authorFont\hypersetup{urlcolor=titleblue} \href{https://rishabhtiwari.ai/projects/fst/video.mp4}{Video} \hspace{0.5em}|\hspace{0.5em} \href{https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/}{Blog} \hspace{0.5em}|\hspace{0.5em} \href{https://rishabhtiwari.ai/projects/fst/code/}{Code}}\\[0.1em]

  \vspace{0.1em}
  \begin{justify}
  \authorFont
  \setstretch{1.1}
  \setlength{\parindent}{0pt}
  % \begin{abstract}

% %Training large language models usually optimizes a single object: the model parameters. Yet deployed LLM systems are controlled by both slowly changing parameters and rapidly changing textual state, including prompts, instructions, and task context. 
% The process of training a large language model optimizes a single object: the model's parameters. 
% However, deployed LLM systems are controlled by both slowly changing parameters and rapidly changing textual state, including prompts, instructions, and task context. 
% %We argue that effective training should jointly optimize these two channels. 

% We argue that effective training should jointly optimize these two components. 

% XXXX
% Alternative: 

% XXXX

% Inspired by neuroscience and classical machine learning work, we introduce a fast-slow learning framework for LLM post-training, where model parameters act as slow weights and optimized textual context as fast weights. The key advantage is that fast-weights absorb part of the domain-specific learning burden that would otherwise be forced into the parameters.
% % , as happens in the currently used paradigm.
% Context optimization enables rapid task-level adaptation by learning from text feedback, while slow weights can remain closer to the base model and focus on consolidating more persistent reasoning behaviors. We study this principle in reinforcement learning with verifiable rewards by interleaving slow-weight RL updates with fast-weight GEPA-style context optimization.
% Our method FULL NAME (\methodname{}) is up to $3\times$ more sample-efficient across math, code, and scientific reasoning tasks, while consistently reaching a higher performance asymptote.
% XX  SOME NUMBERS HERE, NOT SAMPLE EFFICIENCY XX.
% % Our method leads to more sample-efficient learning across math, code, and scientific reasoning tasks and results in a higher performance asymptote.
% % : fast weights quickly improve the training trajectory, allowing the system to reach higher reward with fewer slow-weight updates.
% Since performance is achieved with less parameter movement, the resulting models
% % remain closer to the base checkpoint and thus
% exhibit (up to $X\%$) lower KL to the base policy.
% % , higher entropy, and stronger out-of-distribution generalization than RL-only baselines.
% Preserving slow weights thereby improves future adaptability: after training on one task, fast-slow trained models retain greater plasticity and achieve larger gains when subsequently trained on a new task.
% % Moreover, the fast-slow trained model exhibit continual learning <something brief to fill here>.
% % Finally, we investigate fast-to-slow distillation and show that improvements discovered in fast weights can be partially consolidated into parameters, making our approach complementary to recent self-distillation methods.
% % Finally, we exhibit strong continual learning ability of our method when task domains change on the fly.
% Finally, we observe strong continual learning ability. When task domains change on the fly, our method continues to acquire each new task while parameter-only RL stalls.
% More broadly, our work motivates a new paradigm of multi-timescale optimization, where fast textual adaptation and slow parameter learning are optimized jointly end-to-end rather than sequentially.
% % \todo{Is there anything sharper we can say about the advantages of co-evolving? anything more quantitative?}
% \end{abstract}


% \begin{abstract}

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters
% typically optimizes only the model’s parameters.
% This enables the weights to 
forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity.
In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly \emph{adapt} to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters.
% We argue that effective training should jointly optimize these two components. 
% XXXX
% Alternative: 
% XXXX
%We argue that effective post-training should jointly optimize these two components. 
There is no good reason for \emph{restricting} learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2).
To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as ``slow'' weights and optimized context as ``fast'' weights. These fast ``weights'' can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors.
% Inspired by neuroscience and classical machine learning work, we introduce a fast-slow learning framework for LLM post-training, where model parameters act as slow weights and optimized textual context as fast weights. The key advantage is that fast-weights absorb part of the domain-specific learning burden that would otherwise be forced into the parameters.
% , as happens in the currently used paradigm.
% Context optimization enables rapid task-level adaptation by learning from text feedback, while slow weights can remain closer to the base model and focus on consolidating more persistent reasoning behaviors. We study this principle in reinforcement learning with verifiable rewards by interleaving slow-weight RL updates with fast-weight GEPA-style context optimization.
\textbf{Fast-Slow Training} (\methodname{}) is up to $3\times$ more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to $70\%$ less KL divergence), resulting in less catastrophic forgetting than RL-training. % Since performance is achieved with less parameter change, the resulting models % remain closer to the base checkpoint and thus% exhibit (up to $X\%$) lower KL to the base policy. % , higher entropy, and stronger out-of-distribution generalization than RL-only baselines. 
This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models.
% Preserving slow weights thereby improves future adaptability: after training on one task, fast-slow trained models retain greater plasticity and achieve larger gains when subsequently trained on a new task.
% Moreover, the fast-slow trained model exhibit continual learning <something brief to fill here>.
% Finally, we investigate fast-to-slow distillation and show that improvements discovered in fast weights can be partially consolidated into parameters, making our approach complementary to recent self-distillation methods.
% Finally, we exhibit strong continual learning ability of our method when task domains change on the fly.
In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.
% As such, FST offers strong continual learning ability. 
% Overall, our work motivates a new paradigm where fast context adaptation and slow parameter learning happens jointly, which we believe open up new avenues for LLMs that adapt continually.
% end-to-end rather than sequentially.
% \todo{Is there anything sharper we can say about the advantages of co-evolving? anything more quantitative?}
% \end{abstract}
   \end{justify}

  \vspace{0.3em}
  {\centering\small\authorFont \textbf{Correspondence}: \texttt{\href{mailto:rishabhtiwari@berkeley.edu,lakshyaaagrawal@berkeley.edu}{\{rishabhtiwari, lakshyaaagrawal\}@berkeley.edu}, \href{mailto:kusha.sareen@mila.quebec}{kusha.sareen@mila.quebec}, \href{mailto:devvrit.03@gmail.com,rishabhagarwal.467@gmail.com}{\{devvrit.03, rishabhagarwal.467\}@gmail.com}\\[0.5em]$^{*}$Equal contribution\hspace{0.1em} $^{\dagger}$Equal advising.}\par}
\end{tcolorbox}
```
```{=latex}
\vspace{1.0em}
```
```{=latex}
\vspace{-1.5mm}
```
Introduction
============

```{=latex}
\vspace{-1.5mm}
```
```{=latex}
\centering
```
![**Fast-Slow Training (`\methodname{}`{=latex}) learns faster, stays adaptable, and forgets less.** Comparison of `\methodname{}`{=latex} (slow-weight RL interleaved with fast-weight prompt optimization), RL alone (slow weights only), and GEPA alone (fast weights only), averaged over `CodeIO`, `Math (Polaris)`, and `HoVer-hard`. **Left:** Evaluation accuracy on the trained task as training samples accumulate. `\methodname{}`{=latex} reaches RL's peak with substantially fewer samples and converges to a higher ceiling than either RL or GEPA alone. **Middle:** Plasticity, the model's remaining ability to learn a new skill. After training on a first task, we continue with a fresh round of RL on a second task and report the final accuracy from each initialization. The RL-trained checkpoint barely learns the new task, while the `\methodname{}`{=latex}-trained checkpoint roughly matches the base model. **Right:** How far each training run drifts from the base model, measured by KL$(\pi_{\mathrm{train}}\,\|\,\pi_{\mathrm{base}})$. Smaller drift correlates with less forgetting of prior abilities; at matched accuracy, `\methodname{}`{=latex} sits well to the left of RL.](paper_plots/fig_main_hero_v4.png "fig:"){#fig:hero width="\\linewidth"} `\vspace{-4.5mm}`{=latex}

```{=latex}
\centering
```
![**Slow weights and fast weights co-evolve through interleaved updates.** The slow loop (top) updates $\theta$ from the scalar reward alone ($\theta_c \to \theta_{c+1}$). The fast loop (bottom) updates $\Phi$ via reflective optimization, additionally consuming the rollout's full text including thoughts, tool calls, errors, and rich feedback ($\Phi_c \to \Phi_{c+1}$). Maintaining $\Phi$ as a Pareto-frontier population (rather than a single best prompt) preserves diversity: different contexts specialize to different problem slices exposing the slow update rule to rich conditioning during training. ](figures/D9.png "fig:"){#fig:method width="\\linewidth"} `\vspace{-4.5mm}`{=latex}

Large language models (LLMs) are commonly adapted through supervised finetuning (SFT) or reinforcement learning (RL), both of which modify the model parameters, to specialized domains such as math and coding [@deepseek_r1; @tulu3; @deepseekmath; @deepseekcoder; @instructgpt]. However, treating parameter updates as the sole mechanism of adaptation creates a fundamental bottleneck: every improvement, whether it be a reusable reasoning skill, a task-specific heuristic or a transient lesson from recent rollouts, must be written into the same persistent set of model weights. Since the entire policy is parameterized by these weights, an update that improves in-domain reward simultaneously moves the model away from its base behavior [@gao2022scalinglawsrewardmodel; @mitigatingalignmenttaxrlhf], reducing entropy [@entropymechanism; @li2026choicedivergenceneglectedkey], hurting out-of-distribution generalization [@instructgpt; @scalinglawsforgettingfinetuning; @empiricalstudycatastrophicforgetting; @beyondreasoninggains], and degrading its ability to adapt to future tasks, known as plasticity loss [@Frati_2024; @understandingpreventingcapacityloss; @Dohare2024; @mitigatingplasticitylosscontinual].

LLM systems also possess another powerful adaptation mechanism: prompts, instructions, and contextual information [@languagemodelsfewshotlearners; @cot]. Unlike model parameters, these textual components can be modified cheaply, frequently, and per task. Prompt optimization methods demonstrate that substantial behavioral improvements can be obtained by improving the textual context under which the model operates [@ape; @opro; @dspy; @miprov2; @gepa].

In this work, we introduce **Fast-Slow Training (FST)**, where we view LLM adaptation as occurring through two complementary components (Figure `\ref{fig:method}`{=latex}). The first is a *slow parametric component*: the model weights, which are expensive to update, persist across tasks, and encode long-lived behavior. The second is a *fast textual component*: prompts, instructions, and task context, which can be changed cheaply and frequently, influence behavior immediately, capturing task-level adaptation without permanently modifying the model.

The fast-slow distinction we draw above has a long history in neural networks [@hinton1987using; @schmidhuber1992learning; @ba2016usingfastweightsattend; @anand2023predictioncontrolcontinualreinforcement], motivated by separating temporary, task-specific adaptations in fast-weights from persistent, broadly useful behaviors in slow-weights. We instantiate this idea in RLVR [@deepseekmath; @scalerl] by interleaving slow reinforcement learning updates with fast context optimization using GEPA [@gepa]. Rather than first training a policy and then optimizing a prompt for the final checkpoint, our method allows the context and the policy to co-evolve. The fast textual weights quickly incorporate lessons from rollouts, steering the model toward better reasoning behavior, while the slow parametric weights are updated under this evolving context. This produces a training process in which performance gains are distributed appropriately across both elements, instead of being forced entirely into the model parameters.

This division of labor has several consequences, which we evaluate in RLVR settings spanning math, code, and general reasoning tasks.

1.  **Fast textual adaptation improves data efficiency.** Fast weights incorporate task-level signal rapidly, so the system improves without waiting for slow parameter updates. Empirically, fast-slow training matches RL reward with up to $3\times$ fewer rollouts and consistently reaches a higher performance ceiling (Section `\ref{sec:de}`{=latex} Advantages 1 and 2).

2.  **Fast-slow training induces smaller slow-weight displacement.** With the textual channel carrying part of the adaptation, the parameters need not move as far from the base policy. At matched reward, our models have up to 70% lower KL to the base policy than RL-only baselines. (Section `\ref{sec:displacement}`{=latex} Advantage 3)

3.  **Fast-slow training preserves plasticity.** We test this by training on one task using RL-only and FST, then continuing training on a second task from the resulting checkpoints; fast-slow trained models adapt effectively in the second phase while RL trained models collapse to near $0\%$ - suggesting FST retains greater capacity for future learning (Section `\ref{sec:plasticity}`{=latex} Advantage 4).

4.  **Fast-slow training enables continual learning.** We test our method in setting where tasks change on the fly. We observe our method is able to adapt more quickly to changing objectives (Section `\ref{sec:cl}`{=latex} Advantage 5).

Overall, our results suggest that effective LLM post-training should not be viewed as parameter learning followed by prompt tuning. Instead, it should be viewed as optimization over multiple adaptation channels, where fast textual weights and slow parametric weights are trained together to achieve rapid and task-specific improvements while preserving the generality and plasticity of the base model.

Preliminaries {#sec:prelim}
=============

#### Fast and slow weights: a general framework

We model the *slow weights* (model parameters) as $\theta$, and *fast weights* (textual scaffolds) as $\phi$ drawn from a discrete text space $\Sigma^\ast$. Given a query $x$, the system produces a response by sampling $$y \sim \pi_\theta(\cdot \mid x, \phi),
    \label{eq:policy}$$

where $\pi_\theta(y \mid x, \phi)$ denotes the policy induced by parameters $\theta$ when conditioned on textual context $\phi$ and query $x$. For a task distribution $\mathcal{D}$ and reward $r$, the natural joint objective is $$\max_{\theta,\, \phi}\;\; J(\theta, \phi) \;=\; \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi_\theta(\cdot \mid x, \phi)}\!\big[\,r(x, y)\,\big].
    \label{eq:joint-base}$$ Each factor admits many concrete optimizers. On the slow side, $\theta$ can be updated by SFT, preference optimization [@dpo], or policy-gradient methods such as PPO [@ppo] and GRPO [@deepseekmath], frequently under verifiable rewards [@tulu3]. On the fast side, $\phi$ can be updated by automated prompt-optimization methods such as APE [@ape], OPRO [@opro], DSPy/MIPROv2 [@dspy; @miprov2], and GEPA [@gepa]. Our framework is agnostic to these choices; we instantiate it with RL with verifiable rewards (RLVR) for $\theta$ and reflective evolutionary prompt optimization (GEPA) for $\phi$.

#### Slow weights: RL with verifiable rewards

We follow the ScaleRL recipe [@scalerl] for slow-weight updates. The reward $r(x, y) \in [0, 1]$ is given by an automatic verifier on $(x, y)$ [@tulu3] (e.g., rule-based correctness for math, code, and science tasks). For each query $x$, the current policy generates a *group* of $G$ rollouts $\{y_i\}_{i=1}^G$ under the current $(\theta, \phi)$, from which group-relative advantages [@deepseekmath] are computed, $$A_i \;=\; \frac{r(x, y_i) \;-\; \bar r_g}{\sigma_g \;+\; \varepsilon}, \qquad \bar r_g \;=\; \tfrac{1}{G}\!\sum_{j=1}^{G} r(x, y_j), \qquad \sigma_g^2 \;=\; \tfrac{1}{G}\!\sum_{j=1}^{G}\!\big(r(x, y_j) - \bar r_g\big)^2, \label{eq:adv}$$ and normalized at the batch level. The policy is updated using the truncated importance-sampling REINFORCE objective [cispo]{.smallcaps} [@minimaxm1; @scalerl], $$\mathcal{L}_{\textsc{cispo}}(\theta) \;=\; -\,\mathbb{E}\!\left[\, \mathrm{sg}\!\big(\min(\rho_t,\,\tau)\big)\,\cdot\, A \,\cdot\, \nabla_\theta \log \pi_\theta(y_t \mid x, \phi, y_{<t})\,\right],
    \label{eq:cispo}$$ where $\rho_t = \pi_\theta(y_t \mid x, \phi, y_{<t}) / \pi_{\theta_{\text{old}}}(y_t \mid x, \phi, y_{<t})$ is the per-token importance ratio between the current and behavior policies, $\tau$ is a truncation threshold, $\mathrm{sg}(\cdot)$ is the stop-gradient operator, and the loss is aggregated at the prompt level. In conventional RLVR training, $\phi$ is fixed to a generic system prompt and only $\theta$ is updated.

#### Fast weights: reflective prompt evolution.

We optimize the fast weights $\phi$ using GEPA [@gepa], a reflective evolutionary procedure over textual prompts $\phi \in \Sigma^\ast$. For a fixed policy $\pi_\theta$, the fitness of a prompt on instance $x$ is its expected reward, $$s(\phi; x) \;=\; 
    \mathbb{E}_{y \sim \pi_\theta(\cdot \mid x, \phi)}
    \!\left[ r(x,y) \right].
    \label{eq:fitness}$$ GEPA maintains a population of prompts, uses rollouts to elicit natural-language critiques from a frozen reflection LM, and proposes textual mutations that improve performance on an anchor set from $\mathcal{D}$. Rather than returning a single prompt, GEPA retains a Pareto frontier of complementary prompts and returns the top-$m$ candidates, which we use as fast weights. We defer the details of parent selection, mutation, pruning, and prompt examples to Appendix `\ref{appendix:gepa}`{=latex}.

Fast-Slow Training (FST) {#sec:method}
========================

We now describe `\methodname`{=latex}, which jointly optimizes slow weights $\theta$ through RL and fast weights $\Phi$ through GEPA. The method maintains a population of $K$ textual prompts, $\Phi=\{\phi^{(1)},\ldots,\phi^{(K)}\}$, and optimizes $$\max_{\theta,\,\Phi}\;\;
    J(\theta,\Phi)
    =
    \mathbb{E}_{x\sim\mathcal{D},\;\phi\sim U(\Phi),\;y\sim\pi_\theta(\cdot\mid x,\phi)}
    \!\left[r(x,y)\right],
    \label{eq:joint-pop}$$ where $U(\Phi)$ is uniform over the prompt population. We keep a population rather than a single best prompt because GEPA returns a Pareto frontier of complementary prompts: different prompts perform best on different subsets of $\mathcal{D}$. Sampling across this frontier during RL gives the policy access to multiple conditioning behaviors and lets group-relative advantages compare both prompt-induced and sampling-induced variation on the same problem.

Training proceeds in cycles of $T$ slow-weight updates. At the start of cycle $c$, we pre-fetch the next $T$ RL batches and denote their union by the lookahead batch $\mathcal{L}_c$. We run GEPA with the current policy $\pi_{\theta_c}$ as the rollout model, a frozen reflection LM $\pi_{\mathrm{ref}}$ as the proposer, $\mathcal{L}_c$ or a fixed-size subset as the anchor set, and the previous population $\Phi_c$ as the seed. GEPA returns the top-$K$ candidates from its Pareto frontier, yielding the fast weights $\Phi_{c+1}$.

For the next $T$ steps, we update $\theta$ on minibatches from $\mathcal{L}_c$ while holding $\Phi_{c+1}$ fixed. For each problem $p$, we form a rollout group of size $G$ by sampling each prompt $\phi^{(k)}\in\Phi_{c+1}$ exactly $G/K$ times. That is, in each group, $G/K$ rollouts receive the same prompts and we have $K$ such mini-groups. Cumulatively, they are treated as one group for $p$; rewards are normalized by the per-problem statistics $(\bar r_g,\sigma_g)$ as in eq `\ref{eq:adv}`{=latex}, mixing prompt and sampling variation within the same advantage computation. We then apply the [cispo]{.smallcaps} update in Eq. `\eqref{eq:cispo}`{=latex}. After $T$ updates, the procedure repeats with a new GEPA phase under the updated policy. Pseudocode of `\methodname `{=latex}is given in Appendix `\ref{app:algorithm}`{=latex}.

Advantages of Fast-Slow Training {#sec:benefits}
================================

The textual fast weights $\Phi$ carry part of the task-level information that RL would otherwise force into $\theta$, so the slow weights move less to reach the same reward. The downstream signature of this division of labor is consistent across our settings: training reaches matched reward more quickly, $\theta$ drifts less from the base policy at convergence, the model retains greater plasticity to adapt to subsequent tasks, and our method shows higher continual learning capability. We show each of these in the following sections.

#### Advantage 1: Fast-Slow Training Improves Data Efficiency {#sec:de}

```{=latex}
\centering
```
![**Data efficiency across three training families.** **Top row**: matched-step validation accuracy (running max, mean\@4); dash-dot GEPA-only reference rises from the step-0 base accuracy to the prompt-only ceiling within GEPA's inference budget. `\methodname{}`{=latex} reaches RL's running peak in substantially fewer training steps ($3.0\times$ on `CodeIO`, $1.4\times$ on `Math` (Polaris), $3.0\times$ on `HoVer-hard`). **Bottom row**: out-of-distribution accuracy averaged across cross-domain (and easy$\to$hard, where available) benchmarks for each family, evaluated with no GEPA prompt. `\methodname{}`{=latex} matches RL on OOD averages while reaching the in-distribution peak with substantially fewer steps.](paper_plots/fig_data_efficiency_combined_bars_avg_ood.png "fig:"){#fig:data-efficiency width="\\linewidth"} `\vspace{-4.5mm}`{=latex}

We evaluate `\methodname{}`{=latex} on three training families: code-output prediction (`CodeIO`) [@codeio; @reasoning_gym], math (`Polaris`) [@polaris], and multi-hop fact verification (`HoVer-hard`) [@hover]. All experiments use `Qwen3-8B` [@qwen3], except for the Math run, where we first SFT `Qwen3-8B-Base` on Nemotron data [@nemotronmath] because `Qwen3-8B` is already saturated on math benchmarks. `\methodname{}`{=latex} uses cycle length $T{=}6$ and $K \in \{4, 8\}$ candidate prompts per cycle. Training-time performance is measured on a held-out in-distribution validation set. RL is trained until step $1500$ or in-distribution saturation (whichever comes first); `\methodname{}`{=latex} is trained at least until it matches RL's running peak. Full hyperparameters and dataset details are deferred to Appendix `\ref{app:hyperparams}`{=latex}.

The matched-step training curves (Figure `\ref{fig:data-efficiency}`{=latex} Top) show that `\methodname{}`{=latex} reaches RL's running peak in substantially fewer optimizer steps: $\mathbf{3.0\times}$ fewer on `CodeIO` , $\mathbf{1.4\times}$ on `Math` , and $\mathbf{3.0\times}$ on `HoVer-hard`. Continuing past the crossover, `\methodname{}`{=latex}'s running peak also exceeds RL's on all three tasks .

To check that the in-distribution data efficiency does not come at the cost of out-of-distribution behavior, for each training task we compare Base, RL's final checkpoint, and `\methodname{}`{=latex}'s validation matched-performance checkpoint on a family of cross-domain and easy-to-hard generalization datasets, with no GEPA prompt at inference (Figure `\ref{fig:data-efficiency}`{=latex} bottom).

Across all three training families, the OOD average is essentially flat between `\methodname{}`{=latex} and RL ($-1.4$, $+0.3$, $+0.1$ pp on `CodeIO`, `Math (Polaris)`, and `HoVer-hard` respectively) despite `\methodname{}`{=latex} reaching the in-distribution peak in $1.4\times$ to $3.0\times$ fewer training samples. Concretely, `\methodname{}`{=latex} reaches $49.7\%$ versus RL's $51.1\%$ on `CodeIO`-trained OOD (avg over Polaris, Physics, LCB-Hard), $36.5\%$ versus $36.2\%$ on `Math`-trained OOD (avg over CodeIO, Physics, HMMT25), and $48.5\%$ versus $48.4\%$ on `HoVer-hard`-trained OOD (avg over Physics, CodeIO). All three are well above the corresponding Base reference ($41.0\%$, $26.9\%$, $47.5\%$). The in-distribution data-efficiency advantage therefore comes at no measurable cost in out-of-distribution behavior.

#### Advantage 2: Fast-Slow Training Raises the Performance Asymptote {#sec:asymptote}

```{=latex}
\centering
```
![**Performance asymptote** on `CodeIO`, `Math` (Polaris), and `HoVer-hard`. For each run we fit a 4-parameter sigmoid $R-R_0 = \frac{A - R_0}{1 + (C_{mid}/C)^B}$ to the validation-accuracy trajectory and annotate the upper asymptote $A$. `\methodname{}`{=latex}'s asymptote (green) is higher than RL's (blue) on all three tasks. Solid curves cover the fit window; dotted curves are extrapolation past the last training step.](paper_plots/fig_sigmoid_asymptotes_v2.png "fig:"){#fig:performance-asymptote width="\\linewidth"} `\vspace{-4.5mm}`{=latex}

Following @scalerl, we compare RL and `\methodname{}`{=latex} by the saturation level of their validation-accuracy curves rather than at any single training step. Unlike final-step or matched-step accuracy, which depends on where each run was stopped, the asymptote of a fitted curve reads off the level the run is converging to. For each (task, method) we fit a sigmoid curve $$\Delta R = \frac{A - R_0}{1 + (C_{mid}/C)^B}$$ to the validation-accuracy trajectory, where $A$ is the upper asymptote, $B$ a scaling exponent, $C_{mid}$ the midpoint of the performance, and $R_0$ is the initial reward at step 0.

Across all three tasks (Figure `\ref{fig:performance-asymptote}`{=latex}), `\methodname{}`{=latex}'s fitted asymptote exceeds RL's: $A{=}47.4\%$ vs $43.0\%$ on `CodeIO` ($+4.4$pp), $49.2\%$ vs $46.4\%$ on `Math` (Polaris) ($+2.9$pp), and $25.0\%$ vs $17.3\%$ on `HoVer-hard` ($+7.7$pp). Pushing part of the task adaptation into the textual fast-weight channel $\Phi$ in addition to the slow weights $\theta$ helps the overall method converge to a higher accuracy ceiling than RL alone reaches.

#### Advantage 3: Fast-Slow Training Remains Close to the Base Model {#sec:displacement}

```{=latex}
\centering
```
![**Validation reward versus $\mathrm{KL}(\pi_{\text{train}}\,\|\,\pi_{\text{base}})$ trajectories** on `CodeIO`, `HoVer`, and `Physics`. Translucent markers are per-checkpoint measurements; the line is a rolling-mean smoothing along training step. At matched reward, `\methodname{}`{=latex} (green) sits to the left of RL (blue) on every task, reaching the same reward at a significantly lower KL from the base policy. Full figure in Appendix `\ref{app:polaris-kl}`{=latex}.](paper_plots/fig_kl_vs_acc_v2_no_polaris.png "fig:"){#fig:kl-vs-acc width="\\linewidth"} `\vspace{-4.5mm}`{=latex}

The KL divergence $\mathrm{KL}(\pi_{\text{train}}\,\|\,\pi_{\text{base}})$ between the post-trained policy and the base measures how far the slow weights have moved away from their base configuration; larger displacement is associated with reduced entropy, weaker OOD generalization, and lower plasticity for future tasks [@Frati_2024; @understandingpreventingcapacityloss; @Dohare2024; @mitigatingplasticitylosscontinual]. We track this directly - at each training checkpoint we compute token-level KL from the base on the held-out validation prompts and plot it against the same checkpoint's validation accuracy, for both `\methodname{}`{=latex} and RL across `Physics`, `Math` (Polaris), `HoVer`, and `CodeIO`.

Across all four tasks (Figure `\ref{fig:kl-vs-acc}`{=latex}), `\methodname{}`{=latex} achieves higher performance at lower KL than RL. @shenfeld2025rlrazor recently showed that on-policy RL is already biased toward KL-minimal solutions on a new task, and that the size of this shift correlates with how much prior knowledge is forgotten. Even relative to this strong baseline, `\methodname{}`{=latex} shifts the accuracy/KL frontier further left. We next demonstrate that this reduced displacement preserves plasticity and enables continual learning in the models trained with `\methodname{}`{=latex}.

#### Advantage 4: Fast-Slow Training Preserves Plasticity {#sec:plasticity}

```{=latex}
\centering
```
![**Plasticity probe**: starting from a Math (left) or Physics (right) checkpoint trained with either RL or `\methodname{}`{=latex}, we run a *fresh* RL pass on `HoVer-hard` and plot HoVer validation accuracy over 400 steps. Base init (dotted) is the no-prior-training reference. `\methodname{}`{=latex}-init (green) preserves more capacity for the new task than RL-init (blue) on both arms; on the Math arm, prior RL collapses `HoVer-hard` learnability to near-zero.](paper_plots/fig_plasticity_v2_bars.png "fig:"){#fig:plasticity width="\\linewidth"} `\vspace{-4.5mm}`{=latex}

Continued post-training has been observed to hamper a model's ability to learn future tasks, a phenomenon commonly called *plasticity loss* [@Dohare2024; @understandingpreventingcapacityloss; @Frati_2024; @mitigatingplasticitylosscontinual]: the slow weights become specialized to the trained task and lose responsiveness to gradient signals from new ones. We probe this directly in two phases. *Phase 1* trains a base model on task $X$ using either standard RL or `\methodname{}`{=latex}. *Phase 2* takes the Phase-1 checkpoint as initialization and runs standard RL on a different task $Y$. Throughout Phase 2 we track validation accuracy on $Y$. As a no-prior-training reference, we also run Phase 2 starting from the base model. We test $\texttt{Math} \to \texttt{HoVer-hard}$ and $\texttt{Physics} \to \texttt{HoVer-hard}$.

Figure `\ref{fig:plasticity}`{=latex} shows that in Phase-2, `\methodname{}`{=latex}-init outperforms RL-init through the 400-step probe in both settings. The contrast is sharpest in `Math` $\to$ `HoVer-hard`: prior RL collapses `HoVer-hard` learnability to near-zero, the RL-init curve drops to $\sim$$0\%$ within 40 steps and stays flat for the rest of the run. In contrast, `\methodname{}`{=latex}-init reaches performance close to the base-init reference. On `Physics` $\to$ `HoVer-hard`, `\methodname{}`{=latex}-init finishes at $24.2\%$ and is still climbing, versus RL-init's $19.9\%$ at step 400 . This indicates that, unlike RL, `\methodname{}`{=latex} does not over-specialize the slow weights to task $X$: the resulting checkpoint retains capacity to learn a new task $Y$, exhibiting higher plasticity.

#### Advantage 5: Fast-Slow Training Improves Continual Learning {#sec:cl}

```{=latex}
\centering
```
![**Continual learning across `HoVer` $\to$ `CodeIO` $\to$ `Physics`**: a single uninterrupted training run that switches task every 200 steps. The y-axis is per-task validation accuracy normalized with respect to the peak accuracy reached across methods within each stage. `\methodname{}`{=latex} (solid) reaches near-peak on every stage; RL (dashed) acquires `HoVer` but completely stalls on `CodeIO` and only partially recovers on `Physics`.](paper_plots/fig_cl_v2_split.png "fig:"){#fig:cl width="\\linewidth"} `\vspace{-5mm}`{=latex}

A continual learning algorithm must keep absorbing new tasks as training proceeds, without losing the capacity to absorb later ones [@Dohare2024; @mitigatingplasticitylosscontinual; @understandingpreventingcapacityloss]. To test this we run a single uninterrupted training pass over three tasks, sequentially swapping the task every 200 steps - first 200 steps with `HoVer` (multi-hop fact verification), then `CodeIO` (code-output prediction), and finally `Physics` (multiple-choice from `sciknoweval`). In this setting, the same live training trajectory must absorb three task changes back-to-back, mirroring how a deployed model would actually be trained on a stream of incoming tasks.

Figure `\ref{fig:cl}`{=latex} shows evaluation on all three tasks at different points across the full 600-step training run, normalized within each stage so that $0$ is the stage's starting accuracy and $100\%$ is peak performance on the task across methods. `\methodname{}`{=latex} reaches near-peak in every stage while learning faster within each stage, mirroring the data-efficiency gap of Section `\ref{sec:de}`{=latex} Advantage 1. The contrast is sharpest in the second stage, `CodeIO`: across the full 200-step budget, RL barely lifts off its starting accuracy, peaking at $20.7\%$ mean\@16 (a $+2.5$pp gain over its $18.3\%$ stage-start), while `\methodname{}`{=latex} climbs to near-peak in just $\sim$80 steps (less than half the budget) and finishes the stage at $37.7\%$, a $+19.6$pp gain (a $\sim$$8\times$ within-stage acquisition rate over RL, and a $+17.0$pp absolute lead at step 400). This demonstrates that `\methodname{}`{=latex} is a promising continual-learning algorithm for LLMs: by routing task-level adaptation through both the textual fast-weight channel $\Phi$ in addition to $\theta$, the method remains capable of acquiring later tasks under continued optimization.

```{=latex}
\vspace{-1mm}
```
Why Does Fast-Slow Training Work? {#sec:ablations}
=================================

```{=latex}
\vspace{-2mm}
```
The empirical benefits in Section `\ref{sec:benefits}`{=latex} raise the questions: where do the benefits come from exactly and which component is doing the majority of the work in which setting? The two studies below isolate these questions.

#### Observation 1: Fast Weights Acquire Task Signal Faster Than Slow Weights {#sec:abl:star-graph}

```{=latex}
\centering
```
![**Star Graph Search Task.** `\methodname{}`{=latex} escapes the zero-reward regime by step $\sim$50, an order of magnitude before RL begins to move signal at $\sim$250.](paper_plots/fig_star_graph_reward_v2_no_gepa.png "fig:"){#fig:star-graph width="\\linewidth"} `\vspace{-5mm}`{=latex}

To explore how `\methodname{}`{=latex} and RL behave when the base model obtains near-zero rewards, we run both `\methodname{}`{=latex} and an RL baseline on a synthetic star-graph reasoning task. Given a star-shaped graph in context, the goal is to find a path between two labeled nodes.

The two methods exhibit qualitatively different early-training behavior (Figure `\ref{fig:star-graph}`{=latex}). Parameter-only RL produces near-zero reward for roughly the first $\sim$300 steps before reward begins to rise. In contrast, `\methodname{}`{=latex} reaches measurable reward by around step $\sim$50, driven almost entirely by the first few GEPA cycles, before $\theta$ has had time to move appreciably. This is heightened by the ability of `\methodname{}`{=latex} to leverage *text feedback*. The task provides informative feedback on failures, detailing where exactly a submitted path went wrong. The interpretation is direct: slow weights are slow in how many updates they require to begin moving signal at all. The fast channel does not have this latency: GEPA can extract task structure from a handful of rollouts and inject it through $\Phi$ immediately. While GEPA alone only aids in solving a few problems early on, it provides enough gradient signal for FST to climb rewards quickly.

#### Observation 2: Fast and Slow Weights Both Optimizing for Reward Raise Performance Ceiling {#sec:abl:explicit-distill}

```{=latex}
\centering
```
![**In-distribution gain decomposed into slow- and fast-weight contributions** (pass\@1, %). For each task, we evaluate four combinations: base or `\methodname{}`{=latex}-trained weights, with the original prompt or the `\methodname{}`{=latex}-evolved prompt. On `HoVer-hard` and `CodeIO`, both channels contribute and the joint cell (`\methodname{}`{=latex} weights + `\methodname{}`{=latex} prompt) dominates. On `Math (Polaris)`, almost all of the gain is carried by the slow weights .](paper_plots/fig_indist_quadrants.png "fig:"){#fig:indist-quadrants width="\\linewidth"} `\vspace{-1em}`{=latex}

Figure `\ref{fig:indist-quadrants}`{=latex} decomposes the in-distribution gain on each training task into slow-weight and fast-weight contributions, evaluating every combination of {base, `\methodname{}`{=latex}-trained weights} with {original prompt, `\methodname{}`{=latex}-evolved prompt}. On `HoVer-hard`, the slow channel alone lifts pass\@1 from $2.0\%$ to $11.6\%$, the fast channel alone lifts it to $10.6\%$, and combining the two reaches $21.2\%$. The same pattern holds on `CodeIO`, where the joint cell reaches $43.3\%$ versus $25.1\%$ (slow only) and $34.5\%$ (fast only); a finer-grained CodeIO decomposition appears in Figure `\ref{fig:fast-slow-decomp}`{=latex}. On `Math (Polaris)` almost all of the gain is carried by the slow weights ($20.0 \to 47.2$), consistent with the weaker instruction-following of the custom SFT base used for Polaris (see Appendix `\ref{app:polaris-kl}`{=latex}). `\methodname{}`{=latex} does not assume a fixed division of labor between the two channels, and lets each task draw on whichever channel pays off while still combining them when both contribute. In Appendix `\ref{app:explicit-distill}`{=latex} we further ask whether an explicit fast-to-slow distillation algorithm can substitute for direct RL on the slow weights. Our initial results using naive distillation suggest that it cannot. Distillation alone plateaus well below `\methodname{}`{=latex}, confirming that both channels need to optimize against reward jointly to lift the ceiling.

Discussion {#sec:discussion}
==========

In Sections `\ref{sec:benefits}`{=latex} and `\ref{sec:ablations}`{=latex}, we describe several benefits of training reasoning models with fast-slow updates. As models with finite capacity are trained across ever more diverse sets of environments, we argue that not all task-specific information need be distilled in the weights of the model. We observe some encouraging properties of the new paradigm. First and foremost, `\methodname{}`{=latex} maintains proximity to the base model, enabling a set of features suitable to continual learning: plasticity and lack of forgetting. Secondly, the framework allows for data efficient learning, in part due to the ability to learn from text feedback in the context update, overcoming the widely accepted 1-bit-per-episode information limit of binary RLVR. Finally, we observe healthy diversity during training due to a wide prompt pool. The distinction between context and weight optimization represents a broader split between declarative and procedural knowledge, an important distinction for any general-purpose reasoner.

Limitations and future work {#sec:limitations}
---------------------------

While this study focuses primarily on investigating a particular instantiation of the fast-slow paradigm, taking CISPO and GEPA as highly capable methods for weight and prompt optimization, the framework is highly general. Studying the impact of changing the prompt or the weight optimizer is an interesting avenue for future work. Additionally, we believe there is potential to make the method more compute efficient and better reuse trajectories across prompt and weight optimization. Finally, though we present an initial exploration of applying this paradigm to distillation-based approaches in Figure `\ref{fig:beyond-fast-weights}`{=latex}, we believe a more comprehensive study of this direction to be an exciting avenue for future work.

```{=latex}
\vspace{-1mm}
```
Related Work {#sec:related}
============

```{=latex}
\vspace{-1mm}
```
#### Slow learning: RL for LLM reasoning.

Verifiable-reward LLM post-training writes every improvement into the model parameters via policy-gradient methods such as PPO, DPO, GRPO, and CISPO [@ppo; @dpo; @deepseekmath; @minimaxm1], used in most reasoning-RL pipelines [@tulu3; @scalerl; @qwen3]. Prolonged parametric adaptation shrinks output entropy, raises KL to the base policy, and erodes the model's ability to absorb new tasks, called the *plasticity loss* phenomenon [@Dohare2024; @understandingpreventingcapacityloss; @Frati_2024; @mitigatingplasticitylosscontinual; @shenfeld2025rlrazor]. We share this diagnosis but add a fast textual channel that absorbs much of the task-specific adaptation the slow weights would otherwise carry.

#### Fast learning: prompt and context optimization.

A parallel literature shows that substantial behavioral gains can come from editing the textual context alone, via discrete-prompt search [@shin2020autoprompt; @wen2023pez; @prasad2023grips], LLM-driven prompt proposers [@ape; @opro; @pryzant2023apo; @deng2022rlprompt], evolutionary methods [@fernando2023promptbreeder; @guo2024evoprompt; @agarwal2024promptwizard], compound LM programs [@dspy; @miprov2; @sordoni2023dln; @yuksekgonul2024textgrad; @cheng2024trace; @wu2025optimas], evolving agent context [@suzgun2025dynamiccheatsheet; @wang2024agentworkflowmemory; @xu2025amem; @zhang2025ace], and reflective self-feedback [@reflexion; @selfrefine; @gepa]. We use GEPA, which maintains a per-instance Pareto frontier of candidate prompts. All these methods are typically applied post-hoc to a frozen checkpoint, leaving the slow and fast channels disjoint in time.

#### Fast and slow weights: complementary learning systems.

The fast/slow decomposition predates deep learning, with roots in the neuroscience of complementary learning systems [@mcclelland1995cls; @kumaran2016cls] and a long line of fast-weight architectures and dual-timescale learners in neural networks [@hinton1987using; @schmidhuber1992learning; @ba2016usingfastweightsattend; @schlag2021lineartransformerssecretlyfast; @anthony2017thinking; @pham2022dualnet]. We adopt this decomposition for LLM post-training, instantiating the fast channel as an evolving population of textual prompts and the slow channel as the model parameters.

#### Modern fast--slow methods for LLM RL.

A small but growing body of work combines textual feedback with reward-driven weight updates. BetterTogether [@soylu2024bettertogether] alternates SFT with prompt optimization over a DSPy pipeline, albeit instantiated with prompt optimizers that don't use textual feedback; we extend this to a new training paradigm instantiated in verifiable-reward RL in which a Pareto-frontier prompt population created with textual feedback co-evolves with the policy. LANPO [@lanpo2025] interleaves language and numerical feedback via per-instance reflections; we instead maintain a cross-problem Pareto-frontier population. Recent work E-SPL [@zhang2026evolutionarypromptlearningreinforcement] explores prompt optimization and RL at a smaller scale with focus on performance rather than adaptation. mmGRPO [@mmgrpo2025] runs prompt optimization once and then RL on a DSPy program; we interleave the two across cycles. POPE [@pope2025blog] prefixes hard prompts with partial reference solutions; `\methodname{}`{=latex} learns task-level prompts that condition any rollout, and combining the two is a natural direction for future work.

Conclusion {#sec:conclusion}
==========

We present a fast-slow framework for LLM post-training that jointly optimizes the slow model parameters $\theta$ via RL and a fast textual-context population $\Phi$ via reflective prompt evolution, interleaving the two channels. Across CodeIO, Math, and HoVer-hard, this co-optimization reaches matched performance with $1.4$--$3\times$ fewer optimizer steps, attains a higher asymptote, and incurs lower KL displacement than RL alone, which in turn translates into preserved plasticity and stronger continual-learning behavior on new tasks. More broadly, our results suggest that effective post-training should not ask model parameters to absorb all forms of adaptation. Fast textual weights can capture task-specific and rapidly evolving improvements, while slow weights can focus on consolidating persistent behavior. This division of labor offers a path toward post-training methods that are more data-efficient, less destructive, and more amenable to continual learning.

Acknowledgements {#sec:acknowledgements}
================

The authors acknowledge the gracious support from the Furiosa AI, Apple, NVIDIA, Macronix, Mozilla team, Open Philanthropy / Coefficient Giving, and Amazon Research. Furthermore, we appreciate the support from Google Cloud, the Google TRC team Prof. David Patterson, along with support from Google Gemini team, and Divy Thakkar. Lakshya A Agrawal is also supported by a Laude Slingshot grant and Laude residency provided by the Laude Institute and an Amazon AI PhD Fellowship. We would like to thank Harman Singh, Nishanth Anand and Reza Bayat for useful discussions related to continual learning. Finally, the authors would also like to thank Josh Sirota and the Eragon team for infrastructure and compute support. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred.

```{=latex}
\newpage
```
```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\appendix
```
GEPA {#appendix:gepa}
====

We optimize the fast weights $\phi$ using GEPA [@gepa], a reflective evolutionary procedure that searches the text space $\Sigma^\ast$ guided by a frozen *reflection LM* $\pi_{\text{ref}}$, a separate capable model that proposes textual mutations from natural-language critiques of rollouts. For a candidate $\phi$ and query $x$, define the per-instance fitness $$s(\phi; x) \;=\; \mathbb{E}_{y \sim \pi_\theta(\cdot \mid x, \phi)}\!\big[\,r(x, y)\,\big]. \label{eq:fitness}$$ GEPA maintains a population $\mathcal{P}$ of candidate prompts whose fitness vectors $\big(s(\phi; x_1), \ldots, s(\phi; x_n)\big)$ on an anchor set $\{x_i\}_{i=1}^{n} \subset \mathcal{D}$ are tracked. One generation proceeds in four steps: (i) select a parent $\phi_p$ from the per-instance Pareto frontier of $\mathcal{P}$; (ii) sample a minibatch of rollouts under $\phi_p$ from $\pi_\theta$; (iii) elicit a child $\phi_c \leftarrow \pi_{\text{ref}}\!\big(\phi_p,\,\text{traces}\big)$ by having the reflection LM diagnose failures and propose a textual edit; (iv) evaluate $s(\phi_c; \cdot)$ on the anchor set, add $\phi_c$ to $\mathcal{P}$, and prune dominated candidates. After a fixed metric-call budget, GEPA returns the top-$m$ candidates of the resulting Pareto frontier. The frontier preserves diversity: different candidates are best on different slices of $\mathcal{D}$, and this diversity is precisely what allows the RL phase to exploit several complementary fast weights simultaneously. GEPA is related to other LLM-as-optimizer methods that use natural-language reflection or self-feedback [@reflexion; @selfrefine; @opro], and is distinguished by its per-instance Pareto population and explicit prompt-mutation operator.

Algorithm pseudocode {#app:algorithm}
====================

We present the algorithm pseudocode in Algorithm `\ref{alg:method}`{=latex}.

```{=latex}
\begin{algorithm}[h]\caption{\methodname{}: interleaved RL with population-based prompt evolution.}
\label{alg:method}
\begin{algorithmic}[1]
\Require initial slow weights $\theta_0$; seed prompt $\phi_{\text{seed}}$; data stream $\mathcal{D}$; cycle length $T$; population size $K$; GRPO group size $G$ with $K \mid G$; reflection LM $\pi_{\text{ref}}$
\State $\Phi_0 \leftarrow \{\phi_{\text{seed}}\}$
\For{cycle $c = 0, 1, 2, \dots$}
    \State $\mathcal{L}_c \leftarrow$ pre-fetch the next $T$ minibatches from $\mathcal{D}$ \Comment{lookahead batch}
    \State $\Phi_{c+1} \leftarrow$ \textsc{Gepa}$\big(\pi_{\theta_c},\, \pi_{\text{ref}},\, \mathcal{L}_c,\, \Phi_c,\, K\big)$
    \For{$t = 1, \dots, T$}
        \State sample minibatch $\mathcal{B}_t \subset \mathcal{L}_c$
        \ForAll{$p \in \mathcal{B}_t$}
            \ForAll{$\phi^{(k)} \in \Phi_{c+1}$}
                \State assemble $G/K$ rollouts under $(p, \phi^{(k)})$, sampling from $\pi_\theta(\cdot \mid p, \phi^{(k)})$
            \EndFor
            \State place all $G$ rollouts in one group; compute group-relative advantages
        \EndFor
        \State update $\theta$ with $\mathcal{L}_{\textsc{cispo}}$ (Eq.~\eqref{eq:cispo})
    \EndFor
\EndFor
\end{algorithmic}
\end{algorithm}
```
Star-graph dataset construction {#app:star-graph}
===============================

Star-graph search is a planning task introduced by @prakash2025zerorewards. We adopt their problem definition and procedurally generate our train and test splits.

#### Graph instance.

Each instance is parameterized by a triple $(d, p, n)$: source degree $d$, path length $p$, and node-pool size $n$. We sample without replacement from $\{0, 1, \ldots, n-1\}$ to draw the source $s$ and goal $g$ ($s \neq g$), then $p-2$ distinct intermediate nodes that together form the unique gold path $s \!\to\! v_1 \!\to\! \cdots \!\to\! v_{p-2} \!\to\! g$ (yielding $p-1$ edges of length 1 each). On top of the gold path we attach $d-1$ *decoy branches* rooted at $s$, each itself a chain of length $p$, with all decoy nodes drawn fresh from the unused pool so that no decoy intersects the gold path or another decoy. The full edge set is then shuffled uniformly at random and serialized as a flat space-separated list of comma-separated pairs. The graph is treated as undirected at scoring time but presented as a list with no ordering hint.

#### Why this is a hard exploration problem.

The source $s$ is the only node with degree $d$; every other node sits on a chain and has degree $2$ (its predecessor and successor along that arm). The first hop is therefore the only real branching decision: picking a decoy arm commits the model to a chain that never reaches $g$, and the path-listing output format gives no built-in way to backtrack. With $d=25$ a uniformly-random first hop would succeed only $4\%$ of the time, and empirically Qwen3-4B-Instruct does worse than uniform: pass@$16$ is $0/50$ on the seed prompt before any RL update, because the model's strong path-finding prior is miscalibrated for this synthetic layout and concentrates probability on a wrong arm. This is the \`\`RL stuck at zero" regime that motivates fast-weight prompt evolution.

#### Prompt template.

Every example is formatted with the verbatim template from the original reference implementation:

> ```{=latex}
> \small
> ```
> ```{=latex}
> \ttfamily
> ```
> Given a bi-directional graph in the form of space separated edges, output a path from source node to the destination node in the form of comma separated integers.\
> For this question the graph is {graph}\
> The source node is {source}\
> The destination node is {destination}\
> Please reason step by step, and put your final answer within \\boxed{}.

The seed system prompt used during RL and as the initial GEPA candidate is:

> `\small`{=latex} *You are solving a graph path-finding task. You will be given a list of edges and a source and destination node. Output one valid path from source to destination. Inspect the source node's neighbors first, identify which neighbor leads to the destination via a sequence of valid edges, then commit to that branch. Each consecutive pair in your output path must be a valid edge in the graph. Put your final answer comma-separated inside boxed braces.*

#### Scoring.

The reward function extracts the contents of the last `\boxed{…}` from the post-`</think>` body of the rollout, strips whitespace, and applies an exact-match comparison against the gold path $s, v_1, \ldots, v_{p-2}, g$ rendered as a comma-separated string. The reward is $1.0$ on exact match and $0.0$ otherwise. The task admits no partial credit, so even a single wrong intermediate node zeros the reward.

#### Splits used in the paper.

We sweep difficulty by varying $(d, p, n)$. The headline experiments use $(d,p,n) = (25, 20, 500)$ with $10{,}000$ training examples and $200$ held-out test examples.

Hyperparameters and compute {#app:hyperparams}
===========================

This appendix lists all training hyperparameters needed to reproduce the numbers in the main paper. Settings shared across domains are described first; per-domain overrides follow in Tables `\ref{tab:hyper-rl}`{=latex}--`\ref{tab:hyper-gepa}`{=latex}.

#### Shared RL configuration.

All RL runs are GRPO [@deepseekmath] with the [cispo]{.smallcaps} surrogate loss [@minimaxm1; @scalerl] ($\text{clip\_low}=1.0$, $\text{clip\_high}=3.0$), advantage normalization by per-group standard deviation, and a small KL-to-reference penalty ($\text{coef}=10^{-3}$). The actor optimizer is AdamW (PyTorch defaults $\beta_1=0.9$, $\beta_2=0.999$, weight decay $0$) at learning rate $10^{-6}$ with a $10$-step linear warm-up; we use no learning-rate decay. Each RL step samples $G=8$ rollouts per problem with `train_batch_size`$\,=32$ problems (so $256$ rollouts per step), runs PPO updates with `ppo_mini_batch_size`$\,=32$, and uses tensor-parallel size $1$ for the rollout engine (vLLM). At evaluation time we report mean@$4$ over four rollouts per validation prompt at temperature $0.6$, top-$p$ $0.95$. We checkpoint every $50$ training steps and keep all checkpoints. All runs use Qwen3 *thinking* mode except star-graph (which uses the Instruct base, no thinking) and the no-thinking baselines reported in the main text.

#### Shared GEPA configuration.

GEPA cycles use the same settings for all domains: cycle length $T=6$ RL steps between GEPA optimizations (equivalently, $\texttt{warmstart\_steps}=\texttt{rl\_steps\_per\_cycle}=6$); all $K$ population prompts are scored on every question (`prompts_per_question`$\,=K$) with advantage grouping *Option B* (`advantage_grouping="question"`), so a problem's $G$ rollouts are split $G/K$ per prompt and a single group statistic $(\bar r_g, \sigma_g)$ is computed across all of them. The reflection LM is OpenAI `gpt-5.2`, accessed through LiteLLM. Per-cycle GEPA budgets are $\texttt{num\_eval\_examples}=192$ and $\texttt{max\_metric\_calls}=960$ across all domains except polaris ($96$ and $1500$) and star-graph ($200$ and $960$). Candidate prompts are injected as a system message; for datasets whose raw prompt already contains a system message (HoVer, Physics, Math) we *merge* the GEPA prompt into the existing system role rather than stack a second one. Each GEPA cycle outputs a Pareto-frontier population of size $K$ which seeds the next cycle.

#### Per-domain overrides.

Tables `\ref{tab:hyper-rl}`{=latex} and `\ref{tab:hyper-gepa}`{=latex} summarize the parameters that vary across training domains.

```{=latex}
\centering
```
```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
::: {#tab:hyper-rl}
  Domain           Base model                           $L_{\text{ctx}}$ / $L_p$ / $L_r$        Batch   Rollouts/step  GPU util.
  ---------------- -------------------------------- ----------------------------------------- ------- --------------- -----------
  HoVer-hard       Qwen3-8B (think)                            18944 / 4096 / 8192                 32             256     0.6
  Physics          Qwen3-8B (think)                            18944 / 4096 / 8192                 32             256     0.7
  CodeIO           Qwen3-8B (think)                            18944 / 4096 / 8192                 32             256     0.6
  Math (Polaris)   Qwen3-8B-SFT$^\dagger$ (think)              12288 / 4096 / 8192                 64             512     0.7
  Star-graph       Qwen3-4B-Instruct (no think)      `\phantom{0}`{=latex}8192 / 4096 / 4096       32             256     0.6

  : Per-domain RL hyperparameters and base models. $L_{\text{ctx}}$/$L_p$/$L_r$ are the maximum context, prompt, and response lengths (tokens). Train batch is the number of problems per RL step; rollouts/step is $\text{batch} \times G$.
:::

\
`\footnotesize `{=latex}$^\dagger$ Polaris uses our own continued-SFT base (Qwen3-8B base further SFT'd on Nemotron to recover math performance, since Qwen3-8B-Instruct is already saturated on math). For Polaris only, GRPO advantages are not normalized by group std (`norm_adv_by_std_in_grpo=false`) and the train batch is doubled to $64$ problems/step.

```{=latex}
\centering
```
```{=latex}
\small
```
::: {#tab:hyper-gepa}
  Domain            $K$   $G/K$   Cycle $T$                   Eval set               Metric calls Reflection LM
  ---------------- ----- ------- ----------- ------------------------- -------------------------- ---------------
  HoVer-hard         8      1         6                            192   `\phantom{0}`{=latex}960 `gpt-5.2`
  Physics            4      2         6                            192   `\phantom{0}`{=latex}960 `gpt-5.2`
  CodeIO             8      1         6                            192   `\phantom{0}`{=latex}960 `gpt-5.2`
  Math (Polaris)     4      2         6        `\phantom{0}`{=latex}96                       1500 `gpt-5.2`
  Star-graph         8      1         6                            200   `\phantom{0}`{=latex}960 `gpt-5.2`

  : Per-domain GEPA hyperparameters. $K$ is the population size; $G/K$ is the number of rollouts each candidate prompt produces per problem within an RL group. Eval set / metric calls are the per-cycle GEPA budgets.
:::

#### Compute and wall-clock.

All runs are submitted to a SLURM cluster with $8 \times$ H100 (80GB) per node. The headline runs in the main paper are single-node ($1\times8$ GPU); the Polaris $K=8$ ablation is the only multi-node configuration ($4\times8 = 32$ GPUs). Mean per-RL-step wall-clock under the headline configuration is $\sim 60$ s for RL-only and $\sim 100$ s for `\methodname{}`{=latex} without rollout reuse (HoVer-hard, $K=8$). Enabling rollout reuse (Section `\ref{app:rollout-reuse}`{=latex}) brings the RL-step cost down to $\sim 47$ s, slightly faster than RL-only at the step level, because $\sim 1/3$ of the RL group's rollouts are served from the GEPA evaluation cache rather than freshly generated. This figure covers the RL training loop only: `\methodname{}`{=latex} additionally runs periodic GEPA cycles (rollouts for $K$ candidate prompts plus a reflection call), which add real wall-clock on top, so end-to-end a `\methodname{}`{=latex} run is more expensive than an RL-only run of equal step count. RL training runs go to either $1500$ steps or until validation accuracy saturates (whichever comes first); a single headline RL+`\methodname{}`{=latex} run consumes on the order of $25$--$40$ H100-GPU-hours, of which the GEPA cycles are a sizable fraction. GEPA reflection calls to `gpt-5.2` are billed separately; at the per-cycle metric-call budget above this is $\lesssim$ \$10 per training run.

Design ablations {#app:codeio-ablations}
================

This appendix collects the four design-choice ablations. All sweeps are run on CodeIO (Qwen3-8B thinking, light-recipe defaults from Appendix `\ref{app:hyperparams}`{=latex}) and reported at a matched RL step ($500$) on the held-out CodeIO mean@$4$. The unmodified RL-only baseline at the same step ($\mathbf{39.65\%}$) is included in every panel for reference.

```{=latex}
\centering
```
![CodeIO design ablations (val mean@$4$ at training step $500$). Sage bars mark the headline configuration in each panel; gray bars are the alternatives swept; the dashed line indicates RL-only at the same matched step. (**a**) Population size $K$. (**b**) Advantage baseline at $K{=}2$: per-prompt (Prompt baseline) vs. per-problem (Problem baseline). (**c**) Cycle length $T$ at $K{=}8$, Problem baseline. (**d**) Light vs. full GEPA recipe.](paper_plots/fig_codeio_ablation_panels.png){#fig:codeio-ablations width="\\linewidth"}

#### Population size $K$ (Fig. `\ref{fig:codeio-ablations}`{=latex}a).

We sweep $K \in \{1, 2, 4, 8\}$ holding the rest of the recipe at light / Problem baseline / cycle $T{=}6$. Every $K \geq 1$ improves on RL-only ($39.65\%$), indicating that even a single optimized prompt carries useful task signal into RL. Performance is non-monotonic in $K$: $K{=}1$ already buys $+1.5$ pp ($41.10\%$), $K{=}2$ Problem baseline and $K{=}4$ are roughly tied at $\sim$$40.4\%$, and the gain saturates at $K{=}8$ with $\mathbf{42.84\%}$ (cycle $6$), the headline configuration. Larger $K$ widens the within-group prompt distribution and gives GRPO advantages a richer signal to compare against, but only when paired with a tight cycle (see panel c). $K{=}8$ with cycle$=12$ recovers most but not all of the gain ($41.13\%$).

#### Advantage baseline at $K{=}2$ (Fig. `\ref{fig:codeio-ablations}`{=latex}b).

With a fixed population of $K{=}2$ prompts, the choice of GRPO advantage baseline matters. The *Prompt baseline* (per-prompt: rollouts under each $\phi^{(k)}$ are normalized against their own group mean and std) yields $39.30\%$, slightly *below* RL-only. The *Problem baseline* (per-problem: all $G$ rollouts under both prompts share a single group statistic) yields $\mathbf{40.65\%}$, $+1.4$ pp over the Prompt baseline and $+1.0$ pp over RL-only. The intuition: the prompt baseline makes prompts compete only with themselves and discards the cross-prompt comparison entirely, while the problem baseline exposes the policy gradient to which prompt a stronger response came from on the same problem. The problem baseline is the default for all main-text experiments.

#### Cycle length $T$ at $K{=}8$ (Fig. `\ref{fig:codeio-ablations}`{=latex}c).

Holding $K{=}8$ Problem-baseline fixed, we compare $T{=}6$ vs. $T{=}12$ RL steps between successive GEPA optimizations. Cycle $T{=}6$ reaches $\mathbf{42.84\%}$; doubling the cycle to $T{=}12$ drops mean@$4$ by $1.7$ pp to $41.13\%$, giving up more than half of the advantage over RL-only. This is the expected staleness story: as $\theta$ moves between GEPA cycles, the prompts in $\Phi$ become increasingly mistuned to the current policy, and the per-question rollout-group signal degrades. $T{=}6$ is short enough that the population $\Phi_c$ remains close to optimal across the cycle, though it requires twice as many GEPA optimizations as cycle 12.

#### Light vs. full GEPA recipe (Fig. `\ref{fig:codeio-ablations}`{=latex}d).

The \`\`light" recipe uses $K{=}4$ candidates, a smaller per-cycle GEPA budget (`num_eval_examples`$=192$, `max_metric_calls`$=960$), and a proposer prompt that asks `gpt-5.2` for incremental edits to the current best prompt rather than rewrites from scratch. The \`\`full" recipe is the original configuration from @gepa: $K{=}1$, doubled metric budget (`max_metric_calls`$=1922$), and an open-ended proposer that allows full rewrites. Within our setup the full recipe gives essentially no lift over RL-only ($39.85\%$ vs. $39.65\%$); the light recipe at the same $K{=}1$ yields $+1.5$ pp ($41.10\%$); and scaling light to $K{=}8$ gives a further $+1.7$ pp ($\mathbf{42.84\%}$). Two factors contribute. First, full's $K{=}1$ pinches the population channel that turns out to matter most in panel a. Second, the open-ended rewrite proposer is more prone to drift on a moving policy: incremental edits keep the population's induction biases coherent across cycles, while wholesale rewrites tend to discard structure each round. The combined effect is a recipe that runs on roughly half the GEPA budget and outperforms the original by $\sim$$3$ pp on this task.

Rollout reuse: same accuracy at lower wall-time and rollout cost {#app:rollout-reuse}
================================================================

```{=latex}
\centering
```
![Rollout reuse on HoVer-hard, training step $\le$300. **Left:** mean wall-time per RL step. `\methodname{}`{=latex}+reuse drops from $\sim$66 s to $\sim$47 s (about $30\%$ faster); the saving comes almost entirely from the generation phase. **Middle:** cumulative RL trajectories. The shaded region splits each step's $256$ trajectories into live-generated (lavender) and from-cache (amber). By step 300 about $\sim$17,000 of the $\sim$77,000 RL trajectories were served from cache. **Right:** HoVer-hard val accuracy mean\@4. `\methodname{}`{=latex}+reuse tracks `\methodname{}`{=latex}; the no-prompt RL baseline lags both.](paper_plots/fig_rollout_reuse.png){#fig:rollout-reuse width="\\linewidth"}

#### Why reuse is possible.

`\methodname{}`{=latex} interleaves GEPA prompt optimization with RL updates: every $T$ RL steps GEPA runs a cycle that scores each candidate prompt $\phi^{(k)}$ on a small evaluation pool drawn from the same training distribution the next RL phase will see. Each evaluation entry is a tuple $(p, \phi^{(k)}, y, r)$ -- a problem $p$, the prompt $\phi^{(k)}$ used, the sampled response $y$, and its scalar reward $r$. Without reuse, those tuples are thrown away once GEPA picks a new population $\Phi_{c+1}$; the next RL step then re-rolls $G$ fresh trajectories per problem from scratch under the new prompts. But because the population $\Phi_{c+1}$ is a Pareto-frontier subset of the prompts GEPA *already evaluated*, many of those discarded tuples are perfectly valid samples from the current policy under one of the current prompts. The natural optimization is to splice them into the RL group instead of regenerating them.

#### Cache mechanics.

We maintain a per-(problem, prompt) cache of GEPA evaluation tuples produced during the most recent cycle. When the RL phase forms its rollout group of size $\text{batch} \times G$ across the prompts in $\Phi_{c+1}$, it first claims any cached $(p, \phi, y, r)$ matching a (problem, prompt) slot and only generates the remaining slots live with vLLM. Cached and live trajectories are then concatenated into a single GRPO group; the policy gradient does not distinguish the two sources because both were sampled under prompts in the active population. The cache is cleared when GEPA produces the next $\Phi_{c+2}$, so reused trajectories are at most $T$ RL steps old ( $T{=}6$ in our headline configuration).

#### Empirical effect.

Figure `\ref{fig:rollout-reuse}`{=latex} measures the impact on HoVer-hard. We compare two otherwise identical `\methodname{}`{=latex} runs -- one with reuse enabled and one without -- through the first $300$ training steps, with a no-prompt RL baseline as a third reference. Wall-time per RL step (left panel) drops from $\sim$66 s to $\sim$47 s, a $\sim$$29\%$ speedup that comes almost entirely out of the generation phase: GEPA's own per-cycle cost is unchanged, but a substantial fraction of the following RL phase's rollouts no longer need to be sampled. The middle panel decomposes the cumulative trajectory budget into the live and from-cache components: by step $300$, roughly $17$k of the $77$k total trajectories ($\sim$$22\%$) were served from cache, with the cache hit rate concentrated in the first few RL steps after each GEPA cycle and tapering as the live group fills in problems not in GEPA's evaluation pool. The right panel shows that this comes at no accuracy cost: the reuse and non-reuse `\methodname{}`{=latex} curves are within sampling noise of each other on HoVer-hard val mean@$4$, and both lead the no-prompt RL baseline throughout.

KL-vs-reward, full four-task results {#app:polaris-kl}
====================================

```{=latex}
\centering
```
![Validation reward versus $\mathrm{KL}(\pi_{\mathrm{train}}\,\|\,\pi_{\mathrm{base}})$ on all four training tasks: `CodeIO`, `Math` (Polaris), `HoVer`, and `Physics`. Same axes, smoothing, and conventions as Figure `\ref{fig:kl-vs-acc}`{=latex} in the main text. The three-task variant in the main text drops the Polaris panel for the reasons discussed below.](paper_plots/fig_kl_vs_acc_v2.png){#fig:kl-vs-acc-full width="\\linewidth"}

The Polaris (math) trajectories in Figure `\ref{fig:kl-vs-acc-full}`{=latex} look qualitatively different from the other three tasks: RL and `\methodname{}`{=latex} sit on top of each other in $(\mathrm{KL}, \text{reward})$-space rather than `\methodname{}`{=latex} pulling the frontier to the left. We attribute this to the base model used for the Polaris runs. Unlike the other three tasks, which start from `Qwen3-8B` (an instruction-tuned model with strong format following), Polaris was trained on top of a model built by SFT'ing `Qwen3-8B-Base` on Nemotron, since the public `Qwen3-8B` is already saturated on Polaris. That custom SFT base has noticeably weaker instruction-following than the public Instruct checkpoint: the GEPA-evolved prompts that drive `\methodname{}`{=latex}'s KL gain on the other three tasks rely on the policy actually following format and self-checking instructions in the prompt, and on this base much of that signal is lost. The model still learns the math task from RL reward, so the reward axis behaves normally; it simply does not benefit from the prompt channel as strongly, and the two trajectories collapse onto each other. We expect the Polaris curve to look more like the other three with a stronger instruction-tuned base, but isolating that requires retraining and is left for future work.

Explicit fast-to-slow distillation {#app:explicit-distill}
==================================

The ceiling claim in Section `\ref{sec:abl:explicit-distill}`{=latex} raises a natural question: can the gains from the fast textual channel be folded into the parameters explicitly, without ever doing RL on the slow weights? We test this by replacing the slow-weight policy-gradient update with an on-policy reverse-KL distillation loss $$\label{eq:distill}
\mathcal{L}_{\text{distill}}(\theta) \;=\; \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi_\theta(\cdot \mid x)} \!\left[\, \sum_{t=1}^{|y|} \mathrm{KL}\!\Big(\pi_\theta(\cdot \mid x, y_{<t}) \,\Big\Vert\, \pi_{\bar{\theta}}(\cdot \mid x, \phi, y_{<t}) \Big) \,\right],$$ where the teacher $\pi_{\bar{\theta}}(\cdot \mid x, \phi, \cdot)$ is the same model evaluated with a `\methodname{}`{=latex}-evolved fast-weight prompt $\phi$ and frozen parameters $\bar{\theta}$, and the student $\pi_\theta(\cdot \mid x, \cdot)$ is conditioned only on the problem $x$. Sampling $y$ on-policy from the student and minimizing the per-token reverse KL toward the teacher follows recent work on self-distillation [@hubotter2026sdpo]. We call this variant `\methodname{}`{=latex}-distill: the slow weights move only by chasing the teacher distribution induced by the fast-weight prompt, with no direct exposure to scalar reward.

```{=latex}
\centering
```
![**Explicit fast-to-slow distillation on `HoVer`.** `\methodname{}`{=latex} (green) is compared to `\methodname{}`{=latex}-distill (orange), which updates $\theta$ only via the on-policy reverse-KL loss in Eq. `\ref{eq:distill}`{=latex} using a `\methodname{}`{=latex}-evolved prompt $\phi$ as the teacher. **Left:** Validation reward. `\methodname{}`{=latex}-distill rises above the prompt-only level by transferring fast-weight signal into the parameters across multiple updates, but plateaus well below `\methodname{}`{=latex}, which has both channels optimizing reward jointly. **Right:** Actor entropy. Both methods preserve healthy entropy throughout training.](paper_plots/fig_beyond_fast_weights_no_rl_no_gepa.png "fig:"){#fig:beyond-fast-weights width="\\linewidth"} `\vspace{-1em}`{=latex}

Figure `\ref{fig:beyond-fast-weights}`{=latex} (left) shows that `\methodname{}`{=latex}-distill iteratively transfers signal from the fast weights into the slow weights and rises above the prompt-only ceiling, but does not reach `\methodname{}`{=latex}'s reward. This is consistent with Observation 2. The fast channel alone is not enough to saturate the joint ceiling, and direct policy-gradient updates on $\theta$ against reward, run alongside the fast channel, account for the remaining gain. The entropy panel (right) also shows that the diversity benefits of training under a Pareto-frontier prompt population persist even when the slow update is purely distillation-based.

```{=latex}
\centering
```
![**Decomposing the Fast vs. Slow gain on CodeIO.** Step-matched (training step $650$) validation accuracy (pass@$1$, computed from $n{=}8$ rollouts) on the held-out CodeIO set. *Slow only* isolates the parametric channel (RL- or `\methodname{}`{=latex}-trained weights, evaluated without any GEPA prompt). *Fast only* isolates the textual channel (base weights with a GEPA-evolved prompt). *Slow + Fast* combines them. Both channels contribute, and `\methodname{}`{=latex} compounds the two: its slow weights are stronger than RL's even without the prompt ($25.1\%$ vs. $20.9\%$), the full Slow + Fast (FST) configuration reaches $43.1\%$, and an extra GEPA pass on top lifts it further to $43.3\%$.](paper_plots/fig_codeio_step650_match_mean8.png){#fig:fast-slow-decomp width="\\linewidth"}

Evolved GEPA prompts during `\methodname{}`{=latex} training {#app:gepa-prompts}
============================================================

This appendix shows, for each `\methodname{}`{=latex} training task, the *seed* prompt that GEPA started from and the *evolved* prompt at the matched-step `\methodname{}`{=latex} checkpoint used in §`\ref{sec:de}`{=latex}. The evolved prompt is the population's lead candidate (`gepa_state.json:current_prompts[0]`) of each Problem-baseline $K{=}\{4,8\}$ training run at that step, so it is a prompt that was *co-evolved with the slow-weight RL update*, not a prompt obtained by running GEPA on the un-trained base policy in isolation. `\methodname{}`{=latex}'s rollout group at that step also draws from the rest of the $K$-prompt population, not just the single one shown here.

Across tasks, two patterns are consistent. First, GEPA almost never *rewrites* the seed. The $K{=}\{4,8\}$ Problem-baseline recipe constrains the proposer (`gpt-5.2`) to small targeted edits, so the evolved prompt keeps the seed's basic role and output format and adds layered guidance on top. Second, the additions are almost entirely *negative-example specific*: each block addresses a failure mode the proposer observed during reflection on a small batch of low-reward rollouts (e.g., \`\`do not invent placeholder numbers" for CodeIO, \`\`do not skip pages with parenthetical disambiguators" for HoVer-hard, \`\`re-check off-by-one in process/recurrence problems" for Polaris). The result is a long instruction list that reads less like a generic system prompt and more like a checklist of don'ts assembled from the specific population of mistakes the policy was making.

CodeIO
------

```{=latex}
\promptbox{Seed Prompt -- CodeIO}{gepa_prompts/codeio_seed.txt}{task_codeio}
```
```{=latex}
\promptbox{Evolved Prompt -- CodeIO ($K{=}8$ Problem-baseline, training step 650)}{gepa_prompts/codeio_K8_optB_best_step650.txt}{task_codeio}
```
Math (Polaris)
--------------

```{=latex}
\promptbox{Seed Prompt -- Math (Polaris)}{gepa_prompts/polaris_seed.txt}{task_polaris}
```
```{=latex}
\promptbox{Evolved Prompt -- Math (Polaris) ($K{=}4$ Problem-baseline, training step 1050)}{gepa_prompts/polaris_K4_optB_best_step1050.txt}{task_polaris}
```
Physics
-------

```{=latex}
\promptbox{Seed Prompt -- Physics}{gepa_prompts/physics_seed.txt}{task_physics}
```
```{=latex}
\promptbox{Evolved Prompt -- Physics ($K{=}4$ Problem-baseline, training step 500)}{gepa_prompts/physics_K4_optB_best_step500.txt}{task_physics}
```
HoVer-hard
----------

```{=latex}
\promptbox{Seed Prompt -- HoVer-hard}{gepa_prompts/hover_strict_seed.txt}{task_hover}
```
```{=latex}
\promptbox{Evolved Prompt -- HoVer-hard ($K{=}8$ Problem-baseline, training step 550)}{gepa_prompts/hover_strict_K8_optB_best_step550.txt}{task_hover}
```