Learning, Fast and Slow: Towards LLMs That Adapt Continually

Source

Raw Markdown: paper_fast-slow-training-2026.md
PDF: paper_fast-slow-training-2026.pdf
Preprint: arXiv:2605.12484
Official blog post: Learning, Fast and Slow: LLMs That Adapt Continually
Official code page: Code Coming Soon - FST
Official X thread: Kusha Sareen thread

Core Claim

Fast-Slow Training argues that LLM post-training should optimize two adaptation channels at once: model parameters as slow weights and textual context or prompts as fast weights. In the paper’s instantiation, slow weights are updated with CISPO-style RLVR while fast weights are updated with GEPA prompt optimization. The result is a model that reaches target-task gains with fewer slow-weight updates, less KL drift from the base policy, and better ability to learn later tasks.

Why It Matters

This source is the strongest current wiki example of multi-timescale LLM post-training. Instead of treating prompts as a deployment wrapper around a finished checkpoint, FST treats context as an actively trained state variable that can absorb task-specific information while the parameters consolidate more durable reasoning behavior.

For the wiki’s time-series and world-model framing, the useful analogy is an adaptive controller with both fast working state and slow persistent dynamics. The paper is not a time-series model, but it is directly relevant to models that must keep learning from changing task streams without overwriting their base competencies.

Alex Context

Alex’s main takeaway is that FST is compelling because the prompt format a model wants to see may itself change during training. If the model’s weights are moving, it is natural that the best input interface also drifts; FST makes that prompt co-adaptation an explicit part of the training loop instead of holding the prompt fixed.

Alex also flags the “push information out of weights into context” claim as a useful idea to remember. The method gives task-specific details, formatting preferences, and transient lessons somewhere editable to live, rather than forcing all of them into persistent parameters. That makes the fast/slow split feel logically aligned with the goal of preserving base-model capabilities while still adapting.

Key Contributions

Frames prompt/context optimization as fast weights and model parameters as slow weights for LLM adaptation.
Instantiates the framework by interleaving GEPA prompt optimization with CISPO-style RLVR updates.
Maintains a Pareto-frontier population of prompts rather than a single best prompt, so different contexts can specialize to different problem slices.
Reports that FST reaches RL’s running peak with fewer optimizer steps: 3.0x on CodeIO, 1.4x on Math/Polaris, and 3.0x on HoVer-hard.
Reports higher fitted performance asymptotes than RL on CodeIO, Math/Polaris, and HoVer-hard.
Shows lower KL displacement from the base model at matched reward, with the abstract reporting up to 70% less KL divergence.
Tests plasticity by continuing RL from FST- and RL-trained checkpoints onto HoVer-hard; FST-trained checkpoints remain more learnable.
Tests a continual task stream HoVer → CodeIO → Physics; FST keeps acquiring later tasks while RL stalls on the task switch.

Main Takeaways

FST makes the “where should adaptation live?” question explicit. Task-specific instructions, failure-mode checklists, and domain rules can live in fast textual state instead of being permanently written into parameters. Slow weights can then move less and preserve more of the base model’s broad behavior.

The result is not prompt optimization alone. The paper’s decomposition shows both channels can contribute: on CodeIO and HoVer-hard, fast-only, slow-only, and fast+slow configurations differ, with the combined FST configuration strongest. Their naive FST-distill variant also plateaus below full FST, suggesting that merely distilling a prompt-conditioned teacher is not enough to replace reward-optimized slow weights.

The most important wiki link is to parameter-drift and retention. FST complements the ES forgetting source and DFT source: all three ask how to gain target behavior without over-specializing the base model, but FST’s answer is to route part of adaptation through mutable context rather than only changing the gradient estimator or the parameter-search method.

Method Notes

FST uses a population of prompts Phi, samples several prompts per problem, and computes one per-problem group-relative advantage across the resulting rollouts. This is important: the slow-weight update can see which prompt-conditioned rollout solved the same problem better, rather than normalizing each prompt in isolation.

GEPA cycles are frequent in the headline setup: cycle length T=6 RL steps. The appendix reports that longer cycles make prompts stale as the policy moves. The method is therefore a co-evolution loop, not a one-time prompt search before or after RL.

The code page exists but currently says code is coming soon, so the local reproducibility anchor is the converted paper and its detailed appendix rather than an available repository.

Gotchas

FST is more compute-expensive end-to-end than equal-step RL because GEPA cycles add rollout and reflection costs, even though rollout reuse can reduce the per-RL-step generation cost.
The main evidence is on reasoning-style RLVR tasks: CodeIO, Math/Polaris, HoVer-hard, Physics, and a synthetic star-graph task. It should not be treated as direct evidence for passive time-series forecasting.
The Polaris math setting is a weaker case for the prompt channel because the custom SFT base has weaker instruction following; the paper itself flags that its KL/reward curve behaves differently.
The method depends on a capable prompt optimizer and a reflection LM; in the paper’s setup, GEPA reflection uses gpt-5.2.
“Fast weights” are textual context, not a learned neural fast-weight tensor. The analogy is useful, but the implementation is prompt/context optimization.

Links Into The Wiki

Open Questions

Can FST-style fast context be made durable across many task switches without prompt bloat or brittle failure-mode checklists?
Which parts of fast context should eventually be distilled into slow weights, and which should remain external, editable state?
Does the fast/slow split help time-series reasoning models adapt to changing domains, sensors, event streams, or operator policies without damaging base numeric priors?
Can rollout reuse and shared evaluation make FST practical for larger models or longer task streams?

Alex Knowledge Base

Explorer

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Source

Core Claim

Why It Matters

Alex Context

Key Contributions

Main Takeaways

Method Notes

Gotchas

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks