---
author:
- Joongwon (Daniel) Kim
- Winnie Yang
- Kelvin Niu
- Hongming Zhang
- Yun Zhu
- Eryk Helenowski
- Ruan Silva
- Zhengxing Chen
- Srini Iyer
- Manzil Zaheer
- Daniel Fried
- Hannaneh Hajishirzi
- Sanjeev Arora
- Gabriel Synnaeve
- Ruslan Salakhutdinov
- Anirudh Goyal
bibliography:
- paper.bib
date: '`\today`{=latex}'
title: 'Scaling Test-Time Compute for Agentic Coding'
---

```{=latex}
\newcommand{\printappendixtoc}{%
  \begingroup
  \par\medskip
  \begin{tcolorbox}[
    colback=metabg,
    colframe=metablue,
    arc=2mm,
    boxrule=0.8pt,
    left=6pt,right=6pt,top=4pt,bottom=4pt
  ]
    {\sffamily\bfseries Appendix -- Table of Contents}\par\smallskip
    {\small\@starttoc{atoc}}%
  \end{tcolorbox}
  \endgroup
}
```
```{=latex}
\newcommand{\appendixreturntotoc}{\hfill\hyperref[sec:appendix_toc]{\footnotesize[return to table of contents]}}
```
```{=latex}
\newcommand{\swe}{\texttt{SWE-Bench Verified}}
```
```{=latex}
\newcommand{\terminal}{\texttt{Terminal-Bench v2.0}}
```
```{=latex}
\newcommand{\mle}{\texttt{MLE-Bench}}
```
```{=latex}
\newcommand{\opus}{\texttt{Claude-4.5-Opus}}
```
```{=latex}
\newcommand{\pro}{\texttt{Gemini-3.1-Pro}}
```
```{=latex}
\newcommand{\sonnet}{\texttt{Claude-4.5-Sonnet}}
```
```{=latex}
\newcommand{\flash}{\texttt{Gemini-3-Flash}}
```
```{=latex}
\newcommand{\gpt}{\texttt{GPT-5-0825}}
```
```{=latex}
\maketitle
```
![ Main results of our agentic **PDR+RTV** test-time scaling method. We improve `\opus{}`{=latex} from 70.9% $\rightarrow$ 77.6% on `\swe{}`{=latex} (`mini-SWE-agent`) and from 47.0% $\rightarrow$ 59.1% on `\terminal{}`{=latex} (`Terminus 1`), and `\pro{}`{=latex} from 72.3% $\rightarrow$ 76.6% on `\swe{}`{=latex} and from 52.5% $\rightarrow$ 64.8% on `\terminal{}`{=latex}. ](img/cover_figure_colm_v1.png){#fig:teaser_figure}

Introduction {#sec:introduction}
============

Test-time scaling has become a major driver of progress in large language models  [@openaio1pro; @anthropicclaude4; @google2026geminideepthink; @feng2026towards]. Rather than treating a model as a one-shot predictor, one can allocate additional inference-time compute to sample multiple candidates, aggregate or compare them, or refine later attempts using information from earlier ones  [@self-consistency-paper; @self-refine-paper; @scaling-test-time-compute-paper; @pdr-paper]. This paradigm has proven effective in domains such as mathematical reasoning and single-turn code generation, where model outputs are compact enough to be manipulated directly in context  [@test-time-scaling-survey-paper; @s*-test-time-scaling-for-code-paper; @test-time-recursive-thinking-paper; @rsa-paper].

However, agentic coding introduces a distinct regime from such existing domains. To solve agentic coding tasks, instead of producing a single bounded completion, the model interacts with an external environment over many steps: reading files, editing code, inspecting logs, executing commands, and responding to intermediate failures. Each attempt therefore produces a long trajectory rather than a short response. These trajectories contain useful signal, but they also contain noisy, verbose details, and are difficult to compare or reuse directly [@agentic-coding-trajectory-reduction-paper; @benchmark-test-time-scaling-agents-paper; @ehrlich2025codemonkeys; @antoniades2024swe; @ahmed2025otter]. As a result, standard test-time scaling methods do not transfer cleanly to long-horizon agents.

In this work, we argue that the central bottleneck in scaling agentic coding is *representation*. To scale test-time compute effectively, inference should operate not on raw trajectories, but on their compact representations. We therefore summarize each rollout into a structured artifact that captures its key hypotheses, decisions, progress, and failure modes while omitting low-value trace detail [@complexity-trap-summarization-paper; @summarization-based-context-management-paper]. These summaries become the interface through which prior attempts can be selected from and reused.

We apply these artifacts along two orthogonal dimensions of test-time scaling. First, we explore the *parallel* dimension [@scalable-best-of-n-paper; @scaling-test-time-compute-for-llm-agents-paper; @rl-training-for-solution-aggregation-paper] and introduce **Recursive Tournament Voting** (**RTV**) to select the strongest attempt from a population of rollouts without access to ground-truth outcomes (Section `\ref{sec:parallel_aggregation}`{=latex}). Second, we explore the *sequential* dimension [@recursive-introspection-paper; @swe-replay-paper; @test-time-recursive-thinking-paper] and adapt **Parallel-Distill-Refine** (**PDR**; @pdr-paper) to agentic coding by executing new rollouts conditioned on compact summaries distilled from prior rollouts (Section `\ref{sec:sequential_refinement}`{=latex}).

We perform a set of preliminary experiments to formulate our inference scaling method. First, we find that structured summaries outperform raw agentic trajectories as objects of comparison during parallel aggregation, showing that bounded summaries are a better substrate for rollout selection than full interaction logs. We also observe that recursive selections from small groups outperform flatter selections from large groups (Section `\ref{sec:parallel_aggregation_ablations}`{=latex}). Moreover, we find that sequential refinement functions better by conditioning new rollouts on multiple prior summaries instead of a single prior attempt. We further observe that the quality of the refinement context is strongly related with the performance of the subsequent rollouts (Section `\ref{sec:sequential_refinement_ablations}`{=latex}).

Building on these results, we combine **RTV** and **PDR** into a unified test-time scaling recipe for agentic coding. Our recipe consistently improves over single-attempt baselines across frontier LLMs {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}, `\gpt{}`{=latex}} and agentic coding benchmarks {`\swe{}`{=latex}, `\terminal{}`{=latex}}. Using our method, on `\swe{}`{=latex}, `\opus{}`{=latex} improves from 70.9% to 77.6% and `\pro{}`{=latex} from 72.3% to 76.6%, and on `\terminal{}`{=latex}, `\opus{}`{=latex} improves from 46.9% to 59.1% and `\pro{}`{=latex} from 52.5% to 64.8% (Section `\ref{sec:main_benchmark_results}`{=latex}). Moreover, our method allows agents to aggregate and refine their solutions to successfully complete tasks that were not previously solvable within 16 initial rollouts (Appendix `\ref{sec:new_solution_discoveries_appendix}`{=latex}), such as `\opus{}`{=latex} solving `gpt2-codegolf` and `\pro{}`{=latex} solving `large-scale-text-editing` in `\terminal{}`{=latex} (Appendix `\ref{sec:example_refined_rollout_trajectories}`{=latex}). Our results suggest that scaling long-horizon agentic systems depend not only on stronger base models, but also on high-quality trajectory representations and robust inference techniques.

Our contributions are threefold:

1.  We identify the *representation of prior agent experience* as a central scientific bottleneck in inference-time scaling for long-horizon agentic coding through a series of targeted experiments.

2.  We propose a unified framework in which compact structured rollout summaries serve as the interface for both *parallel aggregation* and *sequential refinement*, instantiated through **RTV** and agentic **PDR**.

3.  We show empirically that this representation-centric approach yields strong gains across challenging agentic coding benchmarks across frontier models.

Taken together, these results suggest a simple view of the problem: for long-horizon agents, inference-time scaling is fundamentally a problem of *representation, selection, and reuse*.

Methodology {#sec:methodology}
===========

Problem Formulation {#sec:problem_formulation}
-------------------

We scale test-time compute for *agentic coding* tasks, where a language model must solve a coding problem by interacting with an external bash environment over multiple steps. Given an agentic coding problem $P_{\text{in}}$, an agent $\Pi_{\text{LM}}$ produces a *rollout* $\mathcal{R}$, a trajectory consisting of a series of interleaved pairs of *actions* $\mathcal{A}_i$ taken in the bash environment $\mathcal{E}$, and the associated *observation* $\mathcal{O}_i$ returned by the terminal upon executing the agent's action in $\mathcal{E}$, for a given step index $i$. Each action consists of (1) the agent's *thought* $\mathcal{T}_i$ which details the agent's cognitive process, and (2) one or more bash commands $\mathcal{B}_i = \{b_1,\ldots,b_{|\mathcal{B}_i|}\}$ to be executed in the terminal, generated by $\Pi_{\text{LM}}$. Denoting the context accumulated from the previous step $i-1$ as $\mathcal{C}_{i-1}$ and the action generation prompt as $\mathcal{P}_{\text{action}}$, we have: $$\begin{aligned}
    \mathcal{A}_i = (\mathcal{T}_i, \mathcal{B}_i) = \Pi_{\text{LM}}\left[\mathcal{P}_{\text{action}}(P_{\text{in}};\mathcal{C}_{i-1})\right]
    \label{eq:action_generation}\end{aligned}$$ After parsing and executing the agent's bash command in the environment $\mathcal{E}$, we gather the observation $\mathcal{O}_i$ consisting of both the expected outputs and unexpected error messages: $$\begin{aligned}
    \mathcal{O}_i = \mathcal{E}(\mathcal{C}_{i-1};\mathcal{B}_i)
    \label{eq:env_transition}\end{aligned}$$ We then update $\Pi_{\text{LM}}$'s context $\mathcal{C}_{i-1}$ by appending the (action, observation) pair for step $i$. $$\begin{aligned}
    \mathcal{C}_i = [\mathcal{C}_{i-1};(\mathcal{A}_i, \mathcal{O}_i)]
    \label{eq:context_update}\end{aligned}$$ Our goal is to improve performance by scaling the number of rollouts rather than the number of tokens in a single generation. We treat a rollout as the primary scaling unit since it interleaves model outputs with observations, making token-level scaling difficult to interpret consistently across steps. Refer to Appendix `\ref{sec:example_rollout_trajectories_appendix}`{=latex} for examples of agentic coding rollout trajectories. We scale our rollouts along two orthogonal dimensions:

-   **Parallel scaling:** execute multiple rollouts in separate containers and run selection.

-   **Sequential scaling:** execute new sets of rollouts in freshly initialized containers while conditioning on information extracted from earlier attempts.

Rollout Summaries as Reusable Representations {#sec:summary_representation}
---------------------------------------------

A rollout trajectory is typically too long and noisy to compare or reuse directly. While it contains useful signal such as promising diagnoses, partial fixes or failure modes, it also contains low-signal detail such as repeated terminal output or dead-end local explorations [@agentic-coding-trajectory-reduction-paper; @benchmark-test-time-scaling-agents-paper; @ehrlich2025codemonkeys]. Given a rollout $\mathcal{R}_i$, we therefore first convert it into a compact, structured summary: $$\begin{aligned}
S_i = \Pi_{\mathrm{LM}}\!\left[\mathcal{P}_{\mathrm{sum}}(\mathcal{R}_i)\right]
\label{eq:summary_generation}\end{aligned}$$ where $\mathcal{P}_{\mathrm{sum}}$ is a summarization prompt. These summaries serve as the common interface for both components of our framework. In the parallel setting, they provide bounded objects over which multiple rollouts can be compared. In the sequential setting, they provide reusable context that can guide future rollouts without replaying full interaction histories in the previous-iteration rollouts. Refer to Appendix `\ref{sec:example_summaries_appendix}`{=latex} for examples of the compact, structured summaries generated by $\Pi_{\mathrm{LM}}$.

Parallel Selection via Recursive Tournament Voting {#sec:parallel_aggregation}
--------------------------------------------------

![ Overview of **RTV** (Recursive Tournament Voting). **RTV** is our parallel aggregation technique designed for agentic tasks -- it (1) executes $N$ parallel, independent rollouts with the agent, (2) produces a structured summary of each rollout, and (3) divides the summaries into groups to compare the summaries and select a rollout from each group, repeating the process in a tournament-like manner until one rollout remains from the entire population of $N$ rollouts. ](img/rtv_overview_v3.png){#fig:parallel_thinking_overview}

We first study the *parallel* dimension of inference-time scaling. Suppose we execute $N$ parallel, independent rollouts with $\Pi_{\mathrm{LM}}$ for the given agentic coding problem $P_{\text{in}}$: $$\mathcal{P}_0 = \{\mathcal{R}_1,\ldots,\mathcal{R}_N\}$$ Our goal is to select the highest-quality rollout possible *without* access to any ground-truth outcomes, test cases or test samples. Note that for agentic coding tasks such as `\swe{}`{=latex} [@swe-bench-paper] and `\terminal{}`{=latex} [@terminal-bench-paper] that involve binary pass or fail metrics, the upper bound of the post-aggregation pass\@1 score is the pass\@N score, where the selection matches oracle performance.

We introduce **Recursive Tournament Voting** (**RTV**) to perform this selection in three stages. Refer to Figure `\ref{fig:parallel_thinking_overview}`{=latex} for a visual overview of **RTV**:

1.  Execute $N$ parallel rollouts.

2.  Summarize each rollout into a compact, structured summary.

3.  Recursively compare summaries in small groups until a single rollout remains.

After generating summaries $\{S_1,\ldots,S_N\}$, **RTV** executes stage 3 as a recursive procedure consisting of multiple *rounds* indexed with $r$, where each round reduces a population of rollouts into a subset by dividing the population into groups of size $G$ and selecting a rollout from each group. More formally, for each round $r$, **RTV** begins with a population of $N^{(r)}$ remaining rollouts such that $N^{(0)} = N$. Within each group indexed by $j$, **RTV** applies a comparison prompt $\mathcal{P}_{\mathrm{comp}}$ and aggregates $V$ comparison votes to select one summary: $$\begin{aligned}
g_j^{(r)} = \arg\max_{g \in \{1,\ldots,G\}}
\sum_{v=1}^{V}
\mathds{1}\!\left[
\Pi_{\mathrm{LM}}\!\left[
\mathcal{P}_{\mathrm{comp}}(P_{\mathrm{in}}; S_{(j,1)}^{(r)}, \ldots, S_{(j,G)}^{(r)})
\right] = g
\right],
\label{eq:rtv_selection}\end{aligned}$$ where $S_{(j,g)}^{(r)}$ denotes the $g$-th summary in group $j$ at round $r$.

The selected rollouts form the population for the next round: $$\begin{aligned}
\mathcal{P}_{r+1}
=
\left\{
\mathcal{R}_{k}
\;\middle|\;
k = (j-1)G + g_j^{(r)}
\text{ for } j \in \left\{1,\ldots,\left\lceil \frac{|\mathcal{P}_r|}{G}\right\rceil\right\}
\right\}.
\label{eq:rtv_population_update}\end{aligned}$$ After iterating Eqs. `\ref{eq:rtv_selection}`{=latex} and `\ref{eq:rtv_population_update}`{=latex}, **RTV** selects the final remaining rollout as its output.

Sequential reuse via Parallel-Distill-Refine {#sec:sequential_refinement}
--------------------------------------------

We next study the *sequential* dimension of inference-time scaling. Here, the objective is to improve future rollouts using information extracted from earlier ones.

To implement sequential refinement, we apply the random-$K$ variant of **PDR** to an agentic setting. We refer to the execution of parallel rollouts as an *iteration*, such that iteration-1 rollouts are refined from iteration-0 rollouts. Starting from an iteration-$0$ population $$\mathcal{P}_0 = \{\mathcal{R}_1,\ldots,\mathcal{R}_N\},$$ we first summarize each rollout into $\{S_1^{(0)},\ldots,S_N^{(0)}\}$ where each $S_{i}^{(t)}$ denotes the summary of the $i$th rollout of iteration $t$. Then, for each rollout in the next iteration, we sample $K$ summaries from the previous iteration and use them as refinement context.

More formally, for rollout $i$ in iteration $t+1$, let $$\begin{aligned}
J_i^{(t+1)} \subseteq \{1,\ldots,N\}, \qquad |J_i^{(t+1)}| = K\end{aligned}$$ denote the set of indices associated with the selected summaries, and define the refinement context as a concatenation of the $K$-selected summaries: $$\begin{aligned}
\mathcal{C}_i^{(t+1)} = \{S_j^{(t)} \mid j \in J_i^{(t+1)}\}\end{aligned}$$ **PDR** then executes each next-iteration rollout in a *freshly initialized environment*, generating the first action conditioned on both the original problem and this distilled prior experience: $$\begin{aligned}
\mathcal{A}_{i,0}^{(t+1)}
=
\Pi_{\mathrm{LM}}
\!\left[
\mathcal{P}_{\mathrm{action}}(P_{\mathrm{in}};\mathcal{C}_i^{(t+1)})
\right],
\label{eq:pdr_first_action}\end{aligned}$$ where $\mathcal{A}_{i,m}^{(t+1)}$ is the $m$th action taken for the $i$th rollout in iteration $t+1$. Subsequent actions follow the usual rollout dynamics from Eqs. `\ref{eq:action_generation}`{=latex} to `\ref{eq:context_update}`{=latex}, except with the refinement context added. As the original implementation of **PDR** selects its final rollout by executing a single rollout during its final iteration [@pdr-paper], in this work we estimate its performance by averaging the scores of the $N$ final-iteration rollouts.

![ **Unified PDR + RTV inference-time scaling recipe for agentic coding.** The agent first executes $N$ independent rollouts in parallel (iteration 0), and each rollout is converted into a compact structured summary. **RTV** is then applied to these summaries to select the top-$K$ summaries, which define the refinement context for the next iteration. Conditioned on this selected prior experience, the agent then executes a fresh set of $N$ rollouts in newly initialized environments (iteration 1). Finally, **RTV** aggregates the refined rollouts and returns the top-1 rollout. In this way, our method combines *sequential reuse* of prior experience via **PDR** with *parallel aggregation* among candidate attempts via **RTV**. ](img/pdr_rtv_overview.png){#fig:pdr_rtv_overview}

A unified pipeline: selection, then reuse, then selection {#sec:unifying_parallel_and_sequential_scaling}
---------------------------------------------------------

**RTV** and **PDR** address complementary aspects of inference-time scaling. **RTV** improves *selection* within a population of rollouts. **PDR** improves *reuse* of information across iterations. Our full recipe integrates these operators into a single pipeline. The full pipeline consists of:

1.  **Iteration 0**: execute $N$ independent rollouts and perform summarization;

2.  **Select-$K$**: apply **RTV** to obtain a high-quality subset of $K$ summaries;

3.  **Iteration 1**: execute $N$ fresh rollouts conditioned on the selected summaries;

4.  **Final RTV**: apply **RTV** to the refined rollouts and return the final top-$1$ rollout.

In a way, we maintain a balance of *exploitation* and *exploration* -- exploitation by narrowing down the search space to a subset of $K < N$ higher-quality rollouts using **RTV**, and exploration by using $K > 1$ rollouts to maintain diversity and cross-pollinating potentially diverse solution approaches for executing the next-iteration rollouts. Moreover, instead of performing a single rollout for the final iteration as done in vanilla **PDR**, we also opt to leverage the parallel selection capability in **RTV** to exploit the residual diversity among the refined rollouts and select the final rollout. Figure `\ref{fig:pdr_rtv_overview}`{=latex} illustrates the full pipeline of our **PDR + RTV** method.

Results and Analyses {#sec:experiments_and_results}
====================

We evaluate our method in three stages: (1) present our *method ablations and design choices* that motivate our framework, (2) report our *main results*, and (3) analyze the *sequential & parallel dynamics* of our method.

A consistent picture emerges from our experiments. First, compact structured summaries function as better interfaces than full rollout traces for both selection and reuse. Second, Recursive Tournament Voting (**RTV**) is effective at selecting among long-horizon agentic trajectories, and its benefits persist after refinement. Third, refinement contexts built from stronger rollouts leads to stronger next-iteration rollouts, indicating that the gains from sequential refinement are driven by the quality of the prior experience that is reused.

Experimental Setup {#sec:experimental_setup}
------------------

We evaluate on two agentic coding benchmarks -- `\swe{}`{=latex} [@swe-bench-paper] and `\terminal{}`{=latex} [@terminal-bench-paper] -- using `\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex} and `\gpt{}`{=latex}. Our main experiments in Section `\ref{sec:main_benchmark_results}`{=latex} use $N=16$, $T=2$, $K=4$, $G=2$, and $V=8$. For `\swe{}`{=latex}, we evaluate on the full test set. For `\terminal{}`{=latex}, we evaluate on 88 of the 89 tasks available to us. Refer to Appendix `\ref{sec:rollout_statistics_appendix}`{=latex} for more details on the model capabilities, Appendix `\ref{sec:model_comparisons_appendix}`{=latex} for their head-to-head comparisons, and Appendix `\ref{sec:example_initial_rollout_trajectories}`{=latex} for examples of our rollout trajectories, for these benchmarks.

Method Ablations and Design Choices {#sec:method_ablations}
-----------------------------------

We begin by presenting the ablations that motivate the final design of our framework. First, we study parallel aggregation in isolation. Then we turn to sequential refinement.

### Parallel Aggregation Ablations {#sec:parallel_aggregation_ablations}

We first identify the best substrate for comparing agentic rollouts during **RTV**. A direct approach is to compare the full rollout traces $\{\mathcal{R}_1,\ldots,\mathcal{R}_N\}$. Alternatively, the rollouts can be represented as compact, structured summaries $\{\mathcal{S}_1,\ldots,\mathcal{S}_N\}$. Figure `\ref{fig:rtv_summary_vs_rollout}`{=latex} shows the result of our investigation on `\swe{}`{=latex} and `\terminal{}`{=latex} using `\sonnet{}`{=latex} and `\flash{}`{=latex}.

![ Parallel aggregation results based on generating structured summaries (**[blue]{style="color: 0849BF"}**) vs. directly using rollout traces (**[orange]{style="color: FFAA00"}**) on `\swe{}`{=latex} and `\terminal{}`{=latex}. We find across `\flash{}`{=latex} and `\sonnet{}`{=latex} that using structured summaries as representations, instead of full rollout traces, leads to better final performances. ](img/rtv_summary_vs_rollouts_colm_v1.png){#fig:rtv_summary_vs_rollout}

![ **(Left)** Effect of scaling the group size $G\in\{16, 8, 4, 2\}$ for parallel aggregation, using `\flash{}`{=latex} on `\swe{}`{=latex} and `\terminal{}`{=latex}, respectively. $G=16$ requires a single round of a 16-way comparison. $G=8$ yields two remaining candidates after the first round of 8-way comparisons, followed by a pairwise comparison. $G=4$ requires two rounds of 4-way comparisons, and $G=2$ requires four successive rounds of pairwise comparisons. We find that $G=2$ yields the best performance. **(Right)** Effect of scaling the vote count $V \in \{1,2,4,8,16\}$, on the final **RTV** performance using `\flash{}`{=latex}. For both `\swe{}`{=latex} and `\terminal{}`{=latex}, we observe noticeable improvements as we scale $V$, the number of candidate comparison votes aggregated for each group. ](img/rtv_parameter_search_colm_v1.png){#fig:rtv_parameter_search}

```{=latex}
\begin{findingbox}
\textbf{Finding 1.} Compact, structured summaries function as better substrates than full rollout trajectories for selecting among long-horizon agentic rollouts.
\end{findingbox}
```
Across both models and both benchmarks, we find that using compact, structured summaries as representations consistently outperforms directly comparing full rollout trajectories -- full trajectories are too long and noisy to serve as reliable objects of comparison, whereas bounded structured summaries preserve the decisive information while discarding low-signal detail. More specifically, as the tournament eliminates the \`\`easier" pairs in the earlier rounds and are left with more difficult candidates that require attention to nuanced details for accurate selection, using compact summaries in the final round (Round 4 in Figure `\ref{fig:rtv_summary_vs_rollout}`{=latex}) provides a decisive advantage over full rollout traces as representations for selecting higher-quality rollouts.

We next investigate the architecture of **RTV**. A natural baseline is to load all $N$ summaries into context and perform a single comparison. Alternatively, our proposed **RTV** architecture decomposes the selection procedure into repeated smaller comparisons. Figure `\ref{fig:rtv_parameter_search}`{=latex} reports the effect of varying the group size $G \in \{16, 8, 4, 2\}$ and the number of candidate comparison votes $V \in \{1,2,4,8,16\}$.

```{=latex}
\begin{findingbox}
\textbf{Finding 2.} Selection of agentic rollouts is best achieved through recursive small-group comparisons with vote aggregation, rather than flat comparisons over many candidates.
\end{findingbox}
```
Figure `\ref{fig:rtv_parameter_search}`{=latex} **(Left)** shows that smaller-group recursive comparison performs better than flatter selection schemes, with pairwise comparisons ($G=2$) yielding the strongest results. This suggests that selecting among long-horizon trajectories is easier when decomposed into a sequence of local decisions rather than a single global ranking over many candidates. Meanwhile, Figure `\ref{fig:rtv_parameter_search}`{=latex} **(Right)** shows that vote aggregation improves the reliability of these local decisions, with clear gains as $V$ increases and diminishing returns beginning around $V=8$. These observations motivate our default **RTV** configuration of $G=2$ and $V=8$.

![ Main **RTV** results with `\opus`{=latex}, `\pro`{=latex}, `\sonnet`{=latex}, `\flash{}`{=latex} and `\gpt{}`{=latex} on `\swe{}`{=latex} and `\terminal{}`{=latex} using $N=16$, $G=2$, $V=8$. **RTV** yields notable performance gains, with pass\@1 scores improving on average by 5-6% on `\swe{}`{=latex} and 8-12% on `\terminal{}`{=latex}. ](img/rtv_main_results_colm_v1.png){#fig:rtv_main_results}

We then evaluate **RTV** as a standalone selection mechanism with $N=16$, with the results in Figure `\ref{fig:rtv_main_results}`{=latex}. Across all five models, **RTV** improves performance over the average rollout in the initial population on both `\swe{}`{=latex} and `\terminal{}`{=latex}, with the largest gains in `\terminal{}`{=latex}. For example, `\sonnet{}`{=latex} improves from $67.4\%\rightarrow73.6\%$ on `\swe{}`{=latex} and from $40.6\%\rightarrow54.6\%$ on `\terminal{}`{=latex}. These results show that recursive selection over structured summaries is already a strong test-time scaling method in isolation, even without performing sequential refinement.

### Sequential Refinement Ablations {#sec:sequential_refinement_ablations}

We next probe our design choices for sequential refinement. First, we study the effects of the prior rollout composition in the refinement context on the next-iteration rollouts. Table `\ref{tab:prelim_avg_pass_rates}`{=latex} compares three methods on 100 `\swe{}`{=latex} tasks using `\sonnet{}`{=latex} and `\pro{}`{=latex}: {*single-rollout*, *random-$K$*, *select-$K$*}-refinement. For *single-rollout*, we take $N$ parallel rollouts and refine each iteration-0 rollout by using its own structured summary to execute the associated iteration-1 rollout. For *random-$K$*, we follow **PDR** and randomly sample $K$ previous summaries into the previous-iteration rollouts to provide as refinement context for executing each next-iteration rollout. For *select-$K$*, we run **RTV** on the $N$ parallel rollouts via tournament voting until we obtain $K$ remaining rollouts. We use $N=16$ rollouts for each iteration, and $K=4$ prior rollouts for building the refinement context.

```{=latex}
\setlength{\tabcolsep}{3pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.1}
```
::: {#tab:prelim_avg_pass_rates}
  --------------------- --------------------------------------------------------------------- ----------------------------------------------------------------- ----------------------------------------------------------------- -------------------------------------- -------------------------------------- --------------------------------------
  **Model**             `\arraybackslash`{=latex} **`\shortstack{Single-rollout}`{=latex}**   `\arraybackslash`{=latex} **`\shortstack{Random-$K$}`{=latex}**   `\arraybackslash`{=latex} **`\shortstack{Select-$K$}`{=latex}**   `\arraybackslash`{=latex}              `\arraybackslash`{=latex}              `\arraybackslash`{=latex}
                        `\arraybackslash`{=latex} **Iter 0**                                  `\arraybackslash`{=latex} **Iter 1**                              `\arraybackslash`{=latex} **Iter 0**                              `\arraybackslash`{=latex} **Iter 1**   `\arraybackslash`{=latex} **Iter 0**   `\arraybackslash`{=latex} **Iter 1**
  `Claude-4.5-Sonnet`   `\arraybackslash 69.87`{=latex}                                       `\arraybackslash 70.87`{=latex}                                   `\arraybackslash 69.87`{=latex}                                   `\arraybackslash 75.06`{=latex}        `\arraybackslash 69.87`{=latex}        `\arraybackslash`{=latex} **78.06**
  `Gemini-3.1-Pro`      `\arraybackslash 72.69`{=latex}                                       `\arraybackslash 73.75`{=latex}                                   `\arraybackslash 72.69`{=latex}                                   `\arraybackslash 76.94`{=latex}        `\arraybackslash 72.69`{=latex}        `\arraybackslash`{=latex} **79.25**
  --------------------- --------------------------------------------------------------------- ----------------------------------------------------------------- ----------------------------------------------------------------- -------------------------------------- -------------------------------------- --------------------------------------

  : Comparison of refinement methods on 100 randomly sampled tasks from `\swe`{=latex}, based on the average pass rate (%) across $N=16$ rollouts for iterations 0 and 1 for `\sonnet{}`{=latex} and `\pro{}`{=latex}, using *Single-rollout* vs. *Random-K rollout* vs. *Select-K rollout* (selection via **RTV**) refinement setups ($K=4$).
:::

![ Pass count distributions for iterations 0 & 1 under single-rollout vs. $K$ rollout refinement for `\sonnet{}`{=latex} and `\pro{}`{=latex}, measured across 100 randomly sampled `\swe{}`{=latex} tasks. Using $K$ randomly sampled summaries from previous iteration rollouts outperforms using a single summary for the refinement context. ](img/pdr_pass_count_random_k.png){#fig:pdr_pass_count_random_k}

```{=latex}
\begin{findingbox}
\textbf{Finding 3.} Sequential refinement benefits from conditioning on multiple prior rollouts, and improves further when those rollouts are \emph{selected} rather than sampled at random.
\end{findingbox}
```
For building the refinement context, using $K$=4 prior rollouts even via random selection clearly outperforms using a single prior rollout. For example, the average pass\@1 score of `\pro{}`{=latex} improves only from $72.69\%\rightarrow73.75\%$ under single-rollout refinement, but improves from $72.69\%\rightarrow76.94\%$ under random-$K$ refinement. Moreover, selecting the $K$=4 rollouts using **RTV** (select-$K$) further improves performance, reaching $79.25\%$ for `\pro{}`{=latex} and $78.06\%$ for `\sonnet{}`{=latex} during iteration 1.

Figure `\ref{fig:pdr_pass_count_random_k}`{=latex} provides a more detailed view by comparing the pass-count distributions under single-rollout and random-$K$ refinement. For both `\sonnet{}`{=latex} and `\pro{}`{=latex}, the distribution under random-$K$ refinement shifts toward larger numbers of passing rollouts relative to single-rollout refinement. For example, using `\sonnet{}`{=latex} with single rollout as refinement context yields 40/100 tasks with 16/16 passes, while with random-K rollout as refinement context it yields 51/100 such tasks. These results indicate that using multiple prior trajectories improves not only the mean but the full distribution of next-iteration outcomes.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{5pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.15}
```
```{=latex}
\resizebox{\linewidth}{!}{%
    \begin{tabular}{l|ccccc|ccccc}
        \hline
        \textbf{Model} & \textbf{PDR (random-$K$) refinement} & \textbf{PDR + RTV (select-$K$) refinement} \\
        \cline{2-11}
        & \textbf{\# pass out of $K$ -- pass rate (\# avg. tasks)} & \textbf{\# pass out of $K$ -- pass rate (\# avg. tasks)} \\
        \cline{2-11}
        & \textbf{0/4} & \textbf{1/4} & \textbf{2/4} & \textbf{3/4} & \textbf{4/4} & \textbf{0/4} & \textbf{1/4} & \textbf{2/4} & \textbf{3/4} & \textbf{4/4} \\
        \hline
        \texttt{Claude-4.5-Sonnet} & 2.5 (17.8)  & 47.1 (6.5) & 80.4 (6.7) & 88.8 (16.2) & 98.1 (52.8) & 1.2 (16) & 12.5 (3) & - (0) & 90.5 (19) & 97.3 (62) \\
        \texttt{Gemini-3.1-Pro}    & 1.9 (16.2)  & 50.0 (6.4) & 66.4 (7.2) & 90.6 (10.6) & 99.1 (58.5) & 2.2 (17) & 40.6 (2) & 52.1 (3) & 81.2 (7) & 99.7 (71) \\
        \hline
    \end{tabular}%
    }
```
We also study how the quality of the refinement context affects next-iteration rollouts. Table `\ref{tab:pdr_k4_passcount_vs_iter1_pass1}`{=latex} stratifies the results for random-$K$ and select-$K$ refinement, based on how many of the $K$=4 context rollouts succeed. The success rates of the iteration-1 rollouts increase monotonically with the success rate of the refinement context across both setups, rising from near-zero for 0/4 success-rate contexts to 97--99% for 4/4 success-rate contexts. It also shows that **RTV** yields more 4/4 passing rollouts than random selection, with 62 vs. 52.8 tasks for `\sonnet{}`{=latex} and 71 vs. 58.5 tasks for `\pro{}`{=latex}. Figure `\ref{fig:pdr_top_k_pass_rate_analysis}`{=latex} in Appendix `\ref{sec:context_quality_analysis_appendix}`{=latex} visualizes the stratified pass rate distributions. The distributions shift rightward with increasing context pass rate for both `\sonnet{}`{=latex} and `\pro{}`{=latex}, indicating that even small improvements in context quality translate into measurable performance gains in next-iteration rollouts. These results motivate our approach of combining **RTV** with **PDR**, so that the refinement context is constructed to be maximally informative.

```{=latex}
\begin{table*}[t]

\small
\resizebox{\textwidth}{!}{%
\begin{tabular}{l|cccc|cccc}
\toprule
& \textbf{\swe{}} & \textbf{\terminal{}} \\
\textbf{Model} & Iter 0 & Sel-K & Iter 1 & Final & Iter 0 & Sel-K & Iter 1 & Final \\
\midrule
\texttt{Claude 4.5 Opus}   & 70.94 & 75.00 & 76.04 & \textbf{77.60} & 46.95 & 54.26 & 52.49 & 59.09 \\
\texttt{Gemini 3.1 Pro}    & 72.25 & 75.30 & 76.16 & 76.60 & 52.49 & 59.66 & 56.89 & \textbf{64.77} \\
\texttt{Claude 4.5 Sonnet} & 67.41 & 72.60 & 74.01 & 75.60 & 40.62 & 50.85 & 50.00 & 56.82 \\
\texttt{Gemini 3 Flash}    & 70.79 & 73.55 & 74.28 & 76.00 & 37.93 & 45.45 & 43.68 & 48.86 \\
\texttt{GPT-5 (0825)}      & 61.41 & 65.25 & 67.73 & 69.80 & 31.32 & 35.23 & 35.30 & 38.64 \\
\bottomrule
\end{tabular}%
}
\caption{
Main results for \textbf{PDR+RTV} on \swe{} and \terminal{}. We report the average pass@1 score at each stage: Iter 0 (iteration-0 rollouts), Select-K (top-K selected rollouts), Iter 1 (iteration-1 rollouts), and Final (final RTV-selected rollout).
We observe consistent performance improvements across all benchmarks and models.
}
\label{tab:deepthink-main-results}
\end{table*}
```
Main Results {#sec:main_benchmark_results}
------------

We now turn to the full **PDR+RTV** pipeline. Table `\ref{tab:deepthink-main-results}`{=latex} and Figure `\ref{fig:teaser_figure}`{=latex} summarize the performance of our full **PDR+RTV** recipe across our agentic coding benchmarks and frontier models.

On `\swe{}`{=latex}, our method yields consistent gains over the single-attempt baselines (Iter 0). In particular, the final **RTV**-selected candidate improves `\opus{}`{=latex} from $70.94\%\rightarrow 77.60\%$ (+6.66), `\pro{}`{=latex} from $72.25\%\rightarrow 76.60\%$ (+4.35), `\sonnet{}`{=latex} from $67.41\%\rightarrow 75.60\%$ (+8.19), `\flash{}`{=latex} from $70.79\%\rightarrow 76.00\%$ (+5.21), and `\gpt{}`{=latex} from $61.41\%\rightarrow 69.80\%$ (+8.39). These improvements are visible in Figure `\ref{fig:teaser_figure}`{=latex} and reflect complementary contributions from (i) improved rollout quality after selecting the $K$ candidates and refining the rollouts (Iter 1), and (ii) an additional selection boost from the final **RTV** on top of the iteration-1 rollouts (Final).

On `\terminal{}`{=latex}, we observe even larger absolute improvements, consistent with the benchmark's higher variance and the importance of selecting among long-horizon trajectories. The final candidate improves `\opus{}`{=latex} from $46.95\%\rightarrow 59.09\%$ (+12.14), `\pro{}`{=latex} from $52.49\%\rightarrow 64.77\%$ (+12.28), `\sonnet{}`{=latex} from $40.62\%\rightarrow 56.82\%$ (+16.20), `\flash{}`{=latex} from $37.93\%\rightarrow 48.86\%$ (+10.93), and `\gpt{}`{=latex} from $31.32\%\rightarrow 38.64\%$ (+7.32). Notably, **RTV** provides a substantial lift over the raw iteration-1 rollouts (e.g., 52.49% $\rightarrow$ 59.09% for `\opus{}`{=latex} and 56.89% $\rightarrow$ 64.77% for `\pro{}`{=latex}), indicating that even after sequential refinement, there remains meaningful diversity among rollouts that can be exploited by parallel aggregation. Our method also helps coding agents to complete tasks that were not previously solvable within their 16 initial rollouts, as shown in Appendix `\ref{sec:new_solution_discoveries_appendix}`{=latex}. As examples of new solution discovery behavior, Figures `\ref{fig:example_refinement_rollout_gpt2_codegolf}`{=latex} and `\ref{fig:example_refinement_rollout_large_scale_text_editing}`{=latex} in Appendix `\ref{sec:example_refined_rollout_trajectories}`{=latex} shows `\opus{}`{=latex} solving `gpt2-codegolf` and `\pro{}`{=latex} solving `large-scale-text-editing` in `\terminal{}`{=latex} under the `Terminus-1` scaffold.

Meanwhile, for our main experiments in `\swe{}`{=latex} and `\terminal{}`{=latex} we also measure the average number of steps taken by the agent for each iteration, both across all rollouts in each iteration and further divided into passing and failing rollouts. Table `\ref{tab:steps_iter0_iter1_swe_terminal}`{=latex} shows the results of our analyses. Across both benchmarks and all of our models, iteration 1 rollouts take substantially fewer steps than iteration 0. For example, upon refinement `\opus{}`{=latex} drops from 41.23 $\rightarrow$ 14.31 steps on `\swe{}`{=latex} and 24.43 $\rightarrow$ 12.14 steps on `\terminal{}`{=latex}. Similarly, `\pro{}`{=latex} drops from 35.56 $\rightarrow$ 17.95 steps on `\swe{}`{=latex} and 21.57 $\rightarrow$ 10.95 steps on `\terminal{}`{=latex}.

These results suggest that the refinement context helps agents navigate to solutions more directly, cutting down on the additional steps which would typically be used to understand the directory structure, explore file contents, try erroneous approaches and more, increasing the *search efficiency*. We also observe a consistent gap between the length of passing and failing trajectories within each iteration, with failing rollouts consuming more steps on average than the passing rollouts. This is expected, as (1) the failing rollouts are more likely associated with difficult problems that require a large number of steps to address, and (2) the agent would be more likely to make mistakes in failing rollouts which would trigger recovery actions that consume more steps. Refer to Appendix `\ref{sec:pdr_rtv_examples_appendix}`{=latex} for examples of refinement behavior while running **PDR+RTV** on our benchmarks.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{6pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.15}
```
```{=latex}
\resizebox{\linewidth}{!}{%
    \begin{tabular}{l|ccc|ccc|ccc|ccc}
        \hline
        \textbf{Model} & \textbf{SWE-Bench Verified} & \textbf{Terminal-Bench v2.0} \\
        \cline{2-13}
        & \textbf{Iter 0} & \textbf{Iter 1} & \textbf{Iter 0} & \textbf{Iter 1} \\
        \cline{2-13}
        & \textbf{All} & \textbf{Pass} & \textbf{Fail} & \textbf{All} & \textbf{Pass} & \textbf{Fail} & \textbf{All} & \textbf{Pass} & \textbf{Fail} & \textbf{All} & \textbf{Pass} & \textbf{Fail} \\
        \hline
        \texttt{Claude-4.5-Opus}   & 41.23 & 33.83 & 59.30 & 14.31 & 13.16 & 17.97 & 24.43 & 24.66 & 24.23 & 12.14 & 10.96 & 13.45 \\
        \texttt{Gemini-3.1-Pro}    & 35.56 & 33.88 & 39.92 & 17.95 & 17.05 & 20.82 & 21.57 & 17.47 & 26.09 & 10.95 & 9.20 & 13.25 \\
        \texttt{Claude-4.5-Sonnet} & 49.24 & 46.13 & 55.67 & 25.02 & 24.39 & 26.84 & 21.74 & 19.47 & 23.30 & 7.78 & 6.29 & 9.26 \\
        \texttt{Gemini-3-Flash}    & 51.10 & 48.39 & 57.65 & 28.80 & 27.37 & 32.94 & 16.01 & 15.68 & 16.22 & 7.80 & 6.16 & 9.07 \\
        \hline
    \end{tabular}%
    }
```
Analysis
========

We now perform a deeper analysis of our **PDR+RTV** test-time scaling method across two dimensions: sequential refinement and parallel aggregation. First, we study *sequential refinement dynamics*: how the pass rate of each task transitions from iteration 0 to 1, and how the quality of the refinement context with the $K$ selected rollouts influences iteration-1 behavior and efficiency. Second, we study *parallel aggregation dynamics*: how recursive small-group comparisons in **RTV** translate rollout diversity into reliable final-candidate selection.

Sequential Refinement Dynamics {#sec:sequential_refinement_dynamics_main}
------------------------------

We next analyze how the rollout populations change from iteration 0 to iteration 1 under the full **PDR+RTV** pipeline. Figure `\ref{fig:pdr_rtv_confusion_matrices_main}`{=latex} visualizes the pass-count transition matrices for each (benchmark, model) combination, where the benchmark is either `\swe{}`{=latex} (**Top**) or `\terminal{}`{=latex} (**Bottom**) and the model is one of {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}}. Each row corresponds to the number of passing rollouts out of $N=16$ rollouts in iteration 0, and each column corresponds to the number of passing rollouts in iteration 1. The mass above the diagonal corresponds to tasks with improvements, while mass below corresponds to tasks with regressions. Across both `\swe{}`{=latex} and `\terminal{}`{=latex}, the matrices exhibit a pronounced upward shift, indicating that sequential refinement improves the overall pass rates for a substantial fraction of tasks.

![Transition matrices of pass-count transitions from iteration 0 to iteration 1 in our **PDR+RTV** main experiments on `\swe{}`{=latex} (top row) and `\terminal{}`{=latex} (bottom row), for each model in {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}}. Each cell counts tasks whose number of passing rollouts (out of $N=16$) changes from the row value (Iter 0) to the column value (Iter 1); the mass above the diagonal indicates net improvement under sequential refinement, while the mass below indicates regression.](img/pdr_rtv_confusion_matrices.png){#fig:pdr_rtv_confusion_matrices_main}

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{5pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.15}
```
```{=latex}
\resizebox{\linewidth}{!}{%
    \begin{tabular}{l|ccccc|ccccc}
        \hline
        \textbf{Model} & \textbf{SWE-Bench Verified} & \textbf{Terminal-Bench v2.0} \\
        \cline{2-11}
        & \textbf{\# pass out of $K$ -- pass rate (\# tasks)} & \textbf{\# pass out of $K$ -- pass rate (\# tasks)} \\
        \cline{2-11}
        & \textbf{0/4} & \textbf{1/4} & \textbf{2/4} & \textbf{3/4} & \textbf{4/4} & \textbf{0/4} & \textbf{1/4} & \textbf{2/4} & \textbf{3/4} & \textbf{4/4} \\
        \hline
        \texttt{Claude-4.5-Opus}    & 0.1 (81) & 33.4 (29) & 55.5 (25) & 85.4 (39) & 99.2 (326) & 1.8 (31) & 31.2 (4) & 43.0 (8) & 78.5 (9) & 94.1 (36) \\
        \texttt{Gemini-3.1-Pro}    & 0.6 (87) & 36.9 (22) & 38.4 (22) & 87.0 (36) & 99.8 (333) & 3.8 (23) & 37.5 (7) & 45.8 (12) & 57.5 (5) & 93.1 (41) \\
        \texttt{Claude-4.5-Sonnet} & 1.7 (94) & 18.8 (16) & 65.4 (28) & 88.1 (68) & 99.7 (294) & 3.0 (33) & 30.2 (6) & 58.0 (7) & 76.4 (9) & 91.7 (33) \\
        \texttt{Gemini-3-Flash}    & 0.0 (93) & 34.7 (18) & 73.1 (20) & 88.1 (63) & 96.4 (306) & 1.0 (37) & 23.2 (7) & 68.1 (9) & 60.0 (5) & 91.0 (30) \\
        \hline
    \end{tabular}%
    }
```
To better understand the factors leading to improved pass rates upon sequential refinement, for each task in our `\swe{}`{=latex} and `\terminal{}`{=latex} main experiments we bucket tasks by how many of the $K=4$ iteration-0 rollouts selected into the refinement context are passing. Within each bucket, we then report the resulting iteration-1 pass\@1 (along with the number of tasks in that bucket).

```{=latex}
\begin{findingbox}
\textbf{Finding 4.} The quality of the selected refinement context is strongly predictive of the quality of the next-iteration rollouts.
\end{findingbox}
```
Table `\ref{tab:pdr_rtv_k4_passcount_vs_iter1_pass1}`{=latex} reports the results of our analysis. Across all four models, the average pass\@1 score of the iteration-1 rollouts increases sharply as the quality of the refinement context improves. For example, `\opus{}`{=latex} scores 0.1% on tasks with 0 out of 4 passing rollouts in the refinement context, 33.4% on tasks with 1 out of 4 passing rollouts, 55.5% on tasks with 2 out of 4 passing rollouts, 85.4% on tasks with 3 out of 4 passing rollouts, and 99.2% on tasks with 4 out of 4 passing rollouts for `\swe{}`{=latex}. The same pattern holds for all other models, and also on `\terminal{}`{=latex}. These results strongly indicate that *selected prior experience directly drives next-iteration performance*.

We also provide the pass count distribution shifts from iteration $0\rightarrow1$ across the four models for both `\swe{}`{=latex} and `\terminal{}`{=latex}. Refer to Figures `\ref{fig:pdr_rtv_pass_count_distributions_swe_bench}`{=latex} and `\ref{fig:pdr_rtv_pass_count_distributions_terminal_bench}`{=latex} for the pass count distribution shifts on `\swe{}`{=latex} and `\terminal{}`{=latex}, respectively. Across both `\swe{}`{=latex} and `\terminal{}`{=latex}, the number of rollouts with 16/16 passes increases from iteration $0\rightarrow1$, because the transitions in the rightmost column in Figure `\ref{fig:pdr_rtv_confusion_matrices_main}`{=latex} leverages the high-quality rollouts selected for the refinement context to achieve high refinement success rates. Meanwhile, we also observe an increase in the number of rollouts with 0/16 passes from iteration $0\rightarrow1$, associated with the transitions in the leftmost column in Figure `\ref{fig:pdr_rtv_confusion_matrices_main}`{=latex}. This is because, for parallel rollouts of low success rates, **RTV** often retains one or less successful candidates in its top-4 candidates, which as observed in Table `\ref{tab:pdr_rtv_k4_passcount_vs_iter1_pass1}`{=latex}, leads to next-iteration rollouts with high failure rates. As a result, we obtain a sharp, bimodal distribution of tasks for the iteration-1 rollouts, albeit with a much higher rate of increases in the first category of successful rollouts than increases in the second category of failing rollouts. For example, in Figure `\ref{fig:pdr_rtv_pass_count_distributions_swe_bench}`{=latex}, during sequential refinement `\opus{}`{=latex} increases the number of tasks with 16/16 passing rollouts from 209/500 to 350/500 (+141) in `\swe{}`{=latex}, while increasing the number of tasks with 0/16 passing rollouts from 73/500 to 94/500 (+21). Hence the average pass\@1 score undergoes a net improvement from iteration $0\rightarrow1$, and this refinement dynamic sets up a favorable ground for a final round of **RTV**, as we investigate in the next section.

![Pass count distribution shift from iterations 0 to 1 for each model in {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}} based on our **PDR+RTV** method, across all tasks from `\swe{}`{=latex}.](img/pdr_rtv_pass_count_distributions_swe_bench.png){#fig:pdr_rtv_pass_count_distributions_swe_bench}

![Pass count distribution shift from iterations 0 to 1 for each model in {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}} based on our **PDR+RTV** method, across all tasks from `\terminal{}`{=latex}.](img/pdr_rtv_pass_count_distributions_terminal_bench.png){#fig:pdr_rtv_pass_count_distributions_terminal_bench}

Parallel Aggregation Dynamics {#sec:parallel_aggregation_dynamics_main}
-----------------------------

![**(Top)** Evolution of pass\@1 scores across **RTV** rounds for both iterations of our **PDR+RTV** method, across models in {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}}. **(Bottom)** Evolution of pass\@N scores across **RTV** rounds for both iterations of our **PDR+RTV** method, across models in {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}}. All experiments are conducted for `\swe{}`{=latex}.](img/pdr_rtv_parallel_analysis_swe_bench.png){#fig:pdr_rtv_parallel_analysis_swe_bench}

![**(Top)** Evolution of pass\@1 scores across **RTV** rounds for both iterations of our **PDR+RTV** method, across models in {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}}. **(Bottom)** Evolution of pass\@N scores across **RTV** rounds for both iterations of our **PDR+RTV** method, across models in {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}}. All experiments are conducted for `\terminal{}`{=latex}.](img/pdr_rtv_parallel_analysis_terminal_bench.png){#fig:pdr_rtv_parallel_analysis_terminal_bench}

We next study the parallel aggregation dynamics occurring during the execution of our $\textbf{PDR+RTV}$ method. In particular, we investigate how **RTV** continues to improve performance even after sequential refinement. Given $T=2$ iterations, there are two occurrences of **RTV** -- the first instance after the completion of the iteration-0 rollouts, which performs tournament voting up to the top-4 candidates, and the second instance after the completion of the iteration-1 rollouts, which performs tournament voting up to the final candidate. While we only use the top-4 candidates for the first instance of **RTV** in our final method, we still run it to the final candidate for a comprehensive analysis of both **RTV** instances. To analyze both instances of **RTV**, we first monitor two evaluation metrics across the tournament rounds:

1.  *Average pass\@1*: This metric computes the pass\@1 score of the candidates remaining in each tournament round. It generally increases over the tournament rounds as $\Pi_{\text{LM}}$ selects a higher proportion of successful rollouts over failing rollouts during each round.

2.  *pass\@N*: This metric computes the pass\@N score of the candidates remaining in each tournament round. It decreases or stays the same over the tournament rounds as $\Pi_{\text{LM}}$ selects failing rollouts for some groups in each round, which may lead to only failing rollouts remaining for those groups.

Figures `\ref{fig:pdr_rtv_parallel_analysis_swe_bench}`{=latex} and `\ref{fig:pdr_rtv_parallel_analysis_terminal_bench}`{=latex} show the results of our analyses performed for `\swe{}`{=latex} and `\terminal{}`{=latex}, respectively. Each figure contains results spanning over the four models in {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}}. The blue lines in the first row report the average pass\@1 scores across tournament rounds, while the orange lines in the second row report the pass\@N scores, for both instances of **RTV** performed after iterations 0 and 1.

```{=latex}
\begin{findingbox}
\textbf{Finding 5.} Parallel aggregation remains valuable after refinement because refined rollout populations retain useful intra-task diversity exploited by recursive selection.
\end{findingbox}
```
We make the following observations across both benchmarks and all frontier models. First, the second instance of **RTV** over the iteration-1 rollouts still provides improvements via parallel aggregation, albeit at a smaller rate than the rate of increase for the first instance of **RTV** over the iteration-0 rollouts. This is because of the distributions observed in Figures `\ref{fig:pdr_rtv_pass_count_distributions_swe_bench}`{=latex} and `\ref{fig:pdr_rtv_pass_count_distributions_terminal_bench}`{=latex}, where the number of tasks with mixed pass/fail test outcomes significantly decreases, leaving less room for improvement via tournament-based selection. Second, while the pass\@N scores of the iteration-1 rollouts begin at lower points than the iteration-0 rollouts as some successful rollouts are incorrectly removed by the first instance of **RTV**, the pass\@N scores of the iteration-1 rollouts converge to a higher final pass\@1 score than those of the iteration-0 rollouts. This is because, as described above, some of the tasks regress to producing 0/16 successful rollouts in iteration 1, and this leaves a set of tasks with iteration-1 rollouts that are more convenient on average for **RTV** to operate over.

Moreover, we measure the accuracy of the groupwise comparisons performed during both instances of **RTV** across all of our experiments. Within each instance of **RTV**, across the tournament rounds we extract all groups containing both successful *and* failing rollouts, and we measure the selection accuracy, i.e. the ratio of groups from which a successful rollout is selected. In other words, we measure the performance of $\Pi_{\text{LM}}$ as a *LLM-as-a-Judge* in terms of how accurately it selects a successful rollout during each groupwise comparison.

```{=latex}
\scriptsize
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{cl|ccccc|ccccc}
\toprule
 & \textbf{SWE-Bench Verified} & \textbf{Terminal-Bench v2.0} \\
\cmidrule(lr){3-7} \cmidrule(lr){8-12}
\textbf{Iter} & \textbf{Model} & R0 & R1 & R2 & R3 & Avg & R0 & R1 & R2 & R3 & Avg \\
\midrule
0 & \opus{} & 69.1 & 68.3 & 60.6 & 55.6 & 67.0 & 81.4 & 78.9 & 67.9 & 64.3 & 77.9 \\
 & \pro{} & 66.9 & 64.5 & 51.1 & 65.1 & 64.5 & 84.2 & 81.4 & 72.2 & 62.5 & 80.7 \\
 & \sonnet{} & 70.6 & 66.2 & 53.3 & 55.6 & 66.6 & 80.5 & 87.0 & 78.6 & 62.5 & 81.7 \\
 & \flash{} & 66.3 & 55.6 & 53.3 & 54.5 & 61.3 & 85.1 & 79.4 & 80.0 & 73.7 & 82.3 \\
\midrule
1 & \opus{} & 54.4 & 62.9 & 60.0 & 71.4 & 58.2 & 75.4 & 74.3 & 84.2 & 83.3 & 77.2 \\
 & \pro{} & 43.4 & 47.4 & 65.8 & 60.0 & 48.3 & 77.4 & 68.6 & 85.7 & 88.9 & 76.7 \\
 & \sonnet{} & 63.2 & 57.5 & 59.0 & 60.0 & 60.9 & 78.9 & 77.4 & 76.2 & 77.8 & 78.0 \\
 & \flash{} & 67.3 & 51.5 & 55.8 & 50.0 & 62.1 & 77.9 & 71.1 & 72.4 & 90.9 & 75.8 \\
\bottomrule
\end{tabular}}
```
Table `\ref{tab:pdr_rtv_judge_accuracy}`{=latex} shows the results of our analyses. Note that the group selection accuracies in Table `\ref{tab:pdr_rtv_judge_accuracy}`{=latex} should not be interpreted as a controlled, head-to-head comparison across judges/models. Each judge is applied to *different* sets of candidate rollouts produced by *different* generator models (i.e., itself), and it makes decisions based on summaries that are themselves generated from those rollouts. As a result, differences in the selection accuracies not only involve the judge's decision quality, but also (i) the difficulty and diversity of the underlying rollout pools, and (ii) the informativeness of the corresponding summaries. Nevertheless, we make several observations. First, $\Pi_{\text{LM}}$ performs more accurate rollout selections amongst `\terminal{}`{=latex} rollouts than `\swe{}`{=latex} rollouts. This gap likely reflects that `\swe{}`{=latex} rollouts require judging subtle code diffs and hidden test outcomes from partial summaries, whereas `\terminal{}`{=latex} rollouts expose more directly verifiable command--output evidence, making successful trajectories easier for the judge to identify. Second, there is room for improving the groupwise comparison accuracies -- for example, `\pro{}`{=latex} achieves the lowest accuracy as a judge[^1] which results in the lowest improvement from the average pass\@1 of its iteration-1 rollouts (76.16%) to its final pass\@1 (76.60%, +0.44%). We expect that training $\Pi_{\text{LM}}$ to make better group-level rollout selections via SFT or RL, or even deploying a dedicated judge for tournament voting, would significantly improve the performance of **RTV** in our setup.

Conclusion {#sec:conclusion}
==========

We propose a new inference-time scaling method for agentic coding, a setting in which each attempt is not a short completion but a long interactive trajectory of reasoning, actions, observations, and partial progress. For agentic tasks, we find that leveraging additional computation is useful only if the information produced by prior attempts can be exposed in a bounded form that later computation can reliably compare and reuse.

Our central claim is that the key bottleneck of leveraging inference-time compute for long-horizon tasks is the *representation* of prior experience. For agentic coding tasks, we address this bottleneck by converting each rollout into a compact, structured summary which functions as an interface for two complementary forms of test-time scaling. For parallel aggregation, we introduce **Recursive Tournament Voting** (**RTV**), which recursively selects successful rollouts by aggregating small-group comparisons of the structured summaries. For sequential refinement, we adapt **Parallel-Distill-Refine** (**PDR**) to the agentic setting by conditioning fresh rollouts upon summaries distilled from prior attempts. Combining these two operators yields a simple unified pipeline that balances *exploitation* by refining high-quality rollouts and *exploration* by executing multiple parallel rollouts.

Across different frontier LLMs {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}, `\gpt{}`{=latex}} and challenging agentic coding benchmarks {`\swe{}`{=latex}, `\terminal{}`{=latex}}, our framework consistently improves over single-attempt baselines. Our analyses further show that these gains are not simply explained by more compute -- we find that (1) structured summaries outperform raw trajectories as comparison inputs, (2) refinement contexts with higher-quality summaries lead to stronger next-iteration rollouts, and (3) a final **RTV** stage continues to add value even after refinement by exploiting residual rollout diversity. Moreover, we find that sequential refinement via agentic **PDR** also improves the *efficiency* of the next-iteration rollouts, completing the same task using about 50% less steps while increasing their success rates. Taken together, these results support a simple view: for long-horizon agents, test-time scaling works when prior experience is transformed into representations that support both *selection* and *reuse*.

A natural direction for future work is to extend this framework beyond textual rollout summaries to *persistent external artifacts*. In our current setting, prior experience is reused through compact summaries provided to fresh rollouts in newly initialized environments. A richer formulation would allow agents to retain and build upon persistent workspace state across attempts, including notes, partial patches, derived tests, debugging scripts, and reusable tools constructed by the agent itself. This would move inference-time scaling from reusing *descriptions* of prior experience to reusing the *artifacts* produced by prior experience. An important open question is then how to represent, select, refine, and maintain such persistent artifacts so that agents can accumulate useful external state without being overwhelmed by stale or low-value information.

```{=latex}
\clearpage
```
```{=latex}
\newpage
```
```{=latex}
\bibliographystyle{assets/plainnat}
```
```{=latex}
\clearpage
```
```{=latex}
\newpage
```
`\phantomsection`{=latex}`\label{sec:appendix_toc}`{=latex} `\appendix`{=latex} `\printappendixtoc`{=latex}

Rollout statistics`\appendixreturntotoc`{=latex}
================================================

```{=latex}
\addcontentsline{atoc}{section}{\protect\numberline{\thesection}Rollout statistics}
```
`\label{sec:rollout_statistics_appendix}`{=latex}

We compute statistics regarding the rollouts performed by all of our models {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}, `\gpt{}`{=latex}} for our main experiments and share the results in Tables `\ref{tab:swe_bench_rollout_stats}`{=latex} and `\ref{tab:terminal_bench_rollout_stats}`{=latex} for `\swe{}`{=latex} and `\terminal{}`{=latex}, respectively.

We measure, for each iteration of our main experiment, (1) the average pass\@1 score, (2) the pass\@16 score ($N$=16), and (3) the number of tasks which contain mixed passing and failing rollouts for the given iteration. Note that for `\swe{}`{=latex} we use the bash-only `mini-SWE-agent` harness, and for `\terminal{}`{=latex} we use the `Terminus 1` harness.

```{=latex}
\small
```
::: {#tab:swe_bench_rollout_stats}
  ----------------------------------------------------------------- ------------------------ ----------------- -------------- ----------------- -------------- --------------
                                                                     **SWE-Bench Verified**
  `\cmidrule`{=latex}(lr)2-7                                            **Iteration 0**       **Iteration 1**
  `\cmidrule`{=latex}(lr)2-4 `\cmidrule`{=latex}(lr)5-7 **Model**       **Avg Pass\@1**        **Pass\@16**     **\# Mixed**   **Avg Pass\@1**   **Pass\@16**   **\# Mixed**
  `\opus{}`{=latex}                                                          70.94                 85.40            218             76.04           81.20            56
  `\pro{}`{=latex}                                                         **72.25**             **86.00**          200           **76.16**       **82.00**          59
  `\sonnet{}`{=latex}                                                        67.41                 83.40            259             74.01           79.20            71
  `\flash{}`{=latex}                                                         70.79                 84.00            251             74.28           79.80           179
  `\gpt{}`{=latex}                                                           61.41                 79.00            257             67.73           73.40            96
  ----------------------------------------------------------------- ------------------------ ----------------- -------------- ----------------- -------------- --------------

  : Rollout statistics for `\swe{}`{=latex} based on our main experiments.
:::

```{=latex}
\small
```
::: {#tab:terminal_bench_rollout_stats}
  ----------------------------------------------------------------- ------------------------- ----------------- -------------- ----------------- -------------- --------------
                                                                     **Terminal-Bench v2.0**
  `\cmidrule`{=latex}(lr)2-7                                             **Iteration 0**       **Iteration 1**
  `\cmidrule`{=latex}(lr)2-4 `\cmidrule`{=latex}(lr)5-7 **Model**        **Avg Pass\@1**        **Pass\@16**     **\# Mixed**   **Avg Pass\@1**   **Pass\@16**   **\# Mixed**
  `\opus{}`{=latex}                                                           46.95                 70.45             36             52.49           65.91            22
  `\pro{}`{=latex}                                                          **52.49**             **76.14**           35           **56.89**       **72.73**          23
  `\sonnet{}`{=latex}                                                         40.62                 67.05             38             50.00           62.50            19
  `\flash{}`{=latex}                                                          37.93                 60.23             37             43.68           56.82            18
  `\gpt{}`{=latex}                                                            31.32                 51.14             30             35.30           43.18            10
  ----------------------------------------------------------------- ------------------------- ----------------- -------------- ----------------- -------------- --------------

  : Rollout statistics for `\terminal{}`{=latex} based on our main experiments.
:::

We find that `\pro{}`{=latex} performs the best across both benchmarks, closely followed by `\opus{}`{=latex}. `\flash{}`{=latex} outperforms `\sonnet{}`{=latex} in `\swe{}`{=latex} but `\sonnet{}`{=latex} outperforms `\flash{}`{=latex} in `\terminal{}`{=latex}. `\gpt{}`{=latex} performs the worst, which is expected given its earlier release date compared to the others.

Model comparisons`\appendixreturntotoc`{=latex}
===============================================

```{=latex}
\addcontentsline{atoc}{section}{\protect\numberline{\thesection}Model comparisons}
```
`\label{sec:model_comparisons_appendix}`{=latex}

We compare the capabilities of the frontier language models used in our experiments, {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}, `\gpt{}`{=latex}}, for both `\swe{}`{=latex} and `\terminal{}`{=latex}. To this end, for each pair of models ($M_i$, $M_j$) we compute the number of tasks where given $N$=16 parallel rollouts in iteration 0, $M_i$ is able to succeed in at least one of its rollouts while $M_j$ fails all of its rollouts.

Figure `\ref{fig:rollout_matchups}`{=latex} visualizes the results of our investigation. We find that `\pro{}`{=latex} is overall the most competitive across both benchmarks in terms of win rates, closely followed by `\opus{}`{=latex}. `\flash{}`{=latex} has slightly better win rates than `\sonnet{}`{=latex} on `\swe{}`{=latex} but does worse than `\sonnet{}`{=latex} on `\terminal{}`{=latex}. `\gpt{}`{=latex} is the least competitive, which is expected given its earlier release date than the other models. Meanwhile, we also observe that smaller models from the same family struggle more to outperform their larger counterparts, such as `\flash{}`{=latex} winning over `\pro{}`{=latex} for only 10/500 tasks on `\swe{}`{=latex} and only 1/88 tasks on `\terminal{}`{=latex}. We observe similar patterns with `\sonnet{}`{=latex} and `\opus{}`{=latex}, but not as pronounced as the Gemini family of models.

![Pairwise model comparisons for iteration 0 with $N$=16 rollouts. Each cell reports the number of tasks for which model $M_i$ produces at least one successful rollout while model $M_j$ fails all of its rollouts (left: `\swe{}`{=latex}; right: `\terminal{}`{=latex}).](img/rollout_matchups_iter0_colm_v1.png){#fig:rollout_matchups}

Sequential refinement dynamics`\appendixreturntotoc`{=latex} {#sequential-refinement-dynamics}
============================================================

```{=latex}
\addcontentsline{atoc}{section}{\protect\numberline{\thesection}Sequential refinement dynamics}
```
`\label{sec:sequential_refinement_dynamics_appendix}`{=latex}

We provide additional results for the sequential refinement analyses performed for our **PDR+RTV** main experiments, as done in Section `\ref{sec:sequential_refinement_dynamics_main}`{=latex}.

Context quality analysis {#sec:context_quality_analysis_appendix}
------------------------

In Section `\ref{sec:sequential_refinement_ablations}`{=latex}, Table `\ref{tab:pdr_k4_passcount_vs_iter1_pass1}`{=latex} shows the effect of refinement context management on the performance of the next-iteration rollouts, comparing random-$K$ refinement with select-$K$ (**RTV**) refinement, by stratifying the tasks by the number of passing rollouts in the context windows. The results in the table show that the select-$K$ refinement yields more refinement contexts with all-passing rollouts, which contributes to the improved average pass\@1 score compared to the rollouts executed via random-$K$ refinement.

![Impact of the average performance of the $K$ rollouts selected for the **PDR** refinement context on the iteration-1 rollout pass rates, executed for 100 tasks from `\swe{}`{=latex}, based on `\sonnet{}`{=latex} and `\pro{}`{=latex}.](img/pdr_top_k_pass_rate_analysis.png){#fig:pdr_top_k_pass_rate_analysis}

In Figure `\ref{fig:pdr_top_k_pass_rate_analysis}`{=latex} we explore this further and visualize the task-level pass rate distribution of 100 randomly-sampled tasks from `\swe{}`{=latex}, for `\sonnet{}`{=latex} and `\pro{}`{=latex}. Again, for each model we stratify the tasks by the number of passing rollouts in the refinement context, but this time we measure the per-bucket distribution of tasks in terms of their average iteration-1 pass rates.

New solution discoveries`\appendixreturntotoc`{=latex}
======================================================

```{=latex}
\addcontentsline{atoc}{section}{\protect\numberline{\thesection}New solution discoveries}
```
`\label{sec:new_solution_discoveries_appendix}`{=latex}

We identify the tasks which the frontier models used in our experiments, {`\opus{}`{=latex}, `\pro{}`{=latex}, `\sonnet{}`{=latex}, `\flash{}`{=latex}, `\gpt{}`{=latex}}, are initially unable to solve across its $N$=16 parallel rollouts, but are able to solve upon performing the select-$K$ refinement during our **PDR+RTV** method. In other words, despite being provided a refinement context with the summaries of all-failing rollouts, the models produce next-iteration rollouts that pass all test cases. Tables `\ref{tab:new_solution_discoveries_swe_bench}`{=latex} and `\ref{tab:new_solution_discoveries_terminal_bench}`{=latex} list the task IDs for each model that discover new solutions upon refinement for `\swe{}`{=latex} and `\terminal{}`{=latex}, respectively.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{6pt}
```
::: {#tab:new_solution_discoveries_swe_bench}
  **Model**             **Task IDs w/ fail $\rightarrow$ pass**
  --------------------- ----------------------------------------------------------
  `\opus{}`{=latex}     `django_django-11951`
  `\pro{}`{=latex}      `sphinx-doc_sphinx-9602`
  `\sonnet{}`{=latex}   `django_django-13964`, `scikit-learn_scikit-learn-25102`
  `\flash{}`{=latex}    N/A
  `\gpt{}`{=latex}      `pydata_xarray-4687`

  : List of tasks in `\swe{}`{=latex} where each model fails to solve the task in iteration 0 across all $N$=16 parallel rollouts, but discovers a solution in iteration 1 in at least one of the $N=16$ rollouts (i.e., succeeds after the select-$K$ refinement during **PDR+RTV**).
:::

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{6pt}
```
::: {#tab:new_solution_discoveries_terminal_bench}
  **Model**             **Task IDs w/ fail $\rightarrow$ pass**
  --------------------- ------------------------------------------------------------------------------------------------------------------------
  `\opus{}`{=latex}     `caffe-cifar-10`, `chess-best-move`, `gpt2-codegolf`, `nginx-request-logging`, `vulnerable-secret`
  `\pro{}`{=latex}      `configure-git-webserver`, `gcode-to-text`, `git-leak-recovery`, `large-scale-text-editing`, `openssl-selfsigned-cert`
  `\sonnet{}`{=latex}   `mcmc-sampling-stan`, `sparql-university`
  `\flash{}`{=latex}    `mcmc-sampling-stan`, `regex-chess`
  `\gpt{}`{=latex}      `mcmc-sampling-stan`, `schemelike-metacircular-eval`, `vulnerable-secret`

  : List of tasks in `\terminal{}`{=latex} where each model fails to solve the task in iteration 0 across all $N$=16 parallel rollouts, but discovers a solution in iteration 1 in at least one of the $N=16$ rollouts (i.e., succeeds after the select-$K$ refinement during **PDR+RTV**).
:::

We observe that despite there being almost six times as many tasks in `\swe{}`{=latex} compared to `\terminal{}`{=latex}, there are 13 tasks with 0$\rightarrow$1 improvement capabilities in `\terminal{}`{=latex} across all models compared to 5 tasks for `\swe{}`{=latex}.

Moreover, we analyze whether the tasks improving from fail $\rightarrow$ pass from iteration 0 to 1 had already been solved by other models during their iteration-0 rollouts. Tables `\ref{tab:new_solution_discovery_matrix_swe_bench}`{=latex} and `\ref{tab:new_solution_discovery_matrix_terminal_bench}`{=latex} show the results of our analysis across all the frontier models on `\swe{}`{=latex} and `\terminal{}`{=latex}, respectively.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{5pt}
```
::: {#tab:new_solution_discovery_matrix_swe_bench}
  **Task ID**                          **`Claude 4.5 Opus`**   **`Gemini 3.1 Pro`**   **`Claude 4.5 Sonnet`**   **`Gemini 3 Flash`**   **`GPT-5`**
  ----------------------------------- ----------------------- ---------------------- ------------------------- ---------------------- --------------
  `django_django-11951`                      $\times$                $\times$              $\checkmark$             $\checkmark$       $\checkmark$
  `sphinx-doc_sphinx-9602`                   $\times$                $\times$              $\checkmark$             $\checkmark$         $\times$
  `django_django-13964`                    $\checkmark$            $\checkmark$              $\times$               $\checkmark$       $\checkmark$
  `scikit-learn_scikit-learn-25102`        $\checkmark$              $\times$              $\checkmark$               $\times$         $\checkmark$
  `pydata_xarray-4687`                     $\checkmark$            $\checkmark$            $\checkmark$             $\checkmark$         $\times$

  : Per-task \`\`new solution discovery" matrix for `\swe{}`{=latex} indicating which models are already able to solve each task. A $\checkmark$ indicates the model fails all iteration-0 rollouts for the task but succeeds in at least one iteration-1 rollout; $\times$ indicates it does not.
:::

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{5pt}
```
::: {#tab:new_solution_discovery_matrix_terminal_bench}
  **Task ID**                       **`Claude 4.5 Opus`**   **`Gemini 3.1 Pro`**   **`Claude 4.5 Sonnet`**   **`Gemini 3 Flash`**   **`GPT-5`**
  -------------------------------- ----------------------- ---------------------- ------------------------- ---------------------- --------------
  `caffe-cifar-10`                        $\times$              $\checkmark$              $\times$               $\checkmark$       $\checkmark$
  `chess-best-move`                       $\times$              $\checkmark$            $\checkmark$             $\checkmark$       $\checkmark$
  `configure-git-webserver`               $\times$                $\times$              $\checkmark$               $\times$         $\checkmark$
  `gcode-to-text`                       $\checkmark$              $\times$                $\times$                 $\times$           $\times$
  `git-leak-recovery`                     $\times$                $\times$              $\checkmark$               $\times$           $\times$
  `gpt2-codegolf`                         $\times$              $\checkmark$              $\times$                 $\times$           $\times$
  `large-scale-text-editing`              $\times$                $\times$                $\times$                 $\times$           $\times$
  `mcmc-sampling-stan`                    $\times$              $\checkmark$              $\times$                 $\times$           $\times$
  `nginx-request-logging`                 $\times$              $\checkmark$              $\times$                 $\times$           $\times$
  `openssl-selfsigned-cert`             $\checkmark$              $\times$                $\times$                 $\times$           $\times$
  `regex-chess`                           $\times$                $\times$              $\checkmark$               $\times$           $\times$
  `schemelike-metacircular-eval`        $\checkmark$            $\checkmark$            $\checkmark$             $\checkmark$         $\times$
  `sparql-university`                   $\checkmark$            $\checkmark$              $\times$               $\checkmark$         $\times$
  `vulnerable-secret`                     $\times$              $\checkmark$              $\times$                 $\times$           $\times$

  : Per-task \`\`new solution discovery" matrix for `\terminal{}`{=latex} indicating which models are already able to solve each task. A $\checkmark$ indicates the model fails all iteration-0 rollouts for the task but succeeds in at least one iteration-1 rollout; $\times$ indicates it does not.
:::

We find that all the `\terminal{}`{=latex} tasks that `\opus{}`{=latex} can self-improve are already solved by `\pro{}`{=latex} in at least one of its iteration-0 rollouts. At the same time, all but one of the `\terminal{}`{=latex} tasks that `\pro{}`{=latex} can self-improve are already solved either by `\opus{}`{=latex} or `\sonnet{}`{=latex} in at least one of their iteration-0 rollouts. Interestingly, the `mcmc-sampling-stan` task self-improved by `\sonnet{}`{=latex}, `\flash{}`{=latex} and `\gpt{}`{=latex} are originally only solved by `\pro{}`{=latex}. Finally, while none of our models are able to solve `large-scale-text-editing` initially, `\pro{}`{=latex} can leverage **PDR+RTV** to execute a successful rollout during iteration 1.

PDR + RTV qualitative examples`\appendixreturntotoc`{=latex}
============================================================

```{=latex}
\addcontentsline{atoc}{section}{\protect\numberline{\thesection}PDR + RTV qualitative examples}
```
`\label{sec:pdr_rtv_examples_appendix}`{=latex}

We present excerpts of several example rollout trajectories generated using our **PDR+RTV** method, where the agent leverages the structured summaries of previous rollouts, synthesizes common approaches and reconciles conflicting details in order to execute a new, successful rollout. Figure `\ref{fig:pdr_rtv_example_1}`{=latex} shows an example of `\opus{}`{=latex} solving the `django_django-13033` task in `\swe{}`{=latex}. Figure `\ref{fig:pdr_rtv_example_2}`{=latex} shows an example of `\pro{}`{=latex} solving the `sympy_sympy-17318` task in `\swe{}`{=latex}. Figure `\ref{fig:pdr_rtv_example_3}`{=latex} shows an example of `\opus{}`{=latex} solving the `sparql-university` task in `\terminal{}`{=latex}. Figure `\ref{fig:pdr_rtv_example_4}`{=latex} shows an example of `\sonnet{}`{=latex} solving the `sqlite-db-truncate` task in `\terminal{}`{=latex}.

Figure `\ref{fig:pdr_rtv_example_1}`{=latex} contains an excerpt of a rollout trajectory generated by `\opus{}`{=latex} for completing the `django_django-13033` task in `\swe{}`{=latex}. In Step 1, the agent draws on all four previous attempts' analyses, synthesizing the commonly identified root cause (line 730 in `find_ordering_name`), the diagnosis that the name is the full path instead of the last piece, and the fix (`pieces[-1]`). Moreover, it does not repeat the extensive exploration performed by all four previous rollouts but rather proceeds directly to the precise file and line number based on the information from the previous summaries. In Step 2, the agent refers back to the synthesized findings from the prior rollouts by using the phrase \`\`According to the analysis". It applies both substitutions (`name` → `pieces[-1]` in both the `attname` check and the \``pk`' check), which matches the approach from two of the previous rollouts, whose `sed` commands replaced both occurrences. It also uses the same `sed` command used by two previous rollouts, directly reusing the working fix pattern. Finally, in Step 5, the agent preemptively installs all three Python dependencies (`asgiref, pytz, sqlparse`) at once, in contrast to the four previous rollouts which all encountered `ModuleNotFoundError`.

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/pdr_rtv_example_1_rollout.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Excerpt of a rollout executed by \opus{} on \swe{} (\texttt{django\_django-13033}).}
```
`\label{fig:pdr_rtv_example_1}`{=latex}

Figure `\ref{fig:pdr_rtv_example_2}`{=latex} contains an excerpt of a rollout trajectory generated by `\pro{}`{=latex} for completing the `sympy_sympy-17318` task in `\swe{}`{=latex}. In Step 1, the agent mentions the previous findings about the call chain (`split_surds` → `_split_gcd(*surds)` → `empty tuple` → `IndexError`), which was traced by all four prior rollouts. Also, the phrase \`\`agreed-upon fix from the successful parallel attempts" explicitly acknowledges consolidating the consensus from the summaries, where all four rollouts proposed the same guard clause in `_split_gcd`. Furthermore, the phrase \`\`Another approach adds an early return in \..." shows the refined rollout is aware of the divergence between attempts -- one rollout modified `split_surds` with an early return, while another rollout additionally modified `_sqrt_match`. The refined rollout acknowledges both alternatives before choosing its final strategy. In Step 8, the agent directly refers to a specific finding from a prior attempt which discovered the `TypeError` in `sqrt_biquadratic_denest`. The refined rollout reproduces the exact same test to confirm the issue exists, rather than discovering it from scratch. In Step 9, the agent reasons about the `TypeError` to justify a two-file fix. Motivated by the discovery made by one of the previous rollouts, the refined rollout uses this understanding to justify why a second fix in `_sqrt_match` (returning `[]` when `b == S.Zero`) is necessary -- which is exactly the two-file approach taken by one of the previous rollouts which was also the only passing prior rollout.

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/pdr_rtv_example_2_rollout.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Excerpt of a rollout executed by \pro{} on \swe{} (\texttt{sympy\_sympy-17318}).}
```
`\label{fig:pdr_rtv_example_2}`{=latex}

Figure `\ref{fig:pdr_rtv_example_3}`{=latex} contains an excerpt of a rollout trajectory generated by `\opus{}`{=latex} for completing the `sparql-university` task in `\terminal{}`{=latex}. In Step 1, it explicitly states \`\`Based on the analysis of 4 previous parallel attempts" and then enumerates four \`\`Key insights from previous attempts." Each insight maps to specific discoveries from the prior summaries -- for example, it leverages an observation from a passing attempt that \`\`the \>10 students department doesn't need to be in EU, just need to work in SOME EU department and SOME department with \>10 students (they can be different)" and switches from a combined to a separate approach. In Step 4, the agent correctly synthesizes the differences in the approaches taken by previous attempts and selects the correct approach. While two of the failing previous attempts combined the EU check and \>10 students check, producing incorrect results, another successful previous attempt independently discovered this separation by analyzing the test expectations. The refined rollout explicitly calls this out as a \`\`Key insight" which is directly informed by the prior attempts' divergent outcomes.

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/pdr_rtv_example_3_rollout.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Excerpt of a rollout executed by \opus{} on \terminal{} (\texttt{sparql-university}).}
```
`\label{fig:pdr_rtv_example_3}`{=latex}

Figure `\ref{fig:pdr_rtv_example_4}`{=latex} contains an excerpt of a rollout trajectory generated by `\sonnet{}`{=latex} for completing the `sqlite-db-truncate` task in `\terminal{}`{=latex}. The preamble in Step 1 draws on the four prior summaries and enumerates their shared findings. Moreover, the agent synthesizes the \`\`key insights from all attempts". It also resolves a critical conflict between the previous attempts about data types by choosing the interpretation taken by the successful attempts (\`\`seems more accurate based on the SQLite serial type parsing"), the key technical distinction between the passing and failing rollouts. In Step 2, the agent directly references the approach taken by the successful previous attempts (\`\`I will create a comprehensive Python script that properly parses the SQLite B-tree leaf page structure\..."), skipping all dead ends from prior attempts to go directly to a complete SQLite B-tree leaf page parser. It produces the final working script in a single step.

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/pdr_rtv_example_4_rollout.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Excerpt of a rollout executed by \sonnet{} on \terminal{} (\texttt{sqlite-db-truncate}).}
```
`\label{fig:pdr_rtv_example_4}`{=latex}

Example rollout trajectories`\appendixreturntotoc`{=latex}
==========================================================

```{=latex}
\addcontentsline{atoc}{section}{\protect\numberline{\thesection}Example rollout trajectories}
```
`\label{sec:example_rollout_trajectories_appendix}`{=latex}

We provide example rollout trajectories for `\swe{}`{=latex} and `\terminal{}`{=latex}. Note that we only provide a *subset* of the full trajectories since they are too long to fit. We provide both the initial rollout traces (Section `\ref{sec:example_initial_rollout_trajectories}`{=latex}) and refined rollout traces (Section `\ref{sec:example_refined_rollout_trajectories}`{=latex}).

Initial rollout trajectories {#sec:example_initial_rollout_trajectories}
----------------------------

We provide examples of initial rollout trajectories from `\swe{}`{=latex} and `\terminal{}`{=latex}, respectively. Figure `\ref{fig:example_initial_rollout_sympy__sympy_17630}`{=latex} contains an example rollout trajectory from `\swe{}`{=latex} (`sympy_sympy_17630`), and Figure `\ref{fig:example_initial_rollout_openssl_selfsigned_cert}`{=latex} contains an example rollout trajectory from `\terminal{}`{=latex} (`openssl_selfsigned_cert`).

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/sympy__sympy_17630_gemini_3_1_pro_arxiv.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Subset of a successful initial rollout trajectory executed by \pro{} on \swe{} (\texttt{sympy\_sympy\_17630}).}
```
`\label{fig:example_initial_rollout_sympy__sympy_17630}`{=latex}

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/openssl_selfsigned_cert_claude_4_5_opus_arxiv.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Subset of a successful initial rollout trajectory executed by \opus{} on \terminal{} (\texttt{openssl\_selfsigned\_cert}).}
```
`\label{fig:example_initial_rollout_openssl_selfsigned_cert}`{=latex}

Refined rollout trajectories {#sec:example_refined_rollout_trajectories}
----------------------------

We provide two interesting examples of refined rollout trajectories from `\terminal{}`{=latex} -- (1) the well-known `gpt2-codegolf` task succeeded by `\opus{}`{=latex} after failure across all of its iteration-0 rollouts (Figure `\ref{fig:example_refinement_rollout_gpt2_codegolf}`{=latex}), and (2) the `large-scale-text-editing` task, which failed for all iteration-0 rollouts across all models in this work, succeeded by `\pro{}`{=latex} after failure across all of its iteration-0 rollouts (Figure `\ref{fig:example_refinement_rollout_large_scale_text_editing}`{=latex}).

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/gpt2_codegolf_claude_4_5_opus_arxiv.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Subset of a successful refinement rollout trajectory executed by \opus{} on \terminal{} (\texttt{gpt2-codegolf}).}
```
`\label{fig:example_refinement_rollout_gpt2_codegolf}`{=latex}

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/large_scale_text_editing_gemini_3_1_pro_arxiv.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Subset of a successful refinement rollout trajectory executed by \pro{} on \terminal{} (\texttt{large-scale-text-editing}).}
```
`\label{fig:example_refinement_rollout_large_scale_text_editing}`{=latex}

Example summaries`\appendixreturntotoc`{=latex}
===============================================

```{=latex}
\addcontentsline{atoc}{section}{\protect\numberline{\thesection}Example summaries}
```
`\label{sec:example_summaries_appendix}`{=latex}

We provide examples of structured summaries for `\swe{}`{=latex} and `\terminal{}`{=latex}. Figure `\ref{fig:example_summary_sympy__sympy_17630}`{=latex} is the structured summary generated by `\pro{}`{=latex} for `\swe{}`{=latex} (`sympy__sympy_17630`), associated with the rollout trace in Figure `\ref{fig:example_initial_rollout_sympy__sympy_17630}`{=latex}. Meanwhile, Figure `\ref{fig:example_summary_openssl_selfsigned_cert}`{=latex} is the structured summary generated by `\opus{}`{=latex} for `\terminal{}`{=latex} (`openssl_selfsigned_cert`), associated with the rollout trace in Figure `\ref{fig:example_initial_rollout_openssl_selfsigned_cert}`{=latex}.

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/swe_bench_summary_example.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Structured summary generated by \pro{} on \swe{} (\texttt{sympy\_\_sympy\_17630}).}
```
`\label{fig:example_summary_sympy__sympy_17630}`{=latex}

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/terminal_bench_summary_example.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Structured summary generated by \opus{} on \terminal{} (\texttt{openssl\_selfsigned\_cert}).}
```
`\label{fig:example_summary_openssl_selfsigned_cert}`{=latex}

Example groupwise comparisons`\appendixreturntotoc`{=latex}
===========================================================

```{=latex}
\addcontentsline{atoc}{section}{\protect\numberline{\thesection}Example groupwise comparisons}
```
`\label{sec:example_groupwise_comparisons_appendix}`{=latex}

We provide examples of groupwise comparisons performed for `\swe{}`{=latex} and `\terminal{}`{=latex} in Figures `\ref{fig:example_comparison_swe_bench}`{=latex} and `\ref{fig:example_comparison_terminal_bench}`{=latex}, respectively.

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/swe_bench_comparison_example.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Groupwise comparison generated by \pro{} on \swe{} (\texttt{django\_\_django-15973}).}
```
`\label{fig:example_comparison_swe_bench}`{=latex}

```{=latex}
\tcbinputlisting{rollouttrajectory, listing options app={basicstyle=\footnotesize\ttfamily}, listing file={examples/terminal_bench_comparison_example.txt}}
```
```{=latex}
\captionsetup{hypcap=false}
```
```{=latex}
\captionof{figure}{Groupwise comparison generated by \opus{} on \terminal{} (\texttt{video-processing}).}
```
`\label{fig:example_comparison_terminal_bench}`{=latex}

[^1]: We have experienced many API failure rates with `\pro{}`{=latex} during our final **RTV** experiments due to infrastructure issues, which may be related to its abnormally lower success rate as a judge.