Scaling Test-Time Compute for Agentic Coding

Source

Status And Credibility

Recent April 16, 2026 arXiv preprint from a large agentic-coding and ML team. Treat as credible preliminary systems evidence for long-horizon coding agents, but keep the claims tied to the reported harnesses, frontier-model APIs, and benchmark setup until there is peer review or independent replication.

Core Claim

Long-horizon agentic coding does not scale cleanly by reusing raw interaction traces. Each rollout mixes useful diagnoses, failed commands, repeated terminal output, partial fixes, and environment-specific noise. The paper argues that prior agent experience must first be compressed into structured rollout summaries, and then those summaries can become the reusable interface for selection and refinement.

The source is directly relevant to Alex’s compression thesis:

raw action/observation trajectory
  -> compact structured summary
  -> selection, reuse, and next-rollout conditioning

The important lesson is not merely “summarize logs for readability.” It is that the agentic system works better when intermediate experience is transformed into a bounded representation before downstream computation consumes it.

Method

flowchart LR
  Problem[problem]
  R0[parallel rollouts]
  S0[structured summaries]
  RTV[RTV selection]
  PDR[PDR refinement context]
  R1[fresh refined rollouts]
  Final[final RTV]

  Problem --> R0 --> S0 --> RTV --> PDR --> R1 --> Final

The paper combines two mechanisms:

  • Recursive Tournament Voting (RTV). Run many independent rollouts, summarize each rollout, and recursively compare summaries in small groups until the system selects a final candidate.
  • Agentic Parallel-Distill-Refine (PDR). Run a new generation of rollouts in fresh environments conditioned on selected summaries from previous attempts.

RTV provides the parallel selection path. PDR provides the sequential reuse path. The full recipe first selects useful prior summaries, then conditions fresh attempts on them, then selects again.

Key Contributions

  • Identifies representation of prior agent experience as the bottleneck in test-time scaling for long-horizon agentic coding.
  • Shows compact structured summaries outperform full rollout traces as the substrate for comparing agent attempts.
  • Shows recursive small-group selection can outperform one flat all-at-once comparison over many candidates.
  • Shows next-iteration rollout quality is strongly tied to the quality of selected refinement summaries.
  • Reports consistent gains from PDR+RTV on SWE-Bench Verified and Terminal-Bench v2.0 across Claude, Gemini, and GPT-family frontier models.
  • Reports fewer steps in refined rollouts, suggesting the summaries help agents skip repeated exploration and go directly to useful hypotheses or fixes.

Results To Remember

The headline examples are agentic coding benchmarks, not time series. On SWE-Bench Verified, the final PDR+RTV candidate improves Claude 4.5 Opus from 70.94% to 77.60%, Gemini 3.1 Pro from 72.25% to 76.60%, Claude 4.5 Sonnet from 67.41% to 75.60%, Gemini 3 Flash from 70.79% to 76.00%, and GPT-5 (0825) from 61.41% to 69.80%.

On Terminal-Bench v2.0, the reported gains are larger: Claude 4.5 Opus from 46.95% to 59.09%, Gemini 3.1 Pro from 52.49% to 64.77%, Claude 4.5 Sonnet from 40.62% to 56.82%, Gemini 3 Flash from 37.93% to 48.86%, and GPT-5 (0825) from 31.32% to 38.64%.

The ablations matter more than the headline leaderboard numbers for this wiki. The paper reports that structured summaries are a better comparison substrate than raw rollout traces, and that selected prior experience directly predicts refined rollout success.

Alex Context

This source supports the local idea that end-to-end systems need to compress data as processing proceeds. For agents, the raw stream is not only too large; it is also the wrong abstraction. A model that must learn or act end to end should not repeatedly consume raw logs, traces, or command transcripts when a structured state summary can preserve the decision-relevant parts.

The time-series and observability analogue is:

raw telemetry/logs/traces/events
  -> typed state summary with uncertainty and provenance
  -> action scoring, planning, or next-state prediction

The caveat is equally important. Summaries are useful only if they preserve the state variables that future decisions need. For operations, a summary that removes timing, magnitude, topology, delayed effects, or failed-action status may look compact while destroying the control signal.

Gotchas

  • This is not a trained world model. It is a test-time scaling and experience-reuse recipe around frontier LLM agents.
  • The summaries are generated by LLM prompts, so summary fidelity and judge quality become system bottlenecks.
  • The results are specific to coding environments where rollouts are partly verifiable through files, commands, tests, and terminal output.
  • The paper shows that compression can help agentic reuse, but it does not prove that arbitrary compression preserves action-relevant state.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute allocationadjacentAllocates test-time compute across rollouts, recursive selection, and refinement rather than only one long attempt.Needs numeric time-series or action-conditioned trajectory evidence under matched latency and cost.
Streaming state and long contextadjacentConverts long agent trajectories into bounded reusable summaries.Needs learned state updates for continuous streams, not one-off prompt summaries.
Context interfacepartially closesShows structured summaries can be a better agent interface than raw traces.Need schema and preservation tests for telemetry, graph state, actions, rewards, and uncertainty.
Control and counterfactualsadjacentRefined rollouts reuse prior action-observation evidence to choose better next actions.Does not model counterfactual futures directly or estimate action-conditioned outcome distributions.

Open Questions

  • What summary schema preserves decision-relevant state for software agents without copying raw traces?
  • Can structured summaries be trained as latent state rather than generated by prompts after each rollout?
  • For observability, which fields must survive compression: timing, magnitude, topology, action latency, error status, ownership, or policy constraints?
  • How should summary uncertainty be represented so downstream agents know when to inspect raw evidence again?
  • Which prior-experience artifacts should persist across rollouts: summaries, notes, tests, patches, scripts, tool outputs, or learned memory state?