V1: Unifying Generation and Self-Verification for Parallel Reasoners

Abstract

Test-time scaling for complex reasoning tasks shows that leveraging inference compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if correct solutions can be reliably identified among candidates. While existing approaches typically evaluate candidates independently via scalar scoring, we demonstrate that models are substantially stronger at pairwise self-verification. Leveraging this insight, we introduce V₁, a framework that unifies generation and verification through efficient pairwise ranking. V₁ comprises two components: V₁-Infer, an uncertainty-guided algorithm using a tournament-based ranking that dynamically allocates self-verification compute to candidate pairs whose relative correctness is most uncertain; and V₁-PairRL, an RL framework that jointly trains a single model as both generator and pairwise self-verifier, ensuring the verifier adapts to the generator's evolving distribution. On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V₁-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being significantly more efficient. Furthermore, V₁-PairRL achieves 7–9% test-time scaling gains over standard RL and pointwise joint training, and improves base Pass@1 by up to 8.7% over standard RL in a code-generation setting.

Key Results

+3.3% to +10%

Pass@1 improvement over pointwise verification with V₁-Infer across code (CodeContests, LiveCodeBench) and math (AIME, HMMT) benchmarks

+7–9%

Test-time scaling gains with V₁-PairRL over standard RL and pointwise joint training

+8.7%

Base Pass@1 improvement over standard RL (V₁-PairRL, CodeContests)

+23.7%

Accuracy gain on hard problems with pairwise verification (LCB-v6, 3x budget)

Method

V₁-Infer: Uncertainty-Guided Pairwise Verification

V₁-Infer selects the best solution among N parallel-generated candidates by performing pairwise comparisons instead of independent scoring. It operates in two phases: (1) Topology Coverage, which ensures every solution is compared a minimum number of times via degree-constrained pairing, and (2) Swiss Refinement, which dynamically pairs solutions with similar quality scores to resolve ambiguous rankings. This concentrates the limited verification budget on the most informative comparisons. Scores are aggregated using confidence-weighted win rates derived from the magnitude of rating differences in each pairwise judgment.

V₁-PairRL: Co-Evolving Generation and Verification

V₁-PairRL is a unified RL framework that jointly trains a single LLM to be both a strong solution generator and an accurate pairwise self-verifier. The training objective combines a standard GRPO-based generation loss (J_Gen) with a pairwise verification loss (J_PairVerif) that rewards the model for correctly rating pairs of its own solutions. Critically, V₁-PairRL uses an online co-evolving setup where verification training data comes from the model's own current-iteration rollouts, ensuring the verifier always trains on in-distribution data as the generator improves.

Motivation: Limitations of Current Self-Verification and Aggregation

Pointwise self-verification suffers from calibration collapse. Standard pointwise verification assigns scalar scores to solutions in isolation. This approach is fundamentally limited by the lack of a comparative reference set. Absolute scores lack a globally comparable scale, leading to high variance and poor cross-context calibration, often over-scoring plausible but incorrect solutions. Pairwise judgments simplify the task to a well-posed relative comparison, yielding significantly higher top-1 self-ranking accuracy.

Self-aggregation leads to reduction in Pass@N and diversity collapse. Self-aggregation based methods prompt the same LLM to consolidate parallel solutions into one. While they can improve Pass@1, they may lead to diversity collapse: the Pass@N score (representing the probability that at least one correct solution exists among N candidates) monotonically decreases as aggregation steps increase for Recursive Self-Aggregation (RSA). This indicates that RSA frequently discards or degrades correct outlier solutions during refinement.

Pairwise self-verification achieves +4.7% to +6.1% higher top-1 ranking accuracy over pointwise.

RSA shows declining Pass@N (diversity collapse) on LiveCodeBench for GPT-OSS-20B.

Same diversity collapse pattern for Qwen3-4B-Instruct.

V₁-Infer Results

V₁-Infer consistently outperforms pointwise self-verification across all benchmarks and models at N=16 candidate solutions. On code generation, V₁-Infer achieves gains of +7.3% on CodeContests (GPT-OSS-20B), +8.6% on LiveCodeBench-v5, and +4.6% on LiveCodeBench-v6. On math reasoning, gains reach +10.0% on both AIME (Qwen3-4B-Instruct) and HMMT (GPT-OSS-20B).

Performance after self-verification using V₁-Infer compared with pointwise self-verification across benchmarks and models at N=16 base generations.

Budget efficiency. V₁-Infer consistently outperforms pointwise self-verification at equivalent total budgets (generation + verification calls) and shows monotonic performance scaling with compute. Even V₁-Infer with N=8 outperforms Pointwise with N=16 at comparable budgets, demonstrating superior compute efficiency.

Accuracy vs. total budget (generation + verification calls). V₁-Infer consistently outperforms pointwise at equivalent budgets and shows monotonic scaling.

Comparison with Recursive Self-Aggregation (RSA). On LiveCodeBench-v6 (N=16), V₁-Infer achieves higher accuracy with fewer model calls compared to RSA. While RSA accuracy plateaus or degrades with additional aggregation steps due to diversity collapse, V₁-Infer scales monotonically with verification budget.

Comparison with Recursive Self-Aggregation (RSA) on LCB-v6. V₁-Infer achieves higher accuracy with fewer model calls.

Largest Gains on Hard Problems

Pairwise verification provides the most significant improvements on harder problems. On hard problems in LiveCodeBench-v6 (GPT-OSS-20B, N=16), pairwise verification with 3x budget achieves +23.7% improvement over Pass@1, compared to +15.4% on medium and +0.7% on easy problems (which are already near ceiling at 96%).

Accuracy by problem difficulty on LCB-v6 (GPT-OSS-20B, N=16) as verification budget increases. Hard problems benefit the most, gaining +23.7% at 3x budget.

Generalization to Software Engineering

V₁-Infer generalizes beyond competitive programming and math to real-world software engineering tasks. On SWE-bench Lite (300 instances, 8 candidates, Gemini 2.5 Flash), pairwise verification achieves a 33.3% resolve rate, outperforming pointwise (28.3%) and vanilla Pass@1 (26.3%). Notably, the verifier uses only issue descriptions and patch diffs, with no access to repository context or agent trajectories.

Resolve rate on SWE-bench Lite (300 instances, N=8 candidates, Gemini 2.5 Flash). Pairwise verification achieves 33.3%, a +5.0% improvement over pointwise and +7.0% over vanilla.

V₁-PairRL: Joint Training Improves Both Generation and Verification

V₁-PairRL jointly trains a single model as both solution generator and pairwise verifier using reinforcement learning. This co-evolving training yields three key benefits:

Stronger test-time scaling: V₁-PairRL achieves 7–9% higher accuracy than V₁-PointRL (pointwise joint training) when using pairwise verification at inference time across all code generation benchmarks.
Better verification when paired with V₁-Infer: V₁-PairRL outperforms the standard RL baseline by +3.6% on LCB-v5, +1.9% on LCB-v6, and +8.9% on CodeContests, when both use V₁-Infer at 2x budget.
Improved base generation quality: Co-training with pairwise verification improves raw generation Pass@1 by up to +8.7% over the standard RL baseline (CodeContests), demonstrating that verification training feeds back into generation capability.

V₁-PairRL training results (Qwen3-4B-Instruct, N=16). Left: Test-time scaling gains over V₁-PointRL at 1x and 2x budget. Middle: Comparison with RL baseline using V₁-Infer at 2x budget. Right: Base Pass@1 improvement showing co-training improves generation quality.

Co-Evolving Training is Critical

An important ablation compares co-evolving V₁-PairRL (where verification data comes from the model's own current rollouts) with a non-co-evolving variant (where verification is trained on fixed, offline data). Co-evolving training consistently outperforms across all benchmarks, with gains of +4.2% to +6.1%, confirming that keeping the verifier in-distribution with the generator's evolving output is crucial.

Co-evolving V₁-PairRL consistently outperforms non-co-evolving training at 2x budget: +5.2% on LCB-v5, +4.2% on LCB-v6, and +6.1% on CodeContests.