V_1: 推論と自己検証の統一による並列推論器

要旨

複雑な推論タスクにおけるテスト時間スケーリングの研究では、複数の解を独立にサンプリングして統合するといった手法により推論時の計算リソースを活用することが、タスク成果の大幅な改善につながることが示されている。しかし、重大なボトルネックは検証にある。サンプリングは、正しい解を候補群の中から確実に識別できる場合にのみ有効なのである。既存のアプローチでは通常、候補をスカラー評価により独立して評価するが、我々はモデルがペアワイズな自己検証においてはるかに強力な能力を発揮することを実証する。この知見を活かし、効率的なペアワイズ順位付けを通じて生成と検証を統合するフレームワークV_1を提案する。V_1は2つの構成要素からなる：1つはV_1-Inferであり、トーナメント形式の順位付けを用いた不確実性誘導型アルゴリズムで、正しさの相対的判断が最も不確かな候補ペアに対して自己検証の計算リソースを動的に割り当てる。もう1つはV_1-PairRLであり、単一のモデルを生成器かつペアワイズ自己検証器として共同訓練する強化学習フレームワークで、検証器が生成器の進化する分布に適応することを保証する。コード生成（LiveCodeBench, CodeContests, SWE-Bench）および数学的推論（AIME, HMMT）のベンチマークにおいて、V_1-Inferはポイントワイズ検証と比較してPass@1を最大10%改善し、最近のテスト時間スケーリング手法を上回る性能を示すとともに、はるかに効率的であった。さらに、V_1-PairRLは、標準的な強化学習およびポイントワイズ共同訓練と比較して7～9%のテスト時間スケーリング効果を達成し、コード生成設定において標準的な強化学習と比べてベースラインのPass@1を最大8.7%改善した。

English

Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if correct solutions can be reliably identified among candidates. While existing approaches typically evaluate candidates independently via scalar scoring, we demonstrate that models are substantially stronger at pairwise self-verification. Leveraging this insight, we introduce V_1, a framework that unifies generation and verification through efficient pairwise ranking. V_1 comprises two components: V_1-Infer, an uncertainty-guided algorithm using a tournament-based ranking that dynamically allocates self-verification compute to candidate pairs whose relative correctness is most uncertain; and V_1-PairRL, an RL framework that jointly trains a single model as both generator and pairwise self-verifier, ensuring the verifier adapts to the generator's evolving distribution. On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being significantly more efficient. Furthermore, V_1-PairRL achieves 7--9% test-time scaling gains over standard RL and pointwise joint training, and improves base Pass@1 by up to 8.7% over standard RL in a code-generation setting.

V_1: 推論と自己検証の統一による並列推論器

V_1: Unifying Generation and Self-Verification for Parallel Reasoners

要旨

Support