V_1: 생성과 자가 검증을 통합한 병렬 추론 시스템

초록

복잡한 추론 과제를 위한 시험 시간 규모 확장(Test-time scaling) 연구에 따르면, 독립적으로 여러 해결책을 샘플링하고 집계하는 등의 방법으로 추론 시간 연산을 활용하면 과제 성과가 크게 향상됩니다. 그러나 중요한 병목 현상은 검증(verification)에 있습니다: 샘플링은 후보 해결책 중 올바른 해결책을 신뢰성 있게 식별할 수 있을 때만 효과적입니다. 기존 접근법은 일반적으로 스칼라 점수 매기기를 통해 후보를 독립적으로 평가하지만, 우리는 모델이 쌍별 자기 검증(pairwise self-verification)에서 훨씬 더 강력한 성능을 보인다는 것을 입증합니다. 이러한 통찰력을 바탕으로, 우리는 효율적인 쌍별 순위 매기기를 통해 생성과 검증을 통합하는 프레임워크인 V_1을 소개합니다. V_1은 두 가지 구성 요소로 이루어집니다: 첫째, 토너먼트 기반 순위 매기기를 사용하여 상대적 정확도가 가장 불확실한 후보 쌍에 자기 검증 연산을 동적으로 할당하는 불확실성 주도 알고리즘인 V_1-Infer입니다. 둘째, 단일 모델을 생성기이자 쌍별 자기 검증기로 공동 훈련시키며, 검증기가 생성기의 진화하는 분포에 적응하도록 보장하는 RL 프레임워크인 V_1-PairRL입니다. 코드 생성(LiveCodeBench, CodeContests, SWE-Bench) 및 수학 추론(AIME, HMMT) 벤치마크에서 V_1-Infer는 점별 검증(pointwise verification) 대비 Pass@1을 최대 10%까지 향상시켰으며, 최근의 시험 시간 규모 확장 방법들을 능가하면서도 훨씬 더 효율적이었습니다. 더 나아가, V_1-PairRL은 표준 RL 및 점별 공동 훈련 대비 7-9%의 시험 시간 규모 확장 이득을 달성했으며, 코드 생성 환경에서 표준 RL 대비 기본 Pass@1을 최대 8.7%까지 향상시켰습니다.

English

Test-time scaling for complex reasoning tasks shows that leveraging inference-time compute, by methods such as independently sampling and aggregating multiple solutions, results in significantly better task outcomes. However, a critical bottleneck is verification: sampling is only effective if correct solutions can be reliably identified among candidates. While existing approaches typically evaluate candidates independently via scalar scoring, we demonstrate that models are substantially stronger at pairwise self-verification. Leveraging this insight, we introduce V_1, a framework that unifies generation and verification through efficient pairwise ranking. V_1 comprises two components: V_1-Infer, an uncertainty-guided algorithm using a tournament-based ranking that dynamically allocates self-verification compute to candidate pairs whose relative correctness is most uncertain; and V_1-PairRL, an RL framework that jointly trains a single model as both generator and pairwise self-verifier, ensuring the verifier adapts to the generator's evolving distribution. On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being significantly more efficient. Furthermore, V_1-PairRL achieves 7--9% test-time scaling gains over standard RL and pointwise joint training, and improves base Pass@1 by up to 8.7% over standard RL in a code-generation setting.

V_1: 생성과 자가 검증을 통합한 병렬 추론 시스템

V_1: Unifying Generation and Self-Verification for Parallel Reasoners

초록

Support