예산 인식 테스트 시간 스케일링을 위한 판별 검증

초록

테스트 타임 스케일링은 복잡한 추론 작업에서 대규모 언어 모델의 성능을 향상시키는 강력한 전략입니다. 최첨단 접근법들은 종종 생성형 검증기를 사용하여 후보 풀에서 최적의 해결책을 선택하지만, 이 방법은 과도한 계산 비용을 초래하여 실용성을 제한합니다. 본 연구에서는 더 예산 친화적인 패러다임인 판별형 검증에 초점을 맞춥니다. 우리는 철저한 실증 분석을 수행하고, 판별형 검증기가 단독으로는 성능이 떨어질 수 있지만, 이를 자기 일관성과 결합한 하이브리드 접근법이 강력하고 효율적인 테스트 타임 스케일링 메커니즘을 만든다는 것을 입증했습니다. 특히, 고정된 계산 예산 하에서 이 하이브리드 접근법은 최첨단 생성형 검증을 상당한 차이로 능가하며, AIME2025에서 최대 15.3% 더 높은 정확도를 달성했습니다. 우리의 연구 결과는 실용적인 실제 응용 프로그램에서 판별형 검증기를 사용한 예산 친화적 스케일링이 자기 일관성에 비해 "무료" 업그레이드일 뿐만 아니라, 비용이 많이 드는 생성형 기술보다 더 효과적이고 효율적인 대안임을 입증합니다. 코드는 https://github.com/wang-research-lab/verification에서 확인할 수 있습니다.

English

Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3\% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a "free" upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at https://github.com/wang-research-lab/verification.

예산 인식 테스트 시간 스케일링을 위한 판별 검증

Budget-aware Test-time Scaling via Discriminative Verification

초록

Support