언제 풀고 언제 검증할 것인가: LLM 추론을 위한 계산 최적화 문제 해결 및 생성적 검증

초록

테스트 시점 계산 자원 확장은 대규모 언어 모델(LLMs)의 추론 능력, 특히 수학 문제 해결과 같은 작업에서 향상시키기 위한 핵심 전략으로 부상했습니다. 전통적인 접근 방식인 자기 일관성(Self-Consistency, SC)은 문제에 대한 여러 해결책을 생성하고 다수결 투표를 통해 가장 일반적인 답을 선택합니다. 또 다른 일반적인 방법은 각 해결책을 보상 모델(검증기)로 점수화하여 최상의 답을 선택하는 것입니다. 최근 생성적 보상 모델(Generative Reward Models, GenRM)의 발전은 검증을 다음 토큰 예측 작업으로 재구성함으로써 새로운 축에서의 추론 시점 확장을 가능하게 했습니다. 구체적으로, GenRM은 각 해결책을 점수화하기 위해 여러 검증 사고 사슬(chain-of-thought)을 생성합니다. 제한된 추론 예산 하에서, 이는 근본적인 트레이드오프를 도입합니다: SC를 통해 해결책 생성을 확장하는 데 예산을 사용해야 할지, 아니면 더 적은 수의 해결책을 생성하고 GenRM을 통한 검증에 계산 자원을 할당해야 할지? 이를 해결하기 위해, 우리는 고정된 추론 예산 하에서 GenRM과 SC를 비교 평가했습니다. 흥미롭게도, 다양한 모델과 데이터셋에서 대부분의 실용적인 추론 예산에 대해 SC가 GenRM보다 계산 효율적이라는 것을 발견했습니다. 예를 들어, GenRM은 추론 계산 자원을 최대 8배까지 소비한 후에야 SC와 동등한 성능을 보이며, 이를 능가하려면 훨씬 더 많은 계산 자원이 필요합니다. 더 나아가, 우리는 GenRM 패러다임에 대한 추론 확장 법칙을 도출했는데, 이는 계산 최적의 추론이 검증 횟수를 확장하는 것보다 해결책 생성을 더 공격적으로 확장하는 것을 선호한다는 것을 보여줍니다. 우리의 연구는 해결책 생성과 검증의 균형을 맞춤으로써 테스트 시점 확장을 최적화하는 실용적인 지침을 제공합니다. 코드는 https://github.com/nishadsinghi/sc-genrm-scaling에서 확인할 수 있습니다.

English

Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at https://github.com/nishadsinghi/sc-genrm-scaling.

언제 풀고 언제 검증할 것인가: LLM 추론을 위한 계산 최적화 문제 해결 및 생성적 검증

When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning

초록

Support