기성 LLM을 프로세스 스코어러로 사용: 수학적 추론을 위한 훈련 없는 PRM 대안

초록

강력한 평가자를 사용하여 여러 소형 모델 샘플 중에서 최상의 응답을 선택하는 것은 간단한 추론 시간 전략이지만, 소형 모델이 이미 잘못된 추론 경로에 도달한 경우 실패합니다. PRM 기반 탐색은 생성 중에 후보 지속을 평가함으로써 이를 피하지만, 단계별 레이블로 훈련된 보상 모델이 필요합니다. 본 논문에서는 학습이 필요 없는 대안으로, 기성 대형 언어 모델을 프로세스 평가자로 사용하는 청크 수준 유도 생성을 제안합니다. 각 단계에서 소형 모델은 k개의 고정 길이 후보 청크를 샘플링하고, 대형 모델은 텍스트를 생성하지 않고 가능도를 사용하여 후보를 평가합니다. 선택된 청크는 다음 단계 전에 확정되며, 오류가 전파되기 전에 생성을 유도합니다. 이 프레임워크를 두 가지 선택 규칙으로 구체화합니다: 길이 정규화된 대형 모델 로그 확률이 가장 높은 청크를 선택하는 가능도 유도 선택(LGS)과, 소형 모델의 로그 확률을 빼서 대형 모델의 선호도가 소형 모델과 다른 청크를 선호하는 대조 유도 선택(CGS)입니다. 대형 모델 가능도로 가변 길이 추론 단계를 평가하는 것은 길이 정규화 후에도 지속되는 체계적인 길이 편향으로 인해 신뢰할 수 없으며, 고정 길이 청크가 이러한 혼란 변수를 방지함을 보여줍니다. GSM8K, MATH, Minerva Math, AMC23, AIME24에서 Qwen2.5-1.5B를 Qwen2.5-32B로 유도하고 Llama-3.2-1B를 Llama-3.1-70B로 유도한 경우, CGS는 다수결 투표보다 최대 28% 포인트 높은 성능을 보였으며, 일치된 유도 예산 하에서 대부분의 벤치마크에서 보상 모델 학습 없이 Qwen2.5-Math-PRM-72B 기반 탐색과 동등하거나 더 나은 성능을 보였습니다. Qwen2.5-7B를 Qwen2.5-72B로 유도할 때, CGS는 k=16에서 MATH 81.8%, Minerva Math 63.6%에 도달하여 다수결 투표보다 4~6% 포인트 앞섰습니다. 마지막으로, 청크 수준 유도 생성은 PRM 기반 탐색보다 훨씬 짧은 추론 궤적을 생성합니다.

English

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.