ChatPaper.aiChatPaper

基于采样的推理:您的基础模型比您想象的更聪明

Reasoning with Sampling: Your Base Model is Smarter Than You Think

October 16, 2025
作者: Aayush Karan, Yilun Du
cs.AI

摘要

前沿推理模型通过强化学习对大型语言模型进行后训练,已在多学科领域展现出惊人能力。然而尽管该范式已取得广泛成功,现有研究大多致力于解析强化训练过程中涌现、但基础模型原本不具备的全新行为。本文从不同角度切入探讨该问题,转而研究是否能在推理阶段仅通过纯采样方式,从基础模型中激发出与之相当的推理能力。受马尔可夫链蒙特卡洛方法从锐化分布中采样的启发,我们提出一种利用基础模型自身似然度的简单迭代采样算法。在不同基础模型上的实验表明,该算法在MATH500、HumanEval、GPQA等单次任务中的推理能力提升显著,几乎达到甚至超越强化学习的效果。更重要的是,我们的采样器避免了强化学习后训练中常见的多样性与多样本崩溃问题。该方法无需额外训练、精选数据集或验证器,表明其具有超越易验证领域的广泛适用性。
English
Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
PDF476December 17, 2025