基於抽樣的推理:您的基本模型比您想像的更聰明
Reasoning with Sampling: Your Base Model is Smarter Than You Think
October 16, 2025
作者: Aayush Karan, Yilun Du
cs.AI
摘要
前沿推理模型在強化學習(RL)對大型語言模型(LLMs)進行後訓練的驅動下,已在多個學科領域展現出卓越能力。然而,儘管此範式已取得廣泛成功,現有文獻多聚焦於釐清RL訓練過程中湧現、但基礎模型原本不具備的全新行為。本研究從不同角度切入,探討能否在推理階段僅通過純採樣方法,從基礎模型中激發出與RL相當的推理能力。受馬可夫鏈蒙地卡羅(MCMC)技術中從銳化分佈採樣的啟發,我們提出一種利用基礎模型自身似然率的簡潔迭代採樣演算法。實驗顯示,該演算法在多種基礎模型上能顯著提升推理性能,於MATH500、HumanEval、GPQA等多項單次推理任務中接近甚至超越RL後訓練的效果。更重要的是,我們的採樣方法避免了RL後訓練常見的多樣性衰減問題。關鍵在於,本方法無需額外訓練、精選資料集或驗證器,展現出在易驗證領域之外的廣泛應用潛力。
English
Frontier reasoning models have exhibited incredible capabilities across a
wide array of disciplines, driven by posttraining large language models (LLMs)
with reinforcement learning (RL). However, despite the widespread success of
this paradigm, much of the literature has been devoted to disentangling truly
novel behaviors that emerge during RL but are not present in the base models.
In our work, we approach this question from a different angle, instead asking
whether comparable reasoning capabilites can be elicited from base models at
inference time by pure sampling, without any additional training. Inspired by
Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened
distributions, we propose a simple iterative sampling algorithm leveraging the
base models' own likelihoods. Over different base models, we show that our
algorithm offers substantial boosts in reasoning that nearly match and even
outperform those from RL on a wide variety of single-shot tasks, including
MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in
diversity over multiple samples that is characteristic of RL-posttraining.
Crucially, our method does not require training, curated datasets, or a
verifier, suggesting broad applicability beyond easily verifiable domains.