可扩展功率采样:通过分布锐化实现大语言模型的高效免训练推理
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening
January 29, 2026
作者: Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou Ammar
cs.AI
摘要
强化学习(RL)后训练是提升大语言模型(LLM)推理能力的主流方法,但越来越多证据表明其效果提升主要源于分布锐化而非新能力的获得。近期研究表明,采用马尔可夫链蒙特卡洛(MCMC)方法对LLM的幂分布进行采样,可在不依赖外部奖励的情况下达到与RL后训练相当的性能;然而MCMC的高计算成本使得该方法难以广泛应用。本研究提出一种理论严密的替代方案,无需迭代式MCMC运算。我们推导出新颖的数学表述,证明全局幂分布可通过标记级缩放低温分布来近似,其中缩放因子可捕捉未来轨迹质量。基于这一发现,我们提出一种免训练、免验证器的算法,能够自回归地锐化基础模型的生成分布。实证阶段,我们在四个LLM上对数学、问答和代码任务进行评估,结果表明本方法在不依赖任何外部奖励的情况下达到或超越单次GRPO效果,同时相比基于MCMC的采样将推理延迟降低逾10倍。
English
Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.