ChatPaper.aiChatPaper

可擴展功率採樣:透過分佈銳化實現大型語言模型的高效免訓練推理

Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

January 29, 2026
作者: Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou Ammar
cs.AI

摘要

強化學習(RL)後訓練是提升大型語言模型(LLM)推理性能的主流方法,但越來越多證據表明其效果提升主要源於分佈銳化而非新能力的獲得。近期研究表明,使用馬爾可夫鏈蒙特卡洛(MCMC)從LLM的冪分佈中採樣,無需依賴外部獎勵即可達到與RL後訓練相當的性能;然而MCMC的高計算成本使此類方法難以廣泛應用。本文提出一種理論嚴謹的替代方案,無需迭代式MCMC推演。我們推導出新穎的數學表述,證明全局冪分佈可通過標記級別的低溫縮放分佈來逼近,其中縮放因子捕獲了未來軌跡質量。基於此洞見,我們提出一種免訓練、免驗證器的算法,以自迴歸方式銳化基礎模型的生成分佈。實證方面,我們在四種LLM上針對數學、問答和編程任務進行評估,結果表明本方法在無需外部獎勵的情況下,性能匹配或超越單次GRPO,同時相比基於MCMC的採樣將推理延遲降低逾10倍。
English
Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.
PDF108January 31, 2026