直觉引导：基于强化内在置信度的高效测试时扩展

摘要

提升大型語言模型（LLM）推理能力的測試時縮放（TTS）方法，通常因過度依賴外部過程獎勵模型（PRMs）或如最佳N選取（BoN）的抽樣方法而產生顯著的計算成本。本文介紹了「直覺引導」（GG），一種高效的自我引導TTS框架，該框架在不依賴昂貴外部驗證模型的情況下，達到了PRM級別的表現。我們的方法採用了一種輕量級的樹搜索，僅由LLM內在信號、詞元級置信度及步驟新穎性引導。一項關鍵創新是通過針對性的強化學習微調階段，提升了內部置信度估計的可靠性。在具挑戰性的數學推理基準上的實證評估顯示，GG使較小模型（例如1.5B參數）能夠達到或超越顯著更大模型（例如32B-70B參數）的準確性，同時將GPU記憶體使用量減少高達10倍。與基於PRM的方法相比，GG在保持相當準確性的同時，實現了8倍的推理速度提升及4-5倍的記憶體使用降低。此外，與BoN策略相比，GG將KV快取記憶體使用量減少了約50%，促進了TTS技術更高效且實際的部署。

English

Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.

直觉引导：基于强化内在置信度的高效测试时扩展

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

摘要

Support