直觉引导：基于强化内在置信度的高效测试时扩展

摘要

测试时扩展（TTS）方法在提升大型语言模型（LLM）推理能力时，往往伴随着高昂的计算成本，这主要源于对外部过程奖励模型（PRM）或诸如最佳N采样（BoN）等方法的过度依赖。本文提出了一种名为“直觉引导”（GG）的高效自引导TTS框架，该框架无需依赖昂贵的外部验证模型，即可达到PRM级别的性能。我们的方法采用了一种轻量级的树搜索策略，仅依赖于LLM内部信号——即令牌级置信度和步骤新颖性进行引导。其中一项关键创新在于，通过有针对性的强化学习微调阶段，提升了内部置信度估计的可靠性。在具有挑战性的数学推理基准测试中，实证评估显示，GG使得较小模型（例如1.5B参数）能够达到甚至超越显著更大模型（如32B-70B参数）的准确度，同时将GPU内存使用量减少高达10倍。与基于PRM的方法相比，GG在保持相当准确度的同时，实现了8倍的推理速度提升和4-5倍的内存使用降低。此外，相较于BoN策略，GG还将KV缓存内存使用量减少了约50%，从而促进了TTS技术更为高效和实用的部署。

English

Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.

直觉引导：基于强化内在置信度的高效测试时扩展

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

摘要

Support