腸内感覚に導かれる：強化された内在的自信による効率的なテストタイムスケーリング

要旨

大規模言語モデル（LLM）の推論能力を向上させるためのテストタイムスケーリング（TTS）手法は、外部のプロセス報酬モデル（PRM）やBest-of-N（BoN）のようなサンプリング手法に大きく依存するため、しばしば多大な計算コストを伴う。本論文では、高価な外部検証モデルを必要とせずにPRMレベルの性能を達成する効率的な自己誘導型TTSフレームワーク「Guided by Gut（GG）」を提案する。本手法は、LLMの内在的な信号、すなわちトークンレベルの信頼度とステップの新規性のみに基づいて誘導される軽量な木探索を採用している。重要な革新点として、ターゲットを絞った強化学習によるファインチューニングフェーズを通じて、内部の信頼度推定の信頼性を向上させることが挙げられる。挑戦的な数学的推論ベンチマークでの実証評価により、GGがより小さなモデル（例：1.5Bパラメータ）に対して、大幅に大きなモデル（例：32B-70Bパラメータ）に匹敵または凌駕する精度を達成しつつ、GPUメモリ使用量を最大10分の1に削減できることが示された。PRMベースの手法と比較して、GGは同等の精度を8倍の推論速度と4-5倍の低いメモリ使用量で達成する。さらに、GGはBoN戦略と比べてKVキャッシュのメモリ使用量を約50％削減し、TTS技術のより効率的で実用的な展開を可能にする。

English

Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.

腸内感覚に導かれる：強化された内在的自信による効率的なテストタイムスケーリング

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

要旨

Support