小規模RLコントローラ、大規模言語モデル：テスト時スケーリングのためのRL誘導適応的サンプリング

要旨

テスト時スケーリングは大規模言語モデルの推論性能を向上させるが、全体の計算量と遅延の両面で大きなコストを伴う。既存の適応的サンプリング手法はこの問題を部分的に緩和するが、典型的にはヒューリスティックなルールや分布仮定に依存している。本研究では、適応的サンプリングをマルコフ決定過程として定式化する。強化学習を用いて軽量なサンプリング制御器を訓練し、解答の正しさ、遅延、計算コストを同時にバランスさせる。各ラウンドにおいて、制御器はサンプリングを停止するか、追加サンプルを取得するかを決定する。本手法は軽量であり、最終解答の統計量のみに依存し、CPU上で訓練と展開が可能である。さらに、得られた枠組みは明示的な予算制約を伴う制約付き最適化問題のラグランジュ緩和として解釈できることを示す。ASCやESCなどの強力なベースラインとの比較実験により、本手法が解答の正しさ、サンプリングラウンド数、総サンプル数の間でより良いトレードオフを達成することを示す。

English

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.