小型RL控制器，大型語言模型：用於測試時擴展的RL引導自適應取樣

摘要

測試時擴展提升了大型語言模型的推理表現，但同時也導致總計算量和延遲大幅增加。現有的自適應採樣方法透過動態決定何時停止採樣來部分緩解此問題，但這些方法通常依賴啟發式規則或基於分佈假設。在本研究中，我們將自適應採樣表述為一個馬可夫決策過程（MDP）。我們利用強化學習（RL）訓練一個輕量級的採樣控制器，以共同平衡答案正確性、延遲和計算成本。在每一輪中，控制器決定停止採樣或獲取更多樣本。我們的方法輕量且僅需依賴最終答案的統計數據，並可在CPU上進行訓練和部署。我們進一步證明，所提出的框架可被解釋為帶有明確預算限制的約束優化問題的拉格朗日鬆弛。實驗結果顯示，與ASC和ESC等強基線方法相比，我們的方法在答案正確性、採樣輪數和所需總樣本數之間實現了更優的權衡。

English

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.