小型RL控制器，大型语言模型：RL引导的自适应采样用于测试时扩展

摘要

测试时扩展能提升大型语言模型的推理性能，但会大幅增加总计算量和延迟。现有自适应采样方法通过动态决定何时停止采样，在一定程度上缓解了该问题，但这些方法通常依赖启发式规则或分布假设。本研究将自适应采样建模为马尔可夫决策过程（MDP），并利用强化学习（RL）训练一个轻量级采样控制器，以联合权衡答案正确性、延迟与计算成本。在每个轮次中，控制器决定是停止采样还是获取更多样本。该方法仅依赖最终答案的统计数据，极为轻量，可在CPU上完成训练与部署。我们进一步证明，该框架可解释为带有显式预算约束的约束优化问题的拉格朗日松弛。在ASC和ESC等强基线上的实验表明，本方法在答案正确性、采样轮次与所需总样本数之间实现了更优的权衡。

English

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.