小型通用提示预测模型可引导大型推理模型的高效强化学习后训练
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models
February 2, 2026
作者: Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji
cs.AI
摘要
强化学习虽能增强大型语言模型的推理能力,但由于其依赖大量试错迭代的优化过程,往往伴随着高昂的计算成本。在线提示选择通过优先筛选信息量丰富的提示来提升训练效率,为此提供了可行解决方案。然而现有方法要么依赖计算密集的精确评估,要么构建仅适用于特定提示的预测模型,缺乏跨提示的泛化能力。本研究提出可泛化预测式提示选择框架(GPS),该框架基于共享优化历史训练轻量级生成模型,通过贝叶斯推断预测提示难度。该方法将中等难度优先原则与历史锚定的多样性策略融入批量获取机制,从而筛选出信息量最大的提示批次。轻量化预测模型在测试阶段同样具备泛化能力,可实现高效计算资源分配。在多类推理基准测试上的实验表明,GPS在训练效率、最终性能及测试效率方面均显著优于现有先进基线方法。
English
Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.