模型任务对齐驱动不同的强化学习结果
Model-Task Alignment Drives Distinct RL Outcomes
August 28, 2025
作者: Haoze Wu, Cheng Wang, Wenshuo Zhao, Junxian He
cs.AI
摘要
近期,将强化学习(RL)应用于大规模语言模型(LLMs)的研究取得了显著进展。特别是,一系列引人注目却又常常反直觉的现象在LLMs中被报道,这些现象在传统RL环境中并不常见。例如,有研究声称单个训练样本即可达到使用整个数据集所获得的性能,奖励信号无需非常精确,以及仅使用负样本进行训练便可媲美甚至超越复杂的基于奖励的方法。然而,这些观察结果成立的确切条件——尤其是它们何时失效——仍不明确。在本研究中,我们识别出一个区分RL观察结果的关键因素:预训练模型是否已展现出强大的模型-任务对齐性,这通过评估任务上的pass@k准确率来衡量。通过对一系列反直觉主张进行系统而全面的检验,并辅以跨不同模型架构和任务领域的严格实验验证,我们的发现表明,尽管标准RL训练在各种设置下始终表现出稳健性,但许多这些反直觉结果仅在模型与任务已具备强模型-任务对齐性时才会出现。相比之下,在更具挑战性的情境下,这些技术未能推动显著的学习,而标准RL方法依然有效。
English
Recent advances in applying reinforcement learning (RL) to large language
models (LLMs) have led to substantial progress. In particular, a series of
remarkable yet often counterintuitive phenomena have been reported in LLMs,
exhibiting patterns not typically observed in traditional RL settings. For
example, notable claims include that a single training example can match the
performance achieved with an entire dataset, that the reward signal does not
need to be very accurate, and that training solely with negative samples can
match or even surpass sophisticated reward-based methods. However, the precise
conditions under which these observations hold - and, critically, when they
fail - remain unclear. In this work, we identify a key factor that
differentiates RL observations: whether the pretrained model already exhibits
strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated
task. Through a systematic and comprehensive examination of a series of
counterintuitive claims, supported by rigorous experimental validation across
different model architectures and task domains, our findings show that while
standard RL training remains consistently robust across settings, many of these
counterintuitive results arise only when the model and task already exhibit
strong model-task alignment. In contrast, these techniques fail to drive
substantial learning in more challenging regimes, where standard RL methods
remain effective.