模型任务对齐驱动不同的强化学习结果

摘要

近期，将强化学习（RL）应用于大规模语言模型（LLMs）的研究取得了显著进展。特别是，一系列引人注目却又常常反直觉的现象在LLMs中被报道，这些现象在传统RL环境中并不常见。例如，有研究声称单个训练样本即可达到使用整个数据集所获得的性能，奖励信号无需非常精确，以及仅使用负样本进行训练便可媲美甚至超越复杂的基于奖励的方法。然而，这些观察结果成立的确切条件——尤其是它们何时失效——仍不明确。在本研究中，我们识别出一个区分RL观察结果的关键因素：预训练模型是否已展现出强大的模型-任务对齐性，这通过评估任务上的pass@k准确率来衡量。通过对一系列反直觉主张进行系统而全面的检验，并辅以跨不同模型架构和任务领域的严格实验验证，我们的发现表明，尽管标准RL训练在各种设置下始终表现出稳健性，但许多这些反直觉结果仅在模型与任务已具备强模型-任务对齐性时才会出现。相比之下，在更具挑战性的情境下，这些技术未能推动显著的学习，而标准RL方法依然有效。

English

Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.

模型任务对齐驱动不同的强化学习结果

Model-Task Alignment Drives Distinct RL Outcomes

摘要

Support