从f(x)与g(x)到f(g(x))：大语言模型通过组合旧技能在强化学习中掌握新能力

摘要

强化学习（RL）是否真正教会了大型语言模型（LLMs）新技能，还是仅仅激活了其已有能力？这一问题处于关于RL在LLM后训练中作用的核心争议之中。一方面，即便没有先前的监督微调，RL也能取得显著的实证结果；另一方面，批评者认为RL的作用仅限于重新权衡现有的推理策略。本研究提供了具体证据，表明LLMs在RL过程中能够通过组合现有技能来真正掌握新技能，这反映了人类获取新认知技能的核心机制之一。为了减少数据污染及其他混杂因素的影响，并实现对任务复杂度的精确控制，我们开发了一个合成框架进行研究。具体而言，我们将技能定义为给定输入x，推断字符串转换函数f(x)输出的能力。当LLM在RL之前已学会f和g时，我们的实验揭示，RL使其能够学习到它们未见过的组合h(x)=g(f(x))。此外，这种组合能力还能推广到更复杂的问题，如RL训练期间未见的超过两个函数的组合。令人惊讶的是，我们的实验显示，在源任务上获得的组合技能能够迁移到不同的目标任务上。这种迁移甚至在目标任务的组合训练缺失的情况下也能发生，仅需具备目标任务的基本技能知识即可。我们的定性分析表明，RL从根本上改变了模型的推理行为。相比之下，使用相同数据进行下一词预测训练则未能产生这些发现。通过系统性的实验，我们为LLM学习提供了新的见解，建议首先构建具备基础技能的基模型，随后利用RL激励其发展出适用于复杂问题的高级、可泛化技能。

English

Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target's atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.

从f(x)与g(x)到f(g(x))：大语言模型通过组合旧技能在强化学习中掌握新能力

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

摘要

Support