從f(x)與g(x)到f(g(x))：大型語言模型通過組合舊技能在強化學習中掌握新技能

摘要

強化學習（RL）是否真正教會了大型語言模型（LLMs）新技能，還是僅僅激活了其已有的能力？這一問題是當前關於RL在LLM後訓練中角色爭論的核心。一方面，即使沒有先前的監督微調，RL也能取得顯著的實證結果；另一方面，批評者認為RL除了重新權衡現有的推理策略外，貢獻甚微。本研究提供了具體證據，表明LLMs在RL過程中能夠通過組合現有技能來真正掌握新技能，這與人類獲取新認知技能的核心機制之一相呼應。為減少數據污染及其他混淆因素，並精確控制任務複雜度，我們開發了一個合成框架進行研究。具體而言，我們將技能定義為給定x時推斷字符串轉換函數f(x)輸出的能力。當LLM在RL之前已學習了f和g時，我們的實驗揭示，RL使其能夠學習到未見過的組合h(x)=g(f(x))。此外，這種組合能力還能泛化到更難的問題上，如RL訓練期間未見過的>2個函數的組合。令人驚訝的是，我們的實驗顯示，在源任務上獲得的組合技能能夠遷移到不同的目標任務上。這種遷移甚至無需在目標任務上進行組合訓練，僅需事先了解目標任務的原子技能即可。我們的定性分析表明，RL從根本上改變了模型的推理行為。相比之下，使用相同數據進行下一個詞訓練則無法得出這些發現。我們系統性的實驗為LLM學習提供了新的見解，建議首先構建具備基本技能的基礎模型，然後利用RL激勵其掌握針對複雜問題的高級、可泛化技能。

English

Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target's atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.

從f(x)與g(x)到f(g(x))：大型語言模型通過組合舊技能在強化學習中掌握新技能

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

摘要

Support