f(x)とg(x)からf(g(x))へ：LLMは既存スキルの合成によりRLで新たなスキルを学習する

要旨

RLはLLMに真に新しいスキルを教えるのか、それとも既存のスキルを活性化するだけなのか？この問いは、LLMのポストトレーニングにおけるRLの役割に関する現在進行中の議論の核心にある。一方では、教師ありファインチューニングを事前に行わなくてもRLで強力な実証結果が得られる。他方で、批評家たちは、RLが既存の推論戦略の重み付けを超える貢献はほとんどないと主張する。本研究は、LLMがRL中に既存のスキルを組み合わせることで真に新しいスキルを獲得できるという具体的な証拠を提供し、人間が新しい認知スキルを獲得する際の中心的なメカニズムの一つを反映している。データ汚染やその他の交絡因子を軽減し、タスクの複雑さを正確に制御するために、我々は調査のための合成フレームワークを開発した。具体的には、スキルを文字列変換関数f(x)の出力をxから推論する能力と定義する。LLMがRL前にfとgを既に学習している場合、我々の実験は、RLがそれらの未見の合成h(x)=g(f(x))を学習することを可能にすることを明らかにした。さらに、この合成能力は、RLトレーニング中に見られなかった2つ以上の関数の合成のようなより難しい問題にも一般化する。驚くべきことに、我々の実験は、ソースタスクで獲得された合成スキルが異なるターゲットタスクに転移することを示している。この転移は、ターゲットでの合成トレーニングなしでも起こり、ターゲットの基本的なスキルの事前知識のみを必要とする。我々の質的分析は、RLがモデルの推論行動を根本的に変化させることを示している。対照的に、同じデータを用いた次のトークントレーニングでは、これらの発見は得られない。我々の体系的な実験は、LLM学習に関する新たな洞察を提供し、基本的なスキルを持つベースモデルを最初に構築し、その後RLを使用して複雑な問題に対する高度で一般化可能なスキルを促進することの価値を示唆している。

English

Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target's atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.

f(x)とg(x)からf(g(x))へ：LLMは既存スキルの合成によりRLで新たなスキルを学習する

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

要旨

Support