f(x)와 g(x)에서 f(g(x))로: LLM이 기존 기술을 조합하여 RL에서 새로운 기술을 학습하는 방법

초록

RL(강화 학습)이 LLM(대형 언어 모델)에게 진정으로 새로운 기술을 가르치는 것인지, 아니면 기존의 기술을 활성화하는 것에 불과한 것인지? 이 질문은 LLM 사후 훈련에서 RL의 역할에 대한 현재 진행 중인 논쟁의 핵심에 있습니다. 한편으로는 지도 미세 조정 없이도 RL을 통해 강력한 실험 결과를 달성할 수 있지만, 다른 한편으로는 비평가들이 RL이 기존 추론 전략을 재조정하는 것 이상의 기여를 하지 않는다고 주장합니다. 본 연구는 LLM이 RL 과정에서 기존 기술을 조합하여 진정으로 새로운 기술을 습득할 수 있다는 구체적인 증거를 제시하며, 이는 인간이 새로운 인지 기술을 습득하는 주요 메커니즘 중 하나를 반영합니다. 데이터 오염 및 기타 혼란 요인을 완화하고 작업 복잡성을 정밀하게 제어하기 위해, 우리는 연구를 위한 합성 프레임워크를 개발했습니다. 구체적으로, 우리는 문자열 변환 함수 f(x)의 출력을 x가 주어졌을 때 추론하는 능력을 기술로 정의합니다. LLM이 RL 이전에 f와 g를 이미 학습한 경우, 우리의 실험은 RL이 이를 통해 보지 못한 조합 h(x)=g(f(x))를 학습할 수 있게 한다는 것을 보여줍니다. 더 나아가, 이 조합 능력은 RL 훈련 중에 보지 못한 2개 이상의 함수 조합과 같은 더 어려운 문제로 일반화됩니다. 놀랍게도, 우리의 실험은 소스 작업에서 습득한 조합 기술이 다른 타겟 작업으로 전이된다는 것을 보여줍니다. 이 전이는 타겟에 대한 조합 훈련 없이도 발생하며, 단지 타겟의 기본 기술에 대한 사전 지식만 필요로 합니다. 우리의 질적 분석은 RL이 모델의 추론 행동을 근본적으로 변화시킨다는 것을 보여줍니다. 반면, 동일한 데이터를 사용한 다음 토큰 훈련에서는 이러한 결과가 전혀 나타나지 않습니다. 우리의 체계적인 실험은 LLM 학습에 대한 새로운 통찰을 제공하며, 기본 기술을 갖춘 기본 모델을 먼저 구축한 다음, 복잡한 문제를 해결하기 위해 고급, 일반화 가능한 기술을 장려하기 위해 RL을 사용하는 가치를 시사합니다.

English

Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target's atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.

f(x)와 g(x)에서 f(g(x))로: LLM이 기존 기술을 조합하여 RL에서 새로운 기술을 학습하는 방법

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

초록

Support