자가 생성 데이터를 활용한 중간 훈련이 언어 모델의 강화 학습을 개선한다.

초록

강화학습(Reinforcement Learning, RL)이 대규모 언어 모델(Large Language Models, LLMs)에서 효과를 발휘하기 위해서는 RL 이전 및 진행 중에 사용되는 데이터의 성격과 다양성에 크게 의존한다. 특히 추론 문제는 서로 다른 형태의 추론에 기반한 여러 접근 방식으로 풀 수 있는 경우가 많으며, 훈련 데이터에서 이러한 접근 방식의 제한된 범위만 접하게 되면 RL의 효과가 제한될 수 있다. 이러한 동기에 따라 본 연구에서는 RL 훈련 전 중간 단계로 중간 훈련(mid-training) 과정에서 다양한 자체 생성 데이터(self-generated data)를 활용하는 방안을 조사한다. 구체적으로, 조지 폴리아(George Polya)의 문제 해결 접근법에 기반한 부트스트래핑 데이터 생성 프레임워크를 채택하여 훈련 데이터의 각 질문에 대해 여러 변형된 정답을 생성한 후 미세 조정(fine-tuning)을 수행한다. 먼저, 이러한 데이터에 대한 중간 훈련이 RL을 어떻게 개선하는지에 대한 이론적 관점을 제시하고, 정책 기울기 업데이트(policy-gradient updates)가 여러 접근 방식을 결합하도록 어떻게 유도할 수 있는지 설명한다. 그런 다음, 중간 훈련 데이터로 초기화된 RL 훈련 모델이 다양한 수학적 추론 벤치마크와 코드 생성 및 서사 추론과 같은 다른 분포 외 과제(OOD tasks)에서 일관된 성능 향상을 달성함을 실증적으로 보여준다. 전반적으로 본 연구는 언어 모델이 자체 생성 데이터를 통해 여러 문제 해결 접근 방식을 학습하는 것이 이후 RL에 도움이 된다는 점을 보여준다.

English

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.