데이터 없이 학습하기 위한 언어 자기 대결

초록

대형 언어 모델(LLM)은 규모, 풍부한 고품질 학습 데이터, 그리고 강화 학습의 발전에 힘입어 최근 몇 년 동안 빠르게 진보해 왔습니다. 그러나 이러한 발전은 근본적인 병목 현상에 직면해 있습니다: 모델이 계속 학습할 수 있도록 점점 더 많은 데이터가 필요하다는 점입니다. 본 연구에서는 추가 데이터 없이도 모델이 개선될 수 있도록 하는 강화 학습 접근법을 제안합니다. 우리의 방법은 게임 이론적 프레임워크인 자기 대결(self-play)을 활용하며, 여기서 모델의 능력은 경쟁 게임에서의 성능으로 간주되고, 모델이 스스로와 대결함으로써 더 강력한 정책이 등장합니다. 우리는 이 과정을 언어 자기 대결(Language Self-Play, LSP)이라고 부릅니다. Llama-3.2-3B-Instruct 모델을 사용한 지시 따르기 벤치마크 실험에서, 사전 학습된 모델이 자기 대결만을 통해 어려운 작업에서 성능을 향상시킬 수 있을 뿐만 아니라, 데이터 기반 기준선보다 더 효과적으로 이를 달성할 수 있음을 보여줍니다.

English

Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model's capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself - a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained models can not only enhance their performance on challenging tasks through self-play alone, but can also do so more effectively than data-driven baselines.

데이터 없이 학습하기 위한 언어 자기 대결

Language Self-Play For Data-Free Training

초록

Support