データ不要のトレーニングのための言語セルフプレイ

要旨

大規模言語モデル（LLM）は近年、スケールの拡大、高品質な学習データの豊富さ、そして強化学習によって急速に進化を遂げてきた。しかし、この進歩には根本的なボトルネックが存在する：モデルが継続的に学習するために、ますます多くのデータが必要とされることだ。本研究では、追加のデータを必要とせずにモデルを改善することを可能にする強化学習アプローチを提案する。我々の手法は、ゲーム理論的な自己対戦のフレームワークを活用しており、モデルの能力を競争ゲームにおけるパフォーマンスとして捉え、モデル自身と対戦させることでより強力なポリシーを導き出す。このプロセスを「Language Self-Play（LSP）」と呼ぶ。Llama-3.2-3B-Instructを用いた指示追従ベンチマークでの実験では、事前学習済みモデルが自己対戦のみを通じて難しいタスクにおける性能を向上させることができるだけでなく、データ駆動型のベースラインよりも効果的にそれを実現できることが示された。

English

Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model's capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself - a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained models can not only enhance their performance on challenging tasks through self-play alone, but can also do so more effectively than data-driven baselines.

データ不要のトレーニングのための言語セルフプレイ

Language Self-Play For Data-Free Training

要旨

Support