语言自博弈实现无数据训练
Language Self-Play For Data-Free Training
September 9, 2025
作者: Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan
cs.AI
摘要
近年来,大型语言模型(LLMs)在规模扩展、海量高质量训练数据及强化学习的推动下取得了飞速进展。然而,这一进步面临着一个根本性瓶颈:模型需要持续学习的数据量不断增长。在本研究中,我们提出了一种强化学习方法,通过使模型无需额外数据即可自我提升,从而摆脱这一依赖。我们的方法采用了一种博弈论框架下的自我对弈机制,将模型的能力转化为在竞争性游戏中的表现,并通过模型与自身对弈——我们称之为“语言自我对弈”(LSP)——来催生更强策略。基于Llama-3.2-3B-Instruct模型在指令跟随基准上的实验表明,预训练模型不仅能够仅通过自我对弈提升其在复杂任务上的表现,而且其效果优于基于数据驱动的基线方法。
English
Large language models (LLMs) have advanced rapidly in recent years, driven by
scale, abundant high-quality training data, and reinforcement learning. Yet
this progress faces a fundamental bottleneck: the need for ever more data from
which models can continue to learn. In this work, we propose a reinforcement
learning approach that removes this dependency by enabling models to improve
without additional data. Our method leverages a game-theoretic framework of
self-play, where a model's capabilities are cast as performance in a
competitive game and stronger policies emerge by having the model play against
itself - a process we call Language Self-Play (LSP). Experiments with
Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained
models can not only enhance their performance on challenging tasks through
self-play alone, but can also do so more effectively than data-driven
baselines.