语言自博弈实现无数据训练

摘要

近年来，大型语言模型（LLMs）在规模扩展、海量高质量训练数据及强化学习的推动下取得了飞速进展。然而，这一进步面临着一个根本性瓶颈：模型需要持续学习的数据量不断增长。在本研究中，我们提出了一种强化学习方法，通过使模型无需额外数据即可自我提升，从而摆脱这一依赖。我们的方法采用了一种博弈论框架下的自我对弈机制，将模型的能力转化为在竞争性游戏中的表现，并通过模型与自身对弈——我们称之为“语言自我对弈”（LSP）——来催生更强策略。基于Llama-3.2-3B-Instruct模型在指令跟随基准上的实验表明，预训练模型不仅能够仅通过自我对弈提升其在复杂任务上的表现，而且其效果优于基于数据驱动的基线方法。

English

Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model's capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself - a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained models can not only enhance their performance on challenging tasks through self-play alone, but can also do so more effectively than data-driven baselines.

语言自博弈实现无数据训练

Language Self-Play For Data-Free Training

摘要

Support