ChatPaper.aiChatPaper

语言自博弈实现无数据训练

Language Self-Play For Data-Free Training

September 9, 2025
作者: Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan
cs.AI

摘要

近年来,大型语言模型(LLMs)在规模扩展、海量高质量训练数据及强化学习的推动下取得了飞速进展。然而,这一进步面临着一个根本性瓶颈:模型需要持续学习的数据量不断增长。在本研究中,我们提出了一种强化学习方法,通过使模型无需额外数据即可自我提升,从而摆脱这一依赖。我们的方法采用了一种博弈论框架下的自我对弈机制,将模型的能力转化为在竞争性游戏中的表现,并通过模型与自身对弈——我们称之为“语言自我对弈”(LSP)——来催生更强策略。基于Llama-3.2-3B-Instruct模型在指令跟随基准上的实验表明,预训练模型不仅能够仅通过自我对弈提升其在复杂任务上的表现,而且其效果优于基于数据驱动的基线方法。
English
Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model's capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself - a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained models can not only enhance their performance on challenging tasks through self-play alone, but can also do so more effectively than data-driven baselines.
PDF213September 10, 2025