語言自對弈:無數據訓練法
Language Self-Play For Data-Free Training
September 9, 2025
作者: Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan
cs.AI
摘要
近年來,大型語言模型(LLMs)在規模擴展、高質量訓練數據的豐富性以及強化學習的推動下迅速發展。然而,這一進展面臨著一個根本性的瓶頸:模型需要不斷獲取更多數據以持續學習。在本研究中,我們提出了一種強化學習方法,通過使模型能夠在不依賴額外數據的情況下進行改進,從而消除這一依賴性。我們的方法利用了自我對弈的博弈論框架,將模型的能力轉化為在競爭性遊戲中的表現,並通過讓模型與自身對弈來產生更強的策略——這一過程我們稱之為語言自我對弈(Language Self-Play, LSP)。在Llama-3.2-3B-Instruct模型上進行的指令跟蹤基準測試實驗表明,預訓練模型不僅能夠僅通過自我對弈來提升其在挑戰性任務上的表現,而且比基於數據的基準方法更為有效。
English
Large language models (LLMs) have advanced rapidly in recent years, driven by
scale, abundant high-quality training data, and reinforcement learning. Yet
this progress faces a fundamental bottleneck: the need for ever more data from
which models can continue to learn. In this work, we propose a reinforcement
learning approach that removes this dependency by enabling models to improve
without additional data. Our method leverages a game-theoretic framework of
self-play, where a model's capabilities are cast as performance in a
competitive game and stronger policies emerge by having the model play against
itself - a process we call Language Self-Play (LSP). Experiments with
Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained
models can not only enhance their performance on challenging tasks through
self-play alone, but can also do so more effectively than data-driven
baselines.