超越人類數據：擴展自我訓練以解決問題與語言模型

摘要

在人類生成的數據上對語言模型（LMs）進行微調仍然是一種普遍的做法。然而，這些模型的性能通常受限於高質量人類數據的數量和多樣性。本文探討了在我們可以獲得標量反饋的任務上是否可以超越人類數據，例如在可以驗證正確性的數學問題上。為此，我們研究了一種基於期望-最大化的簡單自我訓練方法，我們稱之為ReST^{EM}，其中我們（1）從模型中生成樣本並使用二元反饋對其進行篩選，（2）對這些樣本進行微調，然後（3）重複這個過程幾次。在使用PaLM-2模型對高級數學推理和應用編碼基準進行測試時，我們發現ReST^{EM}隨著模型大小的增加而有利地擴展，並明顯優於僅在人類數據上進行微調。總的來說，我們的研究結果表明，通過反饋進行自我訓練可以顯著減少對人類生成數據的依賴。

English

Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST^{EM}, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST^{EM} scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.

超越人類數據：擴展自我訓練以解決問題與語言模型

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

摘要

Support