反思、重试、獎勵：基於強化學習的自我改進大語言模型

摘要

我們探討了一種通過自我反思與強化學習來提升大型語言模型性能的方法。通過激勵模型在回答錯誤時生成更佳的自我反思，我們證明即使在不適合生成合成數據且僅能獲得二元反饋的情況下，模型解決複雜、可驗證任務的能力也能得到增強。我們的框架分為兩個階段：首先，當模型未能完成給定任務時，它會生成一份自我反思評論，分析其先前的嘗試；其次，模型在自我反思的背景下再次嘗試該任務。若後續嘗試成功，則對自我反思階段生成的詞元進行獎勵。實驗結果顯示，在多種模型架構上均取得了顯著的性能提升，如在數學方程書寫上最高提升了34.7%，在函數調用上提升了18.1%。值得注意的是，經過微調的較小模型（1.5億至70億參數）在相同系列中超越了規模大10倍的模型。因此，我們的新穎範式為開發出在有限外部反饋下能自我提升於挑戰性任務、更為有用且可靠的語言模型開闢了一條令人振奮的道路。

English

We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.

反思、重试、獎勵：基於強化學習的自我改進大語言模型

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

摘要

Support