反思、重试、奖励：通过强化学习实现大语言模型的自我提升

摘要

我们探索了一种通过自我反思和强化学习来提升大型语言模型性能的方法。通过激励模型在回答错误时生成更好的自我反思，我们证明了即使在无法生成合成数据且仅能获得二元反馈的情况下，模型解决复杂可验证任务的能力也能得到增强。我们的框架分为两个阶段：首先，当模型未能完成给定任务时，它会生成一段自我反思性评论，分析其先前的尝试；其次，模型在自我反思的背景下再次尝试该任务。如果后续尝试成功，则在自我反思阶段生成的标记将获得奖励。实验结果显示，在各种模型架构上均取得了显著的性能提升，其中数学方程书写任务最高提升了34.7%，函数调用任务提升了18.1%。值得注意的是，经过微调的小型模型（15亿至70亿参数）在相同系列中超越了规模是其10倍的更大模型。因此，这一新颖范式为开发出在有限外部反馈下能够自我改进、应对挑战性任务的更有用且可靠的语言模型开辟了一条令人兴奋的道路。

English

We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.

反思、重试、奖励：通过强化学习实现大语言模型的自我提升

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

摘要

Support