反思、重试、獎勵:基於強化學習的自我改進大語言模型
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
May 30, 2025
作者: Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, Waseem AlShikh
cs.AI
摘要
我們探討了一種通過自我反思與強化學習來提升大型語言模型性能的方法。通過激勵模型在回答錯誤時生成更佳的自我反思,我們證明即使在不適合生成合成數據且僅能獲得二元反饋的情況下,模型解決複雜、可驗證任務的能力也能得到增強。我們的框架分為兩個階段:首先,當模型未能完成給定任務時,它會生成一份自我反思評論,分析其先前的嘗試;其次,模型在自我反思的背景下再次嘗試該任務。若後續嘗試成功,則對自我反思階段生成的詞元進行獎勵。實驗結果顯示,在多種模型架構上均取得了顯著的性能提升,如在數學方程書寫上最高提升了34.7%,在函數調用上提升了18.1%。值得注意的是,經過微調的較小模型(1.5億至70億參數)在相同系列中超越了規模大10倍的模型。因此,我們的新穎範式為開發出在有限外部反饋下能自我提升於挑戰性任務、更為有用且可靠的語言模型開闢了一條令人振奮的道路。
English
We explore a method for improving the performance of large language models
through self-reflection and reinforcement learning. By incentivizing the model
to generate better self-reflections when it answers incorrectly, we demonstrate
that a model's ability to solve complex, verifiable tasks can be enhanced even
when generating synthetic data is infeasible and only binary feedback is
available. Our framework operates in two stages: first, upon failing a given
task, the model generates a self-reflective commentary analyzing its previous
attempt; second, the model is given another attempt at the task with the
self-reflection in context. If the subsequent attempt succeeds, the tokens
generated during the self-reflection phase are rewarded. Our experimental
results show substantial performance gains across a variety of model
architectures, as high as 34.7% improvement at math equation writing and 18.1%
improvement at function calling. Notably, smaller fine-tuned models (1.5
billion to 7 billion parameters) outperform models in the same family that are
10 times larger. Our novel paradigm is thus an exciting pathway to more useful
and reliable language models that can self-improve on challenging tasks with
limited external feedback.