通過強化學習教導語言模型進行評論
Teaching Language Models to Critique via Reinforcement Learning
February 5, 2025
作者: Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong
cs.AI
摘要
教導大型語言模型(LLMs)批評並改進其輸出對於建立能夠逐步改進的系統至關重要,然而,這在根本上受限於提供準確的評判和可行的建議能力。在這項研究中,我們研究了用於程式碼生成的LLM評論者,並提出了CTRL,一個通過強化學習進行評論者訓練的框架,該框架訓練一個評論者模型來生成反饋,以最大化對於固定生成器模型的校正性能,而無需人類監督。我們的結果表明,使用CTRL訓練的評論者顯著增強了通過率,並減輕了基礎和更強大生成器模型中的錯誤累積。此外,我們展示這些評論者模型作為準確的生成式獎勵模型,並通過迭代的評論-修訂實現了測試時的擴展,從而在具有挑戰性的程式碼生成基準測試中實現高達106.1%的相對改進。
English
Teaching large language models (LLMs) to critique and refine their outputs is
crucial for building systems that can iteratively improve, yet it is
fundamentally limited by the ability to provide accurate judgments and
actionable suggestions. In this work, we study LLM critics for code generation
and propose CTRL, a framework for Critic
Training via Reinforcement Learning, which
trains a critic model to generate feedback that maximizes correction
performance for a fixed generator model without human supervision. Our results
demonstrate that critics trained with CTRL significantly enhance
pass rates and mitigate compounding errors across both base and stronger
generator models. Furthermore, we show that these critic models act as accurate
generative reward models and enable test-time scaling through iterative
critique-revision, achieving up to 106.1% relative improvements across
challenging code generation benchmarks.Summary
AI-Generated Summary