通過強化學習教導語言模型進行評論

摘要

教導大型語言模型（LLMs）批評並改進其輸出對於建立能夠逐步改進的系統至關重要，然而，這在根本上受限於提供準確的評判和可行的建議能力。在這項研究中，我們研究了用於程式碼生成的LLM評論者，並提出了CTRL，一個通過強化學習進行評論者訓練的框架，該框架訓練一個評論者模型來生成反饋，以最大化對於固定生成器模型的校正性能，而無需人類監督。我們的結果表明，使用CTRL訓練的評論者顯著增強了通過率，並減輕了基礎和更強大生成器模型中的錯誤累積。此外，我們展示這些評論者模型作為準確的生成式獎勵模型，並通過迭代的評論-修訂實現了測試時的擴展，從而在具有挑戰性的程式碼生成基準測試中實現高達106.1％的相對改進。

English

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose CTRL, a framework for Critic Training via Reinforcement Learning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with CTRL significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.

通過強化學習教導語言模型進行評論

Teaching Language Models to Critique via Reinforcement Learning

摘要

Support