強化学習を通じて言語モデルに批評を教える

要旨

大規模言語モデル（LLM）に批評と改善を教えることは、反復的に改善できるシステムを構築するために重要ですが、正確な判断と実用的な提案を行う能力に基本的に制限があります。本研究では、コード生成のためのLLM批評を研究し、Critc Training via Reinforcement Learning（CTRL）というフレームワークを提案します。このフレームワークは、人間の監督なしに、修正パフォーマンスを最大化するフィードバックを生成する批評モデルを訓練するものです。私たちの結果は、CTRLで訓練された批評者が、基本的なおよびより強力な生成モデルの両方で合格率を著しく向上させ、複合エラーを軽減することを示しています。さらに、これらの批評モデルが正確な生成報酬モデルとして機能し、反復的な批評修正を通じてテスト時のスケーリングを可能にし、難しいコード生成のベンチマークで最大106.1％の相対的な改善を達成していることを示しています。

English

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose CTRL, a framework for Critic Training via Reinforcement Learning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with CTRL significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.

強化学習を通じて言語モデルに批評を教える

Teaching Language Models to Critique via Reinforcement Learning

要旨

Support