尻尾を追う：大規模言語モデルのポストトレーニングにおける効果的なルーブリックに基づく報酬モデリング

要旨

強化学習による微調整（Reinforcement Fine-Tuning, RFT）は、報酬の過剰最適化に悩まされることが多い。これは、ポリシーモデルが報酬信号を巧妙に利用して高得点を達成する一方で、低品質の出力を生成する現象である。我々の理論分析によれば、その鍵は高報酬領域における報酬の誤指定にある。具体的には、「優れた」応答と「単に良い」応答を確実に区別できないことが問題である。これにより、我々は高報酬領域に焦点を当てることを動機づけられた。しかし、基盤となる大規模言語モデル（LLM）の下では、そのような尾部分の事例は稀である。一方、オフポリシーの事例（例えば、より強力なモデルや書き直しによるもの）は比較的容易に得られるが、それらを単純に学習に用いると、我々が目指すポリシーに整合した報酬が誤って指定される。この問題に対処するため、我々はルーブリックに基づく報酬を検討した。設計上、ルーブリックはオフポリシーの事例を活用しつつ、その人工物に影響されない特性を持つ。高報酬領域を捉えるルーブリックを導出するため、我々は「優れた」応答と「多様な」応答を区別することの重要性を強調し、このアイデアを実現するためのワークフローを導入した。実験的に、ルーブリックに基づく報酬が報酬の過剰最適化を大幅に緩和し、LLMのポストトレーニング改善に有効であることを実証した。我々のコードはhttps://github.com/Jun-Kai-Zhang/rubrics.git で公開されている。

English

Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .

尻尾を追う：大規模言語モデルのポストトレーニングにおける効果的なルーブリックに基づく報酬モデリング

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

要旨

Support