メタ報酬付き言語モデル：LLMをメタ審判として活用した自己改善型アライメント

要旨

大規模言語モデル（LLMs）は多くの領域で人間の知識を急速に凌駕しつつある。これらのモデルを改善する従来の方法は、高コストな人間によるデータに依存していたが、最近の自己報酬メカニズム（Yuan et al., 2024）により、LLMsが人間のラベラーに頼らずに自身の応答を評価することで改善できることが示された。しかし、既存の手法は主にモデルの応答能力の改善に焦点を当てており、判断能力の向上には至らず、反復的なトレーニング中に急速に飽和してしまうという課題があった。この問題を解決するため、我々は自己改善プロセスに新たな「メタ報酬」ステップを導入し、モデルが自身の判断を評価し、そのフィードバックを用いて判断能力を洗練させる方法を提案する。驚くべきことに、この教師なしアプローチは、モデルの指示を判断し従う能力を向上させることが実証され、Llama-3-8B-InstructのAlpacaEval 2における勝率が22.9%から39.4%に、Arena-Hardでは20.6%から29.1%に向上した。これらの結果は、人間の監督なしで自己改善するモデルの可能性を強く示唆している。

English

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

メタ報酬付き言語モデル：LLMをメタ審判として活用した自己改善型アライメント

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

要旨

Support