元獎勵語言模型：透過以LLM為元評判者自我改進對齊

摘要

大型語言模型（LLMs）在許多領域迅速超越人類知識。傳統上改進這些模型通常依賴昂貴的人類數據，但最近的自我獎勵機制（Yuan等，2024年）表明，LLMs可以通過評判自己的回應來改進，而不是依賴人類標註者。然而，現有方法主要集中在改進模型回應而非判斷能力上，導致在迭代訓練期間迅速飽和。為了解決這個問題，我們引入了一個新穎的元獎勵步驟到自我改進過程中，模型評判自己的判斷並利用該反饋來完善其判斷技能。令人驚訝的是，這種無監督方法提高了模型判斷和遵循指示的能力，如在AlpacaEval 2上，Llama-3-8B-Instruct的勝率從22.9%提高到39.4%，在Arena-Hard上從20.6%提高到29.1%。這些結果強烈暗示了無需人類監督即可自我改進模型的潛力。

English

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

元獎勵語言模型：透過以LLM為元評判者自我改進對齊

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

摘要

Support