元奖励语言模型：通过LLM作为元评判者实现自我改进的对齐

摘要

大型语言模型（LLMs）正在许多领域迅速超越人类知识。虽然传统上改进这些模型依赖于昂贵的人类数据，但最近的自我奖励机制（Yuan等，2024年）表明，LLMs可以通过评判自己的响应来改进，而不是依赖于人类标注者。然而，现有方法主要集中在改进模型响应而非判断能力上，导致在迭代训练过程中迅速饱和。为解决这一问题，我们引入了一个新颖的元奖励步骤到自我改进过程中，模型评判自己的判断并利用该反馈来完善其判断技能。令人惊讶的是，这种无监督方法提高了模型判断和遵循指令的能力，如Llama-3-8B-Instruct在AlpacaEval 2上的胜率从22.9%提高到39.4%，在Arena-Hard上从20.6%提高到29.1%。这些结果强烈暗示了无需人类监督即可实现自我改进模型的潜力。

English

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

元奖励语言模型：通过LLM作为元评判者实现自我改进的对齐

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

摘要

Support