元獎勵語言模型:透過以LLM為元評判者自我改進對齊
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
July 28, 2024
作者: Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar
cs.AI
摘要
大型語言模型(LLMs)在許多領域迅速超越人類知識。傳統上改進這些模型通常依賴昂貴的人類數據,但最近的自我獎勵機制(Yuan等,2024年)表明,LLMs可以通過評判自己的回應來改進,而不是依賴人類標註者。然而,現有方法主要集中在改進模型回應而非判斷能力上,導致在迭代訓練期間迅速飽和。為了解決這個問題,我們引入了一個新穎的元獎勵步驟到自我改進過程中,模型評判自己的判斷並利用該反饋來完善其判斷技能。令人驚訝的是,這種無監督方法提高了模型判斷和遵循指示的能力,如在AlpacaEval 2上,Llama-3-8B-Instruct的勝率從22.9%提高到39.4%,在Arena-Hard上從20.6%提高到29.1%。這些結果強烈暗示了無需人類監督即可自我改進模型的潛力。
English
Large Language Models (LLMs) are rapidly surpassing human knowledge in many
domains. While improving these models traditionally relies on costly human
data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs
can improve by judging their own responses instead of relying on human
labelers. However, existing methods have primarily focused on improving model
responses rather than judgment capabilities, resulting in rapid saturation
during iterative training. To address this issue, we introduce a novel
Meta-Rewarding step to the self-improvement process, where the model judges its
own judgements and uses that feedback to refine its judgment skills.
Surprisingly, this unsupervised approach improves the model's ability to judge
{\em and} follow instructions, as demonstrated by a win rate improvement of
Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on
Arena-Hard. These results strongly suggest the potential for self-improving
models without human supervision.Summary
AI-Generated Summary