메타 보상 언어 모델: LLM-as-a-Meta-Judge를 통한 자기 개선적 정렬

초록

대규모 언어 모델(LLM)은 많은 분야에서 인간의 지식을 빠르게 능가하고 있습니다. 이러한 모델을 개선하는 전통적인 방법은 비용이 많이 드는 인간 데이터에 의존해왔지만, 최근의 자기 보상 메커니즘(Yuan et al., 2024)은 LLM이 인간 라벨러에 의존하지 않고 자신의 응답을 평가함으로써 개선될 수 있음을 보여주었습니다. 그러나 기존 방법들은 주로 모델 응답을 개선하는 데 초점을 맞추어 판단 능력 향상에는 소홀히 했기 때문에, 반복적인 훈련 과정에서 빠르게 포화 상태에 이르는 문제가 발생했습니다. 이 문제를 해결하기 위해, 우리는 자기 개선 과정에 새로운 메타 보상 단계를 도입했습니다. 이 단계에서 모델은 자신의 판단을 평가하고 그 피드백을 사용하여 판단 능력을 개선합니다. 놀랍게도, 이 비지도 학습 접근법은 모델의 판단 능력과 지시 따르기 능력을 모두 향상시켰으며, AlpacaEval 2에서 Llama-3-8B-Instruct의 승률이 22.9%에서 39.4%로, Arena-Hard에서는 20.6%에서 29.1%로 향상된 결과를 보여주었습니다. 이러한 결과는 인간의 감독 없이도 자기 개선이 가능한 모델의 잠재력을 강력하게 시사합니다.

English

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

메타 보상 언어 모델: LLM-as-a-Meta-Judge를 통한 자기 개선적 정렬

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

초록

Support