JudgeLM: 미세 조정된 대형 언어 모델은 확장 가능한 평가자입니다

초록

개방형 시나리오에서 대규모 언어 모델(LLMs)을 평가하는 것은 기존 벤치마크와 메트릭이 이를 포괄적으로 측정할 수 없기 때문에 어려운 과제입니다. 이 문제를 해결하기 위해, 우리는 LLMs를 효율적이고 효과적으로 평가할 수 있는 확장 가능한 판단자(JudgeLM)로 미세 조정하는 방법을 제안합니다. 먼저, 고성능 판단자를 미세 조정하기 위한 작업 시드, LLMs가 생성한 답변, GPT-4가 생성한 판단을 포함한 포괄적이고 대규모이며 고품질의 데이터셋과 판단자를 평가하기 위한 새로운 벤치마크를 제안합니다. 우리는 7B, 13B, 33B 파라미터 규모로 JudgeLM을 학습시키고, 그 능력과 행동에 대한 체계적인 분석을 수행합니다. 그런 다음, LLM을 판단자로 미세 조정할 때 발생하는 주요 편향을 위치 편향, 지식 편향, 형식 편향으로 분석합니다. 이러한 문제를 해결하기 위해, JudgeLM은 스왑 증강, 참조 지원, 참조 제거 등의 기술을 도입하여 판단자의 성능을 명확히 향상시킵니다. JudgeLM은 기존 PandaLM 벤치마크와 우리가 제안한 새로운 벤치마크 모두에서 최첨단 판단자 성능을 달성합니다. 우리의 JudgeLM은 효율적이며, JudgeLM-7B는 8개의 A100 GPU로 5,000개의 샘플을 판단하는 데 단 3분이 소요됩니다. JudgeLM은 교사 판단자와 높은 일치도를 보이며, 90%를 초과하는 일치도를 달성하여 인간 간의 일치도를 능가합니다. 또한 JudgeLM은 단일 답변, 다중 모달 모델, 다중 답변, 다중 턴 채팅에 대한 판단자로서의 확장된 능력을 보여줍니다.

English

Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

JudgeLM: 미세 조정된 대형 언어 모델은 확장 가능한 평가자입니다

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

초록

Support