기초 자동 평가자: 더 나은 자동 평가를 위한 대형 언어 모델 활용

초록

대규모 언어 모델(LLM)이 발전함에 따라, 인간 평가의 높은 비용으로 인해 그들의 출력을 신뢰할 수 있게 평가하는 것이 점점 더 어려워지고 있습니다. 더 나은 LLM 자동 평가기를 개발하기 위해, 우리는 FLAMe(Foundational Large Autorater Models)이라는 모델군을 소개합니다. FLAMe은 500만 개 이상의 인간 평가로 구성된 100개 이상의 다양한 품질 평가 작업에 대해 훈련되었으며, 이 데이터는 이전 연구에서 공개된 인간 평가를 기반으로 정리되고 표준화되었습니다. FLAMe은 다양한 보류된 작업에 대한 일반화 능력을 크게 향상시켜, GPT-4 및 Claude-3와 같은 독점 데이터로 훈련된 LLM을 많은 작업에서 능가합니다. 우리는 FLAMe이 추가적인 하위 작업 미세 조정을 위한 강력한 출발점으로도 사용될 수 있음을 보여주며, 보상 모델링 평가를 사례 연구로 사용했습니다(FLAMe-RM). 특히, RewardBench에서 우리의 FLAMe-RM-24B 모델(정확도 87.8%)은 허가된 라이선스 데이터로만 훈련된 최고 성능의 생성 모델로, GPT-4-0125(85.9%) 및 GPT-4o(84.7%)를 모두 능가합니다. 또한, 우리는 보상 모델링 평가를 위해 FLAMe 다중 작업 혼합을 최적화하기 위한 새로운 tail-patch 미세 조정 전략을 사용하여 더 계산 효율적인 접근 방식을 탐구했습니다(FLAMe-Opt-RM). 이 접근 방식은 경쟁력 있는 RewardBench 성능을 제공하면서도 약 25배 적은 훈련 데이터 포인트를 요구합니다. 전반적으로, 우리의 FLAMe 변종은 12개의 자동 평가기 평가 벤치마크 중 8개에서 고려된 모든 인기 있는 독점 LLM-as-a-Judge 모델을 능가하며, 이는 RewardBench 및 LLM-AggreFact를 포함한 53개의 품질 평가 작업을 포괄합니다. 마지막으로, 우리의 분석은 FLAMe이 CoBBLEr 자동 평가기 편향 벤치마크에서 이러한 LLM-as-a-Judge 모델보다 훨씬 덜 편향적이며, 코드 생성을 위한 고품질 응답을 효과적으로 식별함을 보여줍니다.

English

As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25x less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.

기초 자동 평가자: 더 나은 자동 평가를 위한 대형 언어 모델 활용

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

초록

Support