基盤的オートレーター：大規模言語モデルを制御し、自動評価を改善する

要旨

大規模言語モデル（LLM）が進化するにつれ、その出力を信頼性高く評価することが、人間による評価の高コストのためにますます困難になっています。より優れたLLM自動評価器を目指して、我々はFLAMe（Foundational Large Autorater Models）を導入します。FLAMeは、過去の研究で公開された人間による評価を基にキュレーションおよび標準化された、100以上の品質評価タスクと500万以上の人間の判断からなる大規模で多様なデータセットで訓練されています。FLAMeは、様々な保留タスクへの汎化性能を大幅に向上させ、多くのタスクにおいてGPT-4やClaude-3のようなプロプライエタリデータで訓練されたLLMを上回ります。また、FLAMeは、報酬モデリング評価をケーススタディとして（FLAMe-RM）、さらなる下流のファインチューニングの強力な出発点としても機能することを示します。特に、RewardBenchにおいて、我々のFLAMe-RM-24Bモデル（精度87.8%）は、許諾ライセンスデータのみで訓練された生成モデルの中で最高の性能を発揮し、GPT-4-0125（85.9%）とGPT-4o（84.7%）の両方を上回ります。さらに、我々は、報酬モデリング評価のためにFLAMeマルチタスク混合を最適化する新しいテールパッチファインチューニング戦略を使用した、より計算効率の高いアプローチ（FLAMe-Opt-RM）を探求し、RewardBenchの性能を競争力のあるレベルに保ちながら、約25倍少ないトレーニングデータポイントを必要とします。全体として、我々のFLAMeバリアントは、RewardBenchやLLM-AggreFactを含む53の品質評価タスクを網羅する12の自動評価ベンチマークのうち8つにおいて、すべての人気のあるプロプライエタリLLM-as-a-Judgeモデルを上回ります。最後に、我々の分析は、FLAMeがCoBBLEr自動評価バイアスベンチマークにおいてこれらのLLM-as-a-Judgeモデルよりも著しくバイアスが少なく、コード生成のための高品質な応答を効果的に識別することを明らかにしています。

English

As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25x less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.

基盤的オートレーター：大規模言語モデルを制御し、自動評価を改善する

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

要旨

Support