RubricBench: モデル生成ルーブリックと人間基準の整合性評価

要旨

大規模言語モデル（LLM）のアライメントが単純な文補完から複雑で高度に洗練された生成へと進化するにつれ、報酬モデルは表層的なバイアスを軽減するため、ルーブリックに基づく評価へと重点を移しつつある。しかし、既存のベンチマークは識別的複雑性と厳密な分析に必要な正解ルーブリック注釈の両方を欠いており、この評価パラダイムを評価する統一的な基準がコミュニティには不足している。このギャップを埋めるため、我々はルーブリックベース評価の信頼性を測定するために特別に設計された1,147組のペアワイズ比較からなる精選ベンチマーク「RubricBench」を提案する。構築には、微妙な入力の複雑さと誤解を招く表層バイアスを特徴とする難易度の高いサンプルを対象とする多次元フィルタリングパイプラインを採用し、各サンプルに指示文から厳密に導出された専門家注釈付きの原子的ルーブリックを付与している。包括的な実験により、人間による注釈とモデル生成ルーブリックの間には能力に大きな隔たりがあることが明らかとなり、最先端モデルでさえ有効な評価基準を自律的に特定するのが困難で、人間が導出した性能に大きく遅れを取っていることが示された。

English

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

RubricBench: モデル生成ルーブリックと人間基準の整合性評価

RubricBench: Aligning Model-Generated Rubrics with Human Standards

要旨

Support