RubricBench:將模型生成評分標準與人類標準對齊
RubricBench: Aligning Model-Generated Rubrics with Human Standards
March 2, 2026
作者: Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, Chen Ma
cs.AI
摘要
隨著大型語言模型(LLM)對齊技術從簡單的文本補全發展至複雜精密的內容生成,獎勵模型正逐漸轉向以評分量表為指導的評估方式,以降低表面層級的偏誤。然而,學界目前缺乏統一的基準來衡量此評估範式,現有基準既缺乏區分性複雜度,也缺少嚴謹分析所需的事實評分量表標註。為此,我們提出RubricBench——一個包含1,147組對比樣本的精選基準集,專門用於檢驗基於評分量表的評估可靠性。我們通過多維度篩選流程構建具有細膩輸入複雜度與誤導性表面偏誤的困難樣本,並為每個樣本配備嚴格遵循指令生成、由專家標註的原子化評分量表。綜合實驗顯示,人工標註與模型生成的評分量表存在顯著能力落差,表明即使最先進的模型仍難以自主制定有效的評估標準,其表現遠落後於人類指導下的水準。
English
As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.