RubricBench：将模型生成的评估标准与人类标准对齐

摘要

随着大型语言模型（LLM）对齐任务从简单补全发展为复杂精密的文本生成，奖励模型正日益转向基于量规的评估方法以缓解表层偏差。然而，学术界目前缺乏统一的基准来评估这一范式，因为现有基准既不具备足够的判别复杂度，也缺少严格分析所需的标准量规标注。为填补这一空白，我们推出RubricBench——一个包含1,147组配对比较的精选基准，专门用于评估基于量规的评估方法的可靠性。我们通过多维过滤流程构建具有细微输入复杂性和误导性表层偏差的困难样本，并为每个样本严格依据指令添加专家标注的原子化量规。综合实验表明，人工标注与模型生成量规之间存在显著能力差距：即使最先进的模型也难以自主制定有效的评估标准，其表现远逊于人类指导的评估水平。

English

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

RubricBench：将模型生成的评估标准与人类标准对齐

RubricBench: Aligning Model-Generated Rubrics with Human Standards

摘要

Support