RubricBench: 모델 생성 루브릭과 인간 평가 기준의 정렬

초록

대규모 언어 모델(LLM)의 정렬이 단순한 문장 완성에서 복잡하고 고도로 정교한 생성으로 진화함에 따라, 보상 모델은 표면적 편향을 완화하기 위해 루브릭 기반 평가로 점차 전환되고 있습니다. 그러나 기존 벤치마크는 엄격한 분석에 필요한 판별적 복잡성과 실제 루브릭 주석 모두를 갖추지 못해, 커뮤니티는 이러한 평가 패러다임을 측정할 통합 벤치마크를 확보하지 못하고 있습니다. 이러한 격차를 해소하기 위해 우리는 루브릭 기반 평가의 신뢰성을 측정하도록 특별히 설계된 1,147개의 pairwise 비교로 구성된 정제된 벤치마크인 RubricBench을 소개합니다. 우리의 구축 방법은 미묘한 입력 복잡성과 오해의 소지가 있는 표면적 편향을 특징으로 하는 어려운 샘플을 대상으로 다차원 필터링 파이프라인을 활용하며, 각 샘플에는 지시사항에서 엄격히 도출된 전문가 주석의 원자적 루브릭을 부가합니다. 포괄적인 실험을 통해 인간 주석 루브릭과 모델 생성 루브릭 사이에 상당한 능력 격차가 있음이 드러났으며, 이는 최첨단 모델조차 유효한 평가 기준을 자율적으로 명시하는 데 어려움을 겪어 인간 주도 성능에 비해 현저히 뒤처짐을 시사합니다.

English

As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.

RubricBench: 모델 생성 루브릭과 인간 평가 기준의 정렬

RubricBench: Aligning Model-Generated Rubrics with Human Standards

초록

Support