IF-RewardBench: 지시-따름 평가를 위한 판단 모델 벤치마킹

초록

명령어 추종은 대규모 언어 모델(LLM)의 핵심 기초 능력으로, 그 성능 향상은 평가 모델로부터 확장 가능하고 정확한 피드백에 달려 있습니다. 그러나 기존 메타평가 벤치마크의 여러 한계점—예를 들어 불충분한 데이터 범위와 모델 최적화 시나리오와 부합하지 않는 지나치게 단순화된 쌍별 평가 방식—으로 인해 현재 평가 모델의 명령어 추종 신뢰성에 대한 연구는 아직 부족한 실정입니다. 이에 따라 우리는 다양한 명령어 및 제약 조건 유형을 포괄하는 종합적인 명령어 추종 메타평가 벤치마크인 IF-RewardBench를 제안합니다. 각 명령어에 대해 우리는 명령어 추종 품질을 기준으로 여러 응답 간의 모든 쌍별 선호도를 포함하는 선호도 그래프를 구성합니다. 이 설계는 평가 모델이 여러 응답을 순위 매기는 능력을 평가하는 리스트와이즈 평가 방식을 가능하게 하며, 이는 모델 정렬을 안내하는 데 필수적입니다. IF-RewardBench에 대한 대규모 실험을 통해 현재 평가 모델의 심각한 결함을 확인했으며, 우리 벤치마크가 기존 벤치마크 대비 하류 작업 성능과 더 강한 양의 상관관계를 달성함을 입증했습니다. 우리의 코드와 데이터는 https://github.com/thu-coai/IF-RewardBench에서 확인할 수 있습니다.

English

Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.

IF-RewardBench: 지시-따름 평가를 위한 판단 모델 벤치마킹

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

초록

Support