RankJudge: 다중 턴 LLM-as-a-Judge 합성 벤치마크 생성기

초록

대화형 LLM 기반 애플리케이션이 개발되고 정교화됨에 따라, 모델 개발자는 생성된 텍스트의 품질을 다양한 측면에서 평가할 필요가 있습니다. 단순한 시스템에서는 인간 평가가 실용적일 수 있지만, 대화형 챗봇과 같은 복잡한 시스템에서는 생성된 텍스트의 양이 인간 주석(annotation) 자원을 압도할 수 있습니다. 모델 개발자는 LLM을 생성 품질 판단에 활용하는 자동 평가(auto-evaluation)에 크게 의존하기 시작했습니다. 그러나 기존의 LLM-as-a-judge 벤치마크는 대부분 다중 턴 대화의 복잡성을 반영하지 않는 단순한 질의응답(Q&A) 작업에 집중되어 있습니다. 본 논문에서는 참조 문서에 기반한 다중 턴 대화에서 LLM-as-a-judge를 평가하기 위한 벤치마크 생성기인 RankJudge를 소개합니다. RankJudge는 하나의 대화에 한 턴에서 단일 결함이 주입된 대화 쌍을 생성합니다. 이러한 구성은 쌍을 이루는 대화가 더 우수하거나 더 열등함을 명확히 레이블링할 수 있게 하며, 결함 범주를 개별 턴으로 정밀하게 분리하여 판단을 위한 엄격한 공동 정확도 기준을 가능하게 합니다. 우리는 기계 학습, 생물의학, 금융 분야에 걸쳐 RankJudge를 구현하고, 21개의 최첨단 LLM 판단기(judge)를 평가한 후 Bradley-Terry 모델을 통해 이들을 순위화합니다. 또한 본 공식화를 통해 각 대화 쌍에 난이도 등급을 부여할 수 있으며, 이를 활용하여 평가 슬라이스(slice)를 동적으로 선별함으로써 인간 주석을 통해 확인된 레이블 노이즈를 줄입니다. 우리는 판단기 순위가 부분 관측 가능성, 더 거친 정확도 기준, 그리고 대안적인 무작위 보행(random-walk) 평가 알고리즘 하에서도 안정적임을 발견했습니다.

English

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.