RankJudge:一种多轮LLM作为评判的合成基准生成器
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
May 20, 2026
作者: Zhenwei Tang, Zhaoyan Liu, Rasa Hosseinzadeh, Tongzi Wu, Keyvan Golestan, Jesse C. Cresswell
cs.AI
摘要
随着交互式基于LLM的应用不断被创建和优化,模型开发者需要从多个维度评估生成文本的质量。对于简单系统,人工评估或许可行,但在对话式聊天机器人等复杂系统中,生成的文本量可能远超人工标注资源的处理能力。因此,模型开发者开始严重依赖自动评估方法,即同样利用LLM来评判生成质量。然而,现有的LLM作为评判者的基准测试主要聚焦于简单的问答任务,无法匹配多轮对话的复杂性。我们提出RankJudge,这是一个用于评估LLM在多轮对话中作为评判者的基准生成器,并基于参考文档进行构建。RankJudge生成成对的对话,其中一组对话在某一轮次中注入单一缺陷。这种设计使得成对对话能够被明确标注为优劣,并精确定位失败类别至具体轮次,从而为评判确立严格联合正确性标准。我们在机器学习、生物医学和金融领域实现RankJudge,评估了21个前沿LLM评判者,并通过布拉德利-特里模型对这些评判者进行排序。我们的方法还能为每个对话对分配难度评级,据此动态筛选评估子集以降低标注噪声,这一点已通过人工标注验证。我们发现,在部分可观测性、较宽松的正确性标准以及替代性随机游走评分算法下,评判者的排序保持稳定。
English
As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.