RankJudge：一種多輪LLM作為評判者的合成基準生成器

摘要

隨著以互動式LLM為基礎的應用程式被開發與優化，模型開發者需要在多個面向評估生成文本的品質。對於較簡單的系統，人工評估或許可行，但在像對話式聊天機器人這類複雜系統中，生成的文本量可能遠超人力的標註資源。模型開發者已開始高度依賴自動評估機制，亦即利用LLM本身來評判生成品質。然而，現有的LLM作為評審基準大多聚焦於簡單的問答任務，並未能反映多輪對話的複雜性。我們提出RankJudge，這是一個用於評估LLM作為評審在多輪對話中表現的基準生成器，且這些對話皆以參考文件為基礎。RankJudge會生成成對的對話，其中一組對話在某個回合中嵌入單一缺陷。這種設計使成對對話能夠被明確標註為較佳或較差，並精準地將失敗類型歸因至個別回合，從而實現嚴格的聯合正確性判斷標準。我們在機器學習、生物醫學與金融領域實作RankJudge，評估21個前沿LLM評審，並透過Bradley-Terry模型對這些評審進行排名。我們的框架還能根據難度評級對每組對話進行排序，並利用此特性動態篩選評估子集以降低標籤雜訊，這點已透過人工標註驗證。我們發現，在部分觀測、較寬鬆的正確性標準以及另一種隨機漫步評分演算法下，評審排名仍維持穩定。

English

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.