ChatPaper.aiChatPaper

CLEAR:借助LLM作为评判者简化错误分析

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

July 24, 2025
作者: Asaf Yehudai, Lilach Eden, Yotam Perlitz, Roy Bar-Haim, Michal Shmueli-Scheuer
cs.AI

摘要

大型語言模型(LLMs)的評估日益依賴於其他LLMs擔任評判角色。然而,當前的評估範式通常僅產生單一的分數或排名,僅能回答哪個模型更優,卻無法解釋其原因。雖然這些頂層分數對於基準測試至關重要,但它們掩蓋了模型性能背後具體且可操作的原因。為彌補這一差距,我們引入了CLEAR,這是一個基於LLM的互動式開源錯誤分析套件。CLEAR首先生成每例文本反饋,隨後創建一系列系統級錯誤問題,並量化每個已識別問題的普遍性。我們的套件還為用戶提供了一個互動式儀表板,通過聚合視覺化進行全面的錯誤分析,應用互動式過濾器來隔離特定問題或分數範圍,並深入至展示特定行為模式的個別實例。我們展示了CLEAR在RAG和數學基準測試中的分析,並通過用戶案例研究展示了其實用性。
English
The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.
PDF152July 28, 2025