ChatPaper.aiChatPaper

CLEAR:基于大语言模型作为评判者的错误分析简化方案

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

July 24, 2025
作者: Asaf Yehudai, Lilach Eden, Yotam Perlitz, Roy Bar-Haim, Michal Shmueli-Scheuer
cs.AI

摘要

大型语言模型(LLMs)的评估日益依赖于其他LLMs作为评判者。然而,当前的评估范式通常仅产生单一分数或排名,仅能回答哪个模型更优,却无法解释其原因。尽管这些顶层分数对于基准测试至关重要,但它们掩盖了模型性能背后具体且可操作的原因。为了弥合这一差距,我们引入了CLEAR,一个基于LLM的交互式开源错误分析工具包。CLEAR首先生成针对每个实例的文本反馈,随后创建一系列系统级错误问题,并量化每个识别问题的普遍性。我们的工具包还为用户提供了一个交互式仪表板,通过聚合可视化实现全面的错误分析,应用交互式过滤器以隔离特定问题或分数范围,并深入至体现特定行为模式的个别实例。我们通过RAG和数学基准测试展示了CLEAR的分析能力,并通过用户案例研究展现了其实用性。
English
The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.
PDF152July 28, 2025