CLEAR: LLMを裁判官としたエラー分析の簡易化

要旨

大規模言語モデル（LLMs）の評価において、他のLLMsを審判役として利用する傾向が強まっている。しかし、現在の評価パラダイムは通常、単一のスコアやランキングを生成し、どのモデルが優れているかを示すものの、その理由については明らかにしない。ベンチマーキングにおいては不可欠であるものの、これらのトップレベルのスコアは、モデルの性能の背後にある具体的で実践可能な理由を覆い隠してしまう。このギャップを埋めるため、我々はCLEARを紹介する。CLEARは、LLMベースのエラー分析のためのインタラクティブでオープンソースのパッケージである。CLEARはまず、インスタンスごとのテキストフィードバックを生成し、次にシステムレベルのエラー問題のセットを作成し、各特定された問題の発生頻度を定量化する。また、このパッケージはユーザーにインタラクティブなダッシュボードを提供し、集約された視覚化を通じて包括的なエラー分析を行い、特定の問題やスコア範囲を分離するためのインタラクティブなフィルタを適用し、特定の行動パターンを例示する個々のインスタンスにドリルダウンすることを可能にする。我々は、RAGおよび数学ベンチマークに対するCLEAR分析を実証し、ユーザーケーススタディを通じてその有用性を示す。

English

The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

CLEAR: LLMを裁判官としたエラー分析の簡易化

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

要旨

Support