CLEAR: LLM-as-a-Judge를 통한 오류 분석을 쉽게 만드는 방법

초록

대규모 언어 모델(LLM)의 평가는 점차 다른 LLM이 판단자 역할을 하는 방식에 의존하고 있습니다. 그러나 현재의 평가 패러다임은 일반적으로 단일 점수나 순위를 산출하며, 어떤 모델이 더 나은지에 대한 답은 제공하지만 그 이유는 설명하지 않습니다. 벤치마킹에 필수적이지만, 이러한 상위 수준의 점수는 모델 성능 뒤에 숨겨진 구체적이고 실행 가능한 이유를 가려버립니다. 이러한 격차를 해소하기 위해, 우리는 LLM 기반 오류 분석을 위한 인터랙티브 오픈소스 패키지인 CLEAR를 소개합니다. CLEAR는 먼저 인스턴스별 텍스트 피드백을 생성한 다음, 시스템 수준의 오류 문제 세트를 만들고 각 식별된 문제의 빈도를 정량화합니다. 또한, 이 패키지는 사용자에게 집계 시각화를 통해 포괄적인 오류 분석을 가능하게 하는 인터랙티브 대시보드를 제공하며, 특정 문제나 점수 범위를 분리하기 위한 인터랙티브 필터를 적용하고, 특정 행동 패턴을 보여주는 개별 인스턴스까지 드릴다운할 수 있게 합니다. 우리는 RAG 및 수학 벤치마크에 대한 CLEAR 분석을 시연하고, 사용자 사례 연구를 통해 그 유용성을 입증합니다.

English

The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

CLEAR: LLM-as-a-Judge를 통한 오류 분석을 쉽게 만드는 방법

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

초록

Support