CLEAR: Foutanalyse via LLM-als-rechter Gemakkelijk Gemaakt

Samenvatting

De evaluatie van Large Language Models (LLM's) maakt steeds vaker gebruik van andere LLM's die als beoordelaars fungeren. Huidige evaluatieparadigma's resulteren echter meestal in een enkele score of rangschikking, die aangeeft welk model beter is, maar niet waarom. Hoewel essentieel voor benchmarking, verhullen deze toplijnscores de specifieke, actiegerichte redenen achter de prestaties van een model. Om deze kloof te overbruggen, introduceren we CLEAR, een interactief, open-source pakket voor foutenanalyse op basis van LLM's. CLEAR genereert eerst tekstuele feedback per instantie, creëert vervolgens een set van systeemniveau foutproblemen, en kwantificeert de prevalentie van elk geïdentificeerd probleem. Ons pakket biedt gebruikers ook een interactief dashboard dat een uitgebreide foutenanalyse mogelijk maakt via geaggregeerde visualisaties, interactieve filters toepast om specifieke problemen of scorereeksen te isoleren, en inzoomt op de individuele instanties die een bepaald gedragspatroon illustreren. We demonstreren de CLEAR-analyse voor RAG- en Math-benchmarks, en tonen de bruikbaarheid ervan aan via een gebruikerscasestudy.

English

The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

CLEAR: Foutanalyse via LLM-als-rechter Gemakkelijk Gemaakt

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

Samenvatting

Support