Contrastieve Attributie in de Praktijk: Een Interpretatieanalyse van LLM-fouten op Realistische Benchmarks

Samenvatting

Interpretatiemiddelen worden steeds vaker gebruikt om fouten van Large Language Models (LLM's) te analyseren, maar eerder werk richt zich grotendeels op korte prompts of kunstmatige settings, waardoor hun gedrag op veelgebruikte benchmarks onderbelicht blijft. Om deze leemte aan te pakken, bestuderen we contrastieve, op LRP gebaseerde attributie als een praktisch hulpmiddel voor het analyseren van LLM-fouten in realistische settings. We formuleren foutenanalyse als contrastieve attributie, waarbij het logitverschil tussen een incorrecte uitvoertoken en een correct alternatief wordt toegeschreven aan invoertokens en interne modeltoestanden, en introduceren een efficiënte extensie die de constructie van attributiegrafieken over lagen heen voor lange-context invoer mogelijk maakt. Met dit framework voeren we een systematische empirische studie uit over benchmarks, waarbij we attributiepatronen vergelijken tussen datasets, modelgroottes en trainingscheckpoints. Onze resultaten tonen aan dat deze token-level contrastieve attributie in sommige faalgevallen informatieve signalen kan opleveren, maar niet universeel toepasbaar is, wat zowel de bruikbaarheid als de beperkingen ervan voor realistische LLM-foutenanalyse benadrukt. Onze code is beschikbaar op: https://aka.ms/Debug-XAI.

English

Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as contrastive attribution, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis. Our code is available at: https://aka.ms/Debug-XAI.

Contrastieve Attributie in de Praktijk: Een Interpretatieanalyse van LLM-fouten op Realistische Benchmarks

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Samenvatting

Support