HalluGuard: Het Ontrafelen van Data-Gedreven en Redenering-Gedreven Hallucinaties in LLM's

Samenvatting

De betrouwbaarheid van Large Language Models (LLM's) in hoog-risicodomeinen zoals gezondheidszorg, recht en wetenschappelijke ontdekkingen wordt vaak aangetast door hallucinaties. Deze fouten zijn doorgaans afkomstig van twee bronnen: data-gedreven hallucinaties en redeneer-gedreven hallucinaties. Bestaande detectiemethoden richten zich echter meestal op slechts één bron en steunen op taakspecifieke heuristieken, wat hun generalisatie naar complexe scenario's beperkt. Om deze beperkingen te overwinnen, introduceren we de Hallucinatie Risicogrens, een verenigd theoretisch kader dat het hallucinatierisico formeel decomposeert in data-gedreven en redeneer-gedreven componenten, die respectievelijk verband houden met mismatches tijdens de training en instabiliteiten tijdens de inferentie. Dit biedt een principieel fundament om te analyseren hoe hallucinaties ontstaan en evolueren. Voortbouwend op dit fundament introduceren we HalluGuard, een op de NTK gebaseerde score die gebruikmaakt van de geïnduceerde geometrie en vastgelegde representaties van de NTK om gezamenlijk data-gedreven en redeneer-gedreven hallucinaties te identificeren. We evalueren HalluGuard op 10 diverse benchmarks, 11 competitieve baseline-methoden en 9 populaire LLM-architecturen, en behalen consistent state-of-the-art prestaties in het detecteren van diverse vormen van LLM-hallucinaties.

English

The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and reasoning-driven hallucinations. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the Hallucination Risk Bound, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate HalluGuard on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations.

HalluGuard: Het Ontrafelen van Data-Gedreven en Redenering-Gedreven Hallucinaties in LLM's

HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs

Samenvatting

Support