HalluGuard: Svelare le allucinazioni guidate dai dati e dal ragionamento negli LLM

Abstract

L'affidabilità dei Large Language Model (LLM) in domini ad alto rischio come la sanità, il diritto e la scoperta scientifica è spesso compromessa dalle allucinazioni. Questi fallimenti tipicamente originano da due fonti: allucinazioni guidate dai dati e allucinazioni guidate dal ragionamento. Tuttavia, i metodi di rilevamento esistenti affrontano generalmente solo una fonte e si basano su euristiche specifiche per il compito, limitandone la generalizzazione a scenari complessi. Per superare queste limitazioni, introduciamo l'Hallucination Risk Bound, un quadro teorico unificato che scompone formalmente il rischio di allucinazione in componenti guidate dai dati e guidate dal ragionamento, collegate rispettivamente a disallineamenti durante la fase di addestramento e a instabilità durante l'inferenza. Questo fornisce una base principiata per analizzare come le allucinazioni emergono ed evolvono. Basandoci su questa fondazione, introduciamo HalluGuard, un punteggio basato sull'NTK che sfrutta la geometria indotta e le rappresentazioni catturate dall'NTK per identificare congiuntamente le allucinazioni guidate dai dati e quelle guidate dal ragionamento. Valutiamo HalluGuard su 10 benchmark diversificati, 11 baseline competitive e 9 popolari architetture di LLM, raggiungendo costantemente prestazioni all'avanguardia nel rilevamento di diverse forme di allucinazioni nei LLM.

English

The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and reasoning-driven hallucinations. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the Hallucination Risk Bound, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate HalluGuard on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations.

HalluGuard: Svelare le allucinazioni guidate dai dati e dal ragionamento negli LLM

HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs

Abstract

Support