LLM come ragionatori fattuali: approfondimenti dai benchmark esistenti e oltre

Abstract

Con la recente comparsa dei LLM in contesti pratici, disporre di metodi che possano rilevare efficacemente le incongruenze fattuali è cruciale per ridurre la diffusione di disinformazione e migliorare la fiducia negli output dei modelli. Testando su benchmark esistenti per la coerenza fattuale, abbiamo riscontrato che alcuni grandi modelli linguistici (LLM) ottengono prestazioni competitive nei benchmark di classificazione per il rilevamento di incongruenze fattuali rispetto ai metodi tradizionali non basati su LLM. Tuttavia, un'analisi più approfondita rivela che la maggior parte dei LLM fallisce su formulazioni più complesse del compito e mette in luce problemi con i benchmark di valutazione esistenti, influenzando la precisione della valutazione. Per affrontare questo problema, proponiamo un nuovo protocollo per la creazione di benchmark per il rilevamento di incongruenze e lo implementiamo in un benchmark chiamato SummEdits, che copre 10 domini. Questo nuovo benchmark è 20 volte più conveniente per campione rispetto ai benchmark precedenti e altamente riproducibile, con un accordo inter-annotatori stimato intorno a 0,9. La maggior parte dei LLM ha difficoltà su SummEdits, con prestazioni vicine al caso. Il modello con le migliori prestazioni, GPT-4, è ancora dell'8% al di sotto della performance umana stimata, evidenziando le lacune nella capacità dei LLM di ragionare sui fatti e rilevare incongruenze quando si verificano.

English

With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8\% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.

LLM come ragionatori fattuali: approfondimenti dai benchmark esistenti e oltre

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

Abstract

Support