Les LLM en tant que raisonneurs factuels : Perspectives issues des benchmarks existants et au-delà

Résumé

Avec l'émergence récente des LLM dans des contextes pratiques, il est crucial de disposer de méthodes capables de détecter efficacement les incohérences factuelles afin de réduire la propagation de la désinformation et d'améliorer la confiance dans les sorties des modèles. Lors des tests sur les benchmarks existants de cohérence factuelle, nous constatons que quelques grands modèles de langage (LLM) obtiennent des performances compétitives sur les benchmarks de classification pour la détection d'incohérences factuelles par rapport aux méthodes traditionnelles non-LLM. Cependant, une analyse plus approfondie révèle que la plupart des LLM échouent sur des formulations plus complexes de la tâche et met en lumière des problèmes avec les benchmarks d'évaluation existants, affectant la précision de l'évaluation. Pour remédier à cela, nous proposons un nouveau protocole pour la création de benchmarks de détection d'incohérences et l'implémentons dans un benchmark de 10 domaines appelé SummEdits. Ce nouveau benchmark est 20 fois plus rentable par échantillon que les benchmarks précédents et hautement reproductible, avec un accord inter-annotateurs estimé à environ 0,9. La plupart des LLM rencontrent des difficultés sur SummEdits, avec des performances proches du hasard. Le modèle le plus performant, GPT-4, reste encore 8 % en dessous des performances humaines estimées, soulignant les lacunes des LLM dans leur capacité à raisonner sur les faits et à détecter les incohérences lorsqu'elles se produisent.

English

With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8\% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.

Les LLM en tant que raisonneurs factuels : Perspectives issues des benchmarks existants et au-delà

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

Résumé

Support