大型語言模型能否從現實世界文本中推斷因果關係?
Can Large Language Models Infer Causal Relationships from Real-World Text?
May 25, 2025
作者: Ryan Saklad, Aman Chadha, Oleg Pavlov, Raha Moraffah
cs.AI
摘要
理解和推斷文本中的因果關係是人類認知的核心要素,也是推動大型語言模型(LLMs)邁向人工通用智能的關鍵。現有研究主要集中於人工生成的文本,這些文本涉及簡單且明確提及的因果關係,未能反映現實世界任務的複雜性。本文探討了LLMs是否能夠從現實世界的文本中推斷出因果關係。我們開發了一個基於現實學術文獻的基準測試,該測試涵蓋了不同長度、關係複雜性(包括不同層次的明確性、事件數量及因果關係)以及多個領域和子領域的多樣化文本。據我們所知,這是該任務首個現實世界的數據集。我們在最先進的LLMs上進行的實驗顯示,即使在我們提出的基準測試上,表現最佳的模型平均F1分數也僅為0.477,面臨顯著挑戰。分析揭示了常見的缺陷:難以處理隱含信息、區分相關因果因素與上下文細節,以及連接分散在長篇文本中的因果相關信息。通過系統地描述這些不足,我們的基準測試為進一步研究提升LLM因果推理能力提供了針對性的見解。
English
Understanding and inferring causal relationships from texts is a core aspect
of human cognition and is essential for advancing large language models (LLMs)
towards artificial general intelligence. Existing work primarily focuses on
synthetically generated texts which involve simple causal relationships
explicitly mentioned in the text. This fails to reflect the complexities of
real-world tasks. In this paper, we investigate whether LLMs are capable of
inferring causal relationships from real-world texts. We develop a benchmark
drawn from real-world academic literature which includes diverse texts with
respect to length, complexity of relationships (different levels of
explicitness, number of events, and causal relationships), and domains and
sub-domains. To the best of our knowledge, our benchmark is the first-ever
real-world dataset for this task. Our experiments on state-of-the-art LLMs
evaluated on our proposed benchmark demonstrate significant challenges, with
the best-performing model achieving an average F1 score of only 0.477. Analysis
reveals common pitfalls: difficulty with implicitly stated information, in
distinguishing relevant causal factors from surrounding contextual details, and
with connecting causally relevant information spread across lengthy textual
passages. By systematically characterizing these deficiencies, our benchmark
offers targeted insights for further research into advancing LLM causal
reasoning.Summary
AI-Generated Summary