大型语言模型能否从现实世界文本中推断因果关系?
Can Large Language Models Infer Causal Relationships from Real-World Text?
May 25, 2025
作者: Ryan Saklad, Aman Chadha, Oleg Pavlov, Raha Moraffah
cs.AI
摘要
从文本中理解和推断因果关系是人类认知的核心要素,也是推动大语言模型(LLMs)向通用人工智能迈进的关键。现有研究主要集中于从合成生成的文本中提取简单的、明确提及的因果关系,这未能反映现实世界任务的复杂性。本文探讨了LLMs是否能够从现实世界的文本中推断因果关系。我们构建了一个源自真实学术文献的基准测试集,该测试集涵盖了不同长度、关系复杂性(不同明确程度、事件数量及因果关系)以及领域和子领域的多样化文本。据我们所知,这是首个针对此任务的现实世界数据集。基于我们提出的基准测试集,对当前最先进的LLMs进行实验评估,结果显示面临显著挑战,表现最佳的模型平均F1得分仅为0.477。分析揭示了常见问题:难以处理隐含信息、区分相关因果因素与上下文细节,以及连接分散在长篇幅文本中的因果相关信息。通过系统性地描述这些不足,我们的基准测试集为推进LLM因果推理的进一步研究提供了有针对性的洞见。
English
Understanding and inferring causal relationships from texts is a core aspect
of human cognition and is essential for advancing large language models (LLMs)
towards artificial general intelligence. Existing work primarily focuses on
synthetically generated texts which involve simple causal relationships
explicitly mentioned in the text. This fails to reflect the complexities of
real-world tasks. In this paper, we investigate whether LLMs are capable of
inferring causal relationships from real-world texts. We develop a benchmark
drawn from real-world academic literature which includes diverse texts with
respect to length, complexity of relationships (different levels of
explicitness, number of events, and causal relationships), and domains and
sub-domains. To the best of our knowledge, our benchmark is the first-ever
real-world dataset for this task. Our experiments on state-of-the-art LLMs
evaluated on our proposed benchmark demonstrate significant challenges, with
the best-performing model achieving an average F1 score of only 0.477. Analysis
reveals common pitfalls: difficulty with implicitly stated information, in
distinguishing relevant causal factors from surrounding contextual details, and
with connecting causally relevant information spread across lengthy textual
passages. By systematically characterizing these deficiencies, our benchmark
offers targeted insights for further research into advancing LLM causal
reasoning.Summary
AI-Generated Summary