大型语言模型与具有心灵理论的智能体有多远?
How FaR Are Large Language Models From Agents with Theory-of-Mind?
October 4, 2023
作者: Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, Manaal Faruqui
cs.AI
摘要
“思考是为了行动。” 人类可以从观察中推断他人的心理状态,这种能力被称为心灵理论(ToM),随后可以实用地根据这些推断采取行动。现有的问答基准测试,如ToMi,要求模型根据故事中角色的信念进行推断,但并不测试模型是否能够利用这些推断来指导他们的行动。我们提出了一个新的大型语言模型(LLMs)评估范式:思考为了行动(T4D),这需要模型将关于他人心理状态的推断与社交场景中的行动联系起来。对T4D的实验表明,诸如GPT-4和PaLM 2等LLMs似乎擅长追踪故事中角色的信念,但他们难以将这种能力转化为战略行动。我们的分析揭示了LLMs的核心挑战在于识别关于心理状态的隐含推断,而不是像ToMi那样明确询问,这些推断导致在T4D中选择正确的行动。为了弥合这一差距,我们引入了一种零-shot提示框架,预见和反思(FaR),它提供了一种鼓励LLMs预测未来挑战并思考潜在行动的推理结构。FaR将GPT-4在T4D上的表现从50%提升至71%,优于Chain-of-Thought和Self-Ask等其他提示方法。此外,FaR可以推广到多样的超出分布的故事结构和场景,这些场景也需要ToM推断来选择行动,始终优于其他方法,包括少样本上下文学习。
English
"Thinking is for Doing." Humans can infer other people's mental states from
observations--an ability called Theory-of-Mind (ToM)--and subsequently act
pragmatically on those inferences. Existing question answering benchmarks such
as ToMi ask models questions to make inferences about beliefs of characters in
a story, but do not test whether models can then use these inferences to guide
their actions. We propose a new evaluation paradigm for large language models
(LLMs): Thinking for Doing (T4D), which requires models to connect inferences
about others' mental states to actions in social scenarios. Experiments on T4D
demonstrate that LLMs such as GPT-4 and PaLM 2 seemingly excel at tracking
characters' beliefs in stories, but they struggle to translate this capability
into strategic action. Our analysis reveals the core challenge for LLMs lies in
identifying the implicit inferences about mental states without being
explicitly asked about as in ToMi, that lead to choosing the correct action in
T4D. To bridge this gap, we introduce a zero-shot prompting framework, Foresee
and Reflect (FaR), which provides a reasoning structure that encourages LLMs to
anticipate future challenges and reason about potential actions. FaR boosts
GPT-4's performance from 50% to 71% on T4D, outperforming other prompting
methods such as Chain-of-Thought and Self-Ask. Moreover, FaR generalizes to
diverse out-of-distribution story structures and scenarios that also require
ToM inferences to choose an action, consistently outperforming other methods
including few-shot in-context learning.