大規模言語モデルは、心の理論を持つエージェントからどの程度離れているのか？

要旨

「思考は行動のためである」。人間は観察から他者の心的状態を推論する能力（心の理論、Theory-of-Mind: ToM）を持ち、その推論に基づいて実践的に行動することができる。既存の質問応答ベンチマークであるToMiは、物語中の登場人物の信念についてモデルに推論させる質問を投げかけるが、モデルがその推論を行動に結びつけられるかどうかは検証しない。本論文では、大規模言語モデル（LLM）のための新しい評価パラダイム「行動のための思考（Thinking for Doing: T4D）」を提案する。T4Dでは、モデルが他者の心的状態に関する推論を社会的シナリオにおける行動に結びつける能力が求められる。T4Dでの実験により、GPT-4やPaLM 2などのLLMは物語中の登場人物の信念を追跡する能力は高いものの、この能力を戦略的な行動に変換する点では苦戦することが明らかになった。分析の結果、LLMの核心的な課題は、ToMiのように明示的に質問されない限り、心的状態に関する暗黙の推論を特定し、T4Dで正しい行動を選択するための推論を行う点にあることがわかった。このギャップを埋めるため、我々はゼロショットプロンプティングフレームワーク「予見と反映（Foresee and Reflect: FaR）」を導入した。FaRは、LLMが将来の課題を予見し、潜在的な行動について推論することを促す推論構造を提供する。FaRにより、GPT-4のT4Dでの性能は50%から71%に向上し、Chain-of-ThoughtやSelf-Askなどの他のプロンプティング手法を上回った。さらに、FaRは、ToM推論を必要とする多様な分布外の物語構造やシナリオにも一般化され、数ショットのインコンテキスト学習を含む他の手法を一貫して上回る結果を示した。

English

"Thinking is for Doing." Humans can infer other people's mental states from observations--an ability called Theory-of-Mind (ToM)--and subsequently act pragmatically on those inferences. Existing question answering benchmarks such as ToMi ask models questions to make inferences about beliefs of characters in a story, but do not test whether models can then use these inferences to guide their actions. We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D), which requires models to connect inferences about others' mental states to actions in social scenarios. Experiments on T4D demonstrate that LLMs such as GPT-4 and PaLM 2 seemingly excel at tracking characters' beliefs in stories, but they struggle to translate this capability into strategic action. Our analysis reveals the core challenge for LLMs lies in identifying the implicit inferences about mental states without being explicitly asked about as in ToMi, that lead to choosing the correct action in T4D. To bridge this gap, we introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges and reason about potential actions. FaR boosts GPT-4's performance from 50% to 71% on T4D, outperforming other prompting methods such as Chain-of-Thought and Self-Ask. Moreover, FaR generalizes to diverse out-of-distribution story structures and scenarios that also require ToM inferences to choose an action, consistently outperforming other methods including few-shot in-context learning.

大規模言語モデルは、心の理論を持つエージェントからどの程度離れているのか？

How FaR Are Large Language Models From Agents with Theory-of-Mind?

要旨

Support