대규모 언어 모델은 마음이론을 가진 에이전트로부터 얼마나 멀리 있는가?

초록

"생각은 행동을 위한 것이다." 인간은 관찰을 통해 다른 사람의 정신 상태를 추론할 수 있으며, 이러한 능력을 마음이론(Theory-of-Mind, ToM)이라고 한다. 그리고 이러한 추론을 바탕으로 실용적으로 행동할 수 있다. 기존의 질의응답 벤치마크인 ToMi는 모델에게 이야기 속 등장인물의 믿음에 대해 추론하도록 질문하지만, 모델이 이러한 추론을 바탕으로 행동을 이끌어낼 수 있는지 테스트하지는 않는다. 우리는 대규모 언어 모델(LLM)을 위한 새로운 평가 패러다임인 "행동을 위한 생각(Thinking for Doing, T4D)"을 제안한다. T4D는 모델이 다른 사람의 정신 상태에 대한 추론을 사회적 시나리오에서의 행동과 연결하도록 요구한다. T4D에 대한 실험 결과, GPT-4와 PaLM 2와 같은 LLM은 이야기 속 등장인물의 믿음을 추적하는 데는 뛰어난 성능을 보이지만, 이러한 능력을 전략적 행동으로 전환하는 데는 어려움을 겪는다. 우리의 분석은 LLM의 핵심적인 도전 과제가 ToMi에서처럼 명시적으로 질문되지 않은 정신 상태에 대한 암묵적 추론을 식별하고, 이를 T4D에서 올바른 행동을 선택하는 데 연결하는 데 있음을 보여준다. 이러한 격차를 해소하기 위해, 우리는 "예측하고 반영하기(Foresee and Reflect, FaR)"라는 제로샷 프롬프팅 프레임워크를 도입한다. FaR는 LLM이 미래의 도전을 예측하고 잠재적인 행동에 대해 추론하도록 장려하는 추론 구조를 제공한다. FaR는 GPT-4의 T4D 성능을 50%에서 71%로 향상시키며, 사고의 연쇄(Chain-of-Thought) 및 자기 질문(Self-Ask)과 같은 다른 프롬프팅 방법을 능가한다. 또한, FaR는 ToM 추론을 통해 행동을 선택해야 하는 다양한 분포 외 이야기 구조와 시나리오에서도 일반화되며, 퓨샷 인컨텍스트 학습을 포함한 다른 방법들을 일관되게 능가한다.

English

"Thinking is for Doing." Humans can infer other people's mental states from observations--an ability called Theory-of-Mind (ToM)--and subsequently act pragmatically on those inferences. Existing question answering benchmarks such as ToMi ask models questions to make inferences about beliefs of characters in a story, but do not test whether models can then use these inferences to guide their actions. We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D), which requires models to connect inferences about others' mental states to actions in social scenarios. Experiments on T4D demonstrate that LLMs such as GPT-4 and PaLM 2 seemingly excel at tracking characters' beliefs in stories, but they struggle to translate this capability into strategic action. Our analysis reveals the core challenge for LLMs lies in identifying the implicit inferences about mental states without being explicitly asked about as in ToMi, that lead to choosing the correct action in T4D. To bridge this gap, we introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges and reason about potential actions. FaR boosts GPT-4's performance from 50% to 71% on T4D, outperforming other prompting methods such as Chain-of-Thought and Self-Ask. Moreover, FaR generalizes to diverse out-of-distribution story structures and scenarios that also require ToM inferences to choose an action, consistently outperforming other methods including few-shot in-context learning.

대규모 언어 모델은 마음이론을 가진 에이전트로부터 얼마나 멀리 있는가?

How FaR Are Large Language Models From Agents with Theory-of-Mind?

초록

Support