Quanto Sono Lontani i Modelli Linguistici di Grandi Dimensioni Dagli Agenti con Teoria della Mente?

Abstract

"Pensare è per Agire." Gli esseri umani possono dedurre gli stati mentali altrui dalle osservazioni - un'abilità chiamata Teoria della Mente (ToM) - e successivamente agire in modo pragmatico basandosi su tali deduzioni. I benchmark esistenti per il question answering, come ToMi, pongono ai modelli domande per fare inferenze sulle credenze dei personaggi in una storia, ma non verificano se i modelli siano in grado di utilizzare queste inferenze per guidare le loro azioni. Proponiamo un nuovo paradigma di valutazione per i grandi modelli linguistici (LLM): Pensare per Agire (T4D), che richiede ai modelli di collegare le inferenze sugli stati mentali altrui alle azioni in scenari sociali. Gli esperimenti su T4D dimostrano che LLM come GPT-4 e PaLM 2 sembrano eccellere nel tracciare le credenze dei personaggi nelle storie, ma faticano a tradurre questa capacità in azioni strategiche. La nostra analisi rivela che la sfida principale per gli LLM risiede nell'identificare le inferenze implicite sugli stati mentali, che non vengono esplicitamente richieste come in ToMi, ma che portano a scegliere l'azione corretta in T4D. Per colmare questa lacuna, introduciamo un framework di prompting zero-shot, Prevedere e Riflettere (FaR), che fornisce una struttura di ragionamento che incoraggia gli LLM a anticipare le sfide future e a ragionare sulle potenziali azioni. FaR migliora le prestazioni di GPT-4 dal 50% al 71% su T4D, superando altri metodi di prompting come Catena del Pensiero e Auto-Domanda. Inoltre, FaR si generalizza a diverse strutture narrative e scenari fuori distribuzione che richiedono anche inferenze ToM per scegliere un'azione, superando costantemente altri metodi, inclusi l'apprendimento in-context few-shot.

English

"Thinking is for Doing." Humans can infer other people's mental states from observations--an ability called Theory-of-Mind (ToM)--and subsequently act pragmatically on those inferences. Existing question answering benchmarks such as ToMi ask models questions to make inferences about beliefs of characters in a story, but do not test whether models can then use these inferences to guide their actions. We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D), which requires models to connect inferences about others' mental states to actions in social scenarios. Experiments on T4D demonstrate that LLMs such as GPT-4 and PaLM 2 seemingly excel at tracking characters' beliefs in stories, but they struggle to translate this capability into strategic action. Our analysis reveals the core challenge for LLMs lies in identifying the implicit inferences about mental states without being explicitly asked about as in ToMi, that lead to choosing the correct action in T4D. To bridge this gap, we introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges and reason about potential actions. FaR boosts GPT-4's performance from 50% to 71% on T4D, outperforming other prompting methods such as Chain-of-Thought and Self-Ask. Moreover, FaR generalizes to diverse out-of-distribution story structures and scenarios that also require ToM inferences to choose an action, consistently outperforming other methods including few-shot in-context learning.

Quanto Sono Lontani i Modelli Linguistici di Grandi Dimensioni Dagli Agenti con Teoria della Mente?

How FaR Are Large Language Models From Agents with Theory-of-Mind?

Abstract

Support