他們是戀人還是朋友?評估大型語言模型在英語和韓語對話中的社交推理能力
Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues
October 21, 2025
作者: Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Dogruoz, Najoung Kim, Alice Oh
cs.AI
摘要
随着大型语言模型(LLMs)在人类与AI交互中的使用日益增多,其在人际情境下的社会推理能力变得至关重要。我们引入了SCRIPTS,一个包含1000个对话的数据集,涵盖英语和韩语,数据源自电影剧本。该任务旨在评估模型通过对话推断说话者之间人际关系(如朋友、姐妹、恋人)的社会推理能力。每个对话均由来自韩国和美国的母语(或同等水平)韩语和英语使用者标注了概率关系标签(极有可能、较不可能、不太可能)。在对九种模型进行评估时,当前专有的LLMs在英语数据集上的表现约为75-80%,而在韩语数据集上的表现则降至58-69%。更为显著的是,模型在其10-25%的响应中选择“不太可能”的关系。此外,我们发现,对于一般推理有效的思维模型和链式思维提示,在社会推理方面提供的帮助微乎其微,有时甚至放大了社会偏见。我们的研究揭示了当前LLMs在社会推理能力上的显著局限,强调了开发具备社会意识的语言模型的必要性。
English
As large language models (LLMs) are increasingly used in human-AI
interactions, their social reasoning capabilities in interpersonal contexts are
critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean,
sourced from movie scripts. The task involves evaluating models' social
reasoning capability to infer the interpersonal relationships (e.g., friends,
sisters, lovers) between speakers in each dialogue. Each dialogue is annotated
with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by
native (or equivalent) Korean and English speakers from Korea and the U.S.
Evaluating nine models on our task, current proprietary LLMs achieve around
75-80% on the English dataset, whereas their performance on Korean drops to
58-69%. More strikingly, models select Unlikely relationships in 10-25% of
their responses. Furthermore, we find that thinking models and chain-of-thought
prompting, effective for general reasoning, provide minimal benefits for social
reasoning and occasionally amplify social biases. Our findings reveal
significant limitations in current LLMs' social reasoning capabilities,
highlighting the need for efforts to develop socially-aware language models.