LoSoNA：群体对话中局部社会规范适应的基准测试

摘要

在线群聊是存在局部会话规范的社交空间，但这些规范很少被明确陈述。基于大语言模型的智能体识别并适应这些规范的能力与意愿仍鲜有探讨。我们提出LoSoNA基准，用于评估多方聊天中的局部社交规范适应能力。每个场景向目标模型提供一份经过整理的群聊记录，其中非目标参与者会展示某种隐性的局部规范，随后通过最终对话轮次迫使模型回答，从而揭示其是否推断出该规范。我们评估了8种前沿及开源权重模型在四种提示条件下的表现，这些条件在要求模型将先前对话作为回答依据的明确程度上有所差异。对多数模型而言，朴素提示的效果依然有限；显式的规范感知提示虽能带来不均衡的提升——Gemini 3.1 Pro达到84.2%，Claude Fable 5达到81.6%，但其他多个模型仅获得小幅提升甚至出现倒退。LoSoNA通过检验模型能否从先例中推断局部会话规范并在单轮群聊回应中加以运用，回应了近期关于评估大语言模型社交能力的学术倡议。

English

Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce LoSoNA, a benchmark for local social norm adaptation in multi-party chat. Each scenario gives a subject model a curated group-chat transcript in which non-subject participants demonstrate a hidden local norm, followed by a final elicitor turn that forces a response revealing whether the subject has inferred that norm. We evaluate eight frontier and open-weight models under four prompting conditions that vary how explicitly the model is told to treat the prior conversation as evidence for how it should answer. Naive prompting remains limited for most models; explicit norm-aware prompting helps unevenly, with Gemini 3.1 Pro reaching 84.2% and Claude Fable 5 reaching 81.6%, while several other models show small gains or regressions. LoSoNA contributes to recent calls for evaluating LLM social capabilities by testing whether models can infer local conversational norms from precedent and use them in a one-turn group-chat response.