LoSoNA: グループ会話における局所的社会的規範適応のためのベンチマーク

要旨

オンライングループチャットは、暗黙的にしか示されない局所的な会話規範を持つ社会的空間である。LLMベースのエージェントがこれらの規範を認識し適応する能力と意欲は、ほとんど未解明のままである。本稿では、多者間チャットにおける局所的社会的規範適応のためのベンチマークであるLoSoNAを紹介する。各シナリオでは、被験モデルに対して、非被験参加者が隠れた局所規範を示す厳選されたグループチャットのトランスクリプトが与えられ、その後に、被験者がその規範を推論したかどうかを明らかにする応答を強制する最終誘発ターンが続く。我々は、8つのフロンティアモデルおよびオープンウェイトモデルを、モデルに対し先行する会話を回答の根拠として扱うよう指示する明示性の程度を変えた4つのプロンプト条件下で評価する。単純なプロンプティングではほとんどのモデルの性能が限定的であり、明示的な規範認識プロンプティングは不均一に効果を示し、Gemini 3.1 Proは84.2%、Claude Fable 5は81.6%に達した一方、他の複数のモデルではわずかな改善または後退が見られた。LoSoNAは、モデルが先行事例から局所的な会話規範を推論し、それを1ターンのグループチャット応答で活用できるかをテストすることで、LLMの社会的能力の評価を求める近年の主張に貢献する。

English

Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce LoSoNA, a benchmark for local social norm adaptation in multi-party chat. Each scenario gives a subject model a curated group-chat transcript in which non-subject participants demonstrate a hidden local norm, followed by a final elicitor turn that forces a response revealing whether the subject has inferred that norm. We evaluate eight frontier and open-weight models under four prompting conditions that vary how explicitly the model is told to treat the prior conversation as evidence for how it should answer. Naive prompting remains limited for most models; explicit norm-aware prompting helps unevenly, with Gemini 3.1 Pro reaching 84.2% and Claude Fable 5 reaching 81.6%, while several other models show small gains or regressions. LoSoNA contributes to recent calls for evaluating LLM social capabilities by testing whether models can infer local conversational norms from precedent and use them in a one-turn group-chat response.