BenSyc: ベンガル語の文脈における大規模言語モデルの会話におけるおもねりと人間との整合性のベンチマーク評価

要旨

大規模言語モデル（LLM）は、感情的に敏感な社会的会話にますます参加するようになっており、その応答はバランスの取れた支援から過度な是認やエスカレートする同調へと変化する可能性がある。従来の追従（sycophancy）研究は主に事実への同意や指示追従の設定に焦点を当てており、文化的に根ざした会話における追従は十分に探究されていない。本稿では、ベンガル語の社会的文脈における会話的追従を研究するための初のベンチマークであるBenSycを紹介する。バングラデシュと西ベンガル全域のコミュニティから収集した11,840件のReddit投稿と17万件のコメントを出発点として、二値ラベルと「無効化」「中立」「支援」「是認」「エスカレーション」からなる五段階の詳細な分類体系を備えた、人間が検証したベンチマークを構築した。15以上のオープンおよびプロプライエタリなLLMを、会話の同調性分類と応答生成タスクにおいて評価した。その結果、共感的な支援と強化指向の是認を区別することは、最先端の指示チューニングモデルにとっても依然として困難であり、最良のシステムでも二値検出で61.8、五クラス分類で61.7のMacro-F1しか達成できないことが示された。生成設定では、複数のモデルが感情的に高ぶった状況において強く是認的またはエスカレート的な応答を頻繁に生成する。これらの知見は、モデルファミリーや会話行動にわたって大きなばらつきがあることを浮き彫りにしており、社会的に同調する会話AIシステムを評価するための文化的基盤に基づく多言語ベンチマークの重要性を強調している。

English

Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored. We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61.8 Macro-F1 on binary detection and 61.7 Macro-F1 on five-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.