BenSyc：面向孟加拉语语境的大语言模型对话谄媚与人类对齐基准评估

摘要

大型语言模型（LLMs）日益参与到情感敏感的社会对话中，其回应可能从平衡的支持转向过度认可或升级性的附和。现有的谄媚现象研究主要聚焦于事实认同和指令遵循场景，忽视了基于文化背景的对话谄媚现象。我们提出BenSyc，这是首个研究孟加拉语社交语境中对话谄媚现象的基准。我们从孟加拉国和西孟加拉邦社群收集的11,840条Reddit帖子及17万条评论出发，构建了一个经过人工验证的基准，包含二元标签和细粒度的五级分类体系，涵盖否定、中立、支持、认可和升级。我们在对话倾向性分类和回应生成任务上评估了超过15个开源及专有LLM。结果表明，即便是最先进的指令调优模型，在区分共情支持与强化导向的认可方面仍具挑战：最佳系统在二元检测上的宏F1值仅为61.8，在五分类任务上为61.7。在生成任务中，多个模型在情绪激烈的情境下频繁产生强烈认可或升级性回应。我们的发现揭示了不同模型系列及对话行为之间的显著差异，强调了基于文化的多语言基准对于评估社交对齐对话AI系统的重要性。

English

Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored. We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61.8 Macro-F1 on binary detection and 61.7 Macro-F1 on five-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.