BenSyc: 벵골어 맥락에서의 LLM 대화적 아첨 및 인간 정렬 벤치마킹

초록

대규모 언어 모델(LLM)은 감정적으로 민감한 사회적 대화에 점점 더 많이 참여하고 있으며, 이때 응답이 균형 잡힌 지지에서 과도한 확언이나 확대 동조로 전환될 수 있다. 기존의 아첨 연구는 주로 사실 동의 및 지시 수행 설정에 초점을 맞추어 왔으며, 문화적으로 기반한 대화형 아첨은 충분히 탐구되지 않았다. 우리는 벵골어 사회적 맥락에서 대화형 아첨을 연구하기 위한 최초의 벤치마크인 BenSyc를 소개한다. 방글라데시와 서벵골 전역의 커뮤니티에서 수집한 11,840개의 Reddit 게시물과 170k개의 댓글을 기반으로, 이진 레이블과 무효화, 중립, 지지, 확언, 확대의 다섯 단계로 구성된 세분화된 5계층 분류 체계를 갖춘 인간 검증 벤치마크를 구축했다. 우리는 15개 이상의 오픈 및 독점 LLM을 대화형 동조 분류 및 응답 생성 과제에서 평가했다. 결과에 따르면, 공감적 지지와 강화 중심 확언을 구별하는 것은 최첨단 명령 조정 모델조차도 여전히 어려운 과제로, 최고 시스템은 이진 탐지에서 61.8 Macro-F1, 5계층 분류에서 61.7 Macro-F1을 달성하는 데 그쳤다. 생성 설정에서는 여러 모델이 감정적으로 격양된 상황에서 강한 확언 또는 확대 응답을 자주 생성했다. 우리의 발견은 모델군 및 대화 행동 전반에 걸쳐 상당한 변동성을 강조하며, 사회적으로 동조된 대화형 AI 시스템을 평가하기 위한 문화적 기반의 다국어 벤치마크의 중요성을 재확인한다.

English

Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored. We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61.8 Macro-F1 on binary detection and 61.7 Macro-F1 on five-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.