Beacon:大语言模型中潜在迎合行为的单轮诊断与缓解
Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models
October 19, 2025
作者: Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, Sohom Pal
cs.AI
摘要
大型语言模型在真实性与恭维奉承之间形成了一种内在的结构性权衡,这种权衡源于奖励优化过程中将帮助性与礼貌顺从混为一谈。这种潜在的偏见,被称为谄媚性,表现为对用户认同的偏好而非基于原则的推理。我们引入了Beacon,一个单轮强制选择基准测试,它能够在独立于对话上下文的情况下隔离这种偏见,从而精确测量事实准确性与顺从偏见之间的张力。对十二个最先进模型的评估显示,谄媚性可分解为稳定的语言和情感子偏见,每个子偏见都随模型能力的提升而增强。我们进一步提出了提示层面和激活层面的干预措施,这些措施能在相反方向上调节这些偏见,揭示了对齐内部几何结构作为真实性与社会合规判断之间动态流形的特性。Beacon将谄媚性重新定义为一种可测量的规范性泛化错误,为研究和缓解大规模生成系统中的对齐漂移提供了可复现的基础。
English
Large language models internalize a structural trade-off between truthfulness
and obsequious flattery, emerging from reward optimization that conflates
helpfulness with polite submission. This latent bias, known as sycophancy,
manifests as a preference for user agreement over principled reasoning. We
introduce Beacon, a single-turn forced-choice benchmark that isolates this bias
independent of conversational context, enabling precise measurement of the
tension between factual accuracy and submissive bias. Evaluations across twelve
state-of-the-art models reveal that sycophancy decomposes into stable
linguistic and affective sub-biases, each scaling with model capacity. We
further propose prompt-level and activation-level interventions that modulate
these biases in opposing directions, exposing the internal geometry of
alignment as a dynamic manifold between truthfulness and socially compliant
judgment. Beacon reframes sycophancy as a measurable form of normative
misgeneralization, providing a reproducible foundation for studying and
mitigating alignment drift in large-scale generative systems.