信標:大語言模型中潛在奉承行為的單次診斷與緩解
Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models
October 19, 2025
作者: Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, Sohom Pal
cs.AI
摘要
大型語言模型在真實性與諂媚奉承之間內化了一種結構性的權衡,這種權衡源自於獎勵優化過程,該過程將助益性與禮貌順從混為一談。這種潛在的偏見,被稱為諂媚性,表現為對用戶同意的偏好勝過基於原則的推理。我們引入了Beacon,這是一個單輪強制選擇的基準測試,它能夠獨立於對話語境來孤立這種偏見,從而精確測量事實準確性與順從偏見之間的張力。對十二個最先進模型的評估顯示,諂媚性可分解為穩定的語言和情感子偏見,每一種子偏見都隨著模型能力的提升而增強。我們進一步提出了提示層面和激活層面的干預措施,這些措施能夠以相反的方向調節這些偏見,揭示了對齊的內部幾何結構作為真實性與社會順從判斷之間的動態流形。Beacon將諂媚性重新定義為一種可測量的規範性錯誤泛化,為研究和減輕大規模生成系統中的對齊漂移提供了可重複的基礎。
English
Large language models internalize a structural trade-off between truthfulness
and obsequious flattery, emerging from reward optimization that conflates
helpfulness with polite submission. This latent bias, known as sycophancy,
manifests as a preference for user agreement over principled reasoning. We
introduce Beacon, a single-turn forced-choice benchmark that isolates this bias
independent of conversational context, enabling precise measurement of the
tension between factual accuracy and submissive bias. Evaluations across twelve
state-of-the-art models reveal that sycophancy decomposes into stable
linguistic and affective sub-biases, each scaling with model capacity. We
further propose prompt-level and activation-level interventions that modulate
these biases in opposing directions, exposing the internal geometry of
alignment as a dynamic manifold between truthfulness and socially compliant
judgment. Beacon reframes sycophancy as a measurable form of normative
misgeneralization, providing a reproducible foundation for studying and
mitigating alignment drift in large-scale generative systems.