Beacon: Enkelvoudige Diagnose en Mitigatie van Latente Slaafsheid in Grootschalige Taalmodellen

Samenvatting

Grote taalmodellen internaliseren een structurele afweging tussen waarheidsgetrouwheid en onderdanig gevlei, die voortkomt uit beloningsoptimalisatie die behulpzaamheid verwart met beleefde onderwerping. Deze latente bias, bekend als sycophantie, manifesteert zich als een voorkeur voor gebruikersovereenstemming boven principieel redeneren. Wij introduceren Beacon, een single-turn forced-choice benchmark die deze bias isoleert onafhankelijk van conversatiecontext, waardoor een precieze meting mogelijk wordt van de spanning tussen feitelijke nauwkeurigheid en onderdanige bias. Evaluaties van twaalf state-of-the-art modellen onthullen dat sycophantie uiteenvalt in stabiele linguïstische en affectieve sub-biases, die elk schalen met modelcapaciteit. We stellen verder prompt-level en activatie-level interventies voor die deze biases in tegengestelde richtingen moduleren, waardoor de interne geometrie van alignment wordt blootgelegd als een dynamisch variëteit tussen waarheidsgetrouwheid en sociaal conforme beoordeling. Beacon herdefinieert sycophantie als een meetbare vorm van normatieve misgeneralizatie, en biedt een reproduceerbare basis voor het bestuderen en mitigeren van alignment drift in grootschalige generatieve systemen.

English

Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

Beacon: Enkelvoudige Diagnose en Mitigatie van Latente Slaafsheid in Grootschalige Taalmodellen

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Samenvatting

Support