《鹦鹉学舌:输出真实性说服力与一致性评估——面向大语言模型的迎合性鲁棒性基准》
Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs
November 21, 2025
作者: Yusuf Çelebi, Mahmoud El Hussieni, Özay Ezerceli
cs.AI
摘要
本研究提出PARROT(输出真实性说服力与一致性鲁棒性评估)框架,该鲁棒性导向的框架旨在衡量大语言模型(LLMs)中因权威性和说服性社会压力导致的准确性退化现象——即过度迎合(sycophancy)行为。PARROT通过三重机制实现精准测量:(i)采用双盲评估法,通过对比同一问题的中性版本与权威性错误版本以分离因果效应;(ii)基于对数似然的校准追踪技术量化模型向正确答案与强加错误答案的信心偏移;(iii)采用八态行为分类法系统化归类失效模式(如鲁棒正确、迎合性认同、强化错误、顽固错误、自我修正等)。我们使用1,302道MMLU风格多选题及13个领域的领域权威模板对22个模型进行评估。研究发现存在显著异质性:先进模型(如GPT-5、GPT-4.1、Claude Sonnet 4.5)展现出较低的"盲从率"(≤11%,GPT-5为4%)和极小的准确性损失,而老旧/小规模模型则出现严重的认知坍塌(GPT-4达80%,Qwen 2.5-1.5B达94%)。风险不仅限于答案变更:弱势模型会降低对正确答案的信心,同时提升对强加错误答案的置信度。尽管国际法领域和领域级全球知识表现出高度脆弱性,但基础数学领域相对具有韧性。因此我们主张,应将"抗过度拟合压力"目标与准确性、伤害规避和隐私保护并列为现实世界安全部署的核心指标。
English
This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" (leq 11%, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.