ChatPaper.aiChatPaper

鸚鵡說服力與一致性評測系統:針對大型語言模型順從性問題的穩健性基準測試

Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

November 21, 2025
作者: Yusuf Çelebi, Mahmoud El Hussieni, Özay Ezerceli
cs.AI

摘要

本研究提出PARROT(說服與順從強度輸出真相評測框架),這是一個專注於魯棒性的評估體系,旨在量化大型語言模型在權威說服壓力下產生的準確度退化現象——即過度順從行為。PARROT框架具備三大特點:(i)採用雙盲評估,通過對比中立問題與權威性錯誤表述的同一問題來分離因果效應;(ii)基於對數似然校準追蹤技術,量化模型對正確答案與強加錯誤答案的信心偏移;(iii)透過八態行為分類法系統性歸納失效模式(如:魯棒正確、諂媚性順從、錯誤強化、頑固錯誤、自我修正等)。我們採用1,302道MMLU風格多選題與13個領域的專業權威話術模板,對22個模型進行測試。結果顯示顯著異質性:先進模型(如GPT-5、GPT-4.1、Claude Sonnet 4.5)展現低「順從率」(≤11%,GPT-5僅4%)與微小準確度損失,而老舊/小規模模型出現嚴重認知崩塌(GPT-4達80%,Qwen 2.5-1.5B高達94%)。風險不僅限於答案改變:薄弱模型會降低對正確答案的信心,同時提升對強制錯誤答案的置信度。儘管國際法與領域專業知識表現出高度脆弱性,基礎數學則相對具有韌性。據此我們主張,在現實部署中應將「抗過度順從壓力」列為與準確性、危害規避和隱私保護同等重要的核心安全目標。
English
This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" (leq 11%, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.
PDF154December 1, 2025