誰翻轉?自我與跨模型反論證揭示大型語言模型中的答案不穩定性
Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs
June 14, 2026
作者: Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner
cs.AI
摘要
標準準確性基準測試旨在測試大型語言模型(LLMs)接近正確答案的程度,但並不適用於測試模型在遇到合理反駁論點時是否會堅持正確答案。我們提出了一套用於評估答案穩定性的受控協議:在模型正確回答多選題後,我們以連貫的論證挑戰該模型,主張一個錯誤選項,並測量模型是否會改變答案。此設計能:a) 將論證內容與明顯的社會壓力區分開來;b) 改變論證長度、自我歸因及跨模型來源。在七個前沿模型與57個MMLU學科中,翻轉率介於17.5%至97.3%之間,顯示出僅憑準確性指標無法捕捉到的穩定性巨大差異。我們發現,自我歸因會持續提高翻轉率(平均增加7.1個百分點,最高增加18.7個百分點)。此外,彙整跨模型產生的錯誤答案論證,並針對每個問題選出最有效的論證,能產生比依賴單一來源模型更強的對抗性挑戰。我們進一步建構了MaxFlip,這是一套經過策展的挑戰集,能將翻轉率比標準自生成挑戰提高最多23.6個百分點。我們開源此協議、挑戰記錄及MaxFlip,以支援在標準準確性基準測試之外同時進行穩定性評估。相關資料可在 https://github.com/nafisenik/WhoFlips 及 https://hf.co/datasets/nafisehNik/WhoFlips 取得。
English
Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.